Grabbing text between two lines with shell variables.

bathtime · February 21, 2018, 12:58pm

I would like to grab complex html text between lines using variables. I am running Debian and using mksh shell.

Here is the part of the html that I want to extract from. I would like to extract the words 'to love,' and I would like to use the above and below lines as reference points.

                <span class="lemma_definition">
                 to love
         </span>

Working script that does not use variables:

#!/bin/sh

URL="perseus.tufts.edu/hopper/morph?l=amo&la=la"

# Working: prints top definition:
wget -q -O- "$URL" | awk '/<span class="lemma_definition">/,/<\/span>/ {{ if (!/>/) {{$1=$1}1; print $0}} }'

NOTES:

(!/>/) = If there is a '>' just ignore.
{$1=$1}1; = Gets rid of spaces in result else it comes out as: ' <several spaces are here> to love'

Displays the proper text:

to love

Faulty code attempting to use variables:

#!/bin/sh

URL="perseus.tufts.edu/hopper/morph?l=amo&la=la"

wIn='<span class="lemma_definition">'
wOut='</span>'

# Faulty code with variables:
wget -q -O- "$URL" | awk -v vIn="$wIn" -v vOut="$wOut" '/vIn/,/vOut/ {{ if (!/>/) {{$1=$1}1; print $0}} }'

Prints nil.

How can I properly use the variables to make it work like the non-variable code? I've been reading tutorials but have not come across this situation yet.

vgersh99 · February 21, 2018, 1:02pm

wget -q -O- "$URL" | awk -v vIn="$wIn" -v vOut="$wOut" '$0 ~ vIn,$0 ~ vOut {{ if (!/>/) {{$1=$1}1; print $0}} }'

MadeInGermany · February 21, 2018, 3:55pm

Or use a control variable (p).

... | awk -v vIn="$wIn" -v vOut="$wOut" '($0 ~ vOut) {p=0} p {$1=$1; print} ($0 ~ vIn) {p=1}'

The order of the three statements determine if the boundaries are included. Here both are excluded.
This is Regular Expression: special charcters need to be escaped in $wIn and $wOut.
The following variant works with plain strings:

... | awk -v vIn="$wIn" -v vOut="$wOut" 'index($0,vOut) {p=0} p  {$1=$1; print} index($0,vIn) {p=1}'

bathtime · February 21, 2018, 4:43pm

wget -q -O- "$URL" | awk -v vIn="$wIn" -v vOut="$wOut" '$0 ~ vIn,$0 ~ vOut {{ if (!/>/) {{$1=$1}1; print $0}} }'

This worked perfectly!

...so close, yet so far! :rolleyes:

madeingermany:

Or use a control variable (p).
... | awk -v vIn="$wIn" -v vOut="$wOut" '($0 ~ vOut) {p=0} p {$1=$1; print} ($0 ~ vIn) {p=1}'
The order of the three statements determine if the boundaries are included. Here both are excluded.
This is Regular Expression: special charcters need to be escaped in $wIn and $wOut.
The following variant works with plain strings:
... | awk -v vIn="$wIn" -v vOut="$wOut" 'index($0,vOut) {p=0} p  {$1=$1; print} index($0,vIn) {p=1}'

Thank you. I will try this later tonight or tomorrow when my mind is more fresh (been at this for too long and wouldn't do it justice if I tried now).