Multiline html tag parse shell script

SorcRR · October 24, 2019, 6:46am

Hello,

I want to parse the contents of a multiline html tag

ex:

<html>
  <body>
    <p>some other text</p>
    <div>
      <p class="margin-bottom-0">
        text1
        <br>
        text2
        <br>
        <br>
        text3
      </p>
    </div>
  </body>
</html>

and i want the output to be:

text1
text2
text3

i tried with grep and sed combination, awk but i couldn't figure out the formula.

Thanks!

RudiC · October 24, 2019, 6:55am

Show your grep , sed , and awk attempts.

SorcRR · October 24, 2019, 7:04am

i have just the most recent attempt with awk , i don't remember what i tried with sed

echo $siteSource | awk 'f{ if (/<\/p>/){printf "%s", buf; f=0; buf=""} else buf = buf $0 ORS}; /<p class="margin-bottom-0">/{f=1}'

which i thought would show at least:

text1
<br>
text2
<br>
<br>
text3

would be a half solution since it would also show the   s but it's not working at all.

RudiC · October 24, 2019, 7:22am

Pls be aware that there are better suited, taylored tools out there when it comes to analysing / handling HTML data. How far would

sed -n '1h; 1!H; ${x; s/ *<[^>]*>\n* *//g; p;}' file
some other texttext1
text2
text3

(as a starter) get you?

SorcRR · October 24, 2019, 8:00am

not good enough since the some other text in my situation is much more i just simplified it in the example.

I want to get at least what is between  and 
so that the output would be:

text1
<br>
text2
<br>
<br>
text3

I know that there are better tools, but i started out with a simple shell script that grew in time,
and i got everything that i need... this is the last remaining item that i could not parse.

Thanks.

SorcRR · October 24, 2019, 9:34am

Another acceptable solution would be to get the next 5 rows in the code after finding 
i can process that result after

I tried this but did not worked

echo $siteSource | grep -A 5 -Eoi '<p class="margin-bottom-0">[^>]+<'

vgersh99 · October 24, 2019, 10:24am

sorcrr:

not good enough since the some other text in my situation is much more i just simplified it in the example.

I want to get at least what is between  and 
so that the output would be:
text1
 
text2
 
 
text3
I know that there are better tools, but i started out with a simple shell script that grew in time,
and i got everything that i need... this is the last remaining item that i could not parse.

Thanks.

$ sed -n '/<p class="margin-bottom-0">/,/<\/p>/p' myFile
      <p class="margin-bottom-0">
        text1
        <br>
        text2
        <br>
        <br>
        text3
      </p>

SorcRR · October 24, 2019, 10:34am

vgersh99, i tried that before but i don't know why it lists the whole file

vgersh99 · October 24, 2019, 10:51am

pls provide the output of (using code tags): cat -vet myFile

RudiC · October 24, 2019, 11:43am

Why didn't you specify that, then, in the first place?

Try

sed -n '/<p/,/<\/p/ {/<p.*\/p>/b; s/ *<[^>]*> *//g; /^$/d;  p}' file
        text1
        text2
        text3

joker · October 25, 2019, 8:09am

Since HTML is very similar to XML, you may use an xml tool to parse your file.

Since your HTML-File is not fully standards compliant, the parser complains about it and the file has either be adapted by hand to be compliant or to be preprocessed prior to the parsing. The   is the problematic element. Compliant would be   with a slash within the tag.

So you can do it with an xmlparser like xmlstarlet in three steps:

sed 's|<br>|<br/>|gi' data.html      |
  xmlstarlet sel -t -v '//body/div/p'   |
  sed -e '/^\s*$/d' -e 's/^\s*//'

Make the html file compliant by replacing the br-Tags
Get the wanted HTML-Element with xmlstarlet
suppress unwanted empty lines and leading whitespace in data / xmlstarlet output

SorcRR · October 25, 2019, 5:42pm

RudiC, thanks, that works just great if i have a file with the html code, but i store the html code in variable:

This works, but i store the html code in a variable, not a file:

text=$(sed -n '/<p class="margin-bottom-0">/,/<\/p/ {
            /<p.*\/p>/b
            s/ *<[^>]*> *//g
            /^$/d
            p
            }' htmlfile)

echo $text >> results

This is my final solution

siteSource=$(curl -L --connect-timeout 14 "$urls" 2> /dev/null)

text=$(printf "%s" "$siteSource" | sed -n '/<p class="margin-bottom-0">/,/<\/p/ {
            /<p.*\/p>/b
            s/ *<[^>]*> *//g
            /^$/d
            p
            }')

echo $text >> results

oh, and also i had to get rid of the semicolons because i had an error sed: 1: "/<p/,/<\/p/ {/<p.*\/p>/ ...": unexpected EOF (pending }'s)
and found that getting rid of the semicolons and using newline instead fixes this error.

Thanks everyone for the help.

--- Post updated at 09:42 PM ---

stomp, i like your solution too, looks very clean unfortunately xmlstarlet is very picky,
in my real life problem it's not just   -s that needs to be transformed to be compliant and would be overkill to check and transform the whole html page for xmlstarlet
But glad that you showed me this, i might use it somewhere else.

Thanks!

RudiC · October 26, 2019, 5:07am

How about

text=$(curl -L --connect-timeout 14 "$urls" 2> /dev/null |
       sed -n '/<p class="margin-bottom-0">/,/<\/p/ {
            /<p.*\/p>/b
            s/ *<[^>]*> *//g
            /^$/d
            p
            }')

dropping the intermediate variable?

SorcRR · October 26, 2019, 6:14am

RudiC, i'm not dropping it, because i need to get other texts out of the html, but for the example sakes, yes that would make it more optimized.
I have 5 more texts that i'm matching and making the output into a csv file.
The html from which i'm parsing is built up very poorly.

Actual code snippet:

<p class="margin-bottom-0">
											Diameter: 2<br>
Width [cm]: 4<br>Accessories: no<br>Material: metal<br>										</p>

so as you can see it has a lot of spaces, and new lines at the end and beginning.
Your solution doesn't deal with those so i came up with this:

text=$(printf "%s" "$siteSource" | sed -n '/<p class="margin-bottom-0">/,/<\/p/ {
            /<p class="margin-bottom-0">.*\/div>/b
            s/ *<[^>]*> *//g
            /^$/d
            p
            }' | awk '{$1=$1};1' | tr ',' ';' | tr -d "\n\r" )

Since i need this all in one line or else the csv file will broke (just realized this) had to get rid of the new lines tr -d "\n\r"
I' removing the extra whitespaces at the beginning and end awk '{$1=$1};1'
Also for csv proofing i'm replacing the commas with semicolon because csv will interpret commas as end of column tr ',' ';'
So this makes me wonder if that one sed could do all these by on it's own.
But i'm happy now because this works now.

Thanks!

RudiC · October 26, 2019, 12:57pm

Why don't you paint the whole picture with your requirements (including but not limited to "get other texts out of the html", "get rid of the new lines", "replacing the commas with semicolon") and input data so people could work towards a final, optimal solution? E.g. the sed , awk , and dual tr invocations could be combined into a single run of one of the tools,

joker · February 14, 2020, 4:51am

Hi,

here's a suggestion using pup , a html-parser written in go:

pup 'div p text{}' < data.html

# Output:

        text1
        

        text2
        

        text3

Explanation: Get all p-Elements with div-elements as parents and output the text data of it.

To get rid of the empty lines, I suggest a small sed command afterwards:

pup 'div p text{}' < data.html | sed '/^\s*$/d'

# Output
        text1
        text2
        text3

Another short demonstration of pup which I shortly used to get the numbers of cases for the coronovirus out of a complex website and into variables(for generating this graph: coronavirus statistics)) with only one combined command:

 read n n n n infected deceased recovered < <(wget -O- -q https://www.worldometers.info/coronavirus/  \
       | pup 'div[id="maincounter-wrap"]' | pup 'h1,span text{}' | xargs echo)

Pup is found here: pup on Github

As all GO binaries, it's statically linked and quite large in size(4 MB). Precompiled Binaries are available on github(link above).