Awk/sed HTML extract

p1ne · July 31, 2016, 9:36pm

I'm extracting text between table tags in HTML

<th><a href="/wiki/Buick_LeSabre" title="Buick LeSabre">Buick LeSabre</a></th>

using this:

awk -F "</*th>" '/<\/*th>/ {print $2}' auto2 > auto3

then this (text between a href):

sed -e 's/\(<[^<][^<]*>\)//g' auto3 > auto4

How to shorten this into one command, preferably just awk or just sed? I've tried this, where $0 prints entire a href line, with tags, but trying $1, $2, $3, etc. just gives blank file.

awk -F "</?a href.*>" '{print $0}' auto3 > auto5

Thanks in advance for help.

rdrtx1 · July 31, 2016, 10:15pm

awk -F"[<>]" '/<\/th>/ {print $5}' auto2

RudiC · August 1, 2016, 1:36am

Given those <th> tags are on a line by themselves (which would be required for your awk sample to work anyway),

sed -n '/^<th/s/<[^>]*>//gp' file
Buick LeSabre

EDIT: Should that NOT be the case, remove other tags upfront...

sed -n '/<th/{s/^.*<th>//;s/<\/th>.*$//;s/<[^>]*>//gp}' file

p1ne · August 1, 2016, 8:45am

Thanks RudiC, those are both very close. I probably should have posted table structure because the sed commands are returning some fields from other table elements. I just need the text in between <th> a href from the "Automobile" heading:

<table class="wikitable sortable" style="font-size:90%">
<tr>
<th style="width:5em">Image</th>
<th style="width:15em">Automobile</th>
<th style="width:10em">Production</th>
<th style="width:15em">Units Sold</th>
<th style="width:10em">Years sold</th>
<th style="width:25em">Notes</th>
</tr>
<tr>
<td>
<div class="center">
<div class="floatnone"><a href="/wiki/File:Late_model_Ford_Model_T.jpg" class="image" title="1927 Ford Model-T."><img alt="1927 Ford Model-T." src="//upload.wikimedia.org/wikipedia/commons/thumb/1/15/Late_model_Ford_Model_T.jpg/100px-Late_model_Ford_Model_T.jpg" width="100" height="91" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/1/15/Late_model_Ford_Model_T.jpg/150px-Late_model_Ford_Model_T.jpg 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/1/15/Late_model_Ford_Model_T.jpg/200px-Late_model_Ford_Model_T.jpg 2x" data-file-width="400" data-file-height="365" /></a></div>
</div>
</td>
<th><a href="/wiki/Ford_Model_T" title="Ford Model T">Ford Model T</a></th>
<td>1908-27</td>
<td><b>16,500,000</b><sup id="cite_ref-ford_7-0" class="reference"><a href="#cite_note-ford-7">[7]</a></sup></td>
<td>1908-27</td>
<td>The first car to achieve one million, five million, ten million and fifteen million units sold. By 1914, it was estimated that nine out of every ten cars in the world were <a href="/wiki/Ford_Motor_Company" title="Ford Motor Company">Fords</a>.<sup id="cite_ref-ford_7-1" class="reference"><a href="#cite_note-ford-7">[7]</a></sup></td>
</tr>

Thanks for your time.

Re: rdtx1 awk command, thanks, that prints blank file beyond $1 (prints full doc). I tried up to $6).

RavinderSingh13 · August 1, 2016, 8:55am

Hello p1ne,

Could you please try following and let me know if this helps you.

awk '($1 ~ /<th><a/){sub(/.*\">/,X,$0);sub(/<.*/,X,$0);print $0}'   Input_file

Output will be as follows.

Ford Model T

EDIT: Adding one more solution on same now too.

 awk '{if($0 ~ /^<th><a href=\"/){match($0,/\">.*/);print substr($0,RSTART+2,RLENGTH-11)}}'  Input_file

Thanks,
R. Singh

p1ne · August 1, 2016, 9:03am

Thanks so much, R. Singh, indeed, that does it!

RudiC, following your example, I'd like to solve also with sed. I'm trying this and variations, which give blank file:

sed -n '/^<th.^<a href.*/s/<[^>]*>//gp' auto2 > auto3

Thanks again.

RavinderSingh13 · August 1, 2016, 9:38am

p1ne:

Thanks so much, R. Singh, indeed, that does it!
RudiC, following your example, I'd like to solve also with sed. I'm trying this and variations, which give blank file:
sed -n '/^<th.^<a href.*/s/<[^>]*>//gp' auto2 > auto3
Thanks again.

Glad to help you p1ne. Could you please try following code and let us know if this helps.

sed -n '/^<th><a href="/s/\(.*">\)\(.*\)\(<\/a.*\)/\2/p'   Input_file

Output will be as follows.

Ford Model T

Thanks,
R. Singh

RudiC · August 1, 2016, 10:07am

Can't test right now, but would this do:

sed -n '/<th><a href/{s/^.*<th>//;s/<\/th>.*$//;s/<[^>]*>//gp}' file

p1ne · August 1, 2016, 10:33am

Thanks very much RudiC and R. Singh! In each example, seems either { } or 2 (second pattern) can be used to match h ref.