Awk/sed HTML extract

I'm extracting text between table tags in HTML

<th><a href="/wiki/Buick_LeSabre" title="Buick LeSabre">Buick LeSabre</a></th>

using this:

awk -F "</*th>" '/<\/*th>/ {print $2}' auto2 > auto3

then this (text between a href):

sed -e 's/\(<[^<][^<]*>\)//g' auto3 > auto4

How to shorten this into one command, preferably just awk or just sed? I've tried this, where $0 prints entire a href line, with tags, but trying $1, $2, $3, etc. just gives blank file.

awk -F "</?a href.*>" '{print $0}' auto3 > auto5

Thanks in advance for help.

awk -F"[<>]" '/<\/th>/ {print $5}' auto2

Given those <th> tags are on a line by themselves (which would be required for your awk sample to work anyway),

sed -n '/^<th/s/<[^>]*>//gp' file
Buick LeSabre

EDIT: Should that NOT be the case, remove other tags upfront...

sed -n '/<th/{s/^.*<th>//;s/<\/th>.*$//;s/<[^>]*>//gp}' file
1 Like

Thanks RudiC, those are both very close. I probably should have posted table structure because the sed commands are returning some fields from other table elements. I just need the text in between <th> a href from the "Automobile" heading:

<table class="wikitable sortable" style="font-size:90%">
<tr>
<th style="width:5em">Image</th>
<th style="width:15em">Automobile</th>
<th style="width:10em">Production</th>
<th style="width:15em">Units Sold</th>
<th style="width:10em">Years sold</th>
<th style="width:25em">Notes</th>
</tr>
<tr>
<td>
<div class="center">
<div class="floatnone"><a href="/wiki/File:Late_model_Ford_Model_T.jpg" class="image" title="1927 Ford Model-T."><img alt="1927 Ford Model-T." src="//upload.wikimedia.org/wikipedia/commons/thumb/1/15/Late_model_Ford_Model_T.jpg/100px-Late_model_Ford_Model_T.jpg" width="100" height="91" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/1/15/Late_model_Ford_Model_T.jpg/150px-Late_model_Ford_Model_T.jpg 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/1/15/Late_model_Ford_Model_T.jpg/200px-Late_model_Ford_Model_T.jpg 2x" data-file-width="400" data-file-height="365" /></a></div>
</div>
</td>
<th><a href="/wiki/Ford_Model_T" title="Ford Model T">Ford Model T</a></th>
<td>1908-27</td>
<td><b>16,500,000</b><sup id="cite_ref-ford_7-0" class="reference"><a href="#cite_note-ford-7">[7]</a></sup></td>
<td>1908-27</td>
<td>The first car to achieve one million, five million, ten million and fifteen million units sold. By 1914, it was estimated that nine out of every ten cars in the world were <a href="/wiki/Ford_Motor_Company" title="Ford Motor Company">Fords</a>.<sup id="cite_ref-ford_7-1" class="reference"><a href="#cite_note-ford-7">[7]</a></sup></td>
</tr>

Thanks for your time.

Re: rdtx1 awk command, thanks, that prints blank file beyond $1 (prints full doc). I tried up to $6).

Hello p1ne,

Could you please try following and let me know if this helps you.

awk '($1 ~ /<th><a/){sub(/.*\">/,X,$0);sub(/<.*/,X,$0);print $0}'   Input_file

Output will be as follows.

Ford Model T

EDIT: Adding one more solution on same now too.

 awk '{if($0 ~ /^<th><a href=\"/){match($0,/\">.*/);print substr($0,RSTART+2,RLENGTH-11)}}'  Input_file
 

Thanks,
R. Singh

1 Like

Thanks so much, R. Singh, indeed, that does it!

RudiC, following your example, I'd like to solve also with sed. I'm trying this and variations, which give blank file:

sed -n '/^<th.^<a href.*/s/<[^>]*>//gp' auto2 > auto3

Thanks again.

Glad to help you p1ne. Could you please try following code and let us know if this helps.

sed -n '/^<th><a href="/s/\(.*">\)\(.*\)\(<\/a.*\)/\2/p'   Input_file

Output will be as follows.

Ford Model T

Thanks,
R. Singh

1 Like

Can't test right now, but would this do:

sed -n '/<th><a href/{s/^.*<th>//;s/<\/th>.*$//;s/<[^>]*>//gp}' file
1 Like

Thanks very much RudiC and R. Singh! In each example, seems either { } or 2 (second pattern) can be used to match h ref.