Retrieve information Text/Word from HTML code using awk/sed

sk2code · April 11, 2014, 5:12pm

awk/sed newbie here. I have a HTML file and from that file and I would like to retrieve a text word.

<font face=arial size=-1><li><a href=/value_for_clients/Tokyo/abc_process.txt>abc</a> NDK Version:  4.0 </li>
<font face=arial size=-1><li><a href=/value_for_clients/Tokyo/abc01_process.txt>abc01</a> NDK Version:  4.0 </li>
<font face=arial size=-1><li><a href=/value_for_clients/Tokyo/abc045_process.txt>abc045</a> NDK Version:  4.0 </li>
<font face=arial size=-1><li><a href=/value_for_clients/Tokyo/cdf_process.txt>cdf</a> NDK Version:  4.0 </li>
<font face=arial size=-1><li><a href=/value_for_clients/Tokyo/Manhattan_process.txt>Manhattan</a> NDK Version:  4.0 </li>

For eg. From the 1st line I would like to retrieve abc placed between: txt>abc</a>

I have used the following command but as you can see that number of letters in the word keeps changing abc, abc01, abc045, cdf, Manhattan.
awk -F\/ '{print substr($4,0,3)}' list.html

So this command is getting the output for only the 3 letter word. However I want to extract the same information (abc01, abc045, cdf, Manhattan) from all the lines in the HTML code. Please help.

Yoda · April 11, 2014, 5:14pm

awk -F'[<>]' '{ print $7 }' file.html

sk2code · April 11, 2014, 5:21pm

I Just ran this but it is giving me no output. Just blank lines. This HTML file is having 5 lines and when I run the command you mentioned I am just getting 5 blank lines.

Yoda · April 11, 2014, 5:28pm

Ok, here is what I got:

$ cat file.html
<font face=arial size=-1><li><a href=/value_for_clients/Tokyo/abc_process.txt>abc</a> NDK Version:  4.0 </li>
<font face=arial size=-1><li><a href=/value_for_clients/Tokyo/abc01_process.txt>abc01</a> NDK Version:  4.0 </li>
<font face=arial size=-1><li><a href=/value_for_clients/Tokyo/abc045_process.txt>abc045</a> NDK Version:  4.0 </li>
<font face=arial size=-1><li><a href=/value_for_clients/Tokyo/cdf_process.txt>cdf</a> NDK Version:  4.0 </li>
<font face=arial size=-1><li><a href=/value_for_clients/Tokyo/Manhattan_process.txt>Manhattan</a> NDK Version:  4.0 </li>

$ awk -F'[<>]' '{ print $7 }' file.html
abc
abc01
abc045
cdf
Manhattan

You can also try:

sed 's#.*txt>##;s#<.*##' file.html

Corona688 · April 11, 2014, 5:43pm

Does the HTML actually look like the data you pasted, or did you pretty it up? Many times when XML/HTML comes up, 5 "lines" is later found to mean tags not necessarily organized into lines at all.

SriniShoo · April 12, 2014, 12:28am

awk 'BEGIN{FS = "</a>"} {n=split($1, a, ">"); print a[n]}' file

srinus@ubuntu:~$ cat sam
<font face=arial size=-1><li><a href=/value_for_clients/Tokyo/abc_process.txt>abc</a> NDK Version:  4.0 </li>
<font face=arial size=-1><li><a href=/value_for_clients/Tokyo/abc01_process.txt>abc01</a> NDK Version:  4.0 </li>
<font face=arial size=-1><li><a href=/value_for_clients/Tokyo/abc045_process.txt>abc045</a> NDK Version:  4.0 </li>
<font face=arial size=-1><li><a href=/value_for_clients/Tokyo/cdf_process.txt>cdf</a> NDK Version:  4.0 </li>
<font face=arial size=-1><li><a href=/value_for_clients/Tokyo/Manhattan_process.txt>Manhattan</a> NDK Version:  4.0 </li>
srinus@ubuntu:~$ awk 'BEGIN{FS = "</a>"} {n=split($1, a, ">"); print a[n]}' sam
abc
abc01
abc045
cdf
Manhattan
srinus@ubuntu:~$

sk2code · April 14, 2014, 1:19pm

Yoda and everyone thanks for your help.
The awk command still results in blank output but the following command helped me.

sed 's#.*txt>##;s#<.*##' file.html

Thank you!!