Retrieve information Text/Word from HTML code using awk/sed

awk/sed newbie here. I have a HTML file and from that file and I would like to retrieve a text word.

<font face=arial size=-1><li><a href=/value_for_clients/Tokyo/abc_process.txt>abc</a> NDK Version:  4.0 </li>
<font face=arial size=-1><li><a href=/value_for_clients/Tokyo/abc01_process.txt>abc01</a> NDK Version:  4.0 </li>
<font face=arial size=-1><li><a href=/value_for_clients/Tokyo/abc045_process.txt>abc045</a> NDK Version:  4.0 </li>
<font face=arial size=-1><li><a href=/value_for_clients/Tokyo/cdf_process.txt>cdf</a> NDK Version:  4.0 </li>
<font face=arial size=-1><li><a href=/value_for_clients/Tokyo/Manhattan_process.txt>Manhattan</a> NDK Version:  4.0 </li>

For eg. From the 1st line I would like to retrieve abc placed between: txt>abc</a>

I have used the following command but as you can see that number of letters in the word keeps changing abc, abc01, abc045, cdf, Manhattan.
awk -F\/ '{print substr($4,0,3)}' list.html

So this command is getting the output for only the 3 letter word. However I want to extract the same information (abc01, abc045, cdf, Manhattan) from all the lines in the HTML code. Please help.

awk -F'[<>]' '{ print $7 }' file.html

I Just ran this but it is giving me no output. Just blank lines. This HTML file is having 5 lines and when I run the command you mentioned I am just getting 5 blank lines.

Ok, here is what I got:

$ cat file.html
<font face=arial size=-1><li><a href=/value_for_clients/Tokyo/abc_process.txt>abc</a> NDK Version:  4.0 </li>
<font face=arial size=-1><li><a href=/value_for_clients/Tokyo/abc01_process.txt>abc01</a> NDK Version:  4.0 </li>
<font face=arial size=-1><li><a href=/value_for_clients/Tokyo/abc045_process.txt>abc045</a> NDK Version:  4.0 </li>
<font face=arial size=-1><li><a href=/value_for_clients/Tokyo/cdf_process.txt>cdf</a> NDK Version:  4.0 </li>
<font face=arial size=-1><li><a href=/value_for_clients/Tokyo/Manhattan_process.txt>Manhattan</a> NDK Version:  4.0 </li>
$ awk -F'[<>]' '{ print $7 }' file.html
abc
abc01
abc045
cdf
Manhattan

You can also try:

sed 's#.*txt>##;s#<.*##' file.html
1 Like

Does the HTML actually look like the data you pasted, or did you pretty it up? Many times when XML/HTML comes up, 5 "lines" is later found to mean tags not necessarily organized into lines at all.

awk 'BEGIN{FS = "</a>"} {n=split($1, a, ">"); print a[n]}' file
srinus@ubuntu:~$ cat sam
<font face=arial size=-1><li><a href=/value_for_clients/Tokyo/abc_process.txt>abc</a> NDK Version:  4.0 </li>
<font face=arial size=-1><li><a href=/value_for_clients/Tokyo/abc01_process.txt>abc01</a> NDK Version:  4.0 </li>
<font face=arial size=-1><li><a href=/value_for_clients/Tokyo/abc045_process.txt>abc045</a> NDK Version:  4.0 </li>
<font face=arial size=-1><li><a href=/value_for_clients/Tokyo/cdf_process.txt>cdf</a> NDK Version:  4.0 </li>
<font face=arial size=-1><li><a href=/value_for_clients/Tokyo/Manhattan_process.txt>Manhattan</a> NDK Version:  4.0 </li>
srinus@ubuntu:~$ awk 'BEGIN{FS = "</a>"} {n=split($1, a, ">"); print a[n]}' sam
abc
abc01
abc045
cdf
Manhattan
srinus@ubuntu:~$ 

Yoda and everyone thanks for your help.
The awk command still results in blank output but the following command helped me.

sed 's#.*txt>##;s#<.*##' file.html

Thank you!!