extract data with awk from html files

sbobotex · December 17, 2010, 6:51am

Hello everyone, I'm new to this forum and i am new as a shell scripter.

my problem is to have html files in a directory and I would like to extract from these some data that lies between two different lines
Here's my situation

 <td align="default"> oxidizability (mg / l):
 data_to_extract 
 </ td>

this structure is repeated in all of these files
how do I use awk to do this extraction and enter the data into a file. txt?
Thank you all

Franklin52 · December 17, 2010, 7:39am

Try this:

awk 'p && /<\/ td>/{p=0}
p
/<td align="default">/{p=1}' htmlfile > file.txt

sbobotex · December 17, 2010, 9:48am

ok thanks for the answer but i need a customization of the command
i have a grooup of html files inside a directory and inside them lies a structure

<td align="default"> oxidizability (mg / l):
 data_to_extract 
 </td>

"data_to_extract" is the value that changing while

<td align="default"> oxidizability (mg / l):

and

</td>

remains the same

so, assuming i have 3 html files, the resultant file.txt should be something like that

<td align="default"> oxidizability (mg / l):
 34
 </td> <td align="default"> oxidizability (mg / l):
 45 
 </td> <td align="default"> oxidizability (mg / l):
 56
 </td>

i need exaclty do this

Franklin52 · December 17, 2010, 10:43am

You could try something like:

awk '
/<td align="default">/{p=1; s=$0}
p && /<\/td>/{print $0 FS s; s=""; p=0}
p' file >> newfile

sbobotex · December 17, 2010, 11:37am

sorry but still don't work . i need to filter exactly

<td align="default"> oxidizability (mg / l):

not

<td align="default">

ctsgnb · December 17, 2010, 12:23pm

Please give a representative sample of input file and expected output file.

sbobotex · December 20, 2010, 10:39am

ok i made some editings starting from your example!! Now it Works!! You're was very helpfull thank you very much!!!