Advanced sed/awk help

I have thousands of files in HTML that looks like this:

....
....
....
    <!-- table horaire -->             <!-- table horaire -->
        <table border="0" cellspacing="0" cellpadding="0" class="tblHoraires" summary="Table des horaires de la ligne 12">
<tr>
<th scope="row" class="horaireColFill_even">05h</th>
<td class="horaireColFill_even">22</td>
<td class="horaireColFill_even">38</td>
<td class="horaireColFill_even">52</td>
<td class="horaireColEmpty_even">�</td><td class="horaireColEmpty_even">�</td><td class="horaireColEmpty_even">�</td><td class="horaireColEmpty_even">�</td></tr>
<tr>
<th scope="row" class="horaireColFill_odd">06h</th>

<td class="horaireColFill_odd">06</td>
<td class="horaireColFill_odd">19</td>
<td class="horaireColFill_odd">32</td>
<td class="horaireColFill_odd">44</td>
<td class="horaireColFill_odd">55</td>
<td class="horaireColEmpty_odd">�</td><td class="horaireColEmpty_odd">�</td></tr>
<tr>
<th scope="row" class="horaireColFill_even">07h</th>
<td class="horaireColFill_even">06</td>
<td class="horaireColFill_even">16</td>

<td class="horaireColFill_even">26</td>
<td class="horaireColFill_even">36</td>
<td class="horaireColFill_even">47</td>
<td class="horaireColFill_even">58</td>
<td class="horaireColEmpty_even">�</td></tr>
<tr>
</table>
.....
.....
.....

I would like to extract data from all of them look like the following:

Filename1#05h22#05h38#05h52#06h06#06h19#...etc....#00h49
Filename2#05h20#05h48#05h55#06h16#06h39#...etc....#00h19
etc
etc 

Where the numbers are the text from <th> appended to each corresponding <td>
Would that be possible using sed/awk?
Thanks.

Where is "filename1", "filename2", etc. located in the input file?

something along these lines:
nawk -f char.awk myFiles*
char.awk:

BEGIN {
# field separator: either > or <
  FS="[<>]"
}
# first line in a current file? print the FILENAME of a current file
FNR==1 {printf("%c%s", (NR==1)?"":ORS, FILENAME)}

#second field contains "th scope=" pattern? save the value of the third field in a var "h"
$2 ~ "th scope=" { h=$3;next}

# second field contains "td class=.*ColFill.*" pattern?Print "#", followed by var "h", followed by value of the third field.
$2 ~ "td class=.*ColFill.*" { printf "#" h $3}

END {

# print ORS/endOfLine for the last printf
  printf ORS
}
1 Like

How about perl ?
parsehtml.pl

#!/usr/bin/perl
while(<@ARGV>){
chomp;
printf "$_";
open(FH,"$_") || die "FAIL - $!\n";
while(<FH>){
if(/^<th.*>(.+?)<\/th>$/){$th=$1;}
if(/^<td.*>(.+?)<\/td>$/){printf "#%s",$th.$1;}
}
printf "\n";
close(FH);
}

Invocation

perl parsehtml.pl myfiles_*.html

Thank you for your support.

@vgersh99:
The awk script works perfectly fine, however it does not display the filename.
Will you kindly comment the code or explain what it does in details so i can make a few more changes to it?

@pravin27:
The perl script does not return anything, it gives a blank line, i believe it would probably be a simple adjustment but i am not familiar with perl at all so i am not how to do it.

Commented the code.
How do you call the script? As suggested?
What OS are you on? If on Solaris, use nawk or /usr/xpg4/bin/awk (instead of old/plain/broken awk).

Thank you for the comments.
I am actually using Linux, am on Kubuntu. I have awk installed, i will try to get nawk and try with it. am executing from bash.

that's fine - you don't need 'nawk' - you can use 'awk' on Linux.
How do you execute the code for all input files?
Please post the exact execution sequence/script!

am typing:

awk -f awk.txt Folder/* > result.txt

anyway, i tried on cygwin and it seems to work fine.

i just added this line:

 /var point = new / { printf "#" $0  } 

to grab another piece of info, would it be possible to only print what is between parenthesis on this line only? its not very important as i can remove extras later with grep, but would be nice to have everything in one command :slight_smile:

post a sample data file - your previous sample doesn't contain 'var point = new'.

1 Like
var point = new GLatLng(46.20004142,6.168357236);

I would like to get the numbers for the Lat/Long

/var point = new / { split($0,"[()]",a);printf "#" a[2]  }

i get this error when i run it:

$ awk -f ../awk.awk.txt ligne12_aller_ermt
 ligne12_aller_ermtawk: ../awk.awk.txt:4: (FILENAME=ligne12_aller_ermt FNR=55) f
atal: split: second argument is not an array

Sorry:

/var point = new / { split($0,a,"[()]");printf "#" a[2]  }