Convert columns to row using awk

Hi

I need to convert some columns form a html file to rows.
I do manage to make it works without help (some proud :slight_smile: )
For some reason the offline status is not in bold, so I do need to remove the <b> tag from the other field to make this to work. All fields are not needed, so I test and select only one that I need.

Is there a more simple way to do this or can this code be cleaned some?

infile

				ohter blablabla
                        <TH TITLE="Different services during last 60s">CASC USERS</TH>
                        <TH colspan="3" class="centered">Action</TH>
				</TR>
                <TR class="online">
                        <TD class="usercol1"><SPAN TITLE="">master</SPAN></TD>
                        <TD class="usercol2"><b>online</b></TD>
                        <TD class="usercol3">148.31.202.211</TD>
                        <TD class="usercol4">40</TD>
                        <TD class="usercol5">4380</TD>
                        <TD class="usercol6">55</TD>
                        </TR>						
                <TR class="offline">
                        <TD class="usercol1"><SPAN TITLE="">madrid</SPAN></TD>
                        <TD class="usercol2">offline</TD>
                        <TD class="usercol3"></TD>
                        <TD class="usercol4">0</TD>						
                        <TD class="usercol5">120</TD>						
                        <TD class="usercol6">0</TD>
                        </TR>
                <TR class="connected">
                        <TD class="usercol1"><SPAN TITLE="">london</SPAN></TD>
                        <TD class="usercol2"><b>connected</b></TD>
                        <TD class="usercol3">10.10.10.41</TD>
                        <TD class="usercol4">34</TD>
                        <TD class="usercol5">632</TD>
                        <TD class="usercol6">430</TD>
                        </TR>
				</TABLE><BR>
				more .....

script

sed  's/<b>//g' infile | awk -F"[<>]" '{if ($0~"TR class=") {a=1}};
{if ($0~"/TR" && a==1) {a=0; print ""}};
{if ($0~"col1\"") printf "%s",$5};
{if ($0~"col2\"") printf "%s%s",",",$3};
{if ($0~"col3\"") printf "%s%s",",",$3};
{if ($0~"col5\"") printf "%s%s",",",$3}'

output

master,online,148.31.202.211,4380
madrid,offline,,120
london,connected,10.10.10.41,632

Note that awk and sed are not the right tools for parsing html.
If html2text or lynx are not available, I would use Perl, Python or Ruby.

% cat infile.html
                        <TH TITLE="Different services during last 60s">CASC USERS</TH>
                        <TH colspan="3" class="centered">Action</TH>
                                </TR>
                <TR class="online">
                        <TD class="usercol1"><SPAN TITLE="">master</SPAN></TD>
                        <TD class="usercol2"><b>online</b></TD>
                        <TD class="usercol3">148.31.202.211</TD>
                        <TD class="usercol4">40</TD>
                        <TD class="usercol5">4380</TD>
                        <TD class="usercol6">55</TD>
                        </TR>
                <TR class="offline">
                        <TD class="usercol1"><SPAN TITLE="">madrid</SPAN></TD>
                        <TD class="usercol2">offline</TD>
                        <TD class="usercol3"></TD>
                        <TD class="usercol4">0</TD>
                        <TD class="usercol5">120</TD>
                        <TD class="usercol6">0</TD>
                        </TR>
                <TR class="connected">
                        <TD class="usercol1"><SPAN TITLE="">london</SPAN></TD>
                        <TD class="usercol2"><b>connected</b></TD>
                        <TD class="usercol3">10.10.10.41</TD>
                        <TD class="usercol4">34</TD>
                        <TD class="usercol5">632</TD>
                        <TD class="usercol6">430</TD>
                        </TR>
                                </TABLE><BR>
% lynx -dump infile.html
    CASC USERS Action
   master online 148.31.202.211 40 4380 55
   madrid offline 0 120 0
   london connected 10.10.10.41 34 632 430

Thank you.
I know its not the best, but awk is what I know and its included in nearly all system and it works :slight_smile:

OK,
just as an exercise:

awk 'END {
  print rec
  }
  /col[1-3,5]/ { buildrec() }
  /<TR class="[^"]*"> *$/ && length(rec) {
    print rec
    rec = x
    }
func buildrec() {
  if (match($0, />[^<]*<\//))
    rec = length(rec) ? rec OFS substr($0, RSTART + 1, RLENGTH - 3) : \
      substr($0, RSTART + 1, RLENGTH - 3) 
  }' OFS=, infile
2 Likes

Works fine, thanks :slight_smile:
It will take me some time to understand how this works...

I'll try to explain the script.

func buildrec() {
  if (match($0, />[^<]*<\//))
    rec = length(rec) ? rec OFS substr($0, RSTART + 1, RLENGTH - 3) : \
      substr($0, RSTART + 1, RLENGTH - 3) 

buildrec is a user defined function that I used to avoid to repeat the same code for every match.
The function doesn't require parameters as it directly modifies
the global variable rec.
The function code performs the following actions:

  • search for the pattern: an > followd by 0 or more occurrences of characters different than <, followed by the closing tag sequence </,
    using the following regular expression: >[^<]*<\/
  • when match is found, the value is appended to the variable rec (short for record). RSTART and RLENGTH are automatically set by the match builtin function

After that, the code is simple:

  /col[1-3,5]/ { buildrec() }
  /<TR class="[^"]*"> *$/ && length(rec) {
    print rec
    rec = x
    }

When the records match the pattern represented by the regular expression col[1-3,5] , build the record - append the values.
When the pattern <TR class="[^"]*"> *$ matches for a second time (rec is already build) - print the record and reset it: rec = x .
x is an uninitialized variable, so I'm using it as a shortcut for "".

1 Like