AWK - printing certain fields when field order changes in data file

eric4 · April 15, 2008, 2:32pm

I'm hoping someone can help me on this. I have a data file that greatly simplified might look like this:

sec;src;dst;proto
421;10.10.10.1;10.10.10.2;tcp
426;10.10.10.3;10.10.10.4;udp
442;10.10.10.5;10.10.10.6;tcp
sec;src;fac;dst;proto
521;10.10.10.1;ab;10.10.10.2;tcp
525;10.10.10.5;ac;10.10.10.6;tcp
522;10.10.10.3;ab;10.10.10.4;udp
535;10.10.10.5;ac;10.10.10.6;tcp
...

Periodically throughout the file is a header line and the lines underneath that are the actual data where the fields correspond to the header line. However, sometimes the order of the fields change, and different fields are used, but that change is always marked by a new header line. The new header lines can be anywhere in the data file. Fields might be next to each other at one point in the file, then separated in a later part of the file.

I'd like to be able to produce a simplified output of just a few of the fields in the data file. For instance, I'd like to extract the src, dst, and proto from the example above. src is normally field 2, but dst is field 3 and then changes to field 4. My desired output would look something like this:

src;dst;proto
10.10.10.1;10.10.10.2;tcp
10.10.10.3;10.10.10.4;udp
10.10.10.5;10.10.10.6;tcp
10.10.10.1;10.10.10.2;tcp
10.10.10.5;10.10.10.6;tcp
10.10.10.3;10.10.10.4;udp
10.10.10.5;10.10.10.6;tcp

I've worked with AWK quite a bit and know how to work with field numbers, if/then, etc, but I can't figure out how to change a field number to a new value as directed by the header line.

Can anyone help me? I'd sure appreciate any advice. Is AWK the right tool to do this with?

Citricut · April 15, 2008, 2:56pm

Hi eric, I think AWK is the right stuff for you. Try to evaluate some fields (or the entire line) to look for the headers so you know that next lines follow that pattern until a new header comes. At least headers are static, aren't them?

Good luck!

eric4 · April 15, 2008, 4:28pm

I think I figured it out:

awk '
BEGIN {FS=";"} 
/;src;/{
for (num=1;num<=NF;num++) {if ($num == "src") fieldsrc=num}; 
for (num=1;num<=NF;num++) {if ($num == "dst") fielddst=num}; 
for (num=1;num<=NF;num++) {if ($num == "proto") fieldproto=num};
} 

!/;src;/{print $fieldsrc";"$fielddst";"$fieldproto}
'

seems to work for this example:

10.10.10.1;10.10.10.2;tcp
10.10.10.3;10.10.10.4;udp
10.10.10.5;10.10.10.6;tcp
10.10.10.1;10.10.10.2;tcp
10.10.10.5;10.10.10.6;tcp
10.10.10.3;10.10.10.4;udp
10.10.10.5;10.10.10.6;tcp

Now if I can figure it out for the real world data ...

vgersh99 · April 15, 2008, 6:48pm

# default fields: 'src;dst;proto'
nawk -f eric.txt myDataFile.txt

# fields order : 'proto;sec;src'
nawk -v fields='proto;sec;src' -f eric.txt myDataFile.txt

eric.awk:

BEGIN {
  FS=OFS=";"

  if (fields=="") fields="src;dst;proto"

  n=split(fields, fieldsA, FS)

  PATheader="[;]*src[;]*"
}

FNR==1 { print fields }
$0 ~ PATheader {
   for(i=1; i<=NF; i++)
      header[$i]=i
   next
}

{
   for(i=1; i<=n; i++)
     printf("%s%c", $header[fieldsA], (i==n) ? ORS : OFS)
}