Parsing a mixed format (flatfile+xml) logfile

goddevil · November 1, 2012, 12:39pm

I am trying to parse a file that looks like the below:

There are thousands of lines like the above and the file is expected to run into hundreds of thousands.

The issue i have is the mixed format of the file. If it was just an xmlfile, i would use an xmllint to parse the file. Now, i am using nawk to parse the xml and convert it into a flatfile, piping it into an sed to remove unwanted characters and then piping the result into an nawk again to parse the flatfile. A sample code is below:

nawk -F'(<)|(>)' '{print $1 "\t" $2 "\n" $8 "\t" $14 .....  $60 }' testfile.log | sed -e s/event_n//g  ......... -e 's/[()]//g' -e s/-/RESULT/g | nawk -f present.awk > output

The awk file present.awk is as follows:

BEGIN{
FS="[ |:|@||\t]";
}
{
print "Date & time : " $1, $2":"$3":"$4;
print "Login : " $6"@"$7;
print "Ops : " $9;
print "Mod : " $10;
print "Det: "
print $11,$12,$13,$14,$15,$16,$17,$18,$19,$20,$21,$22,$23,$24;
}
END{
print NR,"Records Processed";
}

The file output looks like below:

I have the following queries/concerns regarding my work:

In the file present.awk, if you look at the final print statement (print $11,$12,$13,$14,$15,$16,$17,$18,$19,$20,$21,$22,$23,$24) you will see that the output is not neat as this particular output is separated by tabs. I find that there are multiple tabs between the fields and sometimes nothing is printed. So my question is this: When using a tab as a field separator, can i ignore multiple tabls and consider them as 1 tab? How do i do this in the FS statement?
Going forward, my plan is to extract data daily from the file for the previous day. My plan is to get the previous day's date, grep the log file for this date and then pipe the result into the above code. Is there a better way to do this and avoid the grep and pipe?
I have refrained from using pipes as much as possible to reduce the time complexity but i couldnt avoid the above pipes. Is there anyway i can do the parsing above without using any pipes? I have this nagging feeling that there should be a better way to do the parsing without going through the painstaking work of finding which fiields correspond to the data i need and then printing the particular field from the awk
Once i get the above output file, is there anything i can do to convert the file into a format that would be easily readable from windows? I would like to add some logos and page breaks to the file. Is this possible?

I will be grateful if you can take some time to help me with my predicament

ripat · November 1, 2012, 1:10pm

Yes. Just change your FS as FS="[ |:|@||\t+]";

Usually you can avoid piping a grep result into aw by just using the awk condition filtering like in /string to capture/{... awk processing...}

Most probably.

I am not a windows guy but I would imagine this could be done.

Corona688 · November 1, 2012, 1:43pm

I'm pretty sure Windows can read text.

I have no idea how you get UPDATE on that data.

Corona688 · November 1, 2012, 1:52pm

First I should note that setting FS to a regex like this is a GNU awk feature. Most other versions of awk can't do that.

For a really complicated line like this, you can change FS on the fly and re-split a line by assigning $0 to it. You could do this with arrays and split() but it gets ugly to nest that too much.

First I split on () to extract the XML data, then I split on < to separate the tags from each other. I extract the string data in a loop and cram it into an array.

Then I split on whitespace, dashes, and commas while cramming all the data that wasn't processed before into $0.

Lastly I set FS back to [()] to get ready for the next line.

Not a complete solution since it's not clear where all your data is coming from, but should be enough for you to fill in the blanks:

BEGIN {         OLDFS=FS="[()]" }

{
        for(X in XML) delete ARR[X];
        # Save some bits, and re-split line using <
        A=$1;   B=$3;   FS="<"; $0=$2
        for(N=1; N<=NF; N++)  # Process "tagname>data" strings only.
        {
                if($N == "")                    continue;
                if(substr($N,1,1) == "/")       continue; # Ignore close-tags
                if(split($N, ARR, ">") == 2)    XML[ARR[1]]=ARR[2];
        }

        # XML["event_n"] would be "blah" for example.
        for(X in XML) print X, XML[X];

        # Split on whitespace, dashes, and colons, and process the rest.
        FS="[ \r\n\t:-]+";      $0=A" "B
        # ...now available in $1 ... $N.
        print $1, $2, $3, $4, $5, $6, $7, $8
        FS=OLDFS        # So the next line splits on  ()
}

$ awk -f xml.awk datafile

column username
new_val
old_val blabla
event_n blah
time 1347270053954
2012/09/10 12 18 18 username@192.168.1.1 OPERATION user succeeded

$

goddevil · November 1, 2012, 5:22pm

Windows can read the file but the formatting is uaually lost.

Is there a command like tput that can be used for the printing as opposed to the terminal?

---------- Post updated at 09:22 PM ---------- Previous update was at 07:32 PM ----------

Thank you Corona. I will try this.

I am trying to do a sed replace for specific occurrences of a character. Is there any way to do this in one sed and avoid repetition?

For example, i am doing the below to replace the first 4 occurrences of ~:

sed -e 's/~/|/1' -e 's/~/|/1' -e 's/~/|/1' -e 's/~/|/1'

Corona688 · November 1, 2012, 6:22pm

If you're slapping a sed onto the end of an awk, you probably could've just done it in awk. There are no ~'s in your data, though.

I can't make something which works for your data if you don't post a representative sample. I can try, but it's a game of blind man's bluff.

BEGIN {         OLDFS=FS="[()]" }

{
        for(X in XML) delete ARR[X];
        for(N=1; N<=4; N++) sub(/~/, "|"); 
        # Save some bits, and re-split line using <
        A=$1;   B=$3;   FS="<"; $0=$2
        for(N=1; N<=NF; N++)  # Process "tagname>data" strings only.
        {
                if($N == "")                    continue;
                if(substr($N,1,1) == "/")       continue; # Ignore close-tags
                if(split($N, ARR, ">") == 2)    XML[ARR[1]]=ARR[2];
        }

        # XML["event_n"] would be "blah" for example.
        for(X in XML) print X, XML[X];

        # Split on whitespace, dashes, and colons, and process the rest.
        FS="[ \r\n\t:-]+";      $0=A" "B
        # ...now available in $1 ... $N.
        print $1, $2, $3, $4, $5, $6, $7, $8
        FS=OLDFS        # So the next line splits on  ()
}

goddevil · November 1, 2012, 6:30pm

It was more of a general question and not specific to my example. I was playing around with the script to make the output better and it occurred me to replace the blanks with ~ and then use it as an FS. Then i got to wondering if i can use regexp for the substution range in the sed command.

goddevil · November 4, 2012, 6:11pm

corona688:

BEGIN {         OLDFS=FS="[()]" }
 
{
   for(X in XML) delete ARR[X];
   # Save some bits, and re-split line using <
   A=$1;   B=$3;   FS="<"; $0=$2
   for(N=1; N<=NF; N++)  # Process "tagname>data" strings only.
   {
   if($N == "")                    continue;
   if(substr($N,1,1) == "/")       continue; # Ignore close-tags
   if(split($N, ARR, ">") == 2)    XML[ARR[1]]=ARR[2];
   }
 
   # XML["event_n"] would be "blah" for example.
   for(X in XML) print X, XML[X];
 
   # Split on whitespace, dashes, and colons, and process the rest.
   FS="[ \r\n\t:-]+";      $0=A" "B
   # ...now available in $1 ... $N.
   print $1, $2, $3, $4, $5, $6, $7, $8
   FS=OLDFS        # So the next line splits on  ()
}

$ awk -f xml.awk datafile
 
$

Hi Corona,

I decided to give this a try as it is much more elagant than what i am doing.

On executing the script, i am getting the below error:

bash-3.00$ awk -f vtest.awk datafile > 1234
awk: XML is not an array
 record number 1

The contents of vtest.awk is:

BEGIN {         OLDFS=FS="[()]" }
{
        for(X in XML) delete ARR[X];
#        for(N=1; N<=4; N++) sub(/~/, "|");
        # Save some bits, and re-split line using <
        A=$1;   B=$3;   FS="<"; $0=$2
        for(N=1; N<=NF; N++)  # Process "tagname>data" strings only.
        {
                if($N == "")                    continue;
                if(substr($N,1,1) == "/")       continue; # Ignore close-tags
                if(split($N, ARR, ">") == 2)    XML[ARR[1]]=ARR[2];
        }
        for(X in XML) print X, XML[X];
        # Split on whitespace, dashes, and colons, and process the rest.
        FS="[ \r\n\t:-]+";      $0=A" "B
        # ...now available in $1 ... $N.
        print $1, $2, $3, $4, $5, $6, $7, $8
        FS=OLDFS        # So the next line splits on  ()
}
END{
print NR,"Records Processed";
}

Can you please assist me in what i am doing wrong?

Corona688 · November 14, 2012, 3:07pm

Sorry, I've been away at a conference and hadn't had time to catch up on these things.

Please post some of the data you ran this with.

goddevil · November 15, 2012, 8:06pm

Hi Corona,

No worries. Its great that you are helping us out here.

The issue was that i was using the wrong awk. when i used the /usr/xpg4/bin/awk i got the below error

/usr/xpg4/bin/awk: line 22 (NR=2431): Record too long (LIMIT: 19999 bytes)

The issue is that my original file can get quite big in certain cases. The data is actually read from a DB and in some cases a context is written into the file. This means that there are multiple xml tags(as escape characters) within our standatd xml tags as below. This makes some lines massive thus overflowing the awk.

<config><timeout>60</timeout><enable_timeout>true</enable_timeout><overview>false</overview><

Corona688 · November 15, 2012, 8:17pm

...and if you'd posted actual data when asked I could have even warned you about that.

You will need GNU awk to handle lines that huge.

goddevil · November 15, 2012, 8:24pm

Of course Corona. I am afraid that i didnt have the liberty to post the actual data. That said, i shouldve taken the time to post a better example in order not to waste your time. I will make sure to do it in the future.