I am trying to parse a file that looks like the below:
There are thousands of lines like the above and the file is expected to run into hundreds of thousands.
The issue i have is the mixed format of the file. If it was just an xmlfile, i would use an xmllint to parse the file. Now, i am using nawk to parse the xml and convert it into a flatfile, piping it into an sed to remove unwanted characters and then piping the result into an nawk again to parse the flatfile. A sample code is below:
nawk -F'(<)|(>)' '{print $1 "\t" $2 "\n" $8 "\t" $14 ..... $60 }' testfile.log | sed -e s/event_n//g ......... -e 's/[()]//g' -e s/-/RESULT/g | nawk -f present.awk > output
The awk file present.awk is as follows:
BEGIN{
FS="[ |:|@||\t]";
}
{
print "Date & time : " $1, $2":"$3":"$4;
print "Login : " $6"@"$7;
print "Ops : " $9;
print "Mod : " $10;
print "Det: "
print $11,$12,$13,$14,$15,$16,$17,$18,$19,$20,$21,$22,$23,$24;
}
END{
print NR,"Records Processed";
}
The file output looks like below:
I have the following queries/concerns regarding my work:
-
In the file present.awk, if you look at the final print statement (print $11,$12,$13,$14,$15,$16,$17,$18,$19,$20,$21,$22,$23,$24) you will see that the output is not neat as this particular output is separated by tabs. I find that there are multiple tabs between the fields and sometimes nothing is printed. So my question is this: When using a tab as a field separator, can i ignore multiple tabls and consider them as 1 tab? How do i do this in the FS statement?
-
Going forward, my plan is to extract data daily from the file for the previous day. My plan is to get the previous day's date, grep the log file for this date and then pipe the result into the above code. Is there a better way to do this and avoid the grep and pipe?
-
I have refrained from using pipes as much as possible to reduce the time complexity but i couldnt avoid the above pipes. Is there anyway i can do the parsing above without using any pipes? I have this nagging feeling that there should be a better way to do the parsing without going through the painstaking work of finding which fiields correspond to the data i need and then printing the particular field from the awk
-
Once i get the above output file, is there anything i can do to convert the file into a format that would be easily readable from windows? I would like to add some logos and page breaks to the file. Is this possible?
I will be grateful if you can take some time to help me with my predicament