Multiple input field Separators in awk.

kinksville · April 25, 2008, 1:01pm

I saw a couple of posts here referencing how to handle more than one input field separator in awk. I figured I would share how I (just!) figured out how to turn this line in a logfile:

90000000000000000000010001 name D0.90000000000103787900010001QF840840916070000007085814Y216254@D1111111111111111=1107xxxxxxxxxxxxxxxx919MENCHIES

into this format:

90000000000000000000010001,name,840840916070000007085814Y216654,1111111111111111,1107,919MENCHIES

I have an entire script since this is just one step in a process of turning logs into useful information, but heres the relevant portion.

#Author: kinksville
#Date: April 24, 2008
#Revised: April 24, 2008
#Revision: Revision 1.00
#Other files: cclookup.s, cclookup.rep
#Changelog:
#April 24, 2008: Initial creation of the script.
#
#End changelog.

BEGIN {
FS="[ \. QF \@D = x]+"
OFS = ","
}
#First iteration of the @D search, stripping out the . character and inserting a OFS.
/\@D/ { #Search for any line containing the string @D
report2="cclookup.rep2"; #Define report2 variable.
report="cclookup.rep"; #Define report variable.
num_cclookup++; #Get number of auth requests.
print $1, $2, $5, $6, $7, $8 > report;
print $0 > report2;
} #End of the @D search.

The key is the fact that awk will accept a regular expression as file separator. This regexp FS="[ \. QF \@D = x]+" matches spaces, the . the string QF, the string @D, the =, and the character x. The + after the trailing bracket is the key, since that allows for 1 or more instances of any of the characters matched by the regexp.

That means that x and xxxxxx are both treated as a single field separator.

I still need to work on the output, since now I need to trim the name off the end of the last field. Unfortunately the number in the last field can range anywhere from 9999999 to 1 and that is the part that I want to preserve. Maybe a [^0-9]+ expression?

aigles · April 25, 2008, 1:48pm

Are you sure that your FS definition is valid for your requirement ?
You doesn't define "@D" and "QF" as separators.
The caracters @,D,Q and F are define as separators.

The valid syntax is :

FS  = "([[:space:]]|\\.|QF|=|x)+";

The get the last field without prefixing digits :

last_field=$NF
sub(/^[0-9]*/, "", last_field);

Jean-Pierre.

kinksville · April 25, 2008, 1:54pm

I was a little confused by the fact that QF and @D were working too. I think its because [QF]+ matches QQ QQQ QF QQFF etc.

It's not as clean as I might like but those characters are always at that particular place in the logged message, so it does what I want it to.

I'll sub in your expression and see what happens too

kinksville · April 25, 2008, 2:19pm

Neither of those snippets worked correctly for me. The FS syntax that you used probably changed the number of fields and so they didn't all get printed out.

The second snippet just seemed to add the #1 to the last field ie (,619MENCHIES1).

I'll play with it some more and see what happens.

kinksville · April 25, 2008, 5:12pm

#This script scans the appropriate log file and copies lines containing authorization requests to the output.
#All output is comma separated.
#Author:        kinksville
#Date:          April 24, 2008
#Revised:       April 25, 2008
#Revision:      Revision 1.01
#Other files:   cclookup.s, cclookup.rep
#Changelog:
#April 24, 2008: Initial creation of the script.
#April 25, 2008: Updated the regex for the input FS to match multiple characters.
#
#End changelog.

BEGIN {
#Input field separators will match any of the following characters/strings: blank space, . , QF, @D, =, x (repeating).
#The + on the outside of the brackets will allow it to match 0 or more instances of any of the characters/strings in any combination.
#%  Any comments with the % sign are temporarily there for testing purposes.
FS="[ \. QF \@D = x]+"
#Output field separator is defined as a comma.
OFS = ","
}
#@D search, stripping out the field separator characters and inserting a OFS.
/\@D/                {                                                          #Search for any line containing the string @D
                        last_field=$8 ;
                        sub(/[^0-9]*/,"",last_field );
                        dollar_val=last_field/100 ;
                        report="cclookup.rep";                                  #Define report variable.
                        num_cclookup++;                                         #Get number of auth requests.
                        field1=$1 ;
                        field2=$2 ;
                        field3=$5 ;
                        field4=$6 ;
                        field5=$7 ;
                        printf ("%s,%s,%s,%s,%s,$%-.2f\n",field1,field2,field3,field4,field5,dollar_val) > report
                        #print $1, $2, $5, $6, $7, $8 > report;                 #Print fields 1-2 with the OFS between them to report.
                        }                                                       #End of the @D search.

It's a bit of a kludge but it works. I couldn't seem to get the last_field variable to print out no matter what I did using the plain print command, which is why I eventually went with printf instead. That also allowed me to output the results in a decimal format since those numbers before the MENCHIES were dollar amounts.