awk script: need help

bhupeshchavan · July 13, 2016, 5:00pm

Hi Team,

i need a awk script for reporting purpose, below is the sample output of the log file:

transid=01
name=admin
time=06.58.51
message=test
eof
transid=02
name=account
time=14.58.51
message=live
eof
transid=03
name=bhu3
time=07.58.51
message=testing
eof

requirement :
we need to find out transaction between 0 to 12 and the expected output should be:

 transid=03, name=bhu3, time=07.58.51, message=testing

i tried the below code :

awk 'BEGIN{FS="\n";RS="eof";trn=$1;nm=$2;tme=$3;msg=$4;} if($3=="^[0-1][0-9]") {print trn","nm","tme","msg}' log

but it is not working i am getting parsing error. I tried several different combination but none of them worked.

please help.

Thank you in advance.

Thanks and regards,
Bhupesh

Don_Cragun · July 13, 2016, 6:01pm

What operating system are you using. According to the standards, the record separator is a single character (not a string like eof , although some versions of awk do accept an extended regular expression instead of just a single character).

It is nice of you to tell us that you are getting a parsing error. But, it would be a lot more informative if you showed us the diagnostic messages awk was producing (in CODE tags) instead of just saying there is an error.

Don't transid=01 and transid=02 also indicate transactions between 0 and 12? Why isn't the desired output?:

transid=01,name=admin,time=06.58.51,message=test
transid=02,name=account,time=14.58.51,message=live
transid=03,name=bhu3,time=07.58.51,message=testing

rdrtx1 · July 13, 2016, 7:38pm

awk -F= '
$1 == "transid" {tid=$2}
$1 == "eof" && (tid >=0 && tid <=12) {sub(" *, *$", "", l); print l; l=""; next}
{l=l $0 ", "}
' log

bhupeshchavan · July 14, 2016, 5:10pm

Hi Don,

Please find the details required:

[bhupesh@RHL9 bhupesh]$ ls -ltr `which awk`
lrwxrwxrwx    1 root     root           14 Oct  9  2011 /usr/bin/awk -> ../../bin/gawk
[bhupesh@RHL9 bhupesh]$ uname -a
Linux RHL9.0 2.4.20-31.9 #1 Tue Apr 13 18:04:23 EDT 2004 i686 i686 i386 GNU/Linux
[bhupesh@RHL9 bhupesh]$ locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
[bhupesh@RHL9 bhupesh]$
[bhupesh@RHL9 bhupesh]$ awk 'BEGIN{FS="\n";RS="eof";trn=$1;nm=$2;tme=$3;msg=$4;} if($3=="^[0-1][0-9]") {print trn","nm","tme","msg}' log
awk: cmd. line:1: BEGIN{FS="\n";RS="eof";trn=$1;nm=$2;tme=$3;msg=$4;} if($3=="^[0-1][0-9]") {print trn","nm","tme","msg}
awk: cmd. line:1:                                                     ^ parse error
[bhupesh@RHL9 bhupesh]$

Don't transid=01 and transid=02 also indicate transactions between 0 and 12? Why isn't the desired output?:
--> To avoid the script consedering transid=01 and transid=02,i am trying to use FS as new line and RS as eof, and then i am using "if($3=="^[0-1][0-9]")" so that awk can only check the 3rd field and in that 3rd feild pattern is "time=[0-1][0-9]". I do apologize i did not put time=[0-1][0-2] in the first place.

The output should be genarted based on values in the third feild i.e time=07.58.51, in this feild it should only consider the hour range, which is seven in this feild.

Enclosed is the snap of the same.

I did one more attempt at it but output is same:
code:

awk 'BEGIN{FS="\n";RS="eof";trn=$1;nm=$2;tme=$3;msg=$4;} if($3=="time=[0-1][0-9]*") {print trn","nm","tme","msg}' log

Aplogies for not explaning in a better way.

Thank you in advance.

RudiC · July 15, 2016, 3:44am

Hmmm - where to start whining?

"transaction between 0 to 12" doesn't mean 0 <= transid <= 12 but the hour of the entry's time? Why, then, is entry 1 missing in your output?

in your text, you specify time=[0-1][0-2] which evaluates to 00, 01, 02, 10, 11, 12. In your script sample, you write time=[0-1][0-9] which evaluates to 00 - 19, neither of which may be what you really want.

the parser error is due to the if in the pattern part of the awk command is illegal. Either use without if , which is possible in an awk pattern {action} pair, or put it within the curly braces.

don't attach pictures - data in there can't be copied and analysed but only visually interpreted.

Don_Cragun · July 15, 2016, 4:03am

Expanding on what RudiC has already said...

Your first post said absolutely nothing about time of day having any effect on the transactions that should be displayed; it said the only requirement was that the transaction ID should be between 00 and 12.

I know that many people like to write awk one-liners, instead of writing awk scripts such that you can see the structure of the code and easily spot cases where you have invalid conditions in an awk :

condition { action }

because an if statement is not a valid condition.

But, even if we changed:

if($3=="time=[0-1][0-9]*") {print trn","nm","tme","msg}

to:

$3=="time=[0-1][0-9]*" {print trn","nm","tme","msg}

this is a literal string match; not a regular expression match. And, if we changed it to a regular expression match:

i$3~"time=[0-1][0-9]*" {print trn","nm","the","msg}

that would match transactions that had a time value starting with any value in the range 00 through 19, inclusive.

And, you have another logic error, because $1, $2, $3, and $4 do not have any defined value before the 1st input line has been read (and the BEGIN clause in your awk script is executed before the 1st line of input is read).

If the transaction ID doesn't matter and the only selection criteria for printing records is that the time value starts with 07 , you might want something more like:

awk '
BEGIN {	FS = "\n"
	OFS = ","
	RS = "eof"
}
$3 ~ "=07" {
	print $1, $2, $3, $4
}' log

although this is untested (since the awk on my system does not support multi-character record separators) and it isn't obvious to me whether the <newline> following the eof record separator produces an empty 1st field in records after the 1st record.

bhupeshchavan · July 15, 2016, 6:31am

Hi Don,

Your comments gave me some hint where i was going wrong, to remove confusion of RS and to check if the "record separator produces an empty 1st field in records after the 1st record", i first changed the file to a different format.

Test file 1 : filename is log2:

transid=01,name=bhu,time=06.58.51,message=testeoftransid=2,name=account,time=14.58.51,message=liveeoftransid=3,name=bhu3,time=07.58.51,message=testingeof

tried your code with some changes:

awk '
BEGIN {    FS = "\n"
    OFS = ","
    RS = "eof"
}
#$3 ~ "=07" { --> commented out this part since i wanted to check if the formatting is perfect  or not.
{trn=$1;nm=$2;tme=$3;msg=$4;   #declared it in action pattern.
print trn "," nm "," tme "," msg
}' log2 

and

awk '
BEGIN {FS = "\n"
OFS = ","
RS = "eof"
}
{
print $1, $2, $3, $4
}' log2

output:

transid=01,name=bhu,time=06.58.51,message=test,,,,
transid=2,name=chu,time=14.58.51,message=test,,,,
transid=3,name=bhu3,time=07.58.51,message=test,,,,
,,,

This worked as expected but the only area of concern in this output is the commas after the last field and the last line of the output with commas, so i tried the
below code:

If i ignore variable part and execute the below code :

awk '
BEGIN {    FS = "\n"
    OFS = ","
    RS = "eof"
}
 {
    print $0
}' log2

output:

transid=01,name=bhu,time=06.58.51,message=test
transid=2,name=account,time=14.58.51,message=live
transid=3,name=bhu3,time=07.58.51,message=testing

There are two blank lines at the end in this output.

Then i went ahead one more step and replaced eof with nothing with the help of below command:

sed 's/^eof//g' log > log3

and executed the below code :

awk '
BEGIN {FS = "\n"
OFS = ","
RS = ""
}
 {trn=$1;nm=$2;tme=$3;msg=$4;
print $1,$2,$3,$4
}' log3

output :

transid=01,name=admin,time=06.58.51,message=test
transid=02,name=account,time=14.58.51,message=live
transid=03,name=bhu3,time=07.58.51,message=testing

Output is fine now.

The only thing which was left is evaluating time frame in the third feild(Transaction which are in time frame 00 to 12) but before doing this i tried the $3 ~ 07 and it worked:

awk '
BEGIN {FS = "\n"
OFS = ","
RS = ""
}
$3 ~ "=07" {
 trn=$1;nm=$2;tme=$3;msg=$4;
print $1,$2,$3,$4
}' log3

output:

transid=03,name=bhu3,time=07.58.51,message=testing

I tried different patterns but was not able to write between statement to get transaction between 00 to 12,please help me out in this.

Hi RudiC,

"transaction between 0 to 12" doesn't mean 0 <= transid <= 12 but the hour of the entry's time? Why, then, is entry 1 missing in your output?
--Yes, the first entry was missing.

in your text, you specify time=[0-1][0-2] which evaluates to 00, 01, 02, 10, 11, 12. In your script sample, you write time=[0-1][0-9] which evaluates to 00 - 19, neither of which may be what you really want.

--> This is my mistake,apologies for the same. I am looking to get transaction which falls under time frame 00-12 and the time is in the 3rd feild.

the parser error is due to the if in the pattern part of the awk command is illegal. Either use without if , which is possible in an awk pattern {action} pair, or put it within the curly braces.
--> Please give me an example so that i will understand better,i am not good in awk still learning and exploring.

I have tried achieve the output but i am stuck in matching the time range.

Thank you in advance

Regards,
Bhupesh

RudiC · July 15, 2016, 8:05am

Try

awk '
                        {OUT[$1] = $2
                        }
/eof/ && 
OUT["time"] <= 12       {delete OUT["eof"]
                         DL = ""
                         for (o in OUT) {printf "%s%s=%s", DL, o,  OUT[o]
                                         DL = ","
                                        }
                         printf RS
                        }
' FS="=" OFS="," file
name=admin,transid=01,message=test,time=06.58.51
name=bhu3,transid=03,message=testing,time=07.58.51

bhupeshchavan · July 15, 2016, 11:47am

Hi RudiC,

I tried your code but the output is not as expected.

[bhupesh@RHL9 capgemini]$ awk '
                        {OUT[$1] = $2
                        }
/eof/ &&
OUT["time"] <= 12       {delete OUT["eof"]
                         DL = ""
                         for (o in OUT) {printf "%s%s=%s", DL, o,  OUT[o]
                                         DL = ","
                                        }
                         printf RS
                        }
' FS="=" OFS="," log
name=admin,id=01,time=06.58.51,message=test
transid=03,name=bhu3,id=01,time=07.58.51,message=testing

Please suggest on the same.

Thanks.

Regards,
Bhupesh

chill3chee · July 15, 2016, 2:28pm

Hi Bhupesh,
For the comparison, can you please check whether the following does meet your need (untested)

awk '
BEGIN {FS = "\n"
OFS = ","
RS = ""
}
 {trn=$1;nm=$2;tme=$3;msg=$4;
split($3,a,"=");
split(a[2],b,".");
if ((b[1] + 0) >=0 && (b[1] + 0) <= 12)
{ 
print $1,$2,$3,$4
}}' log3

bhupeshchavan · July 15, 2016, 3:10pm

Hi Chill3chee,

It worked, the output is as expected.

The only doubt is : if ((b[1] + 0) >=0 && (b[1] + 0) <= 12) , i presume you are doing this :b[1] + 0 just to ensure that value is numeric by adding it to a 0 any string will also become 0.Please correct if i am wrong.

I tried this and it worked :

awk '
BEGIN {FS = "\n"
OFS = ","
RS = "eof\n"
}
 {trn=$1;nm=$2;tme=$3;msg=$4;
split($3,a,"=");
split(a[2],b,".");
if (b[1] <= 12)
{ 
print $1,$2,$3,$4
}}' log

output :

transid=01,name=admin,time=06.58.51,message=test
transid=03,name=bhu3,time=07.58.51,message=testing

so now i dont have to replace eof with empty string. eof\n worked for me..

Thank you very much Don,RudiC and Chill3chee for your help.

Cheers !!!!

Regards,
Bhupesh

Don_Cragun · July 15, 2016, 3:34pm

That could be simplified a little bit to:

awk '
BEGIN {	FS = "\n"
	OFS = ","
	RS = "eof\n"
}
{	split($3, a, /[=.]/)
	if((a[2] + 0) <= 12)
		print $1, $2, $3, $4
}' log

The reason chill3chee added 0 before doing the comparison is because the default type for a field created as the result of a split is string; not number. If you were looking for times before 9am, using:

	if(a[2] <= 9)

you'd get the wrong results sometimes because in a string comparison, "10" is less than "9". But with:

	if((a[2] + 0) <= 9)

you force a numeric comparison and get the results you want.

RudiC · July 15, 2016, 3:36pm

Try

awk '
/eof/   {if (HR <= 12)   print TMP
         TMP = DL = ""
         next
        }
        {TMP = TMP DL $0
         DL  = ","
        }
/^time/ {HR = substr ($0, 6, 2)
        }
' file
transid=01,name=admin,time=06.58.51,message=test
transid=03,name=bhu3,time=07.58.51,message=testing

bhupeshchavan · July 15, 2016, 3:59pm

Thank you Don.. Hi RudiC, Your code is working but i am not able to understand the code except for the next statement which tells awk to read the next line and substr line in which you are generating the value for HR . The comparison is done earlier and the HR variable is initialized later, it is confusing for me, please explain the code. Thanks

Don_Cragun · July 15, 2016, 5:28pm

Hi bhupeshchavan,
Here is a copy of RudiC's code with comments added.

# Note that this script reads the file a line at a time (not a record at a time
# with a line ontaining "eof" as the record terminator).  This will work with
# any standards-conforming version of awk; while the other code you're using
# with RS set to "eof\n" only works on versions of awk that support
# multi-character record separator values (which is not required by the
# standards).
awk '
/eof/   {# For each input line that contains the string "eof"...
	 #	if the HR saved for this group is <= 12, print this record.
	 if (HR <= 12)   print TMP

	 # Clear the hold area for the current record and reset the delmiter to
	 # an empty string.
         TMP = DL = ""

	 # Read the next input line and skip the remaining steps in this script
	 # for this line.
         next
        }
        {# For every line in the input file (other than the "eof" lines we have
	 # discarded), add the current delimter and the current input line to
	 # the saved text containing the current record and set the delimiter to
	 # be used when adding later lines to this record to a comma.
	 TMP = TMP DL $0
         DL  = ","
        }
/^time/ {# For lines that start with the string "time", save two characters
	 # starting with the sixth character on this line in the variable HR.
	 # (I.e. set HR to the two digit hour from the "time" line.)
	 HR = substr ($0, 6, 2)
        }
' log	# End the awk script and specify that the file named "log" is to be
	# read as input.

Does this clear up your confusion with RudiC's suggestion?

Again, note that this works fine as long as you're looking for times with an end-of-range greater than or equal to 10. If at some point in the future you might want to test for a range line 2am to 9am, you would need to be sure that you're performing numeric comparisons instead of string comparisons. If this becomes an issue for you, the easy way to do that with the above code is to change:

         HR = substr ($0, 6, 2)

to:

         HR = subset ($0, 6, 2) + 0

bhupeshchavan · July 16, 2016, 6:53am

Hi Don,

Thank you very much for the explanation.

I got it , just for my understanding(like setting up values first and then evaluating it) i tried the below code and the output is fine, please check if this is the correct way to write or i should follow the RudiC's process of writing it.

awk '
/^time/ {HR = substr ($0, 6, 2)
        }
/eof/   {if (HR <= 12)   print TMP
         TMP = DL = ""
         next
        }
        {TMP = TMP DL $0
         DL  = ","
        }
' log

I believe that TMP keeps on adding the content to the TMP for the other lines until it finds eof. But when we use "TMP = TMP DL $0 DL=","", this is what i am still not sure about.
TMP=TMP means all the values till eof then why should we use DL and $0 after that .What does it mean to TMP.

Can we just TMP=TMP and set OFS="," at the beginning.I tried the below code but no output.

awk '
 BEGIN{OFS=","}
 /eof/   {if (HR <= 12)   print TMP
          TMP =""
          next
         }
         {TMP = TMP
         }
 /^time/ {HR = substr ($0, 6, 2)
         }
 ' log

Please help.

Thank you.

RudiC · July 16, 2016, 8:09am

awk works in a way that all pattern {action} pairs are being executed on the actual line until exhausted or otherwise terminated. Then the next line is read and the execution is repeated. So the sequence of operations doesn't matter as long as you don't want to influence the flow within a line. That's why the position of the time check is irrelevant. In awk , string concatenation is done by just listing the strings like str1 str2 str3 , so TMP = TMP DL $0 assigns to TMP the old value of TMP plus DL plus the actual line. On the first occurrence, TMP and DL are empty strings, so the string starts with $0 only.

bhupeshchavan · July 16, 2016, 3:14pm

Hi RudiC,
I seem to get the logic now.My understanding is below:

                           TMP = TMP DL $0

1st line TMP = "" "" name=admin --> since $0 is this value.
DL=","
2nd line TMP = name=admin,(comma because we have specified it to a comma in the earlier statemant) transid=01

TMP value now --> name=admin,transid=01

3rd line TMP= name=admin,transid=01, $0-->time=06.58.51 HR value is also set in this line

TMP value now --> name=admin,transid=01,time=06.58.51

4th line TMP --> name=admin,transid=01,time=06.58.51,message=test

then comes eof line where HR is evaluated and print TMP is executed.

Is this right.

Thanks.

Don_Cragun · July 16, 2016, 3:56pm

Hi bhupeshchavan,
Yes. You've got it.