awk Behavior

sumguy · August 3, 2016, 11:46am

Linux Release

Uname details

Data file

Ive been at the command line for some time. Back as far as SCO and Interactive Unix. I have always used this construct without issues. I want to isolate the ip / field 1. As you can see .. the first line is "skipped".

This works as expected. But again, whats changed ?

Thanks !

Yoda · August 3, 2016, 11:54am

A BEGIN rule is executed once only, before the first input record is read. This is the reason why below code works as expected:-

awk 'BEGIN { FS = "," };{print $1}' dafile

But in this code, FS is set only when the first input record is read:-

awk '{FS="," }{print $1}' dafile

Don_Cragun · August 3, 2016, 2:23pm

yoda:

A BEGIN rule is executed once only, before the first input record is read. This is the reason why below code works as expected:-
awk 'BEGIN { FS = "," };{print $1}' dafile
But in this code, FS is set only when the first input record is read:-
awk '{FS="," }{print $1}' dafile

I would change the statement shown above in red to:

Other ways to make sure that the FS you want is used to split every input line include:

awk -F',' '{print $1}' dafile
awk '{print $1}' FS=',' dafile

Yoda · August 4, 2016, 1:16pm

Thank you Don. I checked gawk code in field.c - routines for dealing with fields and record parsing.

So record parsing happens first with default field separator, then new field separator is used to parse subsequent records.

I also noticed that function set_NF is called before record parsing. So gawk behavior for this variable is different.

awk -F, '{NF=1}{print $NF}' dafile
10.10.10.10
10.10.10.11
10.10.10.12
10.10.10.13

Any idea why developers didn't do the same with function set_FS

MadeInGermany · August 4, 2016, 1:28pm

Old awk and nawk appear inconsistent:

nawk '{print $1; FS=","; print $1}' dafile
10.10.10.10,house
10.10.10.10,house
10.10.10.11
10.10.10.11
10.10.10.12
10.10.10.12
10.10.10.13
10.10.10.13

nawk '{FS=","; print $1}' dafile
10.10.10.10
10.10.10.11
10.10.10.12
10.10.10.13

It looks like they have a "late field splitting" that occurs when a field is referenced the first time.

bakunin · August 4, 2016, 2:17pm

Even though this discussion about awk intrinsics is fascinating and my horizon was expanded (a collective "thank you" to you all in this thread), just for the record:

Wouldn't the usage of shell means (variable expansion or field splitting) be less costly than the use of an external program? I suppose thread-o/p does something with the values once he split them, something along the lines of:

awk -F',' '{print $1}' datafile | while read IP ; do ..... done

In such a case it might be easier to do:

while IFS=, read IP junk ; do ..... done < datafile

or, depending on what else is done:

while read LINE ; do
     IP="${LINE%,*}"
     .....
done < datafile

bakunin

Don_Cragun · August 4, 2016, 3:41pm

yoda:

Thank you Don. I checked gawk code in field.c - routines for dealing with fields and record parsing.

So record parsing happens first with default field separator, then new field separator is used to parse subsequent records.

I also noticed that function set_NF is called before record parsing. So gawk behavior for this variable is different.
awk -F, '{NF=1}{print $NF}' dafile
10.10.10.10
10.10.10.11
10.10.10.12
10.10.10.13
Any idea why developers didn't do the same with function set_FS

I have not looked at the gawk code (and for legal reasons choose not to do so). But one might guess that a function named set_NF() would set the value of the awk NF variable. Are you really telling me that gawk sets the value of NF for a new input record BEFORE parsing that record into fields??? That makes absolutely no sense to me! How can it set NF before it parses a record into fields to determine what value should be assigned to NF ? One might expect that a function like that would be called to parse an input line or AFTER parsing an input line depending on the context. In the context of reading a new record from an input file at the start of a new cycle and in the context of using the awk command:

getline

with no argument naming a variable to be assigned and with no input redirection that should happen (as well as setting $x (for 0 <= x <= NF ), NR , and FNR ). In the context of reading a new record from an input file using the awk command:

getline variable

with a variable, but no input redirection, NR and FNR should be updated, but NF and the current record's fields should not be modified. In the context of reading a new record from an input file using the awk command:

getline variable < file
        or
command | getline variable

with a variable and with input redirection, none of the variables NF , NR , FNR , nor the current record's fields should change.