Tip: alternative for NR==FNR in awk

Example:

$ cat file1
2
3
$ cat file2
1
2
3
4
5
6

The following awk script works like a charm, NR==FNR is true for file1, the remainder runs for file2:

awk '
NR==FNR {A[$1]; next}
($1 in A)
' file1 file2
2
3

Now have an empty file1:

>file1

and run the awk script again.
The result is empty as expected.
However, this time it did the NR==FNR action for file2!
Check with

awk '
NR==FNR {A[$1]; print FILENAME,$1; next}
($1 in A)
' file1 file2

Here the result was good - by good luck.
In some other cases this can lead to misbehavior.
The following fix is available:

awk '
FILENAME=="file1" {A[$1]; print FILENAME,$1; next}
($1 in A)
' file1 file2

But not always applicable, for example if you have a wild card file* .
So here is a better fix:

awk '
F==0 {A[$1]; print FILENAME,$1; next}
($1 in A) {print}
' file1 F=1 file2

Now F is undefined=0 in file1 and set to 1 before file2 is opened.
You can even continue like this: file1 F=1 file2 F=2 file3
then you can even distinguish between file2 and file3.
The ultra-short-code-hackers can even use !F .

3 Likes

Another option would be to just check the filename:

awk '
BEGIN {f=FILENAME}
FILENAME==f {A[$1]; f=FILENAME; next}
($1 in A)
' file1 file2

The ultra-short-code-hackers can even use:

awk '
FILENAME==f {A[$1]; f=FILENAME; next}
($1 in A)
' f=file1 file1 file2

Hmm, what is the f=FILENAME in the main loop for?
Then, in your first example, the BEGIN {f=FILENAME} only works with nawk and derived awk's.

I get the impression that's why that feature exists, so you can process different files with their own default values of some sort.

Sure.
Most useful is FS, like file1 FS="," file2
Then one can as well test with FS!=","

You can also let awk directly examine the arguments given to it:

awk '
BEGIN {	printf("ARGV[0]=%s\n", ARGV[0])
	for(i = 1; i < ARGC; i++)
		if(ARGV ~ /=/)
			printf("ARGV[%d]=%s: assignment\n", i, ARGV)
		else {	printf("ARGV[%d]=%s: file operand\n", i, ARGV)
			if(!f1)	f1 = ARGV
		}
	print ""
}
FILENAME == f1 {
	# Process lines from 1st file here...
	printf("From 1st file(%s); %s\n", f1, $0)
	next
}
{	# Process remaining files here...
	printf("From subsequent file(%s): %s\n", FILENAME, $0)
}' FS=, empty_file OFS='|' file1 FS='|' file2

If empty_file is an empty file, file1 contains:

f1 line1
f1 line2

and file2 contains:

f2 line1
f2 line2

it produces the output:

ARGV[0]=awk
ARGV[1]=FS=,: assignment
ARGV[2]=empty_file: file operand
ARGV[3]=OFS=|: assignment
ARGV[4]=file1: file operand
ARGV[5]=FS=|: assignment
ARGV[6]=file2: file operand

From subsequent file(file1): f1 line1
From subsequent file(file1): f1 line2
From subsequent file(file2): f2 line1
From subsequent file(file2): f2 line2

and if the last line of the script is changed to:

}' FS=, OFS='|' file1 FS='|' file2

it produces the output:

ARGV[0]=awk
ARGV[1]=FS=,: assignment
ARGV[2]=OFS=|: assignment
ARGV[3]=file1: file operand
ARGV[4]=FS=|: assignment
ARGV[5]=file2: file operand

From 1st file(file1); f1 line1
From 1st file(file1); f1 line2
From subsequent file(file2): f2 line1
From subsequent file(file2): f2 line2
1 Like

This might work on some systems, but the standards say that the value of FILENAME in a BEGIN clause is undefined.

In awk on OS X, FILENAME expands to an empty string (or 0 depending on context) in a BEGIN action.

Thanks Don, I though that ARGV is GNUmagic.
So my first example should be improved like this

awk '
BEGIN {
for (i=1; i<ARGC; i++) if (ARGV!~"=") {f1=ARGV; break}
}
FILENAME==f1 {A[$1]; next}
($1 in A)
' file1 file2

And works nicely with shell wildcards like file* !
BTW most awk versions want if (... ~ "=") instead of if (... ~ /=/) , even outside a {block} .
They have a problem to parse the characters ( ) = within / / but not within " " .

1 Like

Thanks for the warning.

The standards say that the right hand operand of the ~ and !~ operators can always be a string containing an ERE or an ERE token (i.e., /ERE/ ). But, if there is an ambiguity as to whether a / is a division operator or part of an ERE token, awk is supposed to assume it is a division operator. In a simple if statement like this, there shouldn't be any ambiguity.