Strange behaviour of arrays in awk

ripat · September 11, 2013, 10:25am

Imagine 2 files f1 f2:

file1_l1_c1 code_to_find file1_l1_c3
file1_l2_c1 file1_code2 file1_l2_c3
file1_l3_c1 file1_code3 file1_l3_c3

file2_l1_c1 file2_l1_c2 code_to_find
file2_l2_c1 file2_l2_c2 file2_code5
file2_l3_c1 file2_l3_c2 file2_code3

Say we want to print lines from f2 having "code_to_find" as $3. I go the classical way with a

FNR == NR && /file1_l1/ {
	code[$2] = 1
	next
}

code[$3] {
	print
}

As expected the output is: file2_l1_c1 file2_l1_c2 code_to_find

Now, if I print the code[] array in the END block: END{ for (i in code) print "code[" i "]=" code}

I would have expected that block to produce the only code[index] with a value i.e. code[code_to_find]=1

But to my great surprise, it returns this:

code[file1_l3_c3]=
code[code_to_find]=1
code[file2_code3]=
code[file1_l2_c3]=
code[file2_code5]=

How come that awk assigns the NULL value to the array with $3 from all files as index? Kind of weird to me.

rdrtx1 · September 11, 2013, 11:12am

The array was loaded in the FNR==NR bracket.

Try:

{ if (cde[$3]) {print } else  { delete cde[$3] }; }

or

END{ for (i in cde) if (cde) print i, "code[" i "]=" cde}

disedorgue · September 11, 2013, 11:22am

(If i understand your question):
If your first bloc is false, your second is execute, this is like:

$ cat te_ak 
XX_1_YY
XX_2_YY
XX_3_YY
XX_4_YY
XX_5_YY
XX_6_YY
XX_7_YY
XX_8_YY
XX_9_YY
XX_10_YY
$ awk -F_ 'code[$1_$2]{ print }; END{ for (i in code) print "code[" i "]=" code} ' te_ak 
code[XX9]=
code[XX1]=
code[XX2]=
code[XX3]=
code[XX4]=
code[XX5]=
code[XX10]=
code[XX6]=
code[XX7]=
code[XX8]=

Regards.

Scrutinizer · September 11, 2013, 12:13pm

Hi Ripat,

IMO the problem is in this section:

code[$3] {
	print
}

This is not only a condition, but it also creates an array element code[$3] with an empty value

If you use:

$3 in code {
  print
}

Then it should work as expected...

ripat · September 11, 2013, 12:46pm

Indeed and that's exactly what I find weird. With code[$3] in the second block I was expecting awk to *evaluate* the value of code[$3] *not* to assign any value to it, albeit NULL.

awk 'foo="bar"{print "block 1"} END{print foo}' f1

Returns bar.

foo="bar" assigns "bar" to foo and returns a TRUE. No problem with that. But in the condition of the second block code[$3] there is no assignment sign and it still assigns a value. I can't stop finding it weird.

Furthermore, if you look to my code above and its return.

FNR == NR && /file1_l1/ {
	code[$2] = 1
	next
}

code[$3] {
	print
}

code[file1_l3_c3]=
code[code_to_find]=1
code[file2_code3]=
code[file1_l2_c3]=
code[file2_code5]=

The instruction next should make the program to loop on the first file until it reaches the end of file1. Then it continues with the second file, right? I understand that the condition of the second block assigns a value while evaluating code[$3] but how come that it assigns values from the first file as the pointer NR is already on the second file? See my point?

Scrutinizer · September 11, 2013, 2:33pm

That is standard awk behaviour, arrays are not declared. If you refer to a non-existing array element, it automatically creates it. It does not assign an empty value, but rather it creates an unitialized array element with an empty value.. To test the presence of an array element without creating it, you need the index in array expression.

As for the second part. No, not exactly because of the first condition, which makes that the second part gets executed for some of the lines in file1. Try:

FNR==NR { 
  if (/file1_l1/) code[$2] = 1
  next
}