nawk is truncating output

sdosanjh · October 13, 2012, 8:46am

Legends,

I have 2 files f1 and f2. when i use nawk to compare the difference(subtraction) from 4th column of the file, it truncates the output.
can you please help to resolve this.

subtraction is (4th col of f1 - 4th col of f2). but it gives only below lines out of 116. I want to print all the lines of the file even if there is diff or no diff. :wall:

san:/tmp> wc -l f1 f2 | grep -v total
     116 f1
     116 f2

san:/tmp> head -3 f1 f2
==> f1 <==
TSCparser1 1irons1 EMEA_01 3
TSCparser12 1irons1 SPAIN_01 0
TSCparser13 1irons1 GERMANY_03 0

==> f2 <==
TSCparser1 1irons1 EMEA_01 3
TSCparser12 1irons1 SPAIN_01 0
TSCparser13 1irons1 GERMANY_03 0

san:/tmp> nawk 'FNR==NR{a[$1,$2,$3]=$4;next}{if(a[$1,$2,$3]){print $1,$2,$3,(a[$1,$2,$3]-$4)" times gapped in past 1 hr."}}' OFS="         " f1 f2
TSCparser1         1irons1         EMEA_01         0 times gapped in past 1 hr.
TSCparser94         1irons1         LSE_01         0 times gapped in past 1 hr.
TSCparser43         4irons1         STUTTGART_04         0 times gapped in past 1 hr.
TSCparser44         4irons1         STUTTGART_05         0 times gapped in past 1 hr.
TSCparser46         4irons1         STUTTGART_07         0 times gapped in past 1 hr.
TSCparser47         4irons1         STUTTGART_08         0 times gapped in past 1 hr.

pamu · October 13, 2012, 9:11am

try this..

nawk 'FNR==NR{a[$1,$2,$3]=$4;next}{if(a[$1,$2,$3] != ""){print $1,$2,$3,(a[$1,$2,$3]-$4)" times gapped in past 1 hr."}}' OFS="\t" f1 f2

RudiC · October 13, 2012, 9:24am

The "error" is that in two of the three cases in your example, a[$1,$2,$3] exists, but is equal to zero. That's why awk won't print your line, even though the difference might be non-zero. Test it with $4 != 0 in f1. I'm not sure how to test the sheer existence of an entity in awk, but I think pamu has shown you a way to correct your statement.

-------------------------- edit ---------------------------------

 Reading man pages is educational. From the mawk man page:

so,

($1,$2,$3) in a {print $1,$2,$3,(a[$1,$2,$3]-$4)" tim..."}

will do the job.

elixir_sinari · October 13, 2012, 9:41am

Then, why are you checking something in if before printing the data? Drop that if :

nawk 'FNR==NR{a[$1,$2,$3]=$4;next}
{print $1,$2,$3,((($1,$2,$3) in a)?(a[$1,$2,$3]-$4):" ") " times gapped in past 1 hr."}' OFS="         " f1 f2

This will output all lines from f2 . If matching line is found in f1 , the numerical difference will be shown. Otherwise, a space will be shown in place of the difference.

Don_Cragun · October 13, 2012, 10:06am

You should also note that the value of SUBSEP varies in different implementations of awk (and I don't remember what value nawk uses). Some systems (for example OS X) default SUBSEP to an empty string. (SUBSEP is used to separate strings in multi-dimensional array subscripts). If there are any cases in your input where concatenating $1, $2, and $3 could yield a string that is not unique, you should explicitly set SUBSEP to something that doesn't appear in any of those three fields. Since $1 in your input ends with one or more digits and $2 starts with at least one digit, it looks like this could be possible issue with your input. For your input I would suggest setting SUBSEP to "," or "|" (e.g., add SUBSEP="," in your nawk command line after setting OFS).

RudiC said he didn't know how to test for the sheer existence of an entity in an array. The way to do that in this case would be to use:

if($1 SUBSEP $2 SUBSEP $3 in a) {...}

which would have the same meaning as:

if(a[$1,$2,$3] != "") {...}

in pamu's correction to the nawk script. In this case the test for an empty string is shorter than the test for existence (and for many is easier to read/understand), so I wouldn't make any change here.

alister · October 13, 2012, 10:53am

That's incorrect. From opensource.apple.com :: awk-18 :: tran.c (OS X 10.8.2):

char	**SUBSEP;	/* subscript separator for a[i,j,k]; default \034 */
...
SUBSEP = &setsymtab("SUBSEP", "\034", 0.0, STR|DONTFREE, symtab)->sval;

That code is also present in opensource.apple.com :: awk-1.2 :: tran.c (10.0), so it's not a recent change.

Any implementor who chooses an empty string for the value of SUBSEP should be shunned by the AWK community ;). Seriously, though, the chance for collisions would be too great.

OS X's awk is nawk (which is also used by the BSD systems). "\034" is also the value of SUBSEP in the mawk, GNU awk, and busybox awk implementations.

In light of this, fiddling with SUBSEP is usually unnecessary.

Regards,
Alister

sdosanjh · October 13, 2012, 12:37pm

Thanks Elixir and pamu, it is working now. the only thing i forgot to mention is f1 has higher numeric count than f2 always.

example: if in 1st run of script 4th col of f2 =123, and f1=125
then second run will be f2=125, f1=127, always greater than value in f2

pamu · October 13, 2012, 1:13pm

Yes. We are subtracting f1 -f2 only..