Simple two file compare with twist

jack.bauer · May 17, 2012, 7:53am

I have file1 and file2
I lookup field3 from file2 in field1 of file1 and if there is a match, output field 2,3,5 from file2.
I now want to add field2 of file1 in the output.

I suspect what I have to do is read the entire line of file1 into a 2 dim array? pls help.

here is my code:

cat file1

foo,cmd1
bar,cmd2

cat file2

Hello,World,foo
Alice,Bob,bar
Egg,Spam,ham

output with current awk:

awk -F, 'FNR==NR {arr[$1];next} $3 in arr {OFS=","; print $2,$4,$5}' file1 file2
World,fi,fom
Bob,bie,doll

desired output:

World,fi,fom,cmd1
Bob,bie,doll,cmd2

Scrutinizer · May 17, 2012, 8:07am

Try:

awk -F, 'FNR==NR {arr[$1]=$2; next} $3 in arr {print $2,$4,$5,arr[$3]}' OFS=, file1 file2

jack.bauer · May 17, 2012, 8:42am

indeed that works, thanks.
do you mind explaining this {arr[$1]=$2
and OFS=, at the end (as i thought the comma within the print brackets did the job)

Scrutinizer · May 17, 2012, 8:55am

Ok, since you already used arr[$1] to create an empty array element with the index of field 1, I used that to give it the value of field 2 instead, which then later gets referenced in the second part, when file 2 gets read. You could leave OFS where it was, but that means it would get set to a comma every time a new line is read from file2. This way it is only set once, before the files are being read, it is just more efficient.

jack.bauer · May 17, 2012, 9:45am

sorry still a little confused.

so the sequence is:

load file1, field1 into arr until FNR==NR
perform the match condition i.e $3 in arr

at this point my array still contains the value of field 1, are we saying this then gets replaced with the value of field2?

good tip

Scrutinizer · May 17, 2012, 9:58am

When FNR==NR (the first file is being read), load field 2 ($2) into array with index of field 1 ($1), until FNR!=NR then the second file starts (and the first section of the script is now skipped). While we are reading file 2 line by line, print the fields of file 2 and recall the previously stored array value using field 3 of the second file ($3) as the index..

jack.bauer · May 17, 2012, 12:34pm

Ok I'm with you now.
That means when I load field 1 into the array, I've actually created an empty array element with my values from field 1 as index.
I was under the impression they are actually loaded as elements into the array with an incremental count 0,1,2,3,...n as index.
So in effect by the time the first part of the script completes, we have the following contents in that array:

Index, Value
foo,cmd1
bar,cmd2

Have I understood this correctly?

Scrutinizer · May 17, 2012, 12:46pm

Yes, that is correct. Awk uses so-called associative arrays..

jack.bauer · May 18, 2012, 1:31pm

understood, thanks.

speaking of making the code for efficient, the current method will become expensive soon as file 2 will grow to thousands of lines whereas file1 will only be around 20-30 lines at most. so currently, the code will iterate through each line of file2 and compare against file1 (which is loaded in the array).

it would be more economical to switch this around i.e load file2 into the array and lookup file1 so if file1 only has 5 lines, it will stop at that and hence be quicker. is my logic correct?

jack.bauer · July 25, 2012, 7:28am

found my first little "bug" with this, more like "shortcoming" as i didnt plan for it.
if file one has field1 appearing more than once (on a separate line), i would need that in the output with its corresponding field2. eg:

cat file1

foo,cmd1
bar,cmd2
foo,cmd3

cat file2

Hello,World,foo
Alice,Bob,bar
Egg,Spam,ham

desired output:

World,fi,fom,cmd1
Bob,bie,doll,cmd2
World,fi,fom,cmd3

currently this is not the case as the script looks in file2 and if field3 is found in array A, prints the output. this will only ever match once as field3 will only be in file2 once. hope that makes sense.

i think we need to do a reverse look up perhaps? i.e use field1,file1 and lookup against file2?