Compare the second column of a file with the second column of another in awk

amarn · May 23, 2013, 12:51pm

Hi,

I know that this topic has been discussed in the past and I've tried to follow all the guidelines. Anyhow, I following describe my problem.

I have a file (file1 , no. records = 67) containing pairs of IP addresses as follows (with single space as delimiter between the fields):

example of file1

1.2.3.4 123.45.2.1
1.2.3.4 68.32.1.4
........  ............

and another file (file2) which is much larger in size (no. records = 500 000) that the first two fields are still IP addresses (single space as delimiter in all fields) that they do also provide some other information in the subsequent fields (48 fields on every record/line)...as follows

example of file2

1.2.3.4 123.45.2.1 53 4000 2.5 30 21 43 .....
1.2.3.4 68.32.1.4   500 80 3.7 40 22 10 .....
1.2.3.4 50.44.66.1 40 30 8.5 29 11 10 .....

Since the first field is the same, I want to compare the second field of file1 with the second field of file2 and if matched to print the whole record (with all the 48 fields) from file2 into a new file (i.e. file3)

I've tried to do with the following one-liner

awk 'NR==FNR{a[$2]=$0;next}a[$2]{$0=a[$2]}1'  file2 file1 > file3

which provides me a result but it only gives me 67 records in the newly created file3...by going visually through file2 I could see more matches...any suggestions?

Thanks in advance!

Yoda · May 23, 2013, 12:55pm

Try this:

awk 'NR==FNR{A[$2];next}$2 in A' file1 file2

amarn · May 23, 2013, 1:04pm

thanks Yoda! It worked...now found 277 new records! Anyhow, just out of curiosity, what was the reason I was obtaining only 67 instances from the previous code? It would be great if you can enlighten me on that.

Thanks again and may the force be with you

Yoda · May 23, 2013, 1:11pm

The code that you posted was actually comparing 2nd field in file2 with 2nd field in file1:

awk 'NR==FNR{a[$2]=$0;next}a[$2]{$0=a[$2]}1'  file2 file1

But your requirement was to compare it other way! I believe this must have caused the discrepancy in the output because you said file1 contains 67 records.

Also your code was logically wrong because you put a 1 to print each record regardless the record is found or not:

awk 'NR==FNR{a[$2]=$0;next}a[$2]{$0=a[$2]}1'  file2 file1

I hope this helps.

amarn · May 23, 2013, 1:17pm

this helped a lot, thank you again!

RudiC · May 23, 2013, 3:28pm

You said field1 is the same in both files, so you could include it in the search as well. grep could do it as well, but might be overwhelmed when it comes to really large files:

$ grep -Ff file1 file2
1.2.3.4 123.45.2.1 53 4000 2.5 30 21 43 .....
1.2.3.4 68.32.1.4   500 80 3.7 40 22 10 .....