awk match two fields in two files

Hi, I have two TEST files t.xyz and a.xyz which have three columns each. a.xyz have more rows than t.xyz. I will like to output rows at which $1 and $2 of t.xyz match $1 and $2 of a.xyz. Total number of output rows should be equal to that of t.xyz.
It works fine, but when I apply it to large file, the output is more than in t.xyz.

I use the following:

awk 'FNR==NR{a[$1];b[$2];next} $1 in a && $2 in b'  t.xyz a.xyz > out.xyz
t.xyz
1907.05604682 2983.53399456 -5435.67749023
1908.05607621 2983.53399456 -3593.08154297
1910.05613499 2983.53399456 -1238.71289063
1911.05616438 2983.53399456 -4244.93823242
1912.05619377 2983.53399456 -3595.24414063
1913.05622316 2983.53399456 -2454.96728516
1923.05651706 2983.53399456 NaN

a.xyz
1907.05604682 2983.53399456 35.67749023
1908.05607621 2983.53399456 93.08154297
1910.05613499 2983.53399456 38.71289063
1911.05616438 2983.53399456 44.93823242
1912.05619377 2983.53399456 95.24414063
1913.05622316 2983.53399456 54.96728516
1923.05651706 2983.53399456 NaN
631.018545121 2646.58662319 24.715881348
635.018662681 2646.58662319 27.13696289

expected out.xyz
1907.05604682 2983.53399456 35.67749023
1908.05607621 2983.53399456 93.08154297
1910.05613499 2983.53399456 38.71289063
1911.05616438 2983.53399456 44.93823242
1912.05619377 2983.53399456 95.24414063
1913.05622316 2983.53399456 54.96728516
1923.05651706 2983.53399456 NaN

Any help to fix this will be appreciated.

I tried your script and I get your expected output. Do you have sample where the expected output is not produced?

1 Like

a slightly simplified variation:

awk '{idx=$1 SUBSEP $2} FNR==NR{a[idx];next} idx in a'  t.xyz a.xyz > out.xyz
1 Like

This works on my linux mawk 1.3.3 :

awk 'FNR==NR {a[$1,$2]; next} ($1,$2) in a'  t.xyz a.xyz 
1 Like

Yes, I apply it to large data file and it failed. I don't understand why I should have output (a.xyz) that is more than number of rows in t.xyz.
I tried this by vgersh99 and it works fine.

awk '{idx=$1 SUBSEP $2} FNR==NR{a[idx];next} idx in a'  t.xyz a.xyz > out.xyz

I now understand that there is no constraint on a.xyz, aside the matching ones, all the row print.
Thanks.

Did you consider duplicates when the output is larger than t.xyz?

I think I found another "failure mode" in your post#1 approach, NOT in the proposals from the forum:
If $1 from a.xyz matches $1 in any line in t.xyz, and $2 matches any OTHER line in t.xyz, your code prints. The other approaches insist on both matches being in one single line to print!
Example:

file1:

A B C
D E F

file2:

A E X

Your code prints A E X !

1 Like