awk match two fields in two files

geomarine · June 12, 2018, 11:54am

Hi, I have two TEST files t.xyz and a.xyz which have three columns each. a.xyz have more rows than t.xyz. I will like to output rows at which $1 and $2 of t.xyz match $1 and $2 of a.xyz. Total number of output rows should be equal to that of t.xyz.
It works fine, but when I apply it to large file, the output is more than in t.xyz.

I use the following:

awk 'FNR==NR{a[$1];b[$2];next} $1 in a && $2 in b'  t.xyz a.xyz > out.xyz

t.xyz
1907.05604682 2983.53399456 -5435.67749023
1908.05607621 2983.53399456 -3593.08154297
1910.05613499 2983.53399456 -1238.71289063
1911.05616438 2983.53399456 -4244.93823242
1912.05619377 2983.53399456 -3595.24414063
1913.05622316 2983.53399456 -2454.96728516
1923.05651706 2983.53399456 NaN

a.xyz
1907.05604682 2983.53399456 35.67749023
1908.05607621 2983.53399456 93.08154297
1910.05613499 2983.53399456 38.71289063
1911.05616438 2983.53399456 44.93823242
1912.05619377 2983.53399456 95.24414063
1913.05622316 2983.53399456 54.96728516
1923.05651706 2983.53399456 NaN
631.018545121 2646.58662319 24.715881348
635.018662681 2646.58662319 27.13696289

expected out.xyz
1907.05604682 2983.53399456 35.67749023
1908.05607621 2983.53399456 93.08154297
1910.05613499 2983.53399456 38.71289063
1911.05616438 2983.53399456 44.93823242
1912.05619377 2983.53399456 95.24414063
1913.05622316 2983.53399456 54.96728516
1923.05651706 2983.53399456 NaN

Any help to fix this will be appreciated.

Scrutinizer · June 12, 2018, 12:58pm

I tried your script and I get your expected output. Do you have sample where the expected output is not produced?

vgersh99 · June 12, 2018, 1:41pm

a slightly simplified variation:

awk '{idx=$1 SUBSEP $2} FNR==NR{a[idx];next} idx in a'  t.xyz a.xyz > out.xyz

RudiC · June 12, 2018, 6:02pm

This works on my linux mawk 1.3.3 :

awk 'FNR==NR {a[$1,$2]; next} ($1,$2) in a'  t.xyz a.xyz

geomarine · June 13, 2018, 1:51am

Yes, I apply it to large data file and it failed. I don't understand why I should have output (a.xyz) that is more than number of rows in t.xyz.
I tried this by vgersh99 and it works fine.

awk '{idx=$1 SUBSEP $2} FNR==NR{a[idx];next} idx in a'  t.xyz a.xyz > out.xyz

I now understand that there is no constraint on a.xyz, aside the matching ones, all the row print.
Thanks.

RudiC · June 13, 2018, 4:17am

Did you consider duplicates when the output is larger than t.xyz?

RudiC · June 13, 2018, 10:37am

I think I found another "failure mode" in your post#1 approach, NOT in the proposals from the forum:
If $1 from a.xyz matches $1 in any line in t.xyz, and $2 matches any OTHER line in t.xyz, your code prints. The other approaches insist on both matches being in one single line to print!
Example:

file1:

A B C
D E F

file2:

A E X

Your code prints A E X !