Merge two columns from two files into one if another column matches

I have two text files that look something like this:

A:B:C 123
D:E:F 234
G:H:I 345
J:K:L 123
M:N:O 456
P:Q:R 567
A:B:C 456
D:E:F 567
G:H:I 678
J:K:L 456
M:N:O 789
P:Q:R 890

I want to find the line where the first column matches and then combine the second columns into a single column so the output looks like this:

A:B:C 123456
D:E:F 234567
G:H:I 345678
J:K:L 123456
M:N:O 456789
P:Q:R 567890

After that, I want to search through the second columns and remove duplicate ones. In this case, that would mean the A:B:C line and the J:K:L line have matching second columns, so I would remove the J:K:L line.

A:B:C 123456
D:E:F 234567
G:H:I 345678
M:N:O 456789
P:Q:R 567890

I've tried some awk, but I am not coming anywhere close with this one. :wall: The files can also get pretty big with hundreds of millions of lines, so a speedier option would be best. Thanks!

join -1 1 -2 1 join1.txt join2.txt | sed 's/\([0-9]\) \([0-9]\)/\1\2/g'

Try:

paste file1 file2 | awk '!A[$2$4]++{print $1,$2 $4}'
1 Like

Hi

$ join -1 1 -2 1  f1  f2 | awk '!a[$2$3]++{print $1,$2$3}'
A:B:C 123456
D:E:F 234567
G:H:I 345678
M:N:O 456789
P:Q:R 567890

Guru.

1 Like
$ awk 'NR==FNR{a[$1]=$2;next}$1 in a{if(!b[a[$1]$2]++)printf("%s %s%s\n",$1,a[$1],$2);}' file1 file2
A:B:C 123456
D:E:F 234567
G:H:I 345678
M:N:O 456789
P:Q:R 567890

1 Like

I'd really appreciate to know how far this solution

can be driven, i.e. how many lines will make the awk arrays explode...

Awk only:

awk '{p=$2; getline<f} !A[$2=p $2]++' f=file2 file1
awk '{getline $1<f; $0=$0} !A[$2]++' OFS= f=file1 file2
2 Likes

Thanks for the help everyone!

All the options worked except for bipinajith's.
It turns out this was the fastest option:

awk '{p=$2; getline<f} !A[$2=p $2]++' f=file2 file1

It churned through two 300+million lines files in about four and a half minutes.

---------- Post updated at 05:54 PM ---------- Previous update was at 05:51 PM ----------

It worked just fine. Took 8.5 minutes and required 33GB of RAM. The best solution from Scrutinizer took half the time and 9GB of RAM.

1 Like

Thanks for reporting back, pbluescript...