I have two text files that look something like this:
A:B:C 123
D:E:F 234
G:H:I 345
J:K:L 123
M:N:O 456
P:Q:R 567
A:B:C 456
D:E:F 567
G:H:I 678
J:K:L 456
M:N:O 789
P:Q:R 890
I want to find the line where the first column matches and then combine the second columns into a single column so the output looks like this:
A:B:C 123456
D:E:F 234567
G:H:I 345678
J:K:L 123456
M:N:O 456789
P:Q:R 567890
After that, I want to search through the second columns and remove duplicate ones. In this case, that would mean the A:B:C line and the J:K:L line have matching second columns, so I would remove the J:K:L line.
A:B:C 123456
D:E:F 234567
G:H:I 345678
M:N:O 456789
P:Q:R 567890
I've tried some awk, but I am not coming anywhere close with this one. :wall: The files can also get pretty big with hundreds of millions of lines, so a speedier option would be best. Thanks!
Yoda
October 29, 2012, 11:36am
2
join -1 1 -2 1 join1.txt join2.txt | sed 's/\([0-9]\) \([0-9]\)/\1\2/g'
Try:
paste file1 file2 | awk '!A[$2$4]++{print $1,$2 $4}'
1 Like
Hi
$ join -1 1 -2 1 f1 f2 | awk '!a[$2$3]++{print $1,$2$3}'
A:B:C 123456
D:E:F 234567
G:H:I 345678
M:N:O 456789
P:Q:R 567890
Guru.
1 Like
$ awk 'NR==FNR{a[$1]=$2;next}$1 in a{if(!b[a[$1]$2]++)printf("%s %s%s\n",$1,a[$1],$2);}' file1 file2
A:B:C 123456
D:E:F 234567
G:H:I 345678
M:N:O 456789
P:Q:R 567890
1 Like
RudiC
October 29, 2012, 12:52pm
6
I'd really appreciate to know how far this solution
can be driven, i.e. how many lines will make the awk arrays explode...
Awk only:
awk '{p=$2; getline<f} !A[$2=p $2]++' f=file2 file1
awk '{getline $1<f; $0=$0} !A[$2]++' OFS= f=file1 file2
2 Likes
Thanks for the help everyone!
All the options worked except for bipinajith's.
It turns out this was the fastest option:
awk '{p=$2; getline<f} !A[$2=p $2]++' f=file2 file1
It churned through two 300+million lines files in about four and a half minutes.
---------- Post updated at 05:54 PM ---------- Previous update was at 05:51 PM ----------
It worked just fine. Took 8.5 minutes and required 33GB of RAM. The best solution from Scrutinizer took half the time and 9GB of RAM.
1 Like
Thanks for reporting back, pbluescript...