Problems with exact merge of 2 columns using awk

Hi there,
I have 2 files: file 1 and file 2. I want to do an exact merge on column 1 in each file. In the match, I want to include all columns in file 1 and column 4 in file 2.

file 1: (140,000,000 rows)

1:964254:T:C 1 0 964254 T C 1:964254 
1:965573:A:C 1 0 965573 A C 1:965573
1:983193:G:A  1 0 983193 G A 1:983193
1:1014228:A:G 1 0 1014228 A G 1:1014228

file 2: (7,000,000 rows)

1:10019:TA:T 1 10019  rs775809821  TA T
1:10039:A:C 1 10039 rs978760828 A C
1:10043:T:A 1 10043 rs1008829651 T A
1:10051:A:G 1 10051 rs1052373574 A G

I developed the code below, but I don't think it is working as there are no exact matches which is very unlikely. In other words, the file that is outputted has 0 rows. Can someone tell me if there's something wrong with my code and how to fix it?

awk 'NR==FNR{a[$1]=$4;next} $1 in a {print $1, a[$1],$2,$3,$4,$5,$6,$7}'  file 2 file 1 >  file 3

Thanks in advance.

Welcome @ellie_story_2020 :+1: ,

There appears to be no overlap between the two sample files, so it seems normal that the output of a merge should be empty, no?

Well, the files are much bigger than the 4 rows I presented. I just wanted to post an example of what the files looked like so one could evaluate whether my code was accurate.

I see, perhaps you could post two samples, where there is overlap?
The code in itself looks fine to me, perhaps there is an issue with the input files?

Thanks, Scrutinzer. After your message, I checked and I actually don't see any exact matches, so that makes sense! I found the problem though. The issue is that the in the first column, the letters are flipped around and that's why there's no exact match. Recall that was the variable I was merging on. I created 2 text files with this example and modified my code below, but it doesn't work. Any advice?

file 1:

22:50779796:C:A	22 50779796 rs9616975 C A

merged_file2.txt:

22:50779796:A:C 22 0 50779796 A C 22:50779796
awk '(NR==FNR){a[$2":"$3":"$5":"$6]=$4; a[$2":"$3":"$6":"$5]=$4;next} ($1 in a) {print $1, a[$1],$2,$3,$4,$5,$6,$7}' file1.txt merged_file2.txt > file3.txt
1 Like

You're welcome @ellie_story_2020 ,

You could try replacing the above part of your original code with:

awk 'NR==FNR{split($1,F,":"); a[F[1]":"F[2]":"F[4]":"F[3]]=$4;next}

But the code that you posted seems to be working as well, where you reconstruct the first record using fields 2, 3, 5 and 6.
So you could also try replacing the above part by this:

awk 'NR==FNR{a[$2":"$3":"$6":"$5]=$4;next}
1 Like

I ended up getting it to work--thank you!

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.