Hi there,
I have 2 files: file 1 and file 2. I want to do an exact merge on column 1 in each file. In the match, I want to include all columns in file 1 and column 4 in file 2.
file 1: (140,000,000 rows)
1:964254:T:C 1 0 964254 T C 1:964254
1:965573:A:C 1 0 965573 A C 1:965573
1:983193:G:A 1 0 983193 G A 1:983193
1:1014228:A:G 1 0 1014228 A G 1:1014228
file 2: (7,000,000 rows)
1:10019:TA:T 1 10019 rs775809821 TA T
1:10039:A:C 1 10039 rs978760828 A C
1:10043:T:A 1 10043 rs1008829651 T A
1:10051:A:G 1 10051 rs1052373574 A G
I developed the code below, but I don't think it is working as there are no exact matches which is very unlikely. In other words, the file that is outputted has 0 rows. Can someone tell me if there's something wrong with my code and how to fix it?
awk 'NR==FNR{a[$1]=$4;next} $1 in a {print $1, a[$1],$2,$3,$4,$5,$6,$7}' file 2 file 1 > file 3
Well, the files are much bigger than the 4 rows I presented. I just wanted to post an example of what the files looked like so one could evaluate whether my code was accurate.
I see, perhaps you could post two samples, where there is overlap?
The code in itself looks fine to me, perhaps there is an issue with the input files?
Thanks, Scrutinzer. After your message, I checked and I actually don't see any exact matches, so that makes sense! I found the problem though. The issue is that the in the first column, the letters are flipped around and that's why there's no exact match. Recall that was the variable I was merging on. I created 2 text files with this example and modified my code below, but it doesn't work. Any advice?
file 1:
22:50779796:C:A 22 50779796 rs9616975 C A
merged_file2.txt:
22:50779796:A:C 22 0 50779796 A C 22:50779796
awk '(NR==FNR){a[$2":"$3":"$5":"$6]=$4; a[$2":"$3":"$6":"$5]=$4;next} ($1 in a) {print $1, a[$1],$2,$3,$4,$5,$6,$7}' file1.txt merged_file2.txt > file3.txt
But the code that you posted seems to be working as well, where you reconstruct the first record using fields 2, 3, 5 and 6.
So you could also try replacing the above part by this: