In the below awk I am trying output to one file those lines that match between $2,$3,$4 of file1 and file2 with the count in () . I am also trying to output those lines that are missing between $2,$3,$4 of file1 and file2 with the count of in () each. Both input files are tab-delimited , but the output is not. I am not sure where to put the counter to get the desired output . Thank you :).
file1
1 955597 G G
1 9773306 T C
1 981931 A G
1 982994 T C
1 984302 T C
file2
1 955597 G G
1 9773306 T C
1 981939 A G
1 982978 T C
1 984302 T C
desired output
Match: (3)
1 955597 G G
1 9773306 T C
1 984302 T C
Missing from file1: (2)
1 981939 A G
1 982978 T C
Missing from file2:
1 981931 A G
1 982994 T C
awk
awk -F'\t' 'FNR==1 { next }
FNR == NR { file1[$2,$3,$4] = $2 " " $3 " " $4 }
FNR != NR { file2[$2,$3,$4] = $2 " " $3 " " $4 }
END { print "Match:"; for (k in file1) if (k in file2) print file1[k] # Or file2[k]
print "Missing in file1:"; for (k in file2) if (!(k in file1)) print file2[k]
print "Missing in file2:"; for (k in file1) if (!(k in file2)) print file1[k]
}' file1 file2 > NA12878_match
What is the role of $1 in all of this? What is the point of not using it for the matching, but still using it for the output. Suppose matches in file1 and file2 do have a different $1, how do you decide which $1 to print, or why would you print it at all.
If on the other hand the value of column 1 is always equal for file1 and file2, then we might just as well use $0 for matching, which would simplify the script
$1 is not used in the match because the format of that is highly variable, sometimes it is chr1 and sometimes it is just 1 .
It is very unlikely that matches will have different $1 values, so it just seemed easier not to match on it but print it instead. Thank you :).
This is a bit closer to the approach you used cmccabe.
All I did was build a string in the for loop and count the lines then printf the result
awk -F'\t' '
{
if(FNR == NR) file1[$2" "$3" "$4]=$0
else file2[$2" "$3" "$4]=$0
}
END {
for (k in file1)
if (k in file2) { X=X"\n"file1[k]; m++}
for (k in file2)
if (!(k in file1)) {Y=Y"\n"file2[k]; f1++}
for (k in file1)
if (!(k in file2)) {Z=Z"\n"file1[k]; f2++}
printf "Match: (%d)%s\n", m, X
printf "Missing in file1: (%d)%s\n", f1, Y
printf "Missing in file2: (%d)%s\n", f2, Z
}' file1 file2 > NA12878_match