awk to output match and mismatch with count using specific fields

cmccabe · December 9, 2016, 6:35pm

In the below awk I am trying output to one file those lines that match between $2,$3,$4 of file1 and file2 with the count in () . I am also trying to output those lines that are missing between $2,$3,$4 of file1 and file2 with the count of in () each. Both input files are tab-delimited , but the output is not. I am not sure where to put the counter to get the desired output . Thank you :).

file1

1    955597    G    G
1    9773306    T    C
1    981931    A    G
1    982994    T    C
1    984302    T    C

file2

1    955597    G    G
1    9773306    T    C
1    981939    A    G
1    982978    T    C
1    984302    T    C

desired output

Match: (3)
1    955597    G    G
1    9773306    T    C
1    984302    T    C
Missing from file1: (2)
1    981939    A    G
1    982978    T    C
Missing from file2:
1    981931    A    G
1    982994    T    C

awk

awk -F'\t' 'FNR==1 { next }
        FNR == NR { file1[$2,$3,$4] = $2 " " $3 " " $4 }
        FNR != NR { file2[$2,$3,$4] = $2 " " $3 " " $4 }
        END { print "Match:"; for (k in file1) if (k in file2) print file1[k] # Or file2[k]
              print "Missing in file1:"; for (k in file2) if (!(k in file1)) print file2[k]
              print "Missing in file2:"; for (k in file1) if (!(k in file2)) print file1[k]
}' file1 file2  > NA12878_match

RavinderSingh13 · December 9, 2016, 7:28pm

Hello cmccabe,

Could you please try following and let me know if this helps you.

awk 'FNR==NR{A[$2 FS $3 FS $4]=$0;next} {Q=$2 FS $3 FS $4} !(Q in A){;NON_MATCH2=NON_MATCH2?NON_MATCH2 ORS $0:$0} (Q in A){MATCH=MATCH?MATCH ORS A[Q]:A[Q];delete A[Q];} END{for(i in A){;NON_MATCH1=NON_MATCH1?NON_MATCH1 ORS A:A};print "Match:" ORS MATCH ORS "Missing from file1:" ORS NON_MATCH2 ORS "Missing from file2:" ORS NON_MATCH1}'   Input_file1  Input_file2

Output will be as follows.

Match:
1    955597    G    G
1    9773306    T    C
1    984302    T    C
Missing from file1:
1    981939    A    G
1    982978    T    C
Missing from file2:
1    982994    T    C
1    981931    A    G

Thanks,
R. Singh

cmccabe · December 9, 2016, 7:39pm

Can the (count) of each be added in? Thank you :).

Match: (3)
1    955597    G    G
1    9773306    T    C
1    984302    T    C
Missing from file1: (2)
1    981939    A    G
1    982978    T    C
Missing from file2: (2)
1    982994    T    C
1    981931    A    G

RavinderSingh13 · December 9, 2016, 10:50pm

cmccabe:

Can the (count) of each be added in? Thank you :).

Match: (3)
1    955597    G    G
1    9773306    T    C
1    984302    T    C
Missing from file1: (2)
1    981939    A    G
1    982978    T    C
Missing from file2: (2)
1    982994    T    C
1    981931    A    G

Hello cmccabe,

Could you please try following and let me know if this helps.

awk 'FNR==NR{A[$2 FS $3 FS $4]=$0;next} {Q=$2 FS $3 FS $4} !(Q in A){;NON_MATCH2=NON_MATCH2?NON_MATCH2 ORS $0:$0;p++} (Q in A){MATCH=MATCH?MATCH ORS A[Q]:A[Q];delete A[Q];q++} END{for(i in A){;NON_MATCH1=NON_MATCH1?NON_MATCH1 ORS A:A;r++};print "Match: (" q ")" ORS MATCH ORS "Missing from file1: (" p ")" ORS NON_MATCH2 ORS "Missing from file2: (" r ")"ORS NON_MATCH1}'  Input_file1   Input_file2

Output will be as follows.

Match: (3)
1    955597    G    G
1    9773306    T    C
1    984302    T    C
Missing from file1: (2)
1    981939    A    G
1    982978    T    C
Missing from file2: (2)
1    982994    T    C
1    981931    A    G

Adding one-line form for solution too now.

awk 'FNR==NR{
                A[$2 FS $3 FS $4]=$0;
                next
            }
            {
                Q=$2 FS $3 FS $4
            }
            !(Q in A){;
                        NON_MATCH2=NON_MATCH2?NON_MATCH2 ORS $0:$0;
                        p++
                     }
             (Q in A){
                        MATCH=MATCH?MATCH ORS A[Q]:A[Q];
                        delete A[Q];
                        q++
                     }
      END   {
                for(i in A){;
                                NON_MATCH1=NON_MATCH1?NON_MATCH1 ORS A:A;
                                r++
                           };
                print "Match: (" q ")" ORS MATCH ORS "Missing from file1: (" p ")" ORS NON_MATCH2 ORS "Missing from file2: (" r ")"ORS NON_MATCH1
            }
    '   Input_file1   Input_file2

Thanks,
R. Singh

Scrutinizer · December 10, 2016, 10:24am

What is the role of $1 in all of this? What is the point of not using it for the matching, but still using it for the output. Suppose matches in file1 and file2 do have a different $1, how do you decide which $1 to print, or why would you print it at all.

If on the other hand the value of column 1 is always equal for file1 and file2, then we might just as well use $0 for matching, which would simplify the script

cmccabe · December 12, 2016, 7:23am

Thank you very much

$1 is not used in the match because the format of that is highly variable, sometimes it is chr1 and sometimes it is just 1 .
It is very unlikely that matches will have different $1 values, so it just seemed easier not to match on it but print it instead. Thank you :).

Chubler_XL · December 12, 2016, 3:19pm

This is a bit closer to the approach you used cmccabe.

All I did was build a string in the for loop and count the lines then printf the result

awk -F'\t' '
{
  if(FNR == NR) file1[$2" "$3" "$4]=$0
  else file2[$2" "$3" "$4]=$0
}
END {
  for (k in file1)
     if (k in file2) { X=X"\n"file1[k]; m++}
  for (k in file2) 
     if (!(k in file1)) {Y=Y"\n"file2[k]; f1++}
  for (k in file1) 
     if (!(k in file2)) {Z=Z"\n"file1[k]; f2++}

  printf "Match: (%d)%s\n", m, X
  printf "Missing in file1: (%d)%s\n", f1, Y
  printf "Missing in file2: (%d)%s\n", f2, Z
}' file1 file2 > NA12878_match

cmccabe · December 13, 2016, 5:45pm

Thank you very much all