Using awk to output matches between two files to one file and mismatches to two others

cmccabe · August 26, 2016, 4:37pm

I am trying to output the matches between $1 of file1 to $3 of file2 into a new file match .

I am also wanting to output the mismatches between those same 2 files and fields to two separate new files called missing from file1 and missing from file2 . The input files are tab-delimited, but the output can be space delimited. The awk below hopefully is a good start. Thank you :).

file1

file2

1    1    1
2    2    2
3    3    
4    4    
5    5    5
6    6    6

desired output
match

1 2 6

missing from file1

missing from file2

3 4

awk tried

awk -F'\t' 'NR==FNR{a[$1]=$3;next}{if (a[$1])print a[$1],$0;else print "Not Found", $0;}' file1 file2

Yoda · August 26, 2016, 6:17pm

I noticed file1 has just 2 columns, so this statement doesn't make any sense!

NR==FNR{a[$1]=$3;next}

cmccabe · August 26, 2016, 6:23pm

That should be $2 , but the output is still only 1 new file. Thank you :).

Yoda · August 26, 2016, 6:25pm

Here is an approach:-

awk -F'\t' '
        NR == FNR {
                A[$1]
                next
        }
        {
                B[$3]
        }
        END {
                print "Match"
                for ( k in A )
                {
                        if ( k && k in B )
                                print k
                }

                print "Missing from file1"
                for ( k in B )
                {
                        if ( ! ( k in A ) )
                                print k
                }

                print "Missing from file2"
                for ( k in A )
                {
                        if ( ! ( k in B ) )
                                print k
                }
        }
' file1 file2

Don_Cragun · August 26, 2016, 8:30pm

cmccabe:

I am trying to output the matches between $1 of file1 to $3 of file2 into a new file match .

I am also wanting to output the mismatches between those same 2 files and fields to two separate new files called missing from file1 and missing from file2 . The input files are tab-delimited, but the output can be space delimited. The awk below hopefully is a good start. Thank you :).

file1
1    1
2    2
3    3
4    4
   5
6    6
file2
1    1    1
2    2    2
3    3    
4    4    
5    5    5
6    6    6
desired output
match
1 2 6
missing from file1
5
missing from file2
3 4
awk tried
awk -F'\t' 'NR==FNR{a[$1]=$3;next}{if (a[$1])print a[$1],$0;else print "Not Found", $0;}' file1 file2

Hi cmccabe,
I don't see any indication in the match output saying that the empty 1st field in the 5th line of file1 matches the 3rd field of the 3rd and 4th lines of file2 ... Are empty fields supposed to be ignored? If not, how are empty fields supposed t be displayed in the <space> or <tab> separated output?

If a single value appears more than once in file1 or in file2 and the number of times that value appears in one file is not the same as the number of times it appears in the other file, should there just be a single entry in the match output or should there one entry for each matched pair and entries in one of the other files for the number of unpaired entries?

cmccabe · August 27, 2016, 1:56pm

@Yoda and @Don Cragun I modified the awk as my real data has the possibility of the same entry being in one of the files with a different pairing. Also, the nulls can remain blank .

For example in the two files below 48719928 AT - is in both files, however 48719928 A G is missing from file1 . So the awk uses a combination array as a key then looks for that.

I am using $19 $21 and $22 of file1 to search $3 $5 and $6 of file2. The header row is skipped and it then outputs a new file with what lines match and if they do not match what file the match is missing from. The awk does run but the output produced seems to be incorrect and I am not able to fix it. Thank you :).

file1

Index    Chromosomal Position    Gene    Inheritance    mRNA    Chromosome    Coverage    Score    A(#F,#R)    C(#F,#R)    G(#F,#R)    T(#F,#R)    Ins(#F,#R)    Del(#F,#R)    SNP    Mutation    Frequency    Chr    Start    End    Ref    Alt    Func.refGene    Gene.refGene    GeneDetail.refGene    ExonicFunc.refGene    AAChange.refGene    PopFreqMax    1000G2012APR_ALL    1000G2012APR_AFR    1000G2012APR_AMR    1000G2012APR_ASN    1000G2012APR_EUR    ESP6500si_ALL    ESP6500si_AA    ESP6500si_EA    CG46    common    clinvar    clinvarsubmit    clinvarreference    Homopolymer    Splice    Pseudogene    Classification    HGMD    Disease    Sanger    References
98    48719928    FBN1    AD    NM_000138.4    15    6786    30.3    1184;2152    0;0    25;23    0;1    0;5    1195;2206        c.7039_7040delAT    50.12    15    48719928    48719929    AT    -    exonic    FBN1        frameshift deletion    FBN1:NM_000138.4:exon58:c.7039_7040del:p.M2347fs                                                                        pathogenic    CD020234    Marfan syndrome        1. Korkko (2002) J Med Genet 39: 34 PubMed: 11826022
101    48807637    FBN1    AD    NM_000138.4    15    3792    27.7    0;0    0;4    0;0    1227;2561    0;7    0;0    rs4775765    c.[1415G>A]+[1415G>A]    99.89    15    48807637    48807637    C    T    exonic    FBN1        nonsynonymous SNV    FBN1:NM_000138.4:exon12:c.G1415A:p.C472Y    1    1    1    1    1    1    .    .    .    1                                likely benign    n

file2

R_Index    Chr    Start    End    Ref    Alt    Func.IDP.refGene    Gene.IDP.refGene    GeneDetail.IDP.refGene    Inheritence    ExonicFunc.IDP.refGene    AAChange.IDP.refGene    avsnp147    PopFreqMax    1000G_ALL    1000G_AFR    1000G_AMR    1000G_EAS    1000G_EUR    1000G_SAS    ExAC_ALL    ExAC_AFR    ExAC_AMR    ExAC_EAS    ExAC_FIN    ExAC_NFE    ExAC_OTH    ExAC_SAS    ESP6500siv2_ALL    ESP6500siv2_AA    ESP6500siv2_EA    CG46    dpsi_max_tissue    dpsi_zscore    SIFT_score    SIFT_pred    Polyphen2_HDIV_score    Polyphen2_HDIV_pred    Polyphen2_HVAR_score    Polyphen2_HVAR_pred    LRT_score    LRT_pred    MutationTaster_score    MutationTaster_pred    MutationAssessor_score    MutationAssessor_pred    CLINSIG    CLNDBN    CLNACC    CLNDSDB    CLNDSDBID    Quality    Reads    Zygosity    Phred    Classification    HGMD    Sanger
36    chr15    48719928    48719929    AT    -    exonic    FBN1    0    0    frameshift deletion    FBN1:NM_000138.4:exon58:c.7039_7040del:p.M2347fs    rs794728319    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    Pathogenic|Pathogenic    Thoracic_aortic_aneurysm_and_aortic_dissection|Marfan_syndrome    RCV000181674.1|RCV000208062.1    MedGen:Orphanet|MedGen:OMIM:Orphanet:SNOMED_CT    CN118826:ORPHA91387|C0024796:154700:ORPHA558:19346006    0    0    0    0    0    0    0
37    chr15    48719928    48719928    A    G    exonic    FBN1    0    0    nonsynonymous SNV    FBN1:NM_000138.4:exon58:c.7040T>C:p.M2347T    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    -0.2791    -0.822    0.56    T    0.369    B    0.222    B    0    D    1    D    1.43    L    0    0    0    0    0    0    0    0    0    0    0    0
38    chr15    48807637    48807637    C    T    exonic    FBN1    0    0    nonsynonymous SNV    FBN1:NM_000138.4:exon12:c.1415G>A:p.C472Y    rs4775765    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    0    0    0    1    2.0758    1.99    1    T    0    B    0    B    0    N    0    P    -4.395    N    0    0    0    0    0    GOOD    308    hom    87    0    0    0

current output

Match:
48807637 C T
Missing in file1:
48719928 A G
48719928 AT -
  
Missing in file2:
48719929 - exonic

desired output

Match 48719928 AT -, 48807637 C T 
Missing from file1 48719928 A G 
Missing from file2

awk

awk 'FNR==1 { next }
      FNR == NR { file1[$19,$21,$22] = $19 " " $21 " " $22 }
      FNR != NR { file2[$3,$5,$6] = $3 " " $5 " " $6 }
      END { print "Match:"; for (k in file1) if (k in file2) print file1[k] # Or file2[k]
            print "Missing in file1:"; for (k in file2) if (!(k in file1)) print file2[k]
            print "Missing in file2:"; for (k in file1) if (!(k in file2)) print file1[k]
      }' file1 file2 > list

RudiC · August 27, 2016, 3:05pm

As usual - garbage in garbage out. In file1, line 2, field 15 is missing which is rs4775765 in line 3. With a dummy in field 15 in line2, the result of the awk given is:

Match:
48719928 AT -
48807637 C T
Missing in file1:
48719928 A G
Missing in file2:

Why don't you exercise a bit more care when describing your problems and supplying sample data?

cmccabe · August 27, 2016, 6:57pm

Interesting, I suspected maybe it was missing values, but couldn't understand why, if the values used in the lookup key have values, why the awk wasn't working properly. I guess why do all the columns need something in them. I will make sure my data is cleaner and my questions more precise. Thank you @RudiC:)

Don_Cragun · August 27, 2016, 8:57pm

The problem is that even though you said that your input files are tab separated, the sample files you provided here are separated by multiple spaces (and we have no way of knowing where an empty tab-separated field is when there are no tabs. (You may have noticed that your script had a -F'\t' that is missing in RudiC's script. With the sample data you provided (with no tabs), adding that -F'\t' would mean that every line in both sample input files would be a single field.)

If your real data really is tab separated, adding a -F'\t' to the start of RudiC's script might be all that you need.

RudiC · August 28, 2016, 3:59am

In fact, it was Yoda's flawlessly working script, and the field separator got lost in the OP's copy/paste action.