I am trying to output the matches between $1 of file1 to $3 of file2 into a new file match .
I am also wanting to output the mismatches between those same 2 files and fields to two separate new files called missing from file1 and missing from file2 . The input files are tab-delimited, but the output can be space delimited. The awk below hopefully is a good start. Thank you :).
awk -F'\t' '
NR == FNR {
A[$1]
next
}
{
B[$3]
}
END {
print "Match"
for ( k in A )
{
if ( k && k in B )
print k
}
print "Missing from file1"
for ( k in B )
{
if ( ! ( k in A ) )
print k
}
print "Missing from file2"
for ( k in A )
{
if ( ! ( k in B ) )
print k
}
}
' file1 file2
Hi cmccabe,
I don't see any indication in the match output saying that the empty 1st field in the 5th line of file1 matches the 3rd field of the 3rd and 4th lines of file2 ... Are empty fields supposed to be ignored? If not, how are empty fields supposed t be displayed in the <space> or <tab> separated output?
If a single value appears more than once in file1 or in file2 and the number of times that value appears in one file is not the same as the number of times it appears in the other file, should there just be a single entry in the match output or should there one entry for each matched pair and entries in one of the other files for the number of unpaired entries?
@Yoda and @Don Cragun I modified the awk as my real data has the possibility of the same entry being in one of the files with a different pairing. Also, the nulls can remain blank .
For example in the two files below 48719928 AT - is in both files, however 48719928 A G is missing from file1 . So the awk uses a combination array as a key then looks for that.
I am using $19$21 and $22 of file1 to search $3$5 and $6 of file2. The header row is skipped and it then outputs a new file with what lines match and if they do not match what file the match is missing from. The awk does run but the output produced seems to be incorrect and I am not able to fix it. Thank you :).
file1
Index Chromosomal Position Gene Inheritance mRNA Chromosome Coverage Score A(#F,#R) C(#F,#R) G(#F,#R) T(#F,#R) Ins(#F,#R) Del(#F,#R) SNP Mutation Frequency Chr Start End Ref Alt Func.refGene Gene.refGene GeneDetail.refGene ExonicFunc.refGene AAChange.refGene PopFreqMax 1000G2012APR_ALL 1000G2012APR_AFR 1000G2012APR_AMR 1000G2012APR_ASN 1000G2012APR_EUR ESP6500si_ALL ESP6500si_AA ESP6500si_EA CG46 common clinvar clinvarsubmit clinvarreference Homopolymer Splice Pseudogene Classification HGMD Disease Sanger References
98 48719928 FBN1 AD NM_000138.4 15 6786 30.3 1184;2152 0;0 25;23 0;1 0;5 1195;2206 c.7039_7040delAT 50.12 15 48719928 48719929 AT - exonic FBN1 frameshift deletion FBN1:NM_000138.4:exon58:c.7039_7040del:p.M2347fs pathogenic CD020234 Marfan syndrome 1. Korkko (2002) J Med Genet 39: 34 PubMed: 11826022
101 48807637 FBN1 AD NM_000138.4 15 3792 27.7 0;0 0;4 0;0 1227;2561 0;7 0;0 rs4775765 c.[1415G>A]+[1415G>A] 99.89 15 48807637 48807637 C T exonic FBN1 nonsynonymous SNV FBN1:NM_000138.4:exon12:c.G1415A:p.C472Y 1 1 1 1 1 1 . . . 1 likely benign n
Match:
48807637 C T
Missing in file1:
48719928 A G
48719928 AT -
Missing in file2:
48719929 - exonic
desired output
Match 48719928 AT -, 48807637 C T
Missing from file1 48719928 A G
Missing from file2
awk
awk 'FNR==1 { next }
FNR == NR { file1[$19,$21,$22] = $19 " " $21 " " $22 }
FNR != NR { file2[$3,$5,$6] = $3 " " $5 " " $6 }
END { print "Match:"; for (k in file1) if (k in file2) print file1[k] # Or file2[k]
print "Missing in file1:"; for (k in file2) if (!(k in file1)) print file2[k]
print "Missing in file2:"; for (k in file1) if (!(k in file2)) print file1[k]
}' file1 file2 > list
As usual - garbage in garbage out. In file1, line 2, field 15 is missing which is rs4775765 in line 3. With a dummy in field 15 in line2, the result of the awk given is:
Match:
48719928 AT -
48807637 C T
Missing in file1:
48719928 A G
Missing in file2:
Why don't you exercise a bit more care when describing your problems and supplying sample data?
Interesting, I suspected maybe it was missing values, but couldn't understand why, if the values used in the lookup key have values, why the awk wasn't working properly. I guess why do all the columns need something in them. I will make sure my data is cleaner and my questions more precise. Thank you @RudiC:)
The problem is that even though you said that your input files are tab separated, the sample files you provided here are separated by multiple spaces (and we have no way of knowing where an empty tab-separated field is when there are no tabs. (You may have noticed that your script had a -F'\t' that is missing in RudiC's script. With the sample data you provided (with no tabs), adding that -F'\t' would mean that every line in both sample input files would be a single field.)
If your real data really is tab separated, adding a -F'\t' to the start of RudiC's script might be all that you need.