@RudiC and @RavinderSingh13, thank you both for all of your help.
it looks like the script reads all the vcf
files from REF
and puts them in a variable FN
. How do the txt
files from VAL
get used by the awk
. The awk
looks at each REF
file and compares it to each VAL
file looking for what's common and what's different. If a difference is found it identifies which file the missing data came from. The awk
portion works on individual files, but I have over 500
to compare so a loop
would help, however that is what I need help with :).
REF
there are 250 files
all located at /home/cmccabe/Desktop/comparison/reference/10bp
F13_ref_FP_10bp.txt
H19_ref_FP_10bp.txt
Data structure in REF
Chr Start End Ref Alt Func.refGene Gene.refGene Coverage Score A(#F,#R) C(#F,#R) G(#F,#R) T(#F,#R) Ins(#F,#R) Del(#F,#R) SNP Mutation Frequency Sanger
12 52200340 52200340 A C exonic SCN8A 4129 28.3 1560;1672 413;453 0;0 0;0 0;2 31;0 c.[5070A>C]+[=] 20.97
2 51254914 51254914 C T exonic NRXN1 1562 25.5 0;0 536;218 0;0 574;234 0;0 0;0 c.[498G>A]+[=] 51.73
X 67433722 67433722 C T exonic OPHN1 2747 25.6 0;0 46;37 0;0 1211;1443 1;8 5;5 c.[579G>A]+[579G>A] 96.61
VAL
there are 250 files
all located at /home/cmccabe/Desktop/comparison/validation/files
F13_epilepsy.vcf
H19_marfan.vcf
Data structure in VAL
Chr Start End Ref Alt Func.refGene Gene.refGene GeneDetail.refGene ExonicFunc.refGene AAChange.refGene avsnp147 PopFreqMax 1000G_ALL 1000G_AFR 1000G_AMR 1000G_EAS 1000G_EUR 1000G_SAS ExAC_ALL ExAC_AFR ExAC_AMR ExAC_EAS ExAC_FIN ExAC_NFE ExAC_OTH ExAC_SAS ESP6500siv2_ALL ESP6500siv2_AA ESP6500siv2_EA CG46 dpsi_max_tissue dpsi_zscore SIFT_score SIFT_pred Polyphen2_HDIV_score Polyphen2_HDIV_pred Polyphen2_HVAR_score Polyphen2_HVAR_pred LRT_score LRT_pred MutationTaster_score MutationTaster_pred MutationAssessor_score MutationAssessor_pred CLINSIG CLNDBN CLNACC CLNDSDB CLNDSDBID Quality Reads Zygosity Phred Classification HGMD Sanger
chr1 43395635 43395635 C T exonic SLC2A1 . synonymous SNV SLC2A1:NM_006516:exon5:c.588G>A:p.P196P rs2229682 0.23 0.12 0.024 0.21 0.08 0.19 0.15 0.18 0.044 0.19 0.074 0.23 0.21 0.19 0.19 0.15 0.049 0.2 0.12 -0.1558 -0.594 . . . . . . . . . . . . Benign not_specified RCV000081436.5 MedGen CN169374 GOOD 399 het 19
chr1 43396414 43396414 G A exonic SLC2A1 . synonymous SNV SLC2A1:NM_006516:exon4:c.399C>T:p.C133C rs11537641 0.24 0.14 0.08 0.21 0.1 0.19 0.16 0.19 0.094 0.2 0.098 0.24 0.21 0.2 0.2 0.16 0.096 0.2 0.14 -0.0227 -0.121 . . . . . . . . . . . . Benign not_specified RCV000081433.6 MedGen CN169374 GOOD 400 het 21
chr1 172410967 172410967 G A exonic PIGC . nonsynonymous SNV PIGC:NM_002642:exon2:c.796C>T:p.P266S,PIGC:NM_153747:exon2:c.796C>T:p.P266S rs1063412 0.66 0.45 0.06 0.54 0.66 0.6 0.57 0.55 0.14 0.64 0.64 0.59 0.58 0.57 0.57 0.42 0.15 0.56 0.41 . . 0.13 T 1.0 D 1.0 D 0.000 D 0.000 P 1.515 L . . . . . GOOD 399 het 19
desired output (example not using these files that compares a REF
file to a VAL
file and finds what's in common, what's different, and where the difference comes from, it includes some additional data as well from another script)
Match:
Chr Start Ref Alt Func.refGene Gene.refGene Quality Reads Zygosity Phred
chr15 68521889 C T exonic CLN6 GOOD 50 het 4
chr7 147183143 A G intronic CNTNAP2 GOOD 382 het 22
chr2 167099158 A G exonic SCN9A GOOD 210 hom 55
Missing in Reference but found in IDP:
Chr Start Ref Alt Func.refGene Gene.refGene Quality Reads Zygosity Phred
chr2 51666313 T C intergenic NRXN1,NONE GOOD 108 het 7
chr2 166903445 T C exonic SCN1A GOOD 400 het 28
Missing in IDP but found in Reference:
Chr Start Ref Alt Func.refGene Gene.refGene Mutation Call Coverage Score Mutant Allele Frequency A(#F,#R) C(#F,#R) G(#F,#R) T(#F,#R) ins(#F,#R) del(#F,#R) SNP db_ref Region
2 166210776 C T exonic SCN2A c.[2994C>T]+[=] 3095 23.1 24.56 0:0 1158:1177 0;0 457;303 1;0 0;0 No low coverage
7 148106478 - GT intronic CNTNAP2 c.3716-5_3716-4insGT 4168 28.6 51.01 0;0 0;1 0;0 2199;1967 1129;997 0;1 rs60451214 No low
I hope this helps and apologize for the long post but think these are all the details. Thank you :).