I have been reading old posts and trying to come up with a solution for the below: Use a tab-delimited
input file to assign
point to variables that are used to update a specific field, Rank
. I really couldn't find too much in the way of assigning points
to variable, but made an attempt at an awk
to try and accomplish this. I added comments to the code and a description of
what the desired result should be. I apologize for the lengthy post but tried to come up with a solution to my question.
I know that I have made many but I am really trying to learn more and use what I have learned. This question uses the information in the
input file to assign points values to variables, then sum them to update a field. Each condition (there are 6), has a negative or condition that
assign 0 points (or in the case of condition 1, variable points... starting from 5 going down). In my real data there are several hundreds or
thousands of rows but they all follow the same format as below. Thank you :).
Description
Line 1:
condition1 = 0 as FuncrefGene in $7 is not exonic (it is .) and GeneDetailrefGene in $9 has a value > 1 or 2 (it is 84), if $7 was not exonic but $9 was <= 1 or 2 than rank would get 2 points
condition2 = 0 as ExonicFuncrefGene is .
condition3 = 0 as PopFreqMax in $14 is > 0.011
condition4 = 0 as ClinSig in $47 is not Pathogenic or Likely pathogenic (it is a .)
condition5 = 0 as Hgmd in $58 is n or .
condition6 = 1 as Zygosity in $54 is het
0+0+0+0+1 = Rank of 1 in $57
Line 2:
condition1 = 0 as FuncrefGene in $7 is exonic, so this line is skipped as it does not meet this condition and the next condition is processed
condition2 = 5 as ExonicFuncrefGene is stopgain
condition3 = 1 as PopFreqMax in $14 is <= 0.011
condition4 = 2 as ClinSig in $47 is Pathogenic or Likely pathogenic
condition5 = 2 as Hgmd in $58 is not n or .
condition6 = 1 as Zygosity in $54 is het
0+5+1+2+2+1 = Rank of 11 in $57
awk
awk '
BEGIN { # Set input and output field separators:
FS = OFS = "\t"
# Create list of needed field headers:
nfh["FuncrefGene"]
nfh["GeneDetailrefGene"]
nfh["ExonicFuncrefGene"]
nfh["PopFreqMax"]
nfh["CLINSIG"]
nfh["Hgmd"]
nfh["Zygosity"]
nfh["Rank"]
}
NR == 1 {
# Create array to tranlate needed field headers to field numbers
for(i = 1; i <= NF; i++)
if($i in nfh)
f[$i] = i
# Verify that all of the needed field headers were found
for(i in nfh)
if(!(i in f)) {
missing++
printf("Needed field missing: %s\n", i)
}
# If one or more needed fields were not found, give up
if(missing)
exit 1
}
# Condition 1
NR > 1 {
if(index$f["FuncrefGene"] == "." || $f["FuncrefGene"] != "exonic" && ($9 ~ /:NM_/ && match($9,/c..+||-/)) { # search for :NM_ and c..+||-
# Get the substring after +/-
VAL=substr($9,RSTART+1,RLENGTH-2);
for(i=1;i<=VAL;i++){ # start loop
$f["GeneDetailrefGene"] <= 2) # if GeneDetailRefgene <= 2
var1=2 # var1 gets 2 points
}
# Condition 1a
else {
if($f["FuncrefGene"] == "." || $f["FuncrefGene"] != "exonic" && ($9 ~ /:NM_/ && match($9,/c..+||-/)) { # search for :NM_ and c..+||-
# Get the substring after +/-
VAL=substr($9,RSTART+1,RLENGTH-2);
for(i=1;i<=VAL;i++){ # start loop
$f["GeneDetailrefGene"] > 2) # if GeneDetailRefgene > 2
var1= 0 # var1 gets 0 points
}
# Condition 1b
else {
if($f["FuncrefGene"] == "exonic") # if FuncrefGene is exonic
var1=0 # var1 gets 0 points
next # skip and process next condition
}
else { #Condition 2
if($f["ExonicFuncrefGene"] ~ /^stopgain$/ || $f["ExonicFuncrefGene"] ~ /^stoploss$/) # if ExonicFuncrefGene has stopgain or stoploss in it
var2=5 # var2 gets 5 points
}
else { #Condition 2a
if($f["ExonicFuncrefGene"] ~ /^frameshift$/) # if ExonicFuncrefGene has frameshift in it (could be frameshift insertion/deletion/block substitution)
var2=4 # var2 gets 4 points
}
else { #Condition 2b
if($f["ExonicFuncrefGene"] ~ /^nonframeshift$/) # if ExonicFuncrefGene has nonframeshift in it (could be nonframeshift insertion/deletion/block substitution)
var2=3 # var2 gets 3 points
}
else { #Condition 2c
if($f["ExonicFuncrefGene"] ~ /^nonsynonymous$/) # if ExonicFuncrefGene has nonsynonymous in it (could be nonsynonymous SNV/MNV/other)
var2=2 # var2 gets 2 points
}
else { #Condition 2d
if($f["ExonicFuncrefGene"] ~ /^synonymous$/) # if ExonicFuncrefGene has synonymous in it (could be nonsynymous SNV/MNV/other)
var2 =1 # var2 gets 1 point
}
else { #Condition 2e
if($f["ExonicFuncrefGene"] == ".") # if ExonicFuncrefGene is . (dot)
var2=0 # var2 gets 0 points
}
else { #Condition 3
if($f["$f["PopFreqMax"] <= 0.01) # if PopFreqMax <= 0.01
var3=1 # var3 gets 1 point
}
else { #Test #3a#
if($f["$f["PopFreqMax"] > 0.01) # if PopFreqMax > 0.01
var3=0 # var3 gets 0 points
}
else { #Condition 4
if($f["Clinsig"] ~ /^Pathogenic$/ || $f["Clinsig"] ~ /^Likely pathogenic$/) # if ClinSig has Pathogenic or Likely pathogenic in it
var4=2 # var3 gets 2 points
}
else { #Condition 4a
if($f["Clinsig"] !~ /^Pathogenic$/ || $f["Clinsig"] !~ /^Likely pathogenic$/) # if ClinSig does not have Pathogenic or Likely pathogenic in it
var4=0 # var4 gets 0 points
}
else { #Condition 5
if($f["Hgmd"] !~ /^n$/ || $f["Hgmd"] !~ /^.$/) # if Hgmd does not have n or . (dot) in it
var5=2 # var5 gets 2 points
else { #Condition 5a
if($f["Hgmd"] ~ /^n$/ || $f["Hgmd"] !~ /^.$/) # if Hgmd does have n or . (dot) in it
var5=0 # var5 gets 0 points
}
else { #Condition 6
if($f["Zygosity"] == "hom") # if Zygosity is hom
var6=2 # var6 gets 2 points
}
else { #Condition 6a
if($f["Zygosity"] ~ /^het$/ || $f["Zygosity"] ~ /^.$/) # if Zygosity has het or . (dot) in it
var6=1 # var6 gets 1 point
}
{ # Update Rank
printf($f["Rank"] = var1+var2+var3+var4+var5+var6) # add all variables and store value in Rank
}1' input.txt > update.txt # update and define input and output
input.txt tab-delimited
R_Index Chr Start End Ref Alt Func.refGene Gene.refGene GeneDetail.refGene Inheritence ExonicFunc.refGene AAChange.refGene avsnp147 PopFreqMax 1000G_ALL 1000G_AFR 1000G_AMR 1000G_EAS 1000G_EUR 1000G_SAS ExAC_ALL ExAC_AFR ExAC_AMR ExAC_EAS ExAC_FIN ExAC_NFE ExAC_OTH ExAC_SAS ESP6500siv2_ALL ESP6500siv2_AA ESP6500siv2_EA CG46 SIFT_score SIFT_pred Polyphen2_HDIV_score Polyphen2_HDIV_pred Polyphen2_HVAR_score Polyphen2_HVAR_pred LRT_score LRT_pred MutationTaster_score MutationTaster_pred MutationAssessor_score MutationAssessor_pred dpsi_max_tissue dpsi_zscore CLINSIG CLNDBN CLNACC CLNDSDB CLNDSDBID Quality Reads Zygosity Score Classification Rank HGMD Sanger
2 chr1 948870 948870 C G UTR5 ISG15 NM_005101:c.-84C>G . . . rs4615788 1 0.9 0.72 0.93 1. 0.96 0.98 . . . . . . . . 0.89 0.74 0.96 0.79 . . . . . . . . . . . . . . . . . . . GOOD 206 het 12 Likely Benign . . NM_005101:c.-84C>G
1990 chr2 178879181 178879181 G A exonic PDE11A . . stopgain PDE11A:NM_016953:exon2:c.919C>T:p.R307X,PDE11A:NM_001077197:exon3:c.169C>T:p.R57X rs76308115 0.0051 0.0018 . 0.0014 . 0.005 0.0031 0.003 0.0009 0.0013 . 0.0043 0.0045 . 0.0011 0.0038 0.0014 0.0051 . . . . . . . . D 1 A . . -50.4062 -3.532 Pathogenic Pigmented_nodular_adrenocortical_disease,_primary,_2 RCV000005604.3 MedGen:OMIM C1864851:610475 GOOD 86 het 8 VUS . . NM_016953:exon2:c.919C>T:p.R307X
desired update.txt in $57 tab-delimited
R_Index Chr Start End Ref Alt Func.refGene Gene.refGene GeneDetail.refGene Inheritence ExonicFunc.refGene AAChange.refGene avsnp147 PopFreqMax 1000G_ALL 1000G_AFR 1000G_AMR 1000G_EAS 1000G_EUR 1000G_SAS ExAC_ALL ExAC_AFR ExAC_AMR ExAC_EAS ExAC_FIN ExAC_NFE ExAC_OTH ExAC_SAS ESP6500siv2_ALL ESP6500siv2_AA ESP6500siv2_EA CG46 SIFT_score SIFT_pred Polyphen2_HDIV_score Polyphen2_HDIV_pred Polyphen2_HVAR_score Polyphen2_HVAR_pred LRT_score LRT_pred MutationTaster_score MutationTaster_pred MutationAssessor_score MutationAssessor_pred dpsi_max_tissue dpsi_zscore CLINSIG CLNDBN CLNACC CLNDSDB CLNDSDBID Quality Reads Zygosity Score Classification Rank HGMD Sanger
2 chr1 948870 948870 C G UTR5 ISG15 NM_005101:c.-84C>G . . . rs4615788 1 0.9 0.72 0.93 1. 0.96 0.98 . . . . . . . . 0.89 0.74 0.96 0.79 . . . . . . . . . . . . . . . . . . . GOOD 206 het 12 Likely Benign 1 . NM_005101:c.-84C>G
1990 chr2 178879181 178879181 G A exonic PDE11A . . stopgain PDE11A:NM_016953:exon2:c.919C>T:p.R307X,PDE11A:NM_001077197:exon3:c.169C>T:p.R57X rs76308115 0.0051 0.0018 . 0.0014 . 0.005 0.0031 0.003 0.0009 0.0013 . 0.0043 0.0045 . 0.0011 0.0038 0.0014 0.0051 . . . . . . . . D 1 A . . -50.4062 -3.532 Pathogenic Pigmented_nodular_adrenocortical_disease,_primary,_2 RCV000005604.3 MedGen:OMIM C1864851:610475 GOOD 86 het 8 VUS 11 CM125 NM_016953:exon2:c.919C>T:p.R307X