awk to assign points to variables based on conditions and update specific field

I have been reading old posts and trying to come up with a solution for the below: Use a tab-delimited input file to assign
point to variables that are used to update a specific field, Rank . I really couldn't find too much in the way of assigning points
to variable, but made an attempt at an awk to try and accomplish this. I added comments to the code and a description of
what the desired result should be. I apologize for the lengthy post but tried to come up with a solution to my question.
I know that I have made many but I am really trying to learn more and use what I have learned. This question uses the information in the
input file to assign points values to variables, then sum them to update a field. Each condition (there are 6), has a negative or condition that
assign 0 points (or in the case of condition 1, variable points... starting from 5 going down). In my real data there are several hundreds or
thousands of rows but they all follow the same format as below. Thank you :).

Description

Line 1:
condition1 = 0 as FuncrefGene in $7 is not exonic (it is .) and GeneDetailrefGene in $9 has a value > 1 or 2 (it is 84), if $7 was not exonic but $9 was <= 1 or 2 than rank would get 2 points
condition2 = 0 as ExonicFuncrefGene is .
condition3 = 0 as PopFreqMax in $14 is > 0.011
condition4 = 0 as ClinSig in $47 is not Pathogenic or Likely pathogenic (it is a .)
condition5 = 0 as Hgmd in $58 is n or .
condition6 = 1 as Zygosity in $54 is het
0+0+0+0+1 = Rank of 1 in $57

Line 2:
condition1 = 0 as FuncrefGene in $7 is exonic, so this line is skipped as it does not meet this condition and the next condition is processed
condition2 = 5 as ExonicFuncrefGene is stopgain
condition3 = 1 as PopFreqMax in $14 is <= 0.011
condition4 = 2 as ClinSig in $47 is Pathogenic or Likely pathogenic
condition5 = 2 as Hgmd in $58 is not n or .
condition6 = 1 as Zygosity in $54 is het
0+5+1+2+2+1 = Rank of 11 in $57

awk

awk '
BEGIN {    # Set input and output field separators:
    FS = OFS = "\t"
    # Create list of needed field headers:
        nfh["FuncrefGene"]
    nfh["GeneDetailrefGene"]
        nfh["ExonicFuncrefGene"]
    nfh["PopFreqMax"]
    nfh["CLINSIG"]
    nfh["Hgmd"]
    nfh["Zygosity"]
    nfh["Rank"]
}
NR == 1 {
    # Create array to tranlate needed field headers to field numbers
    for(i = 1; i <= NF; i++)
        if($i in nfh)
            f[$i] = i
    # Verify that all of the needed field headers were found
    for(i in nfh)
        if(!(i in f)) {
            missing++
            printf("Needed field missing: %s\n", i)
        }
    # If one or more needed fields were not found, give up
    if(missing)
        exit 1
}
        # Condition 1
NR > 1 {
    if(index$f["FuncrefGene"] == "."  || $f["FuncrefGene"] != "exonic" && ($9 ~ /:NM_/ && match($9,/c..+||-/)) {  # search for :NM_ and c..+||-
    # Get the substring after +/-
    VAL=substr($9,RSTART+1,RLENGTH-2);
                        for(i=1;i<=VAL;i++){  # start loop
            $f["GeneDetailrefGene"] <= 2)  # if GeneDetailRefgene <= 2
                var1=2                                # var1 gets 2 points
        }
        # Condition 1a
    else {
        if($f["FuncrefGene"] == "."  || $f["FuncrefGene"] != "exonic" && ($9 ~ /:NM_/ && match($9,/c..+||-/)) {  # search for :NM_ and c..+||-
    # Get the substring after +/-
    VAL=substr($9,RSTART+1,RLENGTH-2);
                        for(i=1;i<=VAL;i++){  # start loop
            $f["GeneDetailrefGene"] > 2)  # if GeneDetailRefgene > 2
                var1= 0                            # var1 gets 0 points
             }
        # Condition 1b
        else {
        if($f["FuncrefGene"] == "exonic")  # if FuncrefGene is exonic
                var1=0                                # var1 gets 0 points 
                        next                   # skip and process next condition
             }
        else {    #Condition 2
        if($f["ExonicFuncrefGene"] ~ /^stopgain$/ || $f["ExonicFuncrefGene"] ~ /^stoploss$/)  # if ExonicFuncrefGene has stopgain or stoploss in it
                var2=5                                                                                                             # var2 gets 5 points
             }
        else {    #Condition 2a
        if($f["ExonicFuncrefGene"] ~ /^frameshift$/)  # if ExonicFuncrefGene has frameshift in it (could be frameshift insertion/deletion/block substitution)
                var2=4                                                # var2 gets 4 points
             }
        else {    #Condition 2b
        if($f["ExonicFuncrefGene"] ~ /^nonframeshift$/)  # if ExonicFuncrefGene has nonframeshift in it (could be nonframeshift insertion/deletion/block substitution)
                var2=3                                                      # var2 gets 3 points
             }
        else {    #Condition 2c
        if($f["ExonicFuncrefGene"] ~ /^nonsynonymous$/)  # if ExonicFuncrefGene has nonsynonymous in it (could be nonsynonymous SNV/MNV/other)
                var2=2                                                         # var2 gets 2 points
             }
        else {    #Condition 2d
        if($f["ExonicFuncrefGene"] ~ /^synonymous$/)  # if ExonicFuncrefGene has synonymous in it (could be nonsynymous SNV/MNV/other)
                var2 =1                                                   # var2 gets 1 point
             }
        else {    #Condition 2e
        if($f["ExonicFuncrefGene"] == ".")  # if ExonicFuncrefGene is . (dot) 
                var2=0                                 # var2 gets 0 points
             }
    else {    #Condition 3
        if($f["$f["PopFreqMax"] <= 0.01)       # if PopFreqMax <= 0.01
                var3=1                                    # var3 gets 1 point
             }
    else {    #Test #3a#
        if($f["$f["PopFreqMax"] > 0.01)       # if PopFreqMax > 0.01
                var3=0                                  # var3 gets 0 points
             }
    else {    #Condition 4
        if($f["Clinsig"] ~ /^Pathogenic$/ || $f["Clinsig"] ~ /^Likely pathogenic$/)  # if ClinSig has Pathogenic or Likely pathogenic in it
                var4=2                                                                                         # var3 gets 2 points
             }
    else {    #Condition 4a
        if($f["Clinsig"] !~ /^Pathogenic$/ || $f["Clinsig"] !~ /^Likely pathogenic$/)  # if ClinSig does not have Pathogenic or Likely pathogenic in it
                var4=0                                                                                            # var4 gets 0 points
             }
    else {    #Condition 5
        if($f["Hgmd"] !~ /^n$/ || $f["Hgmd"] !~ /^.$/) # if Hgmd does not have n or . (dot) in it
                var5=2                                                   # var5 gets 2 points
    else {    #Condition 5a
        if($f["Hgmd"] ~ /^n$/ || $f["Hgmd"] !~ /^.$/)  # if Hgmd does have n or . (dot) in it
                var5=0                                                   # var5 gets 0 points
             }
    else {    #Condition 6
        if($f["Zygosity"] == "hom")  # if Zygosity is hom
                var6=2                       # var6 gets 2 points
             }
    else {    #Condition 6a
        if($f["Zygosity"] ~ /^het$/ || $f["Zygosity"] ~ /^.$/)  # if Zygosity has het or . (dot) in it
                var6=1                                                            # var6 gets 1 point
             }
     { # Update Rank
    printf($f["Rank"] = var1+var2+var3+var4+var5+var6)  # add all variables and store value in Rank
}1' input.txt > update.txt  # update and define input and output

input.txt tab-delimited

R_Index    Chr    Start    End    Ref    Alt    Func.refGene    Gene.refGene    GeneDetail.refGene    Inheritence    ExonicFunc.refGene    AAChange.refGene    avsnp147    PopFreqMax    1000G_ALL    1000G_AFR    1000G_AMR    1000G_EAS    1000G_EUR    1000G_SAS    ExAC_ALL    ExAC_AFR    ExAC_AMR    ExAC_EAS    ExAC_FIN    ExAC_NFE    ExAC_OTH    ExAC_SAS    ESP6500siv2_ALL    ESP6500siv2_AA    ESP6500siv2_EA    CG46    SIFT_score    SIFT_pred    Polyphen2_HDIV_score    Polyphen2_HDIV_pred    Polyphen2_HVAR_score    Polyphen2_HVAR_pred    LRT_score    LRT_pred    MutationTaster_score    MutationTaster_pred    MutationAssessor_score    MutationAssessor_pred    dpsi_max_tissue    dpsi_zscore    CLINSIG    CLNDBN    CLNACC    CLNDSDB    CLNDSDBID    Quality    Reads    Zygosity    Score    Classification    Rank    HGMD    Sanger
2    chr1    948870    948870    C    G    UTR5    ISG15    NM_005101:c.-84C>G    .    .    .    rs4615788    1    0.9    0.72    0.93    1.    0.96    0.98    .    .    .    .    .    .    .    .    0.89    0.74    0.96    0.79    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    GOOD    206    het    12    Likely Benign    .    .    NM_005101:c.-84C>G
1990    chr2    178879181    178879181    G    A    exonic    PDE11A    .    .    stopgain    PDE11A:NM_016953:exon2:c.919C>T:p.R307X,PDE11A:NM_001077197:exon3:c.169C>T:p.R57X    rs76308115    0.0051    0.0018    .    0.0014    .    0.005    0.0031    0.003    0.0009    0.0013    .    0.0043    0.0045    .    0.0011    0.0038    0.0014    0.0051    .    .    .    .    .    .    .    .    D    1    A    .    .    -50.4062    -3.532    Pathogenic    Pigmented_nodular_adrenocortical_disease,_primary,_2    RCV000005604.3    MedGen:OMIM    C1864851:610475    GOOD    86    het    8    VUS    .    .    NM_016953:exon2:c.919C>T:p.R307X

desired update.txt in $57 tab-delimited

R_Index    Chr    Start    End    Ref    Alt    Func.refGene    Gene.refGene    GeneDetail.refGene    Inheritence    ExonicFunc.refGene    AAChange.refGene    avsnp147    PopFreqMax    1000G_ALL    1000G_AFR    1000G_AMR    1000G_EAS    1000G_EUR    1000G_SAS    ExAC_ALL    ExAC_AFR    ExAC_AMR    ExAC_EAS    ExAC_FIN    ExAC_NFE    ExAC_OTH    ExAC_SAS    ESP6500siv2_ALL    ESP6500siv2_AA    ESP6500siv2_EA    CG46    SIFT_score    SIFT_pred    Polyphen2_HDIV_score    Polyphen2_HDIV_pred    Polyphen2_HVAR_score    Polyphen2_HVAR_pred    LRT_score    LRT_pred    MutationTaster_score    MutationTaster_pred    MutationAssessor_score    MutationAssessor_pred    dpsi_max_tissue    dpsi_zscore    CLINSIG    CLNDBN    CLNACC    CLNDSDB    CLNDSDBID    Quality    Reads    Zygosity    Score    Classification    Rank    HGMD    Sanger
2    chr1    948870    948870    C    G    UTR5    ISG15    NM_005101:c.-84C>G    .    .    .    rs4615788    1    0.9    0.72    0.93    1.    0.96    0.98    .    .    .    .    .    .    .    .    0.89    0.74    0.96    0.79    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    GOOD    206    het    12    Likely Benign    1    .    NM_005101:c.-84C>G
1990    chr2    178879181    178879181    G    A    exonic    PDE11A    .    .    stopgain    PDE11A:NM_016953:exon2:c.919C>T:p.R307X,PDE11A:NM_001077197:exon3:c.169C>T:p.R57X    rs76308115    0.0051    0.0018    .    0.0014    .    0.005    0.0031    0.003    0.0009    0.0013    .    0.0043    0.0045    .    0.0011    0.0038    0.0014    0.0051    .    .    .    .    .    .    .    .    D    1    A    .    .    -50.4062    -3.532    Pathogenic    Pigmented_nodular_adrenocortical_disease,_primary,_2    RCV000005604.3    MedGen:OMIM    C1864851:610475    GOOD    86    het    8    VUS    11    CM125    NM_016953:exon2:c.919C>T:p.R307X

Did you test your script so far ? have you performed some "unity testing" to validate one by one that your conditions work as you expect ?

In condition 1 , where does this (see in if(index$f["FuncrefGene"]... ) "index$f" come from ?

Why using the dollar sign in variable settings ? when assigning the value 3 to the variable "var2" in awk, shouldn't you use var2=3 rather than $var2=3 ??? same in testing : you are in awk, not in shell so use : if ( var2 == 3 ) { ... } rather than if ( $var2 == 3 ) { ... }

1 Like

if(index$f["FuncrefGene"]

This is $7 or Func.refGene in the tab-delimited input.txt . Thank you :).

I will make the change in the variables, I didn't even catch that. Thank you :).

In condition 2c, what are you trying to do : an assignation ( = ) or a comparison ( == ) ?

Also review your whole code : you stilll have a bunch of dollars that shouldn't be there ( f[... rather than $f[... , f rather than f[$i] ...).

1 Like

Sorry, that's a typo in 2c, should be an assignation... I will cleanup the code. Thank you :).