I am trying to use awk
to update the below tab-delimited
file
based on 5 different rules/conditions. The final output is also
tab-delimited
and each line in the file
will meet one of the conditions. My attemp is below as well though I am not very confident in it. Thank you :).
Condition 1: The field Classification has a default value of "VUS" for all lines in file
Condition 2: The CLINSIG
field updates Classification
with the value in it if it hasa lenghth < 12
, else it is Conflicting
is the result
- since it is possible for this field to have multiple strings in it I used the greatest single value "Likely Benign" and if the value in the field exceeds 12 characters
then "Conflicting" is the result, the multiple values are also separated by|
symbol
Condition 3: If the Func.IDP.refGene
= UTR
then the value of Classification
is Likely Benign
,
unleess CLINSIG
had a value already
Condition 4: If the PopFreqMax
> .01
then If the Classification
is Likely Benign
else it is VUS
,
unleess CLINSIG
had a value already
Condition 5: If Func.IDP.refGene
= spicing
AND GeneDetail.IDP.refGene
has +/-
> 10
then the value of Classification
is Likely Benign
, unleess CLINSIG
had a value already
Thank you :).
file
R_Index Chr Start End Ref Alt Func.IDP.refGene GeneDetail.IDP.refGene AAChange.IDP.refGene PopFreqMax CLINSIG CLNDBN Classification Quality
1 chr1 40562993 40562993 T C UTR5 NM_000310.3:c.-83A>G . 0.9 . . . 15
2 chr5 125887685 125887685 C T splicing NM_001201377.1:exon14:c.1233+28G>A . 0.82 . . . 10
3 chr16 2105400 2105400 C T splicing NM_000548.4:exon6:c.482-3C>T . 0.21 not provided|not provided|not provided|not provided|other|Benign TSC . 25
4 chr16 2110805 2110805 G A exonic . TSC2:NM_000548.4:exon11:c.1110G>A:p.Q370Q .004 Pathogenic TSC . 40
Descri[tion of fields
awk 'NR==1{for(i=1;i<=NF;i++){print "Number of field in terms of NF is--> NF-" NF-i", value is-->" $i}}' file
Number of field in terms of NF is--> NF-13, value is-->R_Index
Number of field in terms of NF is--> NF-12, value is-->Chr
Number of field in terms of NF is--> NF-11, value is-->Start
Number of field in terms of NF is--> NF-10, value is-->End
Number of field in terms of NF is--> NF-9, value is-->Ref
Number of field in terms of NF is--> NF-8, value is-->Alt
Number of field in terms of NF is--> NF-7, value is-->Func.IDP.refGene
Number of field in terms of NF is--> NF-6, value is-->GeneDetail.IDP.refGene
Number of field in terms of NF is--> NF-5, value is-->AAChange.IDP.refGene
Number of field in terms of NF is--> NF-4, value is-->PopFreqMax
Number of field in terms of NF is--> NF-3, value is-->CLINSIG
Number of field in terms of NF is--> NF-2, value is-->CLNDBN
Number of field in terms of NF is--> NF-1, value is-->Classification
Number of field in terms of NF is--> NF-0, value is-->Quality
# default classification to "VUS"
awk -F'\t' -v OFS='\t' 'NR>1{$(NF-1)="VUS"} 1' file > vus
# check clinvar
awk -F'\t' -v OFS='\t' '{if ($(NF-3=length(<12)=$NF-3) else "Conflicting" 1' vus > clinvar
# UTR check
awk -F'\t' -v OFS='\t' '{if ($(NF-7="UTR")="Likely Benign") else $NF-3} 1' clinvar > utr
# check PopFreq
awk -F'\t' -v OFS='\t' '{if ($(NF-4 > .01)($(NF-1}="Likely Benign")} 1' utr > popfreq
# splicing check
awk -F'\t' -v OFS='\t' '{if ($(NF-7="splicing") AND ($(NF-6)=+/1) else $NF-3} 1' popfreq > final
desired output
R_Index Chr Start End Ref Alt Func.IDP.refGene GeneDetail.IDP.refGene AAChange.IDP.refGene PopFreqMax CLINSIG CLNDBN Classification Quality
1 chr1 40562993 40562993 T C UTR5 NM_000310.3:c.-83A>G . 0.9 . . Likely Benign 15
2 chr5 125887685 125887685 C T splicing NM_001201377.1:exon14:c.1233+28G>A . 0.82 . . Likely Benign 10
3 chr16 2105400 2105400 C T splicing NM_000548.4:exon6:c.482-3C>T . 0.21 not provided|not provided|not provided|not provided|other|Benign TSC Conflicting 25
4 chr16 2110805 2110805 G A exonic . TSC2:NM_000548.4:exon11:c.1110G>A:p.Q370Q .004 Pathogenic TSC Pathogenic 40