awk to filter file based on seperate conditions

The below awk will filter a list of 30,000 lines in the tab-delimited file . What I am having trouble with is adding a condition to SVTYPE=CNV
that will only print that line if CI= must be >.05 .

The other condition to add is if SVTYPE=Fusion , then in order to print that line
READ_COUNT must be > 10 . Thank you :).

file

chr1    11184539    MTOR    A    <CNV>    100.0    PASS    FR=.;PRECISE=FALSE;SVTYPE=CNV;END=11217311;LEN=32772;NUMTILES=4;SD=0.18;CDF_MAPD=0.01:1.373797,0.025:1.472018,0.05:1.562112,0.1:1.67288,0.2:1.817619,0.25:1.875834,0.5:2.13,0.75:2.418604,0.8:2.496068,0.9:2.71203,0.95:2.904337,0.975:3.082096,0.99:3.302454;REF_CN=2;CI=0.05:1.56211,0.95:2.90434;RAW_CN=2.13;FUNC=[{'gene':'MTOR'}]    GT:GQ:CN    ./.:0:2.13
chr1    11810242    AGTRAP-BRAF.A5B8.COSF828.1_1    G    G]chr7:140494267]    .    FAIL    SVTYPE=Fusion;READ_COUNT=0;GENE_NAME=AGTRAP;EXON_NUM=5;RPM=0.0000;NORM_COUNT=0.0;ANNOTATION=COSF828;FAIL_REASON=READ_COUNT<=40|NORM_COUNT<=0.0;FUNC=[{'gene':'AGTRAP','exon':'5'}]    GT:GQ    ./.:.
chr7:140494267]    .    PASS     SVTYPE=Fusion;READ_COUNT=16;GENE_NAME=AGTRAP;EXON_NUM=5;RPM=0.0000;NORM_COUNT=0.0;ANNOTATION=COSF828;FAIL_REASON=|NORM_COUNT<=0.0;FUNC=[{'gene':'AGTRAP','exon':'5'}]     GT:GQ    ./.:.

desired output

chr7:140494267]    .    PASS     SVTYPE=Fusion;READ_COUNT=16;GENE_NAME=AGTRAP;EXON_NUM=5;RPM=0.0000;NORM_COUNT=0.0;ANNOTATION=COSF828;FAIL_REASON=|NORM_COUNT<=0.0;FUNC=[{'gene':'AGTRAP','exon':'5'}]     GT:GQ    ./.:.

awk

awk -F'\t' -v OFS='\t\ '/SVTYPE=/{print}' file

Hello cmccabe,

Could you please try following. You could add tab delimiters by using -F"\t" and OFS="\t" if needed.

awk '{match($0,/SVTYPE=[^;]*/);SVTYPE_VALUE=substr($0,RSTART+7,RLENGTH-7);match($0,/READ_COUNT[^;]*/);READ_COUNT_VALUE=substr($0,RSTART+11,RLENGTH-11);match($0,/CI=[^:]*/);CI_VALUE=substr($0,RSTART+3,RLENGTH-3);if(SVTYPE_VALUE == "CNV" && CI +0> 0.5){print};if(SVTYPE_VALUE == "Fusion" && READ_COUNT_VALUE+0 > 10){print}}'  Input_file

EDIT: Adding a non-one liner form of solution too now.

awk '{
        match($0,/SVTYPE=[^;]*/);
        SVTYPE_VALUE=substr($0,RSTART+7,RLENGTH-7);
        match($0,/READ_COUNT[^;]*/);
        READ_COUNT_VALUE=substr($0,RSTART+11,RLENGTH-11);
        match($0,/CI=[^:]*/);
        CI_VALUE=substr($0,RSTART+3,RLENGTH-3);
        if(SVTYPE_VALUE == "CNV" && CI+0 > 0.5)                {
                                                                print
                                                             };
        if(SVTYPE_VALUE == "Fusion" && READ_COUNT_VALUE+0 > 10){
                                                                print
                                                             }
     }
    '   Input_file
 

Thanks,
R. Singh

1 Like

Another option to try:

awk -F'[\t;]' '
  {
    split(x,V)
    for(i=1; i<=NF; i++) {
      split($i,F,/=/)
      V[F[1]]=F[2]
    }
  }
  (V["SVTYPE"]=="CNV"    && V["CI"]+0 > .05) || 
  (V["SVTYPE"]=="Fusion" && V["READ_COUNT"]+0 > 10)
' file
1 Like

Thank you both very much:).