awk to print line based on two keywords

I am starting to write a multi-line awk and using the file below which is
tab-delimited , print only the line with oncomineGeneClass
and oncomineVariantClass and PASS . The script execute but
seems to be printing the entire file, not the desired line. Thank you :).

file

##FORMAT=<ID=SRF,Number=1,Type=Integer,Description="Number of reference observations on the forward strand">
##FORMAT=<ID=SRR,Number=1,Type=Integer,Description="Number of reference observations on the reverse strand">
#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT    file
SVTYPE=Fusion;READ_COUNT=1868;GENE_NAME=ETV6;EXON_NUM=4;RPM=1.5825e-09;NORM_COUNT=0.001582480886121524;ANNOTATION=COSF823;FUNC=[{'gene':'ETV6','exon':'4','oncomineGeneClass':'Gain-of-Function','oncomineVariantClass':'Fusion'}]    GT:GQ    ./.:.
chr15    88483984    ETV6-NTRK3.E4N15.COSF823.1_2    T    ]chr12:12006495]T    .    PASS
SVTYPE=Fusion;READ_COUNT=1868;GENE_NAME=NTRK3;EXON_NUM=15;RPM=1.5825e-09;NORM_COUNT=0.001582480886121524;ANNOTATION=COSF823;FUNC=[{'gene':'NTRK3','exon':'15','oncomineGeneClass':'Gain-of-Function','oncomineVariantClass':'Fusion'}]    GT:GQ    ./.:.
chr12    12022903    ETV6-NTRK3.E5N14_1    G    G]chr15:88576276]    .    FAIL
chr17    7577108    COSM10749;COSM43737    C    A,T    149.594    PASS    AF=0.0830415,0.0;AO=372,2;DP=4420;FAO=166,0;FDP=1999;FR=.,.,REALIGNEDx0.0865;FRO=1833;FSAF=82,0;FSAR=84,0;FSRF=952;FSRR=881;FWDB=0.0072184,-0.0207142;FXX=4.99998E-4;HRUN=1,1;LEN=1,1;MLLD=293.795,80.5366;OALT=A,T;OID=COSM10749,COSM43737;OMAPALT=A,T;OPOS=7577108,7577108;OREF=C,C;PB=.,.;PBP=.,.;QD=0.299338;RBI=0.00721997,0.02565;REFB=1.40155E-4,-7.81395E-4;REVB=1.50579E-4,0.0151276;RO=4043;SAF=187,1;SAR=185,1;SRF=2118;SRR=1925;SSEN=0,0;SSEP=0,0;SSSB=-0.0251826,-5.12306E-4;STB=0.52327,0.5;STBP=0.541,1.0;TYPE=snp,snp;VARB=-0.00153404,0.0;HS;FUNC=[{'origPos':'7577108','origRef':'C','normalizedRef':'C','gene':'TP53','normalizedPos':'7577108','normalizedAlt':'A','polyphen':'1.0','gt':'pos','codon':'TTT','coding':'c.830G>T','sift':'0.0','grantham':'205.0','transcript':'NM_000546.5','function':'missense','protein':'p.Cys277Phe','location':'exonic','origAlt':'A','exon':'8','oncomineGeneClass':'Loss-of-Function','oncomineVariantClass':'Hotspot'}]    GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR:QT    0/1:149:4420:1999:4043:1833:372,2:166,0:0.0830415,0.0:185,1:187,1:2118:1925:84,0:82,0:952:881:1
chr17    27400788    TIAF1    G    <CNV>    100.0    PASS    HS;FR=.;PRECISE=FALSE;SVTYPE=CNV;END=27495549;LEN=94761;NUMTILES=22;SD=0.33;CDF_MAPD=0.01:1.251248,0.025:1.295465,0.05:1.33475,0.1:1.381537,0.2:1.440411,0.25:1.46343,0.5:1.56,0.75:1.662943,0.8:1.689518,0.9:1.761516,0.95:1.823262,0.975:1.878554,0.99:1.944939;REF_CN=2;CI=0.05:1.33475,0.95:1.82326;RAW_CN=1.56;FUNC=[{'gene':'TIAF1'}]    GT:GQ:CN    ./.:0:1.56

awk

awk -F'\t' '{ # call awk and set FS as tab
        match($0,/oncomineGeneClass=[^:]*/ && /oncomineVariantClass=[^:]*/ && "PASS"); { # match lines on oncomineVariantClass and PASS
        print  # print line
 }
} 
' file   # define input

desired output

SVTYPE=Fusion;READ_COUNT=1868;GENE_NAME=ETV6;EXON_NUM=4;RPM=1.5825e-09;NORM_COUNT=0.001582480886121524;ANNOTATION=COSF823;FUNC=[{'gene':'ETV6','exon':'4','oncomineGeneClass':'Gain-of-Function','oncomineVariantClass':'Fusion'}]     GT:GQ    ./.:.
chr15    88483984    ETV6-NTRK3.E4N15.COSF823.1_2    T    ]chr12:12006495]T    .    PASS

Hello,
The problem is the regular expression, and in the input file the keys are enclosed in single quotes (').
This is not a very smart code, but it works:

awk -F'\t' 
'
BEGIN{ a="\x27oncomineGeneClass\x27:";
       b="\x27oncomineVariantClass\x27:";
       c="PASS"; 
    }
    { if ( match($0, a) && match($0, b) && match($0, c) )
         print;
    }
' file   # define input

Greetings!

1 Like

So, in other words: match(s, r) takes a single regular expression, not multiple re's.

--
Another example:

awk '$7=="PASS" && /oncomineGeneClass/ && /oncomineVariantClass/' file
2 Likes

Your desired output is comprised of two different records, therefore it cannot be handled by the logic you are using.

Unfortunately, your one sample introduces ambiguity which makes it hard to guess for a possible alternative.

1 Like

Thanks, @Scrutinizer.

If I wanto to match the re's with single quotes and colon,

awk '$7=="PASS" && /\'oncomineGeneClass\':/ && /\'oncomineVariantClass\':/' file

It doesn't work so. How should it be modified?

Regards.

1 Like

You could place the code in a file.awk to avoid the shell quoting.
You could also,

awk '$7=="PASS" && /\47oncomineGeneClass\47:/ && /\47oncomineVariantClass\47:/' file
2 Likes

You could also use:

awk -v p1="PASS" -v p2="'oncomineGeneClass'" -v p3="'oncomineVariantClass':" '$0 ~ p1 && $0 ~ p2 && $0 ~ p3' file
2 Likes

Thank you all :slight_smile:

For completeness, some other ways to avoid shell quoting:

awk '$7=="PASS" && $0~ q "oncomineGeneClass" q ":" && $0~ q "oncomineVariantClass" q ":"' q=\" file

Reverse shell quoting:

awk '$7=="PASS" && '"/'oncomineGeneClass':/ && /'oncomineVariantClass':/" 

Put it in a separate file, like aia suggested:

$ cat file.awk
$7=="PASS" && /'oncomineGeneClass':/ && /'oncomineVariantClass':/
$ awk -f file.awk file

or

script=$(cat << "EOS"
$7=="PASS" && /'oncomineGeneClass':/ && /'oncomineVariantClass':/
EOS
)
awk "$script" infile

or with process substitution if your shell supports it:

awk -f <(
cat << "EOS"
$7=="PASS" && /'oncomineGeneClass':/ && /'oncomineVariantClass':/
EOS
) file
1 Like