In the awk
below I am trying to print the entire line, along with the header row, if $2
is SNV
or MNV
or INDEL
. If that condition is met or is true, and $3
is less than or equal to 0.05
, then in $7
the sub pattern :GMAF=
is found and the value after the =
sign is checked. If that value is less than or equal to 0.01
then the entire line, along with header row, is printed.
Since it is possible for $2
to be SNV
or MNV
or INDEL
and $7
to be blank or null, then I am not sure how to capture this as well. Line 1 is an example of this. The assumption is that if there is no value in $4
then this is the same as zero so may be significant and is extracted. I am also not sure how to include the header row minus the #
in the print. The ---
are not part of the file, they are just there to indicate the header. I added comments to each line as well. Thank you :).
file.tsv tab-delimited
##reference=hg19
##referenceURI=hg19
# locus type pvalue coverage gene transcript 5000Exomes function ----- header row
chr4:153271308 SNV 1.30E-20 2000 FBXW7 NM_033632.3 intronic
chr1:123456 SNV 0 1800 APC NM_0000 AMAF=0.0041:EMAF=0.0:GMAF=0.0014 exonic
chr2:78555 REF 0 1900 APC NM_0000
chr1:123456 MNV 0 2000 APC NM_0000 AMAF=0.2195:EMAF=0.1378:GMAF=0.1655 exonic
current output
locus type pvalue coverage gene transcript 5000Exomes function ----- header row
chr4:153271308 SNV 1.30E-20 2000 FBXW7 NM_033632.3 intronic
chr1:123456 MNV 0 2000 APC NM_0000 AMAF=0.2195:EMAF=0.1378:GMAF=0.1655 exonic
[/CODE]
desired output tab-delimited
locus type pvalue coverage gene transcript 5000Exomes function
chr4:153271308 SNV 1.30E-20 2000 FBXW7 NM_033632.3 intronic
chr1:123456 SNV 0 1800 APC NM_0000 AMAF=0.0041:EMAF=0.0:GMAF=0.0014 exonic
awk
awk 'NR<3{next} # start processing in row 3
NR==3{print gensub(/^# /,"","1");next} # print the third line (header) by removing the leading # and whitespace
$2 == "SNV" || $2 == "MNV" || $2 == "INDEL" && $3 <=0.05 { # if $2 and $6 meet the criteria
if (NF!=7) {val=gensub(/.*GMAF=(.[^:]*).*/,"\\1","g",$7); # isolate the value of GMAF with regex and missing lines
if (val<=0.01) next} print }' file.tsv > out.txt # compare and print