awk to retain header lines in output

cmccabe · February 27, 2019, 12:56pm

The awk below executes and produces the current output, which is correct, except I can not seem to include the header lines # and ## in the output as well. I tried adding !/^#/ thinking that it would skip the lines with # and output them but the entire file prints as is. Thank you :).

file

##bcftools_normVersion=1.9+htslib-1.9
##bcftools_normCommand=norm --do-not-normalize -m -both /path/to/xxxxx.vcf; Date=Tue Feb 26 12:59:30 2019
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	xxxx
chr1	11174372	MTOR	A	<CNV>	100	PASS	FR=.;PRECISE=FALSE;SVTYPE=CNV;END=11217311;LEN=42939;NUMTILES=7;SD=0.47;CDF_MAPD=0.01:1.480581,0.025:1.544948,0.05:1.602554,0.1:1.671659,0.2:1.759366,0.25:1.793881,0.5:1.94,0.75:2.098021,0.8:2.139179,0.9:2.251416,0.95:2.348502,0.975:2.436069,0.99:2.541976;REF_CN=2;CI=0.05:1.60255,0.95:2.3485;RAW_CN=1.94;FUNC=[{'gene':'MTOR'}]	GT:GQ:CN	./.:0:1.94
chr1	11174383	COSM1161896	A	G	264.674	PASS	AF=0;AO=0;DP=4229;FAO=0;FDP=2000;FDVR=5;FR=.;FRO=2000;FSAF=0;FSAR=0;FSRF=1166;FSRR=834;FWDB=-0.0180893;FXX=0;HRUN=1;HS_ONLY=0;LEN=1;MLLD=127.881;OALT=G;OID=COSM1161896;OMAPALT=G;OPOS=11174383;OREF=A;PB=.;PBP=.;QD=0.529347;RBI=0.0955223;REFB=9.08841e-06;REVB=0.0937938;RO=4212;SAF=0;SAR=0;SRF=2442;SRR=1770;SSEN=0;SSEP=0;SSSB=3.6281e-08;STB=0.5;STBP=1;TYPE=snp;VARB=0;HS;FUNC=[{'transcript':'NM_004958.3','gene':'MTOR','location':'exonic','exon':'53'}]	GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR	0/0:264:4229:2000:4212:2000:0:0:0:0:0:2442:1770:0:0:1166:834
chr1	43814978	COSM1342796;COSM86963	A	G	231.262	PASS	AF=0.0010005;AO=4;DP=3351;FAO=2;FDP=1999;FDVR=10;FR=.,.;FRO=1997;FSAF=1;FSAR=1;FSRF=944;FSRR=1053;FWDB=0.00987233;FXX=0.000499998;HRUN=1;HS_ONLY=0;LEN=1,1;MLLD=106.81;OALT=G,T;OID=COSM1342796,COSM86963;OMAPALT=G,T;OPOS=43814978,43814978;OREF=A,A;PB=.;PBP=.;QD=0.462755;RBI=0.014386;REFB=4.80559e-05;REVB=0.010464;RO=3338;SAF=1;SAR=3;SRF=1576;SRR=1762;SSEN=0;SSEP=0;SSSB=-0.0113679;STB=0.526994;STBP=0.848;TYPE=snp;VARB=-0.0370454;HS;FUNC=[{'transcript':'NM_005373.2','gene':'MPL','location':'exonic','exon':'10'}]	GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR	0/0:231:3351:1999:3338:1997:4:2:0.0010005:3:1:1576:1762:1:1:944:1053
chr1	43814978	COSM1342796;COSM86963	A	G	231.262	PASS	AF=0.05;AO=4;DP=3351;FAO=2;FDP=1999;FDVR=10;FR=.,.;FRO=1997;FSAF=1;FSAR=1;FSRF=944;FSRR=1053;FWDB=0.00987233;FXX=0.000499998;HRUN=1;HS_ONLY=0;LEN=1,1;MLLD=106.81;OALT=G,T;OID=COSM1342796,COSM86963;OMAPALT=G,T;OPOS=43814978,43814978;OREF=A,A;PB=.;PBP=.;QD=0.462755;RBI=0.014386;REFB=4.80559e-05;REVB=0.010464;RO=3338;SAF=1;SAR=3;SRF=1576;SRR=1762;SSEN=0;SSEP=0;SSSB=-0.0113679;STB=0.526994;STBP=0.848;TYPE=snp;VARB=-0.0370454;HS;FUNC=[{'transcript':'NM_005373.2','gene':'MPL','location':'exonic','exon':'10'}]	GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR	0/0:231:3351:1999:3338:1997:4:2:0.0010005:3:1:1576:1762:1:1:944:1053

current output

chr1	43814978	COSM1342796;COSM86963	A	G	231.262	PASS	AF=0.05;AO=4;DP=3351;FAO=2;FDP=1999;FDVR=10;FR=.,.;FRO=1997;FSAF=1;FSAR=1;FSRF=944;FSRR=1053;FWDB=0.00987233;FXX=0.000499998;HRUN=1;HS_ONLY=0;LEN=1,1;MLLD=106.81;OALT=G,T;OID=COSM1342796,COSM86963;OMAPALT=G,T;OPOS=43814978,43814978;OREF=A,A;PB=.;PBP=.;QD=0.462755;RBI=0.014386;REFB=4.80559e-05;REVB=0.010464;RO=3338;SAF=1;SAR=3;SRF=1576;SRR=1762;SSEN=0;SSEP=0;SSSB=-0.0113679;STB=0.526994;STBP=0.848;TYPE=snp;VARB=-0.0370454;HS;FUNC=[{'transcript':'NM_005373.2','gene':'MPL','location':'exonic','exon':'10'}]	GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR	0/0:231:3351:1999:3338:1997:4:2:0.0010005:3:1:1576:1762:1:1:944:1053

desired output

##bcftools_normVersion=1.9+htslib-1.9
##bcftools_normCommand=norm --do-not-normalize -m -both /path/to/xxxxx.vcf; Date=Tue Feb 26 12:59:30 2019
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	xxxx
chr1	43814978	COSM1342796;COSM86963	A	G	231.262	PASS	AF=0.05;AO=4;DP=3351;FAO=2;FDP=1999;FDVR=10;FR=.,.;FRO=1997;FSAF=1;FSAR=1;FSRF=944;FSRR=1053;FWDB=0.00987233;FXX=0.000499998;HRUN=1;HS_ONLY=0;LEN=1,1;MLLD=106.81;OALT=G,T;OID=COSM1342796,COSM86963;OMAPALT=G,T;OPOS=43814978,43814978;OREF=A,A;PB=.;PBP=.;QD=0.462755;RBI=0.014386;REFB=4.80559e-05;REVB=0.010464;RO=3338;SAF=1;SAR=3;SRF=1576;SRR=1762;SSEN=0;SSEP=0;SSSB=-0.0113679;STB=0.526994;STBP=0.848;TYPE=snp;VARB=-0.0370454;HS;FUNC=[{'transcript':'NM_005373.2','gene':'MPL','location':'exonic','exon':'10'}]	GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR	0/0:231:3351:1999:3338:1997:4:2:0.0010005:3:1:1576:1762:1:1:944:1053

awk

awk -F'[\t;]' '
  {
    split(x,V)
    for(i=1; i<=NF; i++) {
      split($i,F,/=/)
      V[F[1]]=F[2]
    }
  }
  (V["AF"]+0 > .03) && 
  (V["DP"]+0 > 20)
' file

Scrutinizer · February 27, 2019, 1:01pm

Hi, try:

  (V["AF"]+0 > .03) && 
  (V["DP"]+0 > 20) ||
  /^#/

or

awk -F'[\t;]' '
  /^#/ {
    print
    next
  }
  {
    split(x,V)
    for(i=1; i<=NF; i++) {
      split($i,F,/=/)
      V[F[1]]=F[2]
    }
  }
  (V["AF"]+0 > .03) && 
  (V["DP"]+0 > 20)
' file

vgersh99 · February 27, 2019, 1:03pm

awk -F'[\t;]' '
/^#/ { print;next}
  {
    split(x,V)
....
}

cmccabe · February 27, 2019, 1:06pm

Works great, thank you. I am currently learning python (or trying) and was going to use the awk as practice.... that is try rewriting it in python . Could I post back comments on each line to see if my thinking is correct? Thank you :).

awk

awk -F'[\t;]' ' # call awk script and define FS as pattern of tab and semi-colon
  {
    split(x,V) # split each tab and ; and read into array V
    for(i=1; i<=NF; i++) {  # start loop iterating over each line
      split($i,F,/=/)  # split on the = and store in array F
      V[F[1]]=F[2]  # each V is tag=value (example AF=0.05)
    }
  }
  (V["AF"]+0 > .03) && # check AF is greater then 3% and
  (V["DP"]+0 => 20) || check DP is greter than or equal to 20
  /^#/  # retain header lines (if AF and DP criteria are met, print line(s) and header
' file  # define output file

/^#/ { print;next} # retains header as well

bakunin · February 27, 2019, 1:19pm

Of course you can do that - in fact you are explicitly encouraged to do so. This forum is all about self-empowerment and learning to help yourself. But you probably knew that already, didn't you?

A major difference between awk and sed is that the latter outputs every line, changed or not, by default. i.e.

sed 's/old/NEW/g' /some/file

will not only output all lines containing "old" with "old" changed to "NEW" but also all other lines, simply without any change at all. awk works different and will only output what it is explicitly told to output - through the print command or whatever means. Therefore, if there is no rule to print lines starting with a "#" then these lines will not be printed.

Not quite: FS is defined as either a tab or a semicolon. [....] is a so-called "character-class" and often used in regexps. It always means "one of the enclosed characters". i.e. d[ae]n would match either "dan" or "den" but neither "dean" nor "daen". There is the possibility of grouping characters instead of enumerating them, i.e [a-z] is "any (non-capitalised) character a-z" and [a-zA-Z] is "any character a-z, capitalised or not".

You can also negate these classes by using "^" as first character: [^0-9] is "anything but a digit".

I hope this helps.

bakunin

cmccabe · February 27, 2019, 2:00pm

Thank you :).

nezabudka · February 27, 2019, 3:31pm

Only the last 2 lines are correctly compared with this separator -F'[\t;]

awk -F "AF=|DP=" '
/^#/    {print; next}
        {split($2 $3, V, ";")}
( V[1] > 0.03 ) && ( V[3] > 20 )
' file

cmccabe · February 27, 2019, 4:56pm

I am sorry but i dont understand, can you pleasecomment the awk if possible? Why are only the last two lines checked/compared? Thank you :).

nezabudka · February 28, 2019, 1:18am

Hi cmccabe,
If you take as a field separator [\t;] , then the lines will be separated in different ways
fifth string:
chr1 11174383 COSM1161896 A G 264.674 PASS AF=0
seventh string:
chr1 43814978 COSM1342796;COSM86963 A G 231.262 PASS AF=0.05
And the first parameter for comparison (AF=0) will be located in different fields. In the first case, in field 8, and in the second, in field 9
due to the fact that in the last line there appears an additional separator ( marked in red.
With the second parameter for comparison is similar