awk to add value and text to specific lines

cmccabe · November 29, 2017, 9:02am

In the awk I have a very large tab-delimeted file that I am trying to extract the DP= value put it in $16 and add
specific text to $16 with . (dot) in $11-$15 and $18 . Only the lines (there are several) that have the formating below in file
will have an empty $16 . Other lines will be in a different format and have a value in $16 and can be skipped. Thank you :).

awk

awk -F'\t' -v OFS="\t" '{if (!$16) {print $1, $2, $3, $4, $5, $6, $7, $8, $9, $10, ".", ".", ".", ".", ".", /DP=/, "homref", "."}' file

file

##bcftools_normVersion=1.3.1+htslib-1.3.1
##bcftools_normCommand=norm -m-both -o genome_split.vcf genome.vcf.gz
##bcftools_normCommand=norm -f /home/cmccabe/Desktop/NGS/picard-tools-1.140/resources/ucsc.hg19.fasta -o genome_annovar.vcf genome_split.vcf
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT
chr1	948797	.	C	.	0	PASS	DP=159;END=948845;MAX_DP=224;MIN_DP=95	GT:DP:MIN_DP:MAX_DP	0/0:159:95:224

desired output tab-delimeted

##bcftools_normVersion=1.3.1+htslib-1.3.1
##bcftools_normCommand=norm -m-both -o genome_split.vcf genome.vcf.gz
##bcftools_normCommand=norm -f /home/cmccabe/Desktop/NGS/picard-tools-1.140/resources/ucsc.hg19.fasta -o genome_annovar.vcf genome_split.vcf
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT
chr1	948797	.	C	.	0	PASS	DP=159;END=948845;MAX_DP=224;MIN_DP=95	GT:DP:MIN_DP:MAX_DP	0/0:159:95:224	.	.	.	.	.	159	homref	.

Don_Cragun · November 29, 2017, 10:50am

Other than adding text to the end of lines that start with a # , what is the code you're trying to use doing wrong?

What are you hoping that /DP=/ will print. I would expect it to print 1 if the string DP= appears somewhere in the current line and print 0 otherwise. Are you looking for the value between the first = and the first ; in field #8? If so, it would seem that you could split field 8 using /[=;]/ as the ERE to use as the subfield delimiter for field 8 and then print subfield 2. But making wild suggestions like this based on a sample size of 1 with no description of that field's contents is extremely dangerous.

cmccabe · November 29, 2017, 12:15pm

...
...
##bcftools_normVersion=1.3.1+htslib-1.3.1
##bcftools_normCommand=norm -m-both -o genome_split.vcf genome.vcf.gz
##bcftools_normCommand=norm -f /home/cmccabe/Desktop/NGS/picard-tools-1.140/resources/ucsc.hg19.fasta -o genome_annovar.vcf genome_split.vcf
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT
chr1	948797	.	C	.	0	PASS	DP=159;END=948845;MAX_DP=224;MIN_DP=95	GT:DP:MIN_DP:MAX_DP	0/0:159:95:224

desired output

##bcftools_normVersion=1.3.1+htslib-1.3.1
##bcftools_normCommand=norm -m-both -o genome_split.vcf genome.vcf.gz
##bcftools_normCommand=norm -f /home/cmccabe/Desktop/NGS/picard-tools-1.140/resources/ucsc.hg19.fasta -o genome_annovar.vcf genome_split.vcf
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT
chr1	948797	.	C	.	0	PASS	DP=159;END=948845;MAX_DP=224;MIN_DP=95	GT:DP:MIN_DP:MAX_DP	0/0:159:95:224	.	.	.	.	.	159	homref	.

other format of lines to skip

chr1	948846	.	T	TA	1185.38	PASS	AF=0.986111;AO=200;DP=227;FAO=213;FDP=216;FDVR=0;FR=.;FRO=3;FSAF=132;FSAR=81;FSRF=3;FSRR=0;FWDB=-0.038073;FXX=0.0226234;HRUN=1;LEN=1;MLLD=24.748;OALT=A;OID=.;OMAPALT=TA;OPOS=948847;OREF=-;PB=.;PBP=.;QD=21.9515;RBI=0.0905281;REFB=0.0706997;REVB=0.0821327;RO=17;SAF=122;SAR=78;SRF=15;SRR=2;SSEN=0;SSEP=0;SSSB=-0.0430195;STB=0.505617;STBP=0.201;TYPE=ins;VARB=0.000148797	GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR	1/1:219:227:216:17:3:200:213:0.986111:78:122:15:2:81:132:3:0	0.986	132	81	1	GOOD	216	homalt	36

If I split on $8 using DP= and print the value up to the ; , since that tag is iin both lines, wont it print for both? However since $16 has a value in it that line can be skipped (the much larger of the two lines).

awk

awk -F'\t' -v OFS="\t" '{if (!$16);{split($8,a,[=;]); {print $1, $2, $3, $4, $5, $6, $7, $8, $9, $10, ".", ".", ".", ".", ".", a[1], "homref", "."}' file

Not sure is the above is any closer and I did not really know what would print, but understand better know? Thank you :).

I also tried the perl below which executes but does not print any new output the desired output, but seems close.

perl

perl -plae ' 
 BEGIN{ %h = qw(0/0 homref) } /^[^#].*DP=(\d+);.*([0-0]\/[0-0])/ and $_ .= join "\t", ("", "."),".",".",".", ".", $1, $h{$7},"."' file > out

output of line to change --- seems to extract 224 and doesn't print homref though qw =0/0)

chr1	948797	.	C	.	0	PASS	DP=159;END=948845;MAX_DP=224;MIN_DP=95	GT:DP:MIN_DP:MAX_DP	0/0:159:95:224	.	.	.	.	.	224		.
chr1	948846	.	T	TA	1185.38	PASS	AF=0.986111;AO=200;DP=227;FAO=213;FDP=216;FDVR=0;FR=.;FRO=3;FSAF=132;FSAR=81;FSRF=3;FSRR=0;FWDB=-0.038073;FXX=0.0226234;HRUN=1;LEN=1;MLLD=24.748;OALT=A;OID=.;OMAPALT=TA;OPOS=948847;OREF=-;PB=.;PBP=.;QD=21.9515;RBI=0.0905281;REFB=0.0706997;REVB=0.0821327;RO=17;SAF=122;SAR=78;SRF=15;SRR=2;SSEN=0;SSEP=0;SSSB=-0.0430195;STB=0.505617;STBP=0.201;TYPE=ins;VARB=0.000148797	GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR	1/1:219:227:216:17:3:200:213:0.986111:78:122:15:2:81:132:3:0	0.986	132	81	1	GOOD	216	homalt	36  --- this line is correct ----

Don_Cragun · November 29, 2017, 4:15pm

I'm very confused.

Are you just trying to print lines that need to have fields added? Or, do you want to print all lines in the file whether they change or not? Your code makes no attempt to ignore the header/comment lines at the start of your file, but you don't seem to want them to be changed even though none of them have 16 tab separated fields and you try to change every line that doesn't have a field #16.

I suggested using /[=;]/ as an ERE argument for split() ; you used [=;] instead (which I would expect to give you a syntax error).

I suggested printing array[2] IF DP=value; is ALWAYS at the start of field #8. You printed array[1] and showed us an example where DP=value; is the third subfield in field #8.

I know that you like writing awk 1-liners (instead of readable code). But, until you fully understand the syntax of an if statement in akw , you would be much better off writing awk code longhand so you can actually see how the pieces are supposed to fit together. (And please show us the output you get from the code you're running including the diagnostic messages as well as the normal output.)

Those of us who are trying to help you get tired quickly when there is no clear specification of what you are trying to do and no clear specification of the format of the data you're trying to interpret and modify.

cmccabe · November 29, 2017, 6:33pm

I hope the below is a better description:

The DP= is in each line of the input file, however it is always at the start of $8 in the lines that the awk is goiog to update.
$11 will also always be empty (like in line1, in line2 there is a value so nothing is done). All lines are printed that are in the input file, that is the comments and each line is printed. In the example since line2 is good it is only printed as is, but line1 needs to be updated brfore it is printed. Do the comment lines need to be skipped as they do not have more then twwo fields in them? Thank you :).

file 11 tab-delimeted fields --- the actual file is over 22 million lines in the below format ----
line1 and all lines like it are the one that the awk updates
line2 and all lines like it are skipped(nothing needs to be done to them)

file

##bcftools_normVersion=1.3.1+htslib-1.3.1
##bcftools_normCommand=norm -m-both -o genome_split.vcf genome.vcf.gz
##bcftools_normCommand=norm -f /home/cmccabe/Desktop/NGS/picard-tools-1.140/resources/ucsc.hg19.fasta -o genome_annovar.vcf genome_split.vcf
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT
chr1	948797	.	C	.	0	PASS	DP=159;END=948845;MAX_DP=224;MIN_DP=95	GT:DP:MIN_DP:MAX_DP	0/0:159:95:224
chr1	948846	.	T	TA	1185.38	PASS	AF=0.986111;AO=200;DP=227;FAO=213;FDP=216;FDVR=0;FR=.;FRO=3;FSAF=132;FSAR=81;FSRF=3;FSRR=0;FWDB=-0.038073;FXX=0.0226234;HRUN=1;LEN=1;MLLD=24.748;OALT=A;OID=.;OMAPALT=TA;OPOS=948847;OREF=-;PB=.;PBP=.;QD=21.9515;RBI=0.0905281;REFB=0.0706997;REVB=0.0821327;RO=17;SAF=122;SAR=78;SRF=15;SRR=2;SSEN=0;SSEP=0;SSSB=-0.0430195;STB=0.505617;STBP=0.201;TYPE=ins;VARB=0.000148797	GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR	1/1:219:227:216:17:3:200:213:0.986111:78:122:15:2:81:132:3:0	0.986	132	81	1	GOOD	216	homalt	36

awk

awk '		# Run awk to process the following awk script...
BEGIN {	OFS = "\t" }            # Set output field separator.
      'NR==FNR {if (!$11) {     # For each line in the input file check if $11 is empty
        split($8,a,"/[=;]/");    # if $11 is empty then split $8 on the = and ; and store the value in a 
                next    # process next line in file
        }   # close block
                                    {   # start print block
                                        print $1,$2,$3,$4,$5,$6,$7,$8,$9,$10,".",".",".",".",".",a[2],"homref",".";  # print desired output
                                    } # close print block
                    next  # print next line
                    }1' file   # define input

I have syntax errors in the awk but added comments that I hope will help. I use one-liners cause for whatever reason I have a hard time not using them (as evident by the syntax errors).... never-the-less commented seems to help. Thank you :).

Don_Cragun · November 29, 2017, 10:29pm

Assuming that your sample data is representative, the following seems to do what you want:

awk '
BEGIN {	FS = OFS = "\t"
}
NF == 10 {
	split($8, a, /[=;]/)
	$11 = $12 = $13 = $14 = $15 = $18 = "."
	$16 = (a[1] == "DP") ? a[2] : "DP=num_Missing"
	$17 = "homref"
}
1' file

Note that this assumes that lines that are to be modified only have 10 fields (i.e., 9 <tab>s) as in your sample. It verifies that field #8 does start with DP= and assigns a diagnostic message to field #16 if it isn't. This seemed safer to me than blindly assuming that the input file is correctly formatted. (And, I added a line to my sample input file to be sure that it worked as I expected. It didn't the first two times I tried it.)

When the contents of file is:

##bcftools_normVersion=1.3.1+htslib-1.3.1
##bcftools_normCommand=norm -m-both -o genome_split.vcf genome.vcf.gz
##bcftools_normCommand=norm -f /home/cmccabe/Desktop/NGS/picard-tools-1.140/resources/ucsc.hg19.fasta -o genome_annovar.vcf genome_split.vcf
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT
chr1	948797	.	C	.	0	PASS	DP=159;END=948845;MAX_DP=224;MIN_DP=95	GT:DP:MIN_DP:MAX_DP	0/0:159:95:224
chr1	948846	.	T	TA	1185.38	PASS	AF=0.986111;AO=200;DP=227;FAO=213;FDP=216;FDVR=0;FR=.;FRO=3;FSAF=132;FSAR=81;FSRF=3;FSRR=0;FWDB=-0.038073;FXX=0.0226234;HRUN=1;LEN=1;MLLD=24.748;OALT=A;OID=.;OMAPALT=TA;OPOS=948847;OREF=-;PB=.;PBP=.;QD=21.9515;RBI=0.0905281;REFB=0.0706997;REVB=0.0821327;RO=17;SAF=122;SAR=78;SRF=15;SRR=2;SSEN=0;SSEP=0;SSSB=-0.0430195;STB=0.505617;STBP=0.201;TYPE=ins;VARB=0.000148797	GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR	1/1:219:227:216:17:3:200:213:0.986111:78:122:15:2:81:132:3:0	0.986	132	81	1	GOOD	216	homalt	36
ALT1	948798	.	D	.	1	PAST	END=948845;MAX_DP=224;MIN_DP=95;DP=159	GT:DP:MIN_DP:MAX_DP	1/1:160:96:225

then the output produced is:

##bcftools_normVersion=1.3.1+htslib-1.3.1
##bcftools_normCommand=norm -m-both -o genome_split.vcf genome.vcf.gz
##bcftools_normCommand=norm -f /home/cmccabe/Desktop/NGS/picard-tools-1.140/resources/ucsc.hg19.fasta -o genome_annovar.vcf genome_split.vcf
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT
chr1	948797	.	C	.	0	PASS	DP=159;END=948845;MAX_DP=224;MIN_DP=95	GT:DP:MIN_DP:MAX_DP	0/0:159:95:224	.	.	.	159	homref	.
chr1	948846	.	T	TA	1185.38	PASS	AF=0.986111;AO=200;DP=227;FAO=213;FDP=216;FDVR=0;FR=.;FRO=3;FSAF=132;FSAR=81;FSRF=3;FSRR=0;FWDB=-0.038073;FXX=0.0226234;HRUN=1;LEN=1;MLLD=24.748;OALT=A;OID=.;OMAPALT=TA;OPOS=948847;OREF=-;PB=.;PBP=.;QD=21.9515;RBI=0.0905281;REFB=0.0706997;REVB=0.0821327;RO=17;SAF=122;SAR=78;SRF=15;SRR=2;SSEN=0;SSEP=0;SSSB=-0.0430195;STB=0.505617;STBP=0.201;TYPE=ins;VARB=0.000148797	GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR	1/1:219:227:216:17:3:200:213:0.986111:78:122:15:2:81:132:3:0	0.986	132	81	1	GOOD	216	homalt	36
ALT1	948798	.	D	.	1	PAST	END=948845;MAX_DP=224;MIN_DP=95;DP=159	GT:DP:MIN_DP:MAX_DP	1/1:160:96:225	.	.	.	DP=num_Missing	homref	.

Hopefully, this will come close to what you want.

cmccabe · December 4, 2017, 8:04am

Thank you very much, works perfect thank you :).