Modify text file using awk

mehar · May 23, 2013, 7:10am

I have text file with lines as shown here. Each row has 11 columns separated by tab. In each row, i want to split the 8th column such that the output should look like shown below. Here value in the 9th column is DP value and in the 10th column is MQ value followed by the values after resource.EFF=.

chr1 412573 . A C 2758.77 . 71 58.36 INTERGENIC MODIFIER GT:AD:DP:GQ:PL 1/1:0,71:71:99:2787,214,0 GATKSAM

chr1 602567 rs21953190 A G 5481.77 . 152 59.09 SYNONYMOUS_CODING LOW SILENT gaT/gaC D1034 ADNP2 protein_coding CODING ENSCAFT00000000008 5 GT:AD:DP:GQ:PL 1/1:0,151:151:99:5510,430,0 GATKSAM

Which means the 8th column has to be cleaned up such that, it has only DP value, MQ value and the information after resource.EFF= separated by tabs.

Could anyone help?

Don_Cragun · May 23, 2013, 7:22am

This sure looks a lot like another thread you started in this forum earlier today:Split column using awk in a text file

Am I missing something, or is this a duplicate posting?

mehar · May 23, 2013, 7:24am

Indeed the question looks similar, but the input format in the 8th column differs from the previous thread. To avoid confusion i have posted it as a new thread.

pamu · May 23, 2013, 7:39am

You can use previous thread solution to solve this

mehar · May 23, 2013, 7:42am

The previous solution

gives the output as

Here i want to split further

in first row into

similary the data in the second row

into

which means i want to replace () and | with tabs.

Jotne · May 23, 2013, 7:58am

If you like to do lots of modification, its better to split it up and do some in one run and some other things in another run.
This reduce the complexity of the program

Running this on output above

awk '/INT/ {split($10,t,"[(|]"); $10=t[1]" "t[2]}{gsub(/[|()]/," ")}1'

gives

chr1 412573 . A C 2758.77 . 71 58.36 INTERGENIC MODIFIER GT:ADP:GQ:PL 1/1:0,71:71:99:2787,214,0

chr1 602567 rs21953190 A G 5481.77 . 152        59.09    SYNONYMOUS_CODING LOW SILENT gaT/gaC D1034 ADNP2 protein_coding CODING ENSCAFT00000000008 5  GT:ADP:GQ:PL 1/1:0,151:151:99:5510,430,0

mehar · May 23, 2013, 8:05am

This code is working only with first row. The second row remains the same. I want to replace the () and | in the 10th column in all the rows. i.e.
the data in the second row also should look like,

RudiC · May 23, 2013, 2:15pm

Please use code tags in lieu of quote tags!

Try this:

awk     '       {n  = split ($8,TMP,";")
                 $8 = ""
                 for (i=1; i<=n; i++)
                    if (match (TMP, /^DP=|^MQ=|^resource.EFF/))
                        {sub  (/^.*=/, "", TMP)
                         gsub (/[()\|]+/, " ", TMP)
                         $8=$8 ($8?"\t":"") TMP
                        }
                }
         1
        '  FS="\t" file
chr1 412573 . A C 2758.77 . 71    58.36    INTERGENIC MODIFIER  GT:ADP:GQ:PL 1/1:0,71:71:99:2787,214,0 GATKSAM
       
chr1 602567 rs21953190 A G 5481.77 . 152    59.09    SYNONYMOUS_CODING LOW SILENT gaT/gaC D1034 ADNP2 protein_coding CODING ENSCAFT00000000008 5  GT:ADP:GQ:PL 1/1:0,151:151:99:5510,430,0 GATKSAM

mehar · May 23, 2013, 2:33pm

Thanks, i have the similar fix which is doing what i need. Here is the code.

awk 'BEGIN{OFS="\t"}{n=split ($8,TMP,";"); $8=""; for (i=1; i<=n; i++) if (match (TMP, /^DP=|^MQ=|^resource.EFF=/)) {sub (/^.*=/, "" ,TMP); $12=$12 ($12?"\t":"") TMP} {gsub("[\\|()]", "\t")}}1' file

But i have a new problem,

chr1	    901534	rs21932296	   T	G	34.77	0/1:3,2:5:63:63,0,64	GATKSAM	5	55.21	INTRON	MODIFIER				CTDP1	protein_coding	CODING	ENSCAFT00000000012	11

If we observe after MODIFIER i have a series of empty tabs. When i am piping this input to another awk command to perform s0me other action with the command,

awk 'BEGIN{OFS="\t"}{split ($7,TMP,":"); $7= TMP[1]}1'

it is replacing multiple empty tabs into a single tab and gives the output like below:

chr1	    901534	rs21932296	  T	G	34.77	 0/1	GATKSAM	5	55.21	INTRON	MODIFIER	CTDP1	protein_coding	CODING	ENSCAFT00000000012	11

I don't want the multiple tabs to be replaced by single tab. Could you help where im going wrong?

RudiC · May 23, 2013, 2:52pm

Assigning to/modifying any field will make awk reevaluate $0, eliminating empty fields. I don't know how to circumvent that, except by running through all the fields and assign a space to all empty ones - not sure that will work, btw.

Don_Cragun · May 23, 2013, 2:56pm

mehar:

Thanks, i have the similar fix which is doing what i need. Here is the code.
awk 'BEGIN{OFS="\t"}{n=split ($8,TMP,";"); $8=""; for (i=1; i<=n; i++) if (match (TMP, /^DP=|^MQ=|^resource.EFF=/)) {sub (/^.*=/, "" ,TMP); $12=$12 ($12?"\t":"") TMP} {gsub("[\\|()]", "\t")}}1' file
But i have a new problem,
chr1	    901534	rs21932296	   T	G	34.77	0/1:3,2:5:63:63,0,64	GATKSAM	5	55.21	INTRON	MODIFIER				CTDP1	protein_coding	CODING	ENSCAFT00000000012	11
If we observe after MODIFIER i have a series of empty tabs. When i am piping this input to another awk command to perform s0me other action with the command,
awk 'BEGIN{OFS="\t"}{split ($7,TMP,":"); $7= TMP[1]}1' 
it is replacing multiple empty tabs into a single tab and gives the output like below:
chr1	    901534	rs21932296	  T	G	34.77	 0/1	GATKSAM	5	55.21	INTRON	MODIFIER	CTDP1	protein_coding	CODING	ENSCAFT00000000012	11
I don't want the multiple tabs to be replaced by single tab. Could you help where im going wrong?

You're just setting the output field separator to tab, you also need to set the input field separator. Change: BEGIN{OFS="\t"} to: BEGIN{FS=OFS="\t"} .

mehar · May 23, 2013, 3:09pm

Well!! I have tried that too, it still gives the same output.

---------- Post updated at 02:05 PM ---------- Previous update was at 01:59 PM ----------

Okay. Then could you help to do in the following way.

chr1	758630	.	T	TC	2221.73	.	AC=2;AF=1.00;AN=2;DP=61;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=51.14;MQ0=0;QD=36.42;RPA=1,2;RU=C;STR;resource.EFF=INTRON(MODIFIER||||PQLC1|protein_coding|CODING|ENSCAFT00000000011|2)	GT:AD:DP:GQ:PL	1/1:0,55:61:99:2259,165,0	GATKSAM

Could you modify the code given, such that it transforms in the following way:

chr1	758630	.	T	TC	2221.73	.	61 51.14 INTRON MODIFIER NA  NA   NA PQLC1 protein_coding CODING ENSCAFT00000000011 2	GT:AD:DP:GQ:PL	1/1:0,55:61:99:2259,165,0	GATKSAM

i.e replacing the empty fields in between |||| with NA?

RudiC · May 23, 2013, 3:46pm

Why don't you extend your similar fix in post#9 to what you require above? Should be easily doable.

mehar · May 23, 2013, 3:52pm

Sorry if i have put some stupid questions. I have tried doing that but couldn't find the logic as im novice in using awk.

RudiC · May 23, 2013, 4:06pm

You had that fine command {gsub("[\\|()]", "\t")} . Why don't you split it to replace "\|" with "NA" and "[()]" with "" ?

mehar · May 23, 2013, 5:10pm

I tried in the following way

{gsub("[\|]", "NA")}{gsub("[()]", "")}

and it gives

as

INTRONMODIFIERNANANANACTDP1NAprotein_codingNACODINGNAENSCAFT00000000012NA11

But it should keep NA only where there is no text i.e. in between |||||. But the above code replace NA wherever there is |. The output should look like

INTRON MODIFIER NA  NA  NA  CTDP1 protein_coding CODING ENSCAFT00000000012 11

---------- Post updated at 04:10 PM ---------- Previous update was at 03:21 PM ----------

Here is a fix, found in google.

awk 'BEGIN{OFS="\t"}{n=split ($8,TMP,";"); $8=""; for (i=1; i<=n; i++) if (match (TMP, /^DP=|^MQ=|^resource.EFF=/)) {sub (/^.*=/, "" ,TMP); $12=$12 ($12?"\t":"") TMP} {gsub("[\\|()]", "\t")}}1' file | awk 'BEGIN {FS=OFS="\t"} {for(i = 1; i <= NF; i++) {if(!$i) {$i = "NA" }}}1'