Modify text file using awk

I have text file with lines as shown here. Each row has 11 columns separated by tab. In each row, i want to split the 8th column such that the output should look like shown below. Here value in the 9th column is DP value and in the 10th column is MQ value followed by the values after resource.EFF=.

chr1 412573 . A C 2758.77 . 71 58.36 INTERGENIC MODIFIER GT:AD:DP:GQ:PL 1/1:0,71:71:99:2787,214,0 GATKSAM

chr1 602567 rs21953190 A G 5481.77 . 152 59.09 SYNONYMOUS_CODING LOW SILENT gaT/gaC D1034 ADNP2 protein_coding CODING ENSCAFT00000000008 5 GT:AD:DP:GQ:PL 1/1:0,151:151:99:5510,430,0 GATKSAM

Which means the 8th column has to be cleaned up such that, it has only DP value, MQ value and the information after resource.EFF= separated by tabs.

Could anyone help?

This sure looks a lot like another thread you started in this forum earlier today:Split column using awk in a text file

Am I missing something, or is this a duplicate posting?

Indeed the question looks similar, but the input format in the 8th column differs from the previous thread. To avoid confusion i have posted it as a new thread.

You can use previous thread solution to solve this :slight_smile:

The previous solution

gives the output as

Here i want to split further

in first row into

similary the data in the second row

into

which means i want to replace () and | with tabs.

If you like to do lots of modification, its better to split it up and do some in one run and some other things in another run.
This reduce the complexity of the program

Running this on output above

awk '/INT/ {split($10,t,"[(|]"); $10=t[1]" "t[2]}{gsub(/[|()]/," ")}1'

gives

chr1 412573 . A C 2758.77 . 71 58.36 INTERGENIC MODIFIER GT:ADP:GQ:PL 1/1:0,71:71:99:2787,214,0

chr1 602567 rs21953190 A G 5481.77 . 152        59.09    SYNONYMOUS_CODING LOW SILENT gaT/gaC D1034 ADNP2 protein_coding CODING ENSCAFT00000000008 5  GT:ADP:GQ:PL 1/1:0,151:151:99:5510,430,0

This code is working only with first row. The second row remains the same. I want to replace the () and | in the 10th column in all the rows. i.e.
the data in the second row also should look like,

Please use code tags in lieu of quote tags!

Try this:

awk     '       {n  = split ($8,TMP,";")
                 $8 = ""
                 for (i=1; i<=n; i++)
                    if (match (TMP, /^DP=|^MQ=|^resource.EFF/))
                        {sub  (/^.*=/, "", TMP)
                         gsub (/[()\|]+/, " ", TMP)
                         $8=$8 ($8?"\t":"") TMP
                        }
                }
         1
        '  FS="\t" file
chr1 412573 . A C 2758.77 . 71    58.36    INTERGENIC MODIFIER  GT:ADP:GQ:PL 1/1:0,71:71:99:2787,214,0 GATKSAM
       
chr1 602567 rs21953190 A G 5481.77 . 152    59.09    SYNONYMOUS_CODING LOW SILENT gaT/gaC D1034 ADNP2 protein_coding CODING ENSCAFT00000000008 5  GT:ADP:GQ:PL 1/1:0,151:151:99:5510,430,0 GATKSAM

Thanks, i have the similar fix which is doing what i need. Here is the code.

awk 'BEGIN{OFS="\t"}{n=split ($8,TMP,";"); $8=""; for (i=1; i<=n; i++) if (match (TMP, /^DP=|^MQ=|^resource.EFF=/)) {sub (/^.*=/, "" ,TMP); $12=$12 ($12?"\t":"") TMP} {gsub("[\\|()]", "\t")}}1' file

But i have a new problem,

chr1	    901534	rs21932296	   T	G	34.77	0/1:3,2:5:63:63,0,64	GATKSAM	5	55.21	INTRON	MODIFIER				CTDP1	protein_coding	CODING	ENSCAFT00000000012	11

If we observe after MODIFIER i have a series of empty tabs. When i am piping this input to another awk command to perform s0me other action with the command,

awk 'BEGIN{OFS="\t"}{split ($7,TMP,":"); $7= TMP[1]}1' 

it is replacing multiple empty tabs into a single tab and gives the output like below:

chr1	    901534	rs21932296	  T	G	34.77	 0/1	GATKSAM	5	55.21	INTRON	MODIFIER	CTDP1	protein_coding	CODING	ENSCAFT00000000012	11

I don't want the multiple tabs to be replaced by single tab. Could you help where im going wrong?

Assigning to/modifying any field will make awk reevaluate $0, eliminating empty fields. I don't know how to circumvent that, except by running through all the fields and assign a space to all empty ones - not sure that will work, btw.

You're just setting the output field separator to tab, you also need to set the input field separator. Change: BEGIN{OFS="\t"} to: BEGIN{FS=OFS="\t"} .

Well!! I have tried that too, it still gives the same output.

---------- Post updated at 02:05 PM ---------- Previous update was at 01:59 PM ----------

Okay. Then could you help to do in the following way.

chr1	758630	.	T	TC	2221.73	.	AC=2;AF=1.00;AN=2;DP=61;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=51.14;MQ0=0;QD=36.42;RPA=1,2;RU=C;STR;resource.EFF=INTRON(MODIFIER||||PQLC1|protein_coding|CODING|ENSCAFT00000000011|2)	GT:AD:DP:GQ:PL	1/1:0,55:61:99:2259,165,0	GATKSAM

Could you modify the code given, such that it transforms in the following way:

chr1	758630	.	T	TC	2221.73	.	61 51.14 INTRON MODIFIER NA  NA   NA PQLC1 protein_coding CODING ENSCAFT00000000011 2	GT:AD:DP:GQ:PL	1/1:0,55:61:99:2259,165,0	GATKSAM

i.e replacing the empty fields in between |||| with NA?

Why don't you extend your similar fix in post#9 to what you require above? Should be easily doable.

Sorry if i have put some stupid questions. I have tried doing that but couldn't find the logic as im novice in using awk.

You had that fine command {gsub("[\\|()]", "\t")} . Why don't you split it to replace "\|" with "NA" and "[()]" with "" ?

I tried in the following way

{gsub("[\|]", "NA")}{gsub("[()]", "")}

and it gives

as

INTRONMODIFIERNANANANACTDP1NAprotein_codingNACODINGNAENSCAFT00000000012NA11

But it should keep NA only where there is no text i.e. in between |||||. But the above code replace NA wherever there is |. The output should look like

INTRON MODIFIER NA  NA  NA  CTDP1 protein_coding CODING ENSCAFT00000000012 11

---------- Post updated at 04:10 PM ---------- Previous update was at 03:21 PM ----------

Here is a fix, found in google.

awk 'BEGIN{OFS="\t"}{n=split ($8,TMP,";"); $8=""; for (i=1; i<=n; i++) if (match (TMP, /^DP=|^MQ=|^resource.EFF=/)) {sub (/^.*=/, "" ,TMP); $12=$12 ($12?"\t":"") TMP} {gsub("[\\|()]", "\t")}}1' file | awk 'BEGIN {FS=OFS="\t"} {for(i = 1; i <= NF; i++) {if(!$i) {$i = "NA" }}}1'