mehar
May 23, 2013, 7:10am
1
chr1 412573 . A C 2758.77 . AC=2;AF=1.00;AN=2;DP=71;Dels=0.00;FS=0.000;HaplotypeScore=2.8822;MLEAC=2;MLEAF=1.00;MQ=58.36;MQ0=0;QD=38.86;resource.EFF=INTERGENIC(MODIFIER||||||||) GT:AD:DP:GQ:PL 1/1:0,71:71:99:2787,214,0 GATKSAM
chr1 602567 rs21953190 A G 5481.77 . AC=2;AF=1.00;AN=2;DB;DP=152;Dels=0.00;FS=0.000;HaplotypeScore=6.8385;MLEAC=2;MLEAF=1.00;MQ=59.09;MQ0=0;QD=36.06;resource.EFF=SYNONYMOUS_CODING(LOW|SILENT|gaT/gaC|D1034|ADNP2|protein_coding|CODING|ENSCAFT00000000008|5) GT:AD:DP:GQ:PL 1/1:0,151:151:99:5510,430,0 GATKSAM
I have text file with lines as shown here. Each row has 11 columns separated by tab. In each row, i want to split the 8th column such that the output should look like shown below. Here value in the 9th column is DP value and in the 10th column is MQ value followed by the values after resource.EFF=.
chr1 412573 . A C 2758.77 . 71 58.36 INTERGENIC MODIFIER GT:AD:DP:GQ:PL 1/1:0,71:71:99:2787,214,0 GATKSAM
chr1 602567 rs21953190 A G 5481.77 . 152 59.09 SYNONYMOUS_CODING LOW SILENT gaT/gaC D1034 ADNP2 protein_coding CODING ENSCAFT00000000008 5 GT:AD:DP:GQ:PL 1/1:0,151:151:99:5510,430,0 GATKSAM
Which means the 8th column has to be cleaned up such that, it has only DP value, MQ value and the information after resource.EFF= separated by tabs.
Could anyone help?
This sure looks a lot like another thread you started in this forum earlier today:Split column using awk in a text file
Am I missing something, or is this a duplicate posting?
mehar
May 23, 2013, 7:24am
3
Indeed the question looks similar, but the input format in the 8th column differs from the previous thread. To avoid confusion i have posted it as a new thread.
pamu
May 23, 2013, 7:39am
4
You can use previous thread solution to solve this
mehar
May 23, 2013, 7:42am
5
The previous solution
awk '{n=split ($8,TMP,";"); $8=""; for (i=1; i<=n; i++) if (match (TMP[i], /^DP=|^MQ=|^resource.EFF/)) {sub (/^.*=/,"",TMP[i]); $8=$8 ($8?"\t":"") TMP[i]}}1' FS="\t" file
gives the output as
chr1 412573 . A C 2758.77 . 71 58.36 INTERGENIC(MODIFIER||||||||) GT:AD:DP:GQ:PL 1/1:0,71:71:99:2787,214,0
chr1 602567 rs21953190 A G 5481.77 . 152 59.09 SYNONYMOUS_CODING(LOW|SILENT|gaT/gaC|D1034|ADNP2|protein_coding|CODING|ENSCAFT00000000008|5) GT:AD:DP:GQ:PL 1/1:0,151:151:99:5510,430,0
Here i want to split further
in first row into
similary the data in the second row
chr1 602567 rs21953190 A G 5481.77 . 152 59.09 SYNONYMOUS_CODING(LOW|SILENT|gaT/gaC|D1034|ADNP2|protein_coding|CODING|ENSCAFT00000000008|5) GT:AD:DP:GQ:PL 1/1:0,151:151:99:5510,430,0
into
chr1 602567 rs21953190 A G 5481.77 . 152 59.09 SYNONYMOUS_CODING LOW SILENT gaT/gaC D1034 ADNP2 protein_coding CODING ENSCAFT00000000008 5 GT:AD:DP:GQ:PL 1/1:0,151:151:99:5510,430,0
which means i want to replace () and | with tabs.
Jotne
May 23, 2013, 7:58am
6
If you like to do lots of modification, its better to split it up and do some in one run and some other things in another run.
This reduce the complexity of the program
Running this on output above
awk '/INT/ {split($10,t,"[(|]"); $10=t[1]" "t[2]}{gsub(/[|()]/," ")}1'
gives
chr1 412573 . A C 2758.77 . 71 58.36 INTERGENIC MODIFIER GT:ADP:GQ:PL 1/1:0,71:71:99:2787,214,0
chr1 602567 rs21953190 A G 5481.77 . 152 59.09 SYNONYMOUS_CODING LOW SILENT gaT/gaC D1034 ADNP2 protein_coding CODING ENSCAFT00000000008 5 GT:ADP:GQ:PL 1/1:0,151:151:99:5510,430,0
mehar
May 23, 2013, 8:05am
7
This code is working only with first row. The second row remains the same. I want to replace the () and | in the 10th column in all the rows. i.e.
the data in the second row also should look like,
chr1 602567 rs21953190 A G 5481.77 . 152 59.09 SYNONYMOUS_CODING LOW SILENT gaT/gaC D1034 ADNP2 protein_coding CODING ENSCAFT00000000008 5 GT:ADP:GQ:PL 1/1:0,151:151:99:5510,430,0
RudiC
May 23, 2013, 2:15pm
8
Please use code tags in lieu of quote tags!
Try this:
awk ' {n = split ($8,TMP,";")
$8 = ""
for (i=1; i<=n; i++)
if (match (TMP, /^DP=|^MQ=|^resource.EFF/))
{sub (/^.*=/, "", TMP)
gsub (/[()\|]+/, " ", TMP)
$8=$8 ($8?"\t":"") TMP
}
}
1
' FS="\t" file
chr1 412573 . A C 2758.77 . 71 58.36 INTERGENIC MODIFIER GT:ADP:GQ:PL 1/1:0,71:71:99:2787,214,0 GATKSAM
chr1 602567 rs21953190 A G 5481.77 . 152 59.09 SYNONYMOUS_CODING LOW SILENT gaT/gaC D1034 ADNP2 protein_coding CODING ENSCAFT00000000008 5 GT:ADP:GQ:PL 1/1:0,151:151:99:5510,430,0 GATKSAM
mehar
May 23, 2013, 2:33pm
9
Thanks, i have the similar fix which is doing what i need. Here is the code.
awk 'BEGIN{OFS="\t"}{n=split ($8,TMP,";"); $8=""; for (i=1; i<=n; i++) if (match (TMP, /^DP=|^MQ=|^resource.EFF=/)) {sub (/^.*=/, "" ,TMP); $12=$12 ($12?"\t":"") TMP} {gsub("[\\|()]", "\t")}}1' file
But i have a new problem,
chr1 901534 rs21932296 T G 34.77 0/1:3,2:5:63:63,0,64 GATKSAM 5 55.21 INTRON MODIFIER CTDP1 protein_coding CODING ENSCAFT00000000012 11
If we observe after MODIFIER i have a series of empty tabs. When i am piping this input to another awk command to perform s0me other action with the command,
awk 'BEGIN{OFS="\t"}{split ($7,TMP,":"); $7= TMP[1]}1'
it is replacing multiple empty tabs into a single tab and gives the output like below:
chr1 901534 rs21932296 T G 34.77 0/1 GATKSAM 5 55.21 INTRON MODIFIER CTDP1 protein_coding CODING ENSCAFT00000000012 11
I don't want the multiple tabs to be replaced by single tab. Could you help where im going wrong?
RudiC
May 23, 2013, 2:52pm
10
Assigning to/modifying any field will make awk
reevaluate $0, eliminating empty fields. I don't know how to circumvent that, except by running through all the fields and assign a space to all empty ones - not sure that will work, btw.
mehar:
Thanks, i have the similar fix which is doing what i need. Here is the code.
awk 'BEGIN{OFS="\t"}{n=split ($8,TMP,";"); $8=""; for (i=1; i<=n; i++) if (match (TMP, /^DP=|^MQ=|^resource.EFF=/)) {sub (/^.*=/, "" ,TMP); $12=$12 ($12?"\t":"") TMP} {gsub("[\\|()]", "\t")}}1' file
But i have a new problem,
chr1 901534 rs21932296 T G 34.77 0/1:3,2:5:63:63,0,64 GATKSAM 5 55.21 INTRON MODIFIER CTDP1 protein_coding CODING ENSCAFT00000000012 11
If we observe after MODIFIER i have a series of empty tabs. When i am piping this input to another awk command to perform s0me other action with the command,
awk 'BEGIN{OFS="\t"}{split ($7,TMP,":"); $7= TMP[1]}1'
it is replacing multiple empty tabs into a single tab and gives the output like below:
chr1 901534 rs21932296 T G 34.77 0/1 GATKSAM 5 55.21 INTRON MODIFIER CTDP1 protein_coding CODING ENSCAFT00000000012 11
I don't want the multiple tabs to be replaced by single tab. Could you help where im going wrong?
You're just setting the output field separator to tab, you also need to set the input field separator. Change: BEGIN{OFS="\t"}
to: BEGIN{FS=OFS="\t"}
.
mehar
May 23, 2013, 3:09pm
12
Well!! I have tried that too, it still gives the same output.
---------- Post updated at 02:05 PM ---------- Previous update was at 01:59 PM ----------
Okay. Then could you help to do in the following way.
chr1 758630 . T TC 2221.73 . AC=2;AF=1.00;AN=2;DP=61;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=51.14;MQ0=0;QD=36.42;RPA=1,2;RU=C;STR;resource.EFF=INTRON(MODIFIER||||PQLC1|protein_coding|CODING|ENSCAFT00000000011|2) GT:AD:DP:GQ:PL 1/1:0,55:61:99:2259,165,0 GATKSAM
Could you modify the code given, such that it transforms in the following way:
chr1 758630 . T TC 2221.73 . 61 51.14 INTRON MODIFIER NA NA NA PQLC1 protein_coding CODING ENSCAFT00000000011 2 GT:AD:DP:GQ:PL 1/1:0,55:61:99:2259,165,0 GATKSAM
i.e replacing the empty fields in between |||| with NA?
RudiC
May 23, 2013, 3:46pm
13
Why don't you extend your similar fix in post#9 to what you require above? Should be easily doable.
mehar
May 23, 2013, 3:52pm
14
Sorry if i have put some stupid questions. I have tried doing that but couldn't find the logic as im novice in using awk.
RudiC
May 23, 2013, 4:06pm
15
You had that fine command {gsub("[\\|()]", "\t")}
. Why don't you split it to replace "\|"
with "NA"
and "[()]"
with ""
?
mehar
May 23, 2013, 5:10pm
16
I tried in the following way
{gsub("[\|]", "NA")}{gsub("[()]", "")}
and it gives
as
INTRONMODIFIERNANANANACTDP1NAprotein_codingNACODINGNAENSCAFT00000000012NA11
But it should keep NA only where there is no text i.e. in between |||||. But the above code replace NA wherever there is |. The output should look like
INTRON MODIFIER NA NA NA CTDP1 protein_coding CODING ENSCAFT00000000012 11
---------- Post updated at 04:10 PM ---------- Previous update was at 03:21 PM ----------
Here is a fix, found in google.
awk 'BEGIN{OFS="\t"}{n=split ($8,TMP,";"); $8=""; for (i=1; i<=n; i++) if (match (TMP, /^DP=|^MQ=|^resource.EFF=/)) {sub (/^.*=/, "" ,TMP); $12=$12 ($12?"\t":"") TMP} {gsub("[\\|()]", "\t")}}1' file | awk 'BEGIN {FS=OFS="\t"} {for(i = 1; i <= NF; i++) {if(!$i) {$i = "NA" }}}1'