In the awk I am trying to add :p.=? to the end of each $9 that matches the pattern NM_ . The below executes andis close but I can not seem to figure out why the :p.=? repeats in the split as in the green in the current output. I have added comments as well. Thank you :).
file
R_Index Chr Start End Ref Alt Func.refGene Gene.refGene GeneDetail.refGene Inheritance ExonicFunc.refGene AAChange.refGene
1 chr1 155870416 155870416 G A splicing RIT1 NM_001256821:exon6:c.481-7C>T;NM_001256820:exon5:c.322-7C>T;NM_006912:exon6:c.430-7C>T
9 chr10 112760138 112760138 A - splicing SHOC2 NM_007373:exon4:c.842-35A>-;NM_001269039:exon2:c.704-35A>-
11 chr18 53070914 53070914 G A exonic TCF4 . AD nonsynonymous SNV TCF4:NM_001243232:exon1:c.32C>T:p.A11V;TCF4:NM_001306208:exon1:c.32C>T:p.A11V
awk
awk '
BEGIN { FS=OFS="\t" } # define FS and OFS as tab and start processing
$9 ~ /NM/ { # look for pattern NM in $9
# split $9 by ";" and cycle through them
out="" # array out is empty
i=split($9,NM,/;/)
for (n=1; n<=i; n++) {
sub(/$/, ":p=", NM) # add :p. to end off each NM before the ;
out = (out=="" ? "" : out";") NM # add ? to each NM and store in array out
}
$9 = out # update with array out
}1' file
desired output
R_Index Chr Start End Ref Alt Func.refGene Gene.refGene GeneDetail.refGene Inheritance ExonicFunc.refGene AAChange.refGene
1 chr1 155870416 155870416 G A splicing RIT1 NM_001256821:exon6:c.481-7C>T:p=?;NM_001256820:exon5:c.322-7C>T:p=?;NM_006912:exon6:c.430-7C>T:p=?
9 chr10 112760138 112760138 A - splicing SHOC2 NM_007373:exon4:c.842-35A>-:p=?;NM_001269039:exon2:c.704-35A>-:p=?
11 chr18 53070914 53070914 G A exonic TCF4 . AD nonsynonymous SNV TCF4:NM_001243232:exon1:c.32C>T:p.A11V;TCF4:NM_001306208:exon1:c.32C>T:p.A11V
current output
R_Index Chr Start End Ref Alt Func.refGene Gene.refGene GeneDetail.refGene Inheritance ExonicFunc.refGene AAChange.refGene
1 chr1 155870416 155870416 G A splicing RIT1 NM_006912:exon6:c.430-7C>T:p=?;NM_006912:exon6:c.430-7C>T:p=?:p=?;NM_006912:exon6:c.430-7C>T:p=?:p=?:p=?
9 chr10 112760138 112760138 A - splicing SHOC2 NM_001269039:exon2:c.704-35A>-:p=?;NM_001269039:exon2:c.704-35A>-:p=?:p=?
11 chr18 53070914 53070914 G A exonic TCF4 .AD nonsynonymous SNV TCF4:NM_001243232:exon1:c.32C>T:p.A11V;TCF4:NM_001306208:exon1:c.32C>T:p.A11V
Hi cmccabe,
I agree with rdrtx1 that the code suggested in post #2 should do what you want.
What I don't understand is how the code you showed us in post #1 could produce the output that you labeled as "current output" in that post. Are you absolutely positive that the code you showed us in post #1 produced the output you showed us when file had the contents you showed us in that post?
The code you showed us seems like it would produce the desired number of additions to your field #9, but would omit the desired question marks and just append :p= to each subfield. The code to which you appended the comment:
# add ? to each NM and store in array out
does not add question marks; it reforms the new field number 9 by adding back in the semicolons that were removed by the split() . And, note that the variable named out in your code is a string; not an array.
Thank you very much rdrtx1, that works perfect :).
Don Cragun your are correct in that:
I forgot that I changed the sub(/$/, ":p=", NM) # add :p. to end off each NM before the to sub(/$/, ":p=?", NM) # add :p. to end off each NM before the
However the :p.=? seemed to be iterating based on the number of splits. Maybe it is the wrong terminology but I didn't understand why, no matter what I tried. Thank you for the correction on the array being a string, I was confused.
awk
awk '
BEGIN { FS=OFS="\t" } # define FS and OFS as tab and start processing
$9 ~ /NM/ { # look for pattern NM in $9
# split $9 by ";" and cycle through them
out=""
i=split($9,NM,/;/)
for (n=1; n<=i; n++) {
sub(/$/, ":p=", NM) # add :p. to end off each NM before the ;
out = (out=="" ? "" : out";") NM
}
$9 = out
}1' file
R_Index Chr Start End Ref Alt Func.refGene Gene.refGene GeneDetail.refGene Inheritance ExonicFunc.refGene AAChange.refGene
1 chr1 155870416 155870416 G A splicing RIT1 NM_006912:exon6:c.430-7C>T:p=;NM_006912:exon6:c.430-7C>T:p=:p=;NM_006912:exon6:c.430-7C>T:p=:p=:p=
9 chr10 112760138 112760138 A - splicing SHOC2 NM_001269039:exon2:c.704-35A>-:p=;NM_001269039:exon2:c.704-35A>-:p=:p=
11 chr18 53070914 53070914 G A exonic TCF4 .AD nonsynonymous SNV TCF4:NM_001243232:exon1:c.32C>T:p.A11V;TCF4:NM_001306208:exon1:c.32C>T:p.A11V
In rdrtx1 awk is the below close?
awk '
BEGIN { FS=OFS="\t" } # define FS and OFS as tab and start processing
$9 ~ /NM/ { # look for pattern NM in $9
gsub(";", ":p=?;", $9); # split by ; in $9
sub("$", ":p=?", $9); # add :p=? to end of each split by ;
} 1' file # update input
Your code had four minor bugs. If you change what you showed us in post #1 to:
awk '
BEGIN { FS=OFS="\t" } # define FS and OFS as tab and start processing
$9 ~ /NM/ { # look for pattern NM in $9
# split $9 by ";" and cycle through them
out="" # array out is empty
i=split($9,NM,/;/)
for (n=1; n<=i; n++) {
sub(/$/, ":p=?", NM[n]) # add ":p=?" to end off each NM
out = (out=="" ? "" : out";") NM[n] # add updated NM to new output string, restoring ";"s.
}
$9 = out # replace field #9 with updated output string
}1' file
you'll get the output you wanted.
But, rdrtx1's code is easier to read and probably faster. Some of your comments on rdrtx1's code are a little bit off. Try changing:
gsub(";", ":p=?;", $9); # split by ; in $9
sub("$", ":p=?", $9); # add :p=? to end of each split by ;
to:
gsub(";", ":p=?;", $9); # prepend ":p=?" to each of the subfield separators.
sub("$", ":p=?", $9); # add ":p=?" to end of the last subfield