awk to add text to matching pattern in field

In the awk I am trying to add :p.=? to the end of each $9 that matches the pattern NM_ . The below executes andis close but I can not seem to figure out why the :p.=? repeats in the split as in the green in the current output. I have added comments as well. Thank you :).

file

R_Index	Chr	Start	End	Ref	Alt	Func.refGene	Gene.refGene	GeneDetail.refGene	Inheritance	ExonicFunc.refGene	AAChange.refGene
1	chr1	155870416	155870416	G	A	splicing	RIT1	NM_001256821:exon6:c.481-7C>T;NM_001256820:exon5:c.322-7C>T;NM_006912:exon6:c.430-7C>T
9	chr10	112760138	112760138	A	-	splicing	SHOC2	NM_007373:exon4:c.842-35A>-;NM_001269039:exon2:c.704-35A>-
11	chr18	53070914	53070914	G	A	exonic	TCF4	.	AD	nonsynonymous SNV	TCF4:NM_001243232:exon1:c.32C>T:p.A11V;TCF4:NM_001306208:exon1:c.32C>T:p.A11V

awk

awk '
  BEGIN { FS=OFS="\t" }  # define FS and OFS as tab and start processing
  $9 ~ /NM/ {            # look for pattern NM in $9
       # split $9 by ";" and cycle through them
          out=""   # array out is empty
      i=split($9,NM,/;/)
         for (n=1; n<=i; n++) {
          sub(/$/, ":p=", NM)   # add :p. to end off each NM before the ;
          out = (out=="" ? "" : out";") NM  # add ? to each NM and store in array out
         }
      $9 = out  # update with array out
}1' file

desired output

R_Index	Chr	Start	End	Ref	Alt	Func.refGene	Gene.refGene	GeneDetail.refGene	Inheritance	ExonicFunc.refGene	AAChange.refGene
1	chr1	155870416	155870416	G	A	splicing	RIT1	NM_001256821:exon6:c.481-7C>T:p=?;NM_001256820:exon5:c.322-7C>T:p=?;NM_006912:exon6:c.430-7C>T:p=?
9	chr10	112760138	112760138	A	-	splicing	SHOC2	NM_007373:exon4:c.842-35A>-:p=?;NM_001269039:exon2:c.704-35A>-:p=?
11	chr18	53070914	53070914	G	A	exonic	TCF4	.	AD	nonsynonymous SNV	TCF4:NM_001243232:exon1:c.32C>T:p.A11V;TCF4:NM_001306208:exon1:c.32C>T:p.A11V

current output

R_Index	Chr	Start	End	Ref	Alt	Func.refGene	Gene.refGene	GeneDetail.refGene	Inheritance	ExonicFunc.refGene	AAChange.refGene
1	chr1	155870416	155870416	G	A	splicing	RIT1	NM_006912:exon6:c.430-7C>T:p=?;NM_006912:exon6:c.430-7C>T:p=?:p=?;NM_006912:exon6:c.430-7C>T:p=?:p=?:p=?
9	chr10	112760138	112760138	A	-	splicing	SHOC2	NM_001269039:exon2:c.704-35A>-:p=?;NM_001269039:exon2:c.704-35A>-:p=?:p=?
11	chr18	53070914	53070914	G	A	exonic	TCF4	.AD	nonsynonymous SNV	TCF4:NM_001243232:exon1:c.32C>T:p.A11V;TCF4:NM_001306208:exon1:c.32C>T:p.A11V
awk '
  BEGIN { FS=OFS="\t" }
  $9 ~ /NM/ {
       gsub(";", ":p=?;", $9);
       sub("$", ":p=?", $9);
  } 1' file
1 Like

Hi cmccabe,
I agree with rdrtx1 that the code suggested in post #2 should do what you want.

What I don't understand is how the code you showed us in post #1 could produce the output that you labeled as "current output" in that post. Are you absolutely positive that the code you showed us in post #1 produced the output you showed us when file had the contents you showed us in that post?

The code you showed us seems like it would produce the desired number of additions to your field #9, but would omit the desired question marks and just append :p= to each subfield. The code to which you appended the comment:

# add ? to each NM and store in array out

does not add question marks; it reforms the new field number 9 by adding back in the semicolons that were removed by the split() . And, note that the variable named out in your code is a string; not an array.

1 Like

Thank you very much rdrtx1, that works perfect :).

Don Cragun your are correct in that:

I forgot that I changed the
sub(/$/, ":p=", NM) # add :p. to end off each NM before the to
sub(/$/, ":p=?", NM) # add :p. to end off each NM before the

However the :p.=? seemed to be iterating based on the number of splits. Maybe it is the wrong terminology but I didn't understand why, no matter what I tried. Thank you for the correction on the array being a string, I was confused.

awk

awk '
   BEGIN { FS=OFS="\t" }  # define FS and OFS as tab and start processing
   $9 ~ /NM/ {            # look for pattern NM in $9
        # split $9 by ";" and cycle through them
           out=""
       i=split($9,NM,/;/)
          for (n=1; n<=i; n++) {
           sub(/$/, ":p=", NM)   # add :p. to end off each NM before the ;
           out = (out=="" ? "" : out";") NM
          }
       $9 = out
}1' file

R_Index	Chr	Start	End	Ref	Alt	Func.refGene	Gene.refGene	GeneDetail.refGene	Inheritance	ExonicFunc.refGene	AAChange.refGene
1	chr1	155870416	155870416	G	A	splicing	RIT1	NM_006912:exon6:c.430-7C>T:p=;NM_006912:exon6:c.430-7C>T:p=:p=;NM_006912:exon6:c.430-7C>T:p=:p=:p=
9	chr10	112760138	112760138	A	-	splicing	SHOC2	NM_001269039:exon2:c.704-35A>-:p=;NM_001269039:exon2:c.704-35A>-:p=:p=
11	chr18	53070914	53070914	G	A	exonic	TCF4	.AD	nonsynonymous SNV	TCF4:NM_001243232:exon1:c.32C>T:p.A11V;TCF4:NM_001306208:exon1:c.32C>T:p.A11V

In rdrtx1 awk is the below close?

awk '
  BEGIN { FS=OFS="\t" }  # define FS and OFS as tab and start processing
  $9 ~ /NM/ { # look for pattern NM in $9
       gsub(";", ":p=?;", $9);  # split by ; in $9
       sub("$", ":p=?", $9);  # add :p=? to end of each split by ;
  } 1' file  # update input

Thank you very much :).

Your code had four minor bugs. If you change what you showed us in post #1 to:

awk '
BEGIN { FS=OFS="\t" }  # define FS and OFS as tab and start processing
$9 ~ /NM/ {            # look for pattern NM in $9
	# split $9 by ";" and cycle through them
	out=""   # array out is empty
	i=split($9,NM,/;/)
	for (n=1; n<=i; n++) {
		sub(/$/, ":p=?", NM[n])   # add ":p=?" to end off each NM
		out = (out=="" ? "" : out";") NM[n]  # add updated NM to new output string, restoring ";"s.
	}
	$9 = out  # replace field #9 with updated output string
}1' file

you'll get the output you wanted.

But, rdrtx1's code is easier to read and probably faster. Some of your comments on rdrtx1's code are a little bit off. Try changing:

       gsub(";", ":p=?;", $9);  # split by ; in $9
       sub("$", ":p=?", $9);  # add :p=? to end of each split by ;

to:

       gsub(";", ":p=?;", $9);  # prepend ":p=?" to each of the subfield separators.
       sub("$", ":p=?", $9);  # add ":p=?" to end of the last subfield