I am trying to use awk
to extract and print the first ocurrence of NM_
and NP_
with a :
before in each line. The input file is tab-delimeted
, but the output does not need to be. The below does execute but prints all the lines in the file not just the patterns. Thank you :).
file tab-delimeted
Input Variant HGVS description(s) Errors and warnings
rs41302905 NC_000009.11:g.136131316C>T|NC_000009.12:g.133255929C>T|NG_006669.1:g.21739G>A|NM_020469.2:c.802G>A|NW_009646201.1:g.82022C>T|NP_065202.2:p.Gly268Arg|XM_005276848.1:c.799G>A|XM_005276851.1:c.379G>A|XM_005276850.1:c.379G>A|XM_005276849.1:c.745G>A|XM_005276852.1:c.379G>A|XP_005276908.1:p.Gly127Arg|XP_005276907.1:p.Gly127Arg|XP_005276909.1:p.Gly127Arg|XP_005276906.1:p.Gly249Arg|XP_005276905.1:p.Gly267Arg
rs8176745 NC_000009.11:g.136131347G>A|NC_000009.12:g.133255960G>A|NG_006669.1:g.21708C>T|NM_020469.2:c.771C>T|NP_065202.2:p.Pro257=|NW_009646201.1:g.82053G>A|XM_005276852.1:c.348C>T|XM_005276848.1:c.768C>T|XM_005276851.1:c.348C>T|XM_005276850.1:c.348C>T|XM_005276849.1:c.714C>T|XP_005276909.1:p.Pro116=|XP_005276908.1:p.Pro116=|XP_005276907.1:p.Pro116=|XP_005276906.1:p.Pro238=|XP_005276905.1:p.Pro256=
desired output
rs41302905 NM_020469.2:c.802G>A:p.Gly268Arg
rs8176745 NM_020469.2:c.771C>T:p.Pro257=
awk
awk -F'\t' '/NM_/{f=1} && /NP_/{f=2} f{ if(/{/){count++}; print":"; if(/}/){count--; if(count==0) exit}}' file
maybe
awk -F'\t' 'NR > 1 && /NM_/{ # skip header and find NM_ pattern
match($2,/NM_*]/); # match value for NM
NM=substr($2,RSTART+1,RLENGTH-2); # extract value and read into NM
match($2,/NP_*]/); # match value for NP
NP=substr($2,RSTART+1,RLENGTH-2); # extract value and read into NM
for(i=1;i<=NM;i++){ # start loop and iterate over each line in file
print $1, $NM":"$NP # print output with : in between each
} # close block
}1' input > out # define output