awk to extract and print first occurrence of pattern in each line

I am trying to use awk to extract and print the first ocurrence of NM_ and NP_ with a : before in each line. The input file is tab-delimeted , but the output does not need to be. The below does execute but prints all the lines in the file not just the patterns. Thank you :).

file tab-delimeted

Input Variant	HGVS description(s)	Errors and warnings
rs41302905	NC_000009.11:g.136131316C>T|NC_000009.12:g.133255929C>T|NG_006669.1:g.21739G>A|NM_020469.2:c.802G>A|NW_009646201.1:g.82022C>T|NP_065202.2:p.Gly268Arg|XM_005276848.1:c.799G>A|XM_005276851.1:c.379G>A|XM_005276850.1:c.379G>A|XM_005276849.1:c.745G>A|XM_005276852.1:c.379G>A|XP_005276908.1:p.Gly127Arg|XP_005276907.1:p.Gly127Arg|XP_005276909.1:p.Gly127Arg|XP_005276906.1:p.Gly249Arg|XP_005276905.1:p.Gly267Arg		
rs8176745	NC_000009.11:g.136131347G>A|NC_000009.12:g.133255960G>A|NG_006669.1:g.21708C>T|NM_020469.2:c.771C>T|NP_065202.2:p.Pro257=|NW_009646201.1:g.82053G>A|XM_005276852.1:c.348C>T|XM_005276848.1:c.768C>T|XM_005276851.1:c.348C>T|XM_005276850.1:c.348C>T|XM_005276849.1:c.714C>T|XP_005276909.1:p.Pro116=|XP_005276908.1:p.Pro116=|XP_005276907.1:p.Pro116=|XP_005276906.1:p.Pro238=|XP_005276905.1:p.Pro256=

desired output

rs41302905 NM_020469.2:c.802G>A:p.Gly268Arg
rs8176745 NM_020469.2:c.771C>T:p.Pro257=

awk

awk -F'\t' '/NM_/{f=1} && /NP_/{f=2} f{ if(/{/){count++}; print":"; if(/}/){count--; if(count==0) exit}}' file

maybe

awk -F'\t' 'NR > 1 && /NM_/{     # skip header and find NM_ pattern
            match($2,/NM_*]/);   # match value for NM
            NM=substr($2,RSTART+1,RLENGTH-2);  # extract value and read into NM
            match($2,/NP_*]/);  # match value for NP
            NP=substr($2,RSTART+1,RLENGTH-2);  # extract value and read into NM
                       for(i=1;i<=NM;i++){  # start loop and iterate over each line in file
                          print $1, $NM":"$NP  # print output with : in between each 
                       }  # close block
}1' input > out  # define output

Consider the use of

match()

, which will put your selected text into an array that you can print. From the awk manual:

1 Like

The below utilizes match as suggested but it returns multiple lines after it executes. I am not sure what I am doing wrong. Thank you :).

EDIT: I below awk seems to address the duplicates, however the entire line prints. Do I need to split $2 by the | and loop through? Thank you :).

awk

awk -F'\t' 'NR > 1 && ($2 ~ /NM_/ && match($2,/NP_/)) {  # search for NM_ and NP_
    match($2,/NM_.*:/);   # match value for NM_
    NM=substr($2,RSTART+1,RLENGTH-2);  # extract value and read into NM, starting with the N (in purple) - this is RSTART and ending at the : - this is RLENGTH, so NM_020469.2
    # Get its length
    lenNM=length(NM)
           match($2,/NP_.:*|/);   # match value for NP_
           NP=substr($2,RSTART+1,RLENGTH-2);  # extract value and read into NP, starting with the p (portion in purple) -this is RSTART and ending at the | -  this is RLENGTH, so NP_065202.2:p.Gly268Arg
       # Get its length
         lenNP=length(NP)
           # Cycle through each line
             for (i=1; i<=$lenNM; i++) {
             print $1, $NM":"$NP  # print output with : in between each 
        }  # close block
}1' input > out

I am still a little unclear on the RSTART and RLENGHTH concepts but, using line1 as an example from the input:

The NM variable would be NM_020469.2
The NP variable would be :p.Gly268Arg
I also update the awk with comments.