awk to extract and print first occurrence of pattern in each line

cmccabe · September 26, 2017, 8:33am

I am trying to use awk to extract and print the first ocurrence of NM_ and NP_ with a : before in each line. The input file is tab-delimeted , but the output does not need to be. The below does execute but prints all the lines in the file not just the patterns. Thank you :).

file tab-delimeted

Input Variant	HGVS description(s)	Errors and warnings
rs41302905	NC_000009.11:g.136131316C>T|NC_000009.12:g.133255929C>T|NG_006669.1:g.21739G>A|NM_020469.2:c.802G>A|NW_009646201.1:g.82022C>T|NP_065202.2:p.Gly268Arg|XM_005276848.1:c.799G>A|XM_005276851.1:c.379G>A|XM_005276850.1:c.379G>A|XM_005276849.1:c.745G>A|XM_005276852.1:c.379G>A|XP_005276908.1:p.Gly127Arg|XP_005276907.1:p.Gly127Arg|XP_005276909.1:p.Gly127Arg|XP_005276906.1:p.Gly249Arg|XP_005276905.1:p.Gly267Arg		
rs8176745	NC_000009.11:g.136131347G>A|NC_000009.12:g.133255960G>A|NG_006669.1:g.21708C>T|NM_020469.2:c.771C>T|NP_065202.2:p.Pro257=|NW_009646201.1:g.82053G>A|XM_005276852.1:c.348C>T|XM_005276848.1:c.768C>T|XM_005276851.1:c.348C>T|XM_005276850.1:c.348C>T|XM_005276849.1:c.714C>T|XP_005276909.1:p.Pro116=|XP_005276908.1:p.Pro116=|XP_005276907.1:p.Pro116=|XP_005276906.1:p.Pro238=|XP_005276905.1:p.Pro256=

desired output

rs41302905 NM_020469.2:c.802G>A:p.Gly268Arg
rs8176745 NM_020469.2:c.771C>T:p.Pro257=

awk

awk -F'\t' '/NM_/{f=1} && /NP_/{f=2} f{ if(/{/){count++}; print":"; if(/}/){count--; if(count==0) exit}}' file

maybe

awk -F'\t' 'NR > 1 && /NM_/{     # skip header and find NM_ pattern
            match($2,/NM_*]/);   # match value for NM
            NM=substr($2,RSTART+1,RLENGTH-2);  # extract value and read into NM
            match($2,/NP_*]/);  # match value for NP
            NP=substr($2,RSTART+1,RLENGTH-2);  # extract value and read into NM
                       for(i=1;i<=NM;i++){  # start loop and iterate over each line in file
                          print $1, $NM":"$NP  # print output with : in between each 
                       }  # close block
}1' input > out  # define output

jim_mcnamara · September 26, 2017, 12:15pm

Consider the use of

match()

, which will put your selected text into an array that you can print. From the awk manual:

match(string, regexp [, array])

Search string for the longest, leftmost substring matched by the regular expression regexp and return the character position (index) at which that substring begins (one, if it starts at the beginning of string). If no match is found, return zero.

The regexp argument may be either a regexp constant (/�/) or a string constant ("�"). In the latter case, the string is treated as a regexp to be matched. See Computed Regexps for a discussion of the difference between the two forms, and the implications for writing your program correctly.

The order of the first two arguments is the opposite of most other string functions that work with regular expressions, such as sub() and gsub(). It might help to remember that for match(), the order is the same as for the �~� operator: �string ~ regexp�.

The match() function sets the predefined variable RSTART to the index. It also sets the predefined variable RLENGTH to the length in characters of the matched substring. If no match is found, RSTART is set to zero, and RLENGTH to -1.

For example:

{
if ($1 == "FIND")
regex = $2
else {
where = match($0, regex)
if (where != 0)
print "Match of", regex, "found at", where, "in", $0
}
}

This program looks for lines that match the regular expression stored in the variable regex. This regular expression can be changed. If the first word on a line is �FIND�, regex is changed to be the second word on that line. Therefore, if given:

FIND ru+n
My program runs
but not very quickly
FIND Melvin
JF+KM
This line is property of Reality Engineering Co.
Melvin was here.

awk prints:

Match of ru+n found at 12 in My program runs
Match of Melvin found at 1 in Melvin was here.

If array is present, it is cleared, and then the zeroth element of array is set to the entire portion of string matched by regexp. If regexp contains parentheses, the integer-indexed elements of array are set to contain the portion of string matching the corresponding parenthesized subexpression. For example:

$ echo foooobazbarrrrr |
> gawk '{ match($0, /(fo+).+(bar*)/, arr)
> print arr[1], arr[2] }'
-| foooo barrrrr

In addition, multidimensional subscripts are available providing the start index and length of each matched subexpression:

$ echo foooobazbarrrrr |
> gawk '{ match($0, /(fo+).+(bar*)/, arr)
> print arr[1], arr[2]
> print arr[1, "start"], arr[1, "length"]
> print arr[2, "start"], arr[2, "length"]
> }'
-| foooo barrrrr
-| 1 5
-| 9 7

There may not be subscripts for the start and index for every parenthesized subexpression, because they may not all have matched text; thus, they should be tested for with the in operator (see Reference to Elements).

The array argument to match() is a gawk extension. In compatibility mode (see Options), using a third argument is a fatal error.

cmccabe · September 27, 2017, 10:23am

The below utilizes match as suggested but it returns multiple lines after it executes. I am not sure what I am doing wrong. Thank you :).

EDIT: I below awk seems to address the duplicates, however the entire line prints. Do I need to split $2 by the | and loop through? Thank you :).

awk

awk -F'\t' 'NR > 1 && ($2 ~ /NM_/ && match($2,/NP_/)) {  # search for NM_ and NP_
    match($2,/NM_.*:/);   # match value for NM_
    NM=substr($2,RSTART+1,RLENGTH-2);  # extract value and read into NM, starting with the N (in purple) - this is RSTART and ending at the : - this is RLENGTH, so NM_020469.2
    # Get its length
    lenNM=length(NM)
           match($2,/NP_.:*|/);   # match value for NP_
           NP=substr($2,RSTART+1,RLENGTH-2);  # extract value and read into NP, starting with the p (portion in purple) -this is RSTART and ending at the | -  this is RLENGTH, so NP_065202.2:p.Gly268Arg
       # Get its length
         lenNP=length(NP)
           # Cycle through each line
             for (i=1; i<=$lenNM; i++) {
             print $1, $NM":"$NP  # print output with : in between each 
        }  # close block
}1' input > out

I am still a little unclear on the RSTART and RLENGHTH concepts but, using line1 as an example from the input:

The NM variable would be NM_020469.2
The NP variable would be :p.Gly268Arg
I also update the awk with comments.