awk to update field using matching value in file1 and substring in field in file2

In the awk below I am trying to set/update the value of $14 in file2 in
bold, using the matching NM_ in $12 or $9 in file2
with the NM_ in $2 of file1 .
The lengths of $9 and $12 can be variable but what is consistent is the start pattern
will always be NM_ and the end pattern is always ; (semi-colon). or a break (if it is the last)

What is extracted into $14 is all the text from the start to end (string between the NM_ up to the ; or
break.
The value in $7 determines the field to use,that is if $7 is exonic
then $12 is used to extract from. If $7 is not exonic then
$9 is used to extract from. There will always be a value in $7 and
exonic is there the majority of the time, but not always.
I added comments to each line as well in my attempt as to what I think is happening. I hope it is close or a start. Thank you :).

awk

awk -v OFS='\t' 'NR==FNR{split($2,a,"[.]"); k=a[1]; c[k]++} {   # split $2 in file1 on the . and strore valvue in array k using array c to iterate over each line
                     for(i=1;i<=num;i++){ # start for loop in file2 on fields
                         if($7 ==  /exonic/){ # check value in $7 and if it is exonic
                        k=sub(/NM_*;/,"",$12,array); # match the k array from file1 to the string starting with NM_ in $12 up to the ; and read the value into array i
                        $14=array   # set $14 to array i
                                           };  # close block
                     if($7 !=  /exonic/){  # check value in $7 and if it is not exonic
                     k=sub(/NM_*;/,"",$9,array[i);  match the k array from file1 to the string starting with NM_ in $9 up to the ; and read the value into array i
                     $14=array # set $14 to array i
                                        };  # close block
                                       }
                     next  # process next line
                     }1' file1 file2

file1 space delimited

ATP13A2 NM_022089.3
PPT1 NM_000310.3

file2 tab-delimited

2    chr1    17314702    17314702    C    T    exonic    ATP13A2    .    .    synonymous SNV    ATP13A2:NM_001141974:exon24:c.2658G>A:p.S886S;ATP13A2:NM_001141973:exon25:c.2775G>A:p.S925S;ATP13A2:NM_022089:exon25:c.2790G>A:p.S930S    rs3738815     .
3    chr1    40562993    40562993    T    C    UTR5    PPT1    NM_001142604:c.-83A>G;NM_000310:c.-83A>G    .    .    .    rs6600313     .

desired output tab-delimited

2    chr1    17314702    17314702    C    T    exonic    ATP13A2    .    .    synonymous SNV    ATP13A2:NM_001141974:exon24:c.2658G>A:p.S886S;ATP13A2:NM_001141973:exon25:c.2775G>A:p.S925S;ATP13A2:NM_022089:exon25:c.2790G>A:p.S930S    rs3738815     NM_022089.3:exon25:c.2790G>A:p.S930S
3    chr1    40562993    40562993    T    C    UTR5    PPT1    NM_001142604:c.-83A>G;NM_000310:c.-83A>G    .    .    .    rs6600313     NM_000310.3:c.-83A>G

Sorry cmccabe, I missed the question in your post.

Can you please clearly post your question?

1 Like

Sorry Neo, does the below help?

The NM_ value of $2 in file1 , after splitting on the . , will match a substring NM_ in $12 (the majority of the time), or $9 (in some cases).
The substring that matches is extracted starting from the NM_ until the ; or break (if it is the last value, like in case 1 in the example).
The text in $7 of file2 determines the field to use/ extract from.... that is if $7=exonic , then use $12 , but if $7 is not =exonic , then use $9 .
The extracted value is used to update $14 from a . to the extracted value. Thank you very much :).