In the awk
below I am trying to set/update the value of $14
in file2
in
bold, using the matching NM_
in $12
or $9
in file2
with the NM_
in $2
of file1
.
The lengths of $9
and $12
can be variable but what is consistent is the start pattern
will always be NM_
and the end pattern is always ;
(semi-colon). or a break (if it is the last)
What is extracted into $14
is all the text from the start to end (string between the NM_ up to the ; or
break.
The value in $7
determines the field to use,that is if $7
is exonic
then $12
is used to extract from. If $7
is not exonic
then
$9
is used to extract from. There will always be a value in $7
and
exonic
is there the majority of the time, but not always.
I added comments to each line as well in my attempt as to what I think is happening. I hope it is close or a start. Thank you :).
awk
awk -v OFS='\t' 'NR==FNR{split($2,a,"[.]"); k=a[1]; c[k]++} { # split $2 in file1 on the . and strore valvue in array k using array c to iterate over each line
for(i=1;i<=num;i++){ # start for loop in file2 on fields
if($7 == /exonic/){ # check value in $7 and if it is exonic
k=sub(/NM_*;/,"",$12,array); # match the k array from file1 to the string starting with NM_ in $12 up to the ; and read the value into array i
$14=array # set $14 to array i
}; # close block
if($7 != /exonic/){ # check value in $7 and if it is not exonic
k=sub(/NM_*;/,"",$9,array[i); match the k array from file1 to the string starting with NM_ in $9 up to the ; and read the value into array i
$14=array # set $14 to array i
}; # close block
}
next # process next line
}1' file1 file2
file1 space delimited
ATP13A2 NM_022089.3
PPT1 NM_000310.3
file2 tab-delimited
2 chr1 17314702 17314702 C T exonic ATP13A2 . . synonymous SNV ATP13A2:NM_001141974:exon24:c.2658G>A:p.S886S;ATP13A2:NM_001141973:exon25:c.2775G>A:p.S925S;ATP13A2:NM_022089:exon25:c.2790G>A:p.S930S rs3738815 .
3 chr1 40562993 40562993 T C UTR5 PPT1 NM_001142604:c.-83A>G;NM_000310:c.-83A>G . . . rs6600313 .
desired output tab-delimited
2 chr1 17314702 17314702 C T exonic ATP13A2 . . synonymous SNV ATP13A2:NM_001141974:exon24:c.2658G>A:p.S886S;ATP13A2:NM_001141973:exon25:c.2775G>A:p.S925S;ATP13A2:NM_022089:exon25:c.2790G>A:p.S930S rs3738815 NM_022089.3:exon25:c.2790G>A:p.S930S
3 chr1 40562993 40562993 T C UTR5 PPT1 NM_001142604:c.-83A>G;NM_000310:c.-83A>G . . . rs6600313 NM_000310.3:c.-83A>G