Trying to use awk
to store the value of $5
in file1
in array x
. That array x is then used to search $4
of file1
to find aa match (I use x[2] to skip the header in file1). Since $4
can have multiple strings in it seperated by a ,
(comma), I split them and iterate througn each split looking for a match. Each split is then stored in array
and compared to x[2]
for a match. The pattern c.
is extracted and stored as VAL
.
The awk
below will hopefully do that but where I am struggling is in order for $6
in file1
to be updated, the $2 and $3 and array value
must match $5 and $6 and $9
in file2
. If that is true then $6
in file1
is updated with the values of $1 and $3
from file2
. Line 4 is an example of this because the NM_000138.4
matches array x[2]
and the c.
value up to the :
(colon) and that value matches $9
in file2
. So all the conditions are meet to update $6
in file1
. Line2 satisfies all but the c.
value in VAL
as it does not match $9
in file2
, so $6
is not updated. I hope this is a good start and that I didn't over-complicate things (though I may have and there is probably a better way). Thank you :).
file1 tab-delimeted
R_Index Chr Start AAChange.refGeneWithVer MajorTranscript HGMD Sanger
1 chr15 48720526 FBN1:NM_000138.4:exon57:c.6997+17C>G:p.? NM_000138.4 . .
2 chr15 48741091 FBN1:NM_000138.4:exon46:c.5546-1G>A:p.? NM_000138.4 . .
3 chr15 48807637 FBN1:NM_000138.4:exon12:c.1415G>A:p.Cys472Tyr NM_000138.4 . .
4 chr15 48741091 FBN1:NM_000138.4:exon46:c.5546-1G>A:p.?,FBN1:NM_000138.4:exon46:c.5546-1G>T:p.? NM_000138.4 . .
file2 tab-delimeted
HGMD ID Disease Variant Class Gene Symbol chromosome coordinate start coordinate end strand hgvs
CS057006 Marfan syndrome DM FBN1 chr15 48802240 48802240 - c.1714+1G>T
CS057007 Marfan syndrome DM FBN1 chr15 48797346 48797346 - c.1838-2A>G
CS057008 Marfan syndrome DM FBN1 chr15 48741091 48741091 - c.5546-1G>T
desired output tab-delimited
R_Index Chr Start AAChange.refGeneWithVer MajorTranscript HGMD Sanger
1 chr15 48720526 FBN1:NM_000138.4:exon57:c.6997+17C>G:p.? NM_000138.4 . .
2 chr15 48741091 FBN1:NM_000138.4:exon46:c.5546-1G>A:p.? NM_000138.4 . .
3 chr15 48807637 FBN1:NM_000138.4:exon12:c.1415G>A:p.Cys472Tyr NM_000138.4 . .
4 chr15 48741091 FBN1:NM_000138.3:exon46:c.5546-1G>A:p.?,FBN1:NM_000138.4:exon46:c.5546-1G>T:p.? NM_000138.4 CS057008 DM .
awk '
BEGIN { FS=OFS="\t" }
FNR==NR {x[NR]=$5} # store value in $5 in array x
#{print X[2]}
$4 ~ x[2] { # if $4 matches x[2]
match($4,"NM"].*],); # regex match from NM to till , in 4rd field
val=substr($4,RSTART+1,RLENGTH-2); # store substring value that starts from RSTART+1 to RLENGTH-2 using $4 in val
NM=split($4, array,","); # Split $4 on "," and storing it's length(array's length) to variable named num.
for(i=1;i<=NM;i++){ # Starting a loop which will start from value 1 of variable i to till value of variable num
if(array ~ x[2]){ # Check condition if any array's value is equal to array x[2] skipping header
if (match(NM,/c[.].:/)) { # extract pattern c. in each split from c. to :
VAL=substr(NM[n],RSTART+2) # store each c. from split in VAL
}
}
}
}
{a[$5,$6,$9]=$1,$3; next} a[$2,$3]{$6=a[$2,$3]}1' file1 file2 # update $6 in file1 if condition is met