awk to update file based on match in 3 fields

cmccabe · March 22, 2018, 8:57am

Trying to use awk to store the value of $5 in file1 in array x . That array x is then used to search $4 of file1 to find aa match (I use x[2] to skip the header in file1). Since $4 can have multiple strings in it seperated by a , (comma), I split them and iterate througn each split looking for a match. Each split is then stored in array and compared to x[2] for a match. The pattern c. is extracted and stored as VAL .
The awk below will hopefully do that but where I am struggling is in order for $6 in file1 to be updated, the $2 and $3 and array value must match $5 and $6 and $9 in file2 . If that is true then $6 in file1 is updated with the values of $1 and $3 from file2 . Line 4 is an example of this because the NM_000138.4 matches array x[2] and the c. value up to the : (colon) and that value matches $9 in file2 . So all the conditions are meet to update $6 in file1 . Line2 satisfies all but the c. value in VAL as it does not match $9 in file2 , so $6 is not updated. I hope this is a good start and that I didn't over-complicate things (though I may have and there is probably a better way). Thank you :).

file1 tab-delimeted

R_Index	Chr	Start	AAChange.refGeneWithVer	MajorTranscript	HGMD	Sanger
1	chr15	48720526	FBN1:NM_000138.4:exon57:c.6997+17C>G:p.?	NM_000138.4	.	.
2	chr15	48741091	FBN1:NM_000138.4:exon46:c.5546-1G>A:p.?	NM_000138.4	.	.
3	chr15	48807637	FBN1:NM_000138.4:exon12:c.1415G>A:p.Cys472Tyr	NM_000138.4	.	.
4	chr15	48741091	FBN1:NM_000138.4:exon46:c.5546-1G>A:p.?,FBN1:NM_000138.4:exon46:c.5546-1G>T:p.?	NM_000138.4	.	.

file2 tab-delimeted

HGMD ID	Disease	Variant Class	Gene Symbol	chromosome	coordinate start	coordinate end	strand	hgvs
CS057006	Marfan syndrome	DM	FBN1	chr15	48802240	48802240	-	c.1714+1G>T
CS057007	Marfan syndrome	DM	FBN1	chr15	48797346	48797346	-	c.1838-2A>G
CS057008	Marfan syndrome	DM	FBN1	chr15	48741091	48741091	-	c.5546-1G>T

desired output tab-delimited

R_Index	Chr	Start	AAChange.refGeneWithVer	MajorTranscript	HGMD	Sanger
1	chr15	48720526	FBN1:NM_000138.4:exon57:c.6997+17C>G:p.?	NM_000138.4	.	.
2	chr15	48741091	FBN1:NM_000138.4:exon46:c.5546-1G>A:p.?	NM_000138.4	.	.
3	chr15	48807637	FBN1:NM_000138.4:exon12:c.1415G>A:p.Cys472Tyr	NM_000138.4	.	.
4	chr15	48741091	FBN1:NM_000138.3:exon46:c.5546-1G>A:p.?,FBN1:NM_000138.4:exon46:c.5546-1G>T:p.?	NM_000138.4	CS057008 DM	.

awk '
  BEGIN { FS=OFS="\t" }
        FNR==NR {x[NR]=$5}  # store value in $5 in array x
        #{print X[2]} 
         $4 ~ x[2] {      # if $4 matches x[2]
         match($4,"NM"].*],);  # regex match from NM to till , in 4rd field
         val=substr($4,RSTART+1,RLENGTH-2); # store substring value that starts from RSTART+1 to RLENGTH-2 using $4 in val
         NM=split($4, array,",");   # Split $4 on "," and storing it's length(array's length) to variable named num. 
             for(i=1;i<=NM;i++){ # Starting a loop which will start from value 1 of variable i to till value of variable num
              if(array ~  x[2]){  # Check condition if any array's value is equal to array x[2] skipping header
                if (match(NM,/c[.].:/)) {  # extract pattern c. in each split from c. to :
                    VAL=substr(NM[n],RSTART+2) # store each c. from split in VAL
                              }
                             }
                            }
                           }
                              {a[$5,$6,$9]=$1,$3; next} a[$2,$3]{$6=a[$2,$3]}1' file1 file2  # update $6 in file1 if condition is met

Chubler_XL · March 25, 2018, 5:47pm

Few things here.

Process file2 first so you can have a[$5,$6,$9] populated when processing files2
Why store x[NR] when you are only interested in x[2] I replaced with FNR==2 {x=$5} to only store $5 from the 2nd line of file1
If $4 matches x split $4 array and then attempt to match "c." value against the a[] values stored from file2

wk '
BEGIN { FS=OFS="\t" }
FNR==NR {a[$5,$6,$9]=$1" "$3; next}
FNR==2 {x=$5}  # store value of $5 from line 2

$4 ~ x {      # if $4 matches x[2]
    NM=split($4, array,",");   # Split $4 on "," and storing in array[]
    for(i=1;i<=NM;i++){ # Loop thru each split value from above
        if(index(array, x) > 0) {  # If x is in this elelemt
            if (match(array, "c[.].*:")) {  # extract pattern c. from split from c. to :
                VAL=substr(array, RSTART, RLENGTH-1) # extrace c. value from split in VAL
                if($2 SUBSEP $3 SUBSEP VAL in a) $6=a[$2,$3,VAL]  # if this is in a[] array update $6
            }
        }
    }
}
1' file2 file1  # update $6 in file1 if condition is met

cmccabe · March 28, 2018, 9:47am

Thank you very much :).