In the out.txt below I am trying to use awk to update the contents of $9 .. If $9 contains a + or - then $8 of out.txt is used as a key to lookup in $2 of file . When a match ( there will always be one) is found the $3 value of that file is used to update $9 of out.txt separated by a : . So the original +6 value in out.txt would be +6:NM_005101.3 . The awk below is close but has syntax errors that I can not seem to fix. Thank you :).
out tab-delimited
R_Index Chr Start End Ref Alt Func.IDP.refGene Gene.IDP.refGene GeneDetail.IDP.refGene Inheritence ExonicFunc.IDP.refGene AAChange.IDP.refGene
1 chr1 948846 948846 - A upstream ISG15 -0 . . .
2 chr1 948870 948870 C G UTR5 ISG15 NM_005101.3:c.-84C>G . .
3 chr1 949608 949608 G A exonic ISG15 . . nonsynonymous SNV ISG15:NM_005101.3:exon2:c.248G>A:p.S83N
4 chr1 949925 949925 C T downstream ISG15 +6 . . .
5 chr1 207646923 207646923 G A intronic CR2 >50 . . .
6 chr2 3653844 3653844 T C intronic COLEC11 >50 . . .
7 chr1 154562623 154562625 CCG - intronic ADAR >50 . . .
8 chr1 948840 948840 - C upstream ISG15 -6 . . .
1. if $9 in out has a + or - in it
2. using $2 of file store the value of $3 as key a
3. match each $8 value in out to the key a and update $9 in out with $3 of file separated by a :
4. if $9 of file does not have a + or - in them, they are skipped
desired out tab-delimited
R_Index Chr Start End Ref Alt Func.IDP.refGene Gene.IDP.refGene GeneDetail.IDP.refGene Inheritence ExonicFunc.IDP.refGene AAChange.IDP.refGene
1 chr1 948846 948846 - A upstream ISG15 -0:NM_005101.3 . . .
2 chr1 948870 948870 C G UTR5 ISG15 NM_005101.3:c.-84C>G . .
4 chr1 949925 949925 C T downstream ISG15 +6:NM_005101.3 . . .
5 chr1 207646923 207646923 G A intronic CR2 >50 . . .
8 chr1 948840 948840 - C upstream ISG15 -6:NM_005101.3 . . .
lines 1, 3, 5 $9 updated with : and value of $3 in file
line 2 and 4 are skipped as these do not have a + or - in them
Still your question and expected are not on same page, how come line numbers 3 and 5 are updated? As they don't have -ve or +ve digits in them? If that was a typo then could you please try following once.
awk 'FNR==NR{A[$2]=$3;next} ($9 ~ /-[0-9]+$|+[0-9]+$/){Q=$8;for(Q in A){$9=$9":"A[Q]};print}' Input_file out_file
The value to update is only a - or + , I apologize for any typo.
Here is how I read the awk , which is much closer than mine :). Thank you very much :).
awk # Invoke awk
FNR==NR # For each line in the 1st input file (file)...
{A[$2]=$3 # Assign each name in field 2 = to the value in field 3 them to array A
;next} # Process next line and end block
($9 ~ /-[0-9]+$|+[0-9]+$/) # Check if field 9 in out has a - or = in it
{Q=$8; # If it does read the contents of the matching field 8 into array Q
or(Q in A){$9=$9 # For each matching array Q in array A
":"A[Q]};print # print the updated line with the updated NM_ seperated with a :
OFS="\t" # Add a tab to the output
' file out > new # Define input and output
awk 'FNR==NR{A[$2]=$3;next} ($9 ~ /-[0-9]+$|+[0-9]+$/){Q=$8;for(Q in A){$9=$9":"A[Q]};print}' OFS="\t" file out > new
current new (just the 3 updated lines with multiple NM_ )
1 chr1 948846 948846 - A upstream ISG15 -0:NM_005101.3:NM_024027.4:NM_001111.4:NM_001006658.2 . . .
4 chr1 949925 949925 C T downstream ISG15 +6:NM_005101.3:NM_024027.4:NM_001111.4:NM_001006658.2 . . .
8 chr1 948840 948840 - C upstream ISG15 -6:NM_005101.3:NM_024027.4:NM_001111.4:NM_001006658.2 . . .
desired new (all lines printed but only the 3 from above are updated with an :NM_
R_Index Chr Start End Ref Alt Func.IDP.refGene Gene.IDP.refGene GeneDetail.IDP.refGene Inheritence ExonicFunc.IDP.refGene AAChange.IDP.refGene
1 chr1 948846 948846 - A upstream ISG15 -0:NM_005101.3 . . .
2 chr1 948870 948870 C G UTR5 ISG15 NM_005101.3:c.-84C>G . .
4 chr1 949925 949925 C T downstream ISG15 +6:NM_005101.3 . . .
5 chr1 207646923 207646923 G A intronic CR2 >50 . . .
8 chr1 948840 948840 - C upstream ISG15 -6:NM_005101.3 . . .