awk to update value in field of out file using contents of another Ask

In the out.txt below I am trying to use awk to update the contents of $9 .. If $9 contains a + or - then $8 of out.txt is used as a key to lookup in $2 of file . When a match ( there will always be one) is found the $3 value of that file is used to update $9 of out.txt separated by a : . So the original +6 value in out.txt would be +6:NM_005101.3 . The awk below is close but has syntax errors that I can not seem to fix. Thank you :).

out tab-delimited

R_Index	Chr	Start	End	Ref	Alt	Func.IDP.refGene	Gene.IDP.refGene	GeneDetail.IDP.refGene	Inheritence	ExonicFunc.IDP.refGene	AAChange.IDP.refGene
1	chr1	948846	948846	-	A	upstream	ISG15	-0	.	.	.
2	chr1	948870	948870	C	G	UTR5	ISG15	NM_005101.3:c.-84C>G	.	.
3	chr1	949608	949608	G	A	exonic	ISG15	.	.	nonsynonymous SNV	ISG15:NM_005101.3:exon2:c.248G>A:p.S83N
4	chr1	949925	949925	C	T	downstream	ISG15	+6	.	.	.
5	chr1	207646923	207646923	G	A	intronic	CR2	>50	.	.	.
6	chr2	3653844	3653844	T	C	intronic	COLEC11	>50	.	.	.
7	chr1	154562623	154562625	CCG	-	intronic	ADAR	>50	.	.	.
8	chr1	948840	948840	-	C	upstream	ISG15	-6	.	.	.

file space-delimited

2 ISG15 NM_005101.3 948846-948956 949363-949919

awk

awk 'if($9 == "-" || $9 == "+" {printf ":"} FNR==NR{a[$2]=$3; next} a[$9]{$3=a[$8]}1' OFS'\t' out file > result

awk: cmd. line:1: if($9 == "-" || $9 == "+" {printf ":"} FNR==NR{a[$2]=$3; next} a[$8]{$3=a[$8]}1
awk: cmd. line:1: ^ syntax error

Description:

1. if $9 in out has a + or - in it

2. using $2 of file store the value of $3 as key a

3. match each $8 value in out to the key a and update $9 in out with $3 of file separated by a :

4. if $9 of file does not have a + or - in them, they are skipped

desired out tab-delimited

R_Index Chr Start   End Ref Alt Func.IDP.refGene    Gene.IDP.refGene    GeneDetail.IDP.refGene  Inheritence ExonicFunc.IDP.refGene  AAChange.IDP.refGene
1   chr1    948846  948846  -   A   upstream    ISG15   -0:NM_005101.3  .   .   .
2   chr1    948870  948870  C   G   UTR5    ISG15   NM_005101.3:c.-84C>G    .   .
4   chr1    949925  949925  C   T   downstream  ISG15   +6:NM_005101.3  .   .   .
5   chr1    207646923   207646923   G   A   intronic    CR2 >50 .   .   .
8   chr1    948840  948840  -   C   upstream    ISG15   -6:NM_005101.3  .   .   .

lines 1, 3, 5 $9 updated with : and value of $3 in file
line 2 and 4 are skipped as these do not have a + or - in them

if($9 == "-" || $9 == "+" {printf ":"}
# should be:
($9 == "-" || $9 == "+") {printf ":"}
1 Like

Looks like I did something wrong in the fields to update, but that fixed the syntax error. Thank you :).

Hello cmccabe,

Apart from what Jim has stated, we could also put if condition in {if........} braces too.

{if($9 == "-" || $9 == "+"){printf ":"}}

Also set OFS="\t" in your code.

Thanks,
R. Singh

1 Like

Seems closer and executes but the output is empty:

awk

awk -F'\t' '$9 ~ /-/ || $9 ~ /+/ {print $9":"}' out | awk 'FNR==NR {a[$2]=$3; next} a[$8]{$9=a[$8]}1' OFS="\t" file

If I run each command seprate I seem to get the output I need. Thank you :).

Hello cmccabe,

Still your question and expected are not on same page, how come line numbers 3 and 5 are updated? As they don't have -ve or +ve digits in them? If that was a typo then could you please try following once.

 awk 'FNR==NR{A[$2]=$3;next} ($9 ~ /-[0-9]+$|+[0-9]+$/){Q=$8;for(Q in A){$9=$9":"A[Q]};print}'   Input_file out_file
 

Thanks,
R. Singh

1 Like

The value to update is only a - or + , I apologize for any typo.

Here is how I read the awk , which is much closer than mine :). Thank you very much :).

awk           # Invoke awk
FNR==NR # For each line in the 1st input file (file)...
{A[$2]=$3  # Assign each name in field 2 = to the value in field 3 them to array A
;next}    # Process next line and end block
($9 ~ /-[0-9]+$|+[0-9]+$/)   #  Check if field 9 in out has a -  or = in it
{Q=$8;   # If it does read the contents of the matching field 8 into array Q
or(Q in A){$9=$9 # For each matching array Q in array A
":"A[Q]};print    # print the updated line with the updated NM_ seperated with a :
OFS="\t"  # Add a tab to the output
' file out > new  # Define input and output
awk 'FNR==NR{A[$2]=$3;next} ($9 ~ /-[0-9]+$|+[0-9]+$/){Q=$8;for(Q in A){$9=$9":"A[Q]};print}' OFS="\t" file out > new

current new (just the 3 updated lines with multiple NM_ )

1	chr1	948846	948846	-	A	upstream	ISG15	-0:NM_005101.3:NM_024027.4:NM_001111.4:NM_001006658.2	.	.	.
4	chr1	949925	949925	C	T	downstream	ISG15	+6:NM_005101.3:NM_024027.4:NM_001111.4:NM_001006658.2	.	.	.
8	chr1	948840	948840	-	C	upstream	ISG15	-6:NM_005101.3:NM_024027.4:NM_001111.4:NM_001006658.2	.	.	.

desired new (all lines printed but only the 3 from above are updated with an :NM_

R_Index Chr Start   End Ref Alt Func.IDP.refGene    Gene.IDP.refGene    GeneDetail.IDP.refGene  Inheritence ExonicFunc.IDP.refGene  AAChange.IDP.refGene
1   chr1    948846  948846  -   A   upstream    ISG15   -0:NM_005101.3  .   .   .
2   chr1    948870  948870  C   G   UTR5    ISG15   NM_005101.3:c.-84C>G    .   .
4   chr1    949925  949925  C   T   downstream  ISG15   +6:NM_005101.3  .   .   .
5   chr1    207646923   207646923   G   A   intronic    CR2 >50 .   .   .
8   chr1    948840  948840  -   C   upstream    ISG15   -6:NM_005101.3  .   .   .