Array V-Lookup using UNIX bash

Hey everyone,

I am trying to extract column values from a column in a tab-delimited text file and overlay them in a 2nd tab-delimited text file using a V-lookup type script in Unix bash.

These are the 1st few rows of the 1st input file IN1:

rsid    chromosome    position    allele1    allele2
rs4471    1    82154    T    T
rs313    1    752721    G    G
rs125    1    768448    G    G
rs112    1    798959    G    G
rs668    1    800007    C    C
rs497    1    838555    A    A
rs447    1    846808    T    C
rs753    1    854250    A    G
rs1330    1    861808    G    G
rs111    1    873558    T    G
rs171    1    887162    T    T

These are the 1st rows of the 2nd input file IN2:

rsid    chromosome    position    alleles
rs4471    1    734462    AA
rs125    1    752721    AG
rs497    1    760998    CC
rs1330    1    776546    AA

Need to bring in IN2 column 4 values into IN1 based on column 1 values, however, the IN2 is smaller than IN1

If IN2 is missing 1st column value of IN1, then retain value in IN1. In this example there is no value for rs313, rs112, rs668, etc, so their original values in IN1 are retained.

Desired output, OUT:

rsid    chromosome    position    allele1    allele2
rs4471    1    82154    A    A
rs313    1    752721    G    G
rs125    1    768448    A    G
rs112    1    798959    G    G
rs668    1    800007    C    C
rs497    1    838555    C    C
rs447    1    846808    T    C
rs753    1    854250    A    G
rs1330    1    861808    A    A
rs111    1    873558    T    G
rs171    1    887162    T    T

I have highlighted the rows for which values were found in IN2. I'm using Ubuntu 16.04.

Thanks in advance.

Hi, try:

awk '
  NR==1 {
    next
  } 

  NR==FNR {
    A[$1]=substr($4,1,1)
    B[$1]=substr($4,2)
    next
  } 

  $1 in A {
    $4=A[$1]
    $5=B[$1]
  }
  1
' OFS='\t' IN2 IN1
1 Like

Thanks Scrutinizer, works beautifully !

Try

awk 'NR == FNR {REP[$1] = $4; next} $1 in REP && FNR > 1 {$4 = substr (REP[$1], 1, 1); $5 = substr ( REP[$1], 2, 1)} 1' OFS="\t" file2 file1
rsid    chromosome    position    allele1    allele2
rs4471   1    82154     A    A
rs313    1    752721    G    G
rs125    1    768448    A    G
rs112    1    798959    G    G
rs668    1    800007    C    C
rs497    1    838555    C    C
rs447    1    846808    T    C
rs753    1    854250    A    G
rs1330   1    861808    A    A
rs111    1    873558    T    G
rs171    1    887162    T    T
1 Like

Hello Geneanalys,

Could you please try following and let me know if this helps you. (tested with GNU awk )

awk 'FNR==NR{num=split($NF,array,"");for(i=1;i<=num;i++){val=val?val OFS array:array};a[$1]=val;$NF="";b[$1]=$0;val="";next} FNR>1{printf("%s",$1 in a?b[$1] a[$1] OFS val ORS:$0 ORS);next} 1'  in2  in1 | column -t

Thanks,
R. Singh

1 Like

Scrutinizer,

Could you please post explanation for the commands.

---------- Post updated at 03:36 PM ---------- Previous update was at 03:28 PM ----------

Scrutinizer or RudiC,

Is there a way to insert a counter to see the number of rows for which there were values in IN2?

Try

awk 'NR == FNR {REP[$1] = $4; next} $1 in REP && FNR > 1 {$4 = substr (REP[$1], 1, 1); $5 = substr ( REP[$1], 2, 2); CNT++} 1; END {print CNT}' OFS="\t" file2 file1
1 Like

RudyC,

Would the following modification to your script be adequate if the input file IN1 was like this instead:

awk 'NR == FNR {REP[$1] = $4; next} $1 in REP && FNR > 1 { $4 = REP[$1]; CNT++} 1; END {print CNT}' OFS="\t"
rsid    chromosome    position    alleles
rs4471    1    82154    TT
rs313    1    752721    GG
rs125    1    768448    GG
rs112    1    798959    GG
rs668    1    800007    CC
rs497    1    838555    AA
rs447    1    846808    TC
rs753    1    854250    AG
rs1330    1    861808    GG
rs111    1    873558    TG
rs171    1    887162    TT

And the desired output OUT needed to be :

rsid    chromosome    position    alleles
rs4471    1    82154    AA
rs313    1    752721    GG
rs125    1    768448    AG
rs112    1    798959    GG
rs668    1    800007    CC
rs497    1    838555    CC
rs447    1    846808    TC
rs753    1    854250    AG
rs1330    1    861808    AA
rs111    1    873558    TG
rs171    1    887162    TT

Hello Geneanalyst,

If your Input_file is same as shown samples then following may help you on same.

awk 'FNR==NR&&FNR>1{a[$1]=$NF;next} FNR>1{if(a[$1]){$NF=$(NF-1)=""} else {$(NF-1)=OFS $(NF-1) $NF;$NF=""};print $0 a[$1]}' OFS="\t" in2 in1

Thanks,
R. Singh

1 Like

What's the result you get when running your adapted code?

Looks like it works.