Array V-Lookup using UNIX bash

Geneanalyst · April 28, 2018, 1:49pm

Hey everyone,

I am trying to extract column values from a column in a tab-delimited text file and overlay them in a 2nd tab-delimited text file using a V-lookup type script in Unix bash.

These are the 1st few rows of the 1st input file IN1:

rsid    chromosome    position    allele1    allele2
rs4471    1    82154    T    T
rs313    1    752721    G    G
rs125    1    768448    G    G
rs112    1    798959    G    G
rs668    1    800007    C    C
rs497    1    838555    A    A
rs447    1    846808    T    C
rs753    1    854250    A    G
rs1330    1    861808    G    G
rs111    1    873558    T    G
rs171    1    887162    T    T

These are the 1st rows of the 2nd input file IN2:

rsid    chromosome    position    alleles
rs4471    1    734462    AA
rs125    1    752721    AG
rs497    1    760998    CC
rs1330    1    776546    AA

Need to bring in IN2 column 4 values into IN1 based on column 1 values, however, the IN2 is smaller than IN1

If IN2 is missing 1st column value of IN1, then retain value in IN1. In this example there is no value for rs313, rs112, rs668, etc, so their original values in IN1 are retained.

Desired output, OUT:

rsid    chromosome    position    allele1    allele2
rs4471    1    82154    A    A
rs313    1    752721    G    G
rs125    1    768448    A    G
rs112    1    798959    G    G
rs668    1    800007    C    C
rs497    1    838555    C    C
rs447    1    846808    T    C
rs753    1    854250    A    G
rs1330    1    861808    A    A
rs111    1    873558    T    G
rs171    1    887162    T    T

I have highlighted the rows for which values were found in IN2. I'm using Ubuntu 16.04.

Thanks in advance.

Scrutinizer · April 28, 2018, 2:39pm

Hi, try:

awk '
  NR==1 {
    next
  } 

  NR==FNR {
    A[$1]=substr($4,1,1)
    B[$1]=substr($4,2)
    next
  } 

  $1 in A {
    $4=A[$1]
    $5=B[$1]
  }
  1
' OFS='\t' IN2 IN1

Geneanalyst · April 28, 2018, 2:55pm

Thanks Scrutinizer, works beautifully !

RudiC · April 28, 2018, 3:13pm

Try

awk 'NR == FNR {REP[$1] = $4; next} $1 in REP && FNR > 1 {$4 = substr (REP[$1], 1, 1); $5 = substr ( REP[$1], 2, 1)} 1' OFS="\t" file2 file1
rsid    chromosome    position    allele1    allele2
rs4471   1    82154     A    A
rs313    1    752721    G    G
rs125    1    768448    A    G
rs112    1    798959    G    G
rs668    1    800007    C    C
rs497    1    838555    C    C
rs447    1    846808    T    C
rs753    1    854250    A    G
rs1330   1    861808    A    A
rs111    1    873558    T    G
rs171    1    887162    T    T

RavinderSingh13 · April 28, 2018, 3:33pm

Hello Geneanalys,

Could you please try following and let me know if this helps you. (tested with GNU awk )

awk 'FNR==NR{num=split($NF,array,"");for(i=1;i<=num;i++){val=val?val OFS array:array};a[$1]=val;$NF="";b[$1]=$0;val="";next} FNR>1{printf("%s",$1 in a?b[$1] a[$1] OFS val ORS:$0 ORS);next} 1'  in2  in1 | column -t

Thanks,
R. Singh

Geneanalyst · April 28, 2018, 4:36pm

Scrutinizer,

Could you please post explanation for the commands.

---------- Post updated at 03:36 PM ---------- Previous update was at 03:28 PM ----------

Scrutinizer or RudiC,

Is there a way to insert a counter to see the number of rows for which there were values in IN2?

RudiC · April 28, 2018, 6:37pm

Try

awk 'NR == FNR {REP[$1] = $4; next} $1 in REP && FNR > 1 {$4 = substr (REP[$1], 1, 1); $5 = substr ( REP[$1], 2, 2); CNT++} 1; END {print CNT}' OFS="\t" file2 file1

Geneanalyst · April 29, 2018, 11:14pm

RudyC,

Would the following modification to your script be adequate if the input file IN1 was like this instead:

awk 'NR == FNR {REP[$1] = $4; next} $1 in REP && FNR > 1 { $4 = REP[$1]; CNT++} 1; END {print CNT}' OFS="\t"

rsid    chromosome    position    alleles
rs4471    1    82154    TT
rs313    1    752721    GG
rs125    1    768448    GG
rs112    1    798959    GG
rs668    1    800007    CC
rs497    1    838555    AA
rs447    1    846808    TC
rs753    1    854250    AG
rs1330    1    861808    GG
rs111    1    873558    TG
rs171    1    887162    TT

And the desired output OUT needed to be :

rsid    chromosome    position    alleles
rs4471    1    82154    AA
rs313    1    752721    GG
rs125    1    768448    AG
rs112    1    798959    GG
rs668    1    800007    CC
rs497    1    838555    CC
rs447    1    846808    TC
rs753    1    854250    AG
rs1330    1    861808    AA
rs111    1    873558    TG
rs171    1    887162    TT

RavinderSingh13 · April 30, 2018, 12:45am

Hello Geneanalyst,

If your Input_file is same as shown samples then following may help you on same.

awk 'FNR==NR&&FNR>1{a[$1]=$NF;next} FNR>1{if(a[$1]){$NF=$(NF-1)=""} else {$(NF-1)=OFS $(NF-1) $NF;$NF=""};print $0 a[$1]}' OFS="\t" in2 in1

Thanks,
R. Singh

RudiC · April 30, 2018, 3:45am

What's the result you get when running your adapted code?

Geneanalyst · May 1, 2018, 12:28am

Looks like it works.