I am trying to extract column values from a column in a tab-delimited text file and overlay them in a 2nd tab-delimited text file using a V-lookup type script in Unix bash.
These are the 1st few rows of the 1st input file IN1:
rsid chromosome position allele1 allele2
rs4471 1 82154 T T
rs313 1 752721 G G
rs125 1 768448 G G
rs112 1 798959 G G
rs668 1 800007 C C
rs497 1 838555 A A
rs447 1 846808 T C
rs753 1 854250 A G
rs1330 1 861808 G G
rs111 1 873558 T G
rs171 1 887162 T T
These are the 1st rows of the 2nd input file IN2:
rsid chromosome position alleles
rs4471 1 734462 AA
rs125 1 752721 AG
rs497 1 760998 CC
rs1330 1 776546 AA
Need to bring in IN2 column 4 values into IN1 based on column 1 values, however, the IN2 is smaller than IN1
If IN2 is missing 1st column value of IN1, then retain value in IN1. In this example there is no value for rs313, rs112, rs668, etc, so their original values in IN1 are retained.
Desired output, OUT:
rsid chromosome position allele1 allele2
rs4471 1 82154 A A
rs313 1 752721 G G
rs125 1 768448 A G
rs112 1 798959 G G
rs668 1 800007 C C
rs497 1 838555 C C
rs447 1 846808 T C
rs753 1 854250 A G
rs1330 1 861808 A A
rs111 1 873558 T G
rs171 1 887162 T T
I have highlighted the rows for which values were found in IN2. I'm using Ubuntu 16.04.
awk 'NR == FNR {REP[$1] = $4; next} $1 in REP && FNR > 1 {$4 = substr (REP[$1], 1, 1); $5 = substr ( REP[$1], 2, 1)} 1' OFS="\t" file2 file1
rsid chromosome position allele1 allele2
rs4471 1 82154 A A
rs313 1 752721 G G
rs125 1 768448 A G
rs112 1 798959 G G
rs668 1 800007 C C
rs497 1 838555 C C
rs447 1 846808 T C
rs753 1 854250 A G
rs1330 1 861808 A A
rs111 1 873558 T G
rs171 1 887162 T T