Match Columns in one file and extract columns from another file

genehunter · August 16, 2017, 5:18pm

Kindly help merging information from two files with the following data structure.
I want to match for the CHR-SNP in Foo and get the columns that match from CHROM-rsID
Fields 1 & 2 of foo may have duplicates, however, a joint key of Fields $1$2$3$4 is unique.
Also would be helpful to clean up the file column delimiter to make sure spaces (more than one)are converted to single tab.
awk preferred.
Many thanks
~GH
File foo:

CHR                 SNP   A1   A2          MAF  NCHROBS
   1          rs10005934    A    C       0.0038      452
   1          rs10015934    A    G       0.0038      452
   1            rs710870    A    G       0.4004      452
   1           rs2073105    G    A         0.25      452
   1            rs710871    A    G      0.01549      452
   1            exm25630    0    G            0      452

File bar:

CHROM     POS     rsID    cM      A1      A2   
   1       202183358    rs10005934    200.23    A    C
   1       202183358    rs10015934    200.23    A    G
   1       222445567     rs710870    51.21      A    G
   1       235658554     rs2073105    25.84      G    A
  10          27436462     rs1234566    1.52      D    I

required file foobar

CHR                 SNP   A1   A2          MAF  NCHROBS     CHROM     POS     rsID    cM      A1      A2  
   1          rs10005934    A    C       0.0038      452        1       202183358    rs10005934    200.23    A    C            
   1          rs10015934    A    G       0.0038      452        1       202183358    rs10015934    200.23    A    G     
   1            rs710870    A    G       0.4004      452        1       222445567     rs710870    51.21      A    G     
   1           rs2073105    G    A         0.25      452        1       235658554     rs2073105    25.84      G    A

rdrtx1 · August 16, 2017, 9:47pm

awk '
FNR==1  { printf $0 ((c++) ? "\n" : "\t") }
NR==FNR {a[$1,$2]=$0; next}

a[$1,$3] { print a[$1,$3] "\t" $0 }
' foo bar

RudiC · August 17, 2017, 2:55am

While rdrtx1's proposal works fine for the samples given, it doesn't for the duplicates mentioned as the samples don't have any. Nor is the request for <TAB> field separators in the result fulfilled. Try

awk '

                {IX1 = $1 OFS $2 OFS $3 OFS $4
                 IX2 = $1 OFS $3 OFS $5 OFS $6
                 $1 = $1
                }

FNR==1          {printf $0 ((c++) ? "\n" : "\t")
                }

NR==FNR         {a[IX1] = $0
                 next
                }

a[IX2]          {print a[IX2] "\t" $0
                }
' OFS="\t" file[12]

genehunter · August 22, 2017, 3:00pm

Can you please explain the code.
Does

$1=$1

help cleaning up the first column or print the first line?
Would be very useful to understand the usage and learn awk if you can write a few words of explanation.
Thanks

rudic:

While rdrtx1's proposal works fine for the samples given, it doesn't for the duplicates mentioned as the samples don't have any. Nor is the request for <TAB> field separators in the result fulfilled. Try
awk '

   {IX1 = $1 OFS $2 OFS $3 OFS $4
   IX2 = $1 OFS $3 OFS $5 OFS $6
   $1 = $1
   }

FNR==1          {printf $0 ((c++) ? "\n" : "\t")
   }

NR==FNR         {a[IX1] = $0
   next
   }

a[IX2]          {print a[IX2] "\t" $0
   }
' OFS="\t" file[12]

RudiC · August 22, 2017, 3:46pm

Neither ... nor. The $1 = $1 trick replaces ALL field separators (multiples as well) with the OFS char without modifying the fields' contents. man awk :