Merging multiple files from multiple columns

bartman2099 · October 16, 2012, 12:19pm

Hi guys,
I have very basic linux experience so I need some help with a problem.
I have 3 files from which I want to extract columns based on common fields between them.

File1:

--- rs74078040 NA 51288690 T G 461652 0.99223 0.53611 3
--- rs77209296 NA 51303525 T G 461843 0.98973 0.60837 3
--- rs7985170 NA 51304203 C T 461852 0.99215 0.541 3

File2:

id rsid chromosome pos allele_A allele_B index average_call info
--- rs74078040 NA 51288690 T G 461652 0.99852 0.52725 2
--- rs77209296 NA 51303525 T G 461843 0.99992 0.90562 2
--- rs7985170 NA 51304203 C T 461852 0.99994 0.93283 2
--- rs145780716 NA 19032912 G A 2054321 0.9988 0.026123 2

File3:
id rsid chromosome pos allele_A allele_B index average_call info

--- rs74078040 NA 51288690 T G 461652 0.956 0.8558 1
--- rs77209296 NA 51303525 T G 461843 0.9843 0.8989 1
--- rs7985170 NA 51304203 C T 461852 0.9323 0.6754 1
--- rs145780716 NA 19032912 G A 2054321 0.9098 0.7632 1

I want to take all the rows from File1 and say columns 1-7 (which are the same in all the files) and then add column 8 from File 2 and columns 9 and 10 from File3.
Note that File1 doesn't have a header but the other files do. Also, File1 has less lines than the other files which have the same number of lines.

The output would look like:

--- rs74078040 NA 51288690 T G 461652 0.99852 0.8558 1
--- rs77209296 NA 51303525 T G 461843 0.99992 0.8989 1
--- rs7985170 NA 51304203 C T 461852 0.99994 0.6754 1

Any help would be greatly appreciated.
Thanks!

guruprasadpr · October 16, 2012, 12:55pm

Hi

Assuming your files are a1, a2 and a3:

$ join -1 2 -2 2 -o  1.1,1.2,1.3,1.4,1.5,1.6,1.7,2.8 a1 a2  | join -1 2 -2 2 -o 1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,2.9,2.10 - a3
--- rs74078040 NA 51288690 T G 461652 0.99852 0.8558 1
--- rs77209296 NA 51303525 T G 461843 0.99992 0.8989 1
--- rs7985170 NA 51304203 C T 461852 0.99994 0.6754 1

Guru.

bartman2099 · October 17, 2012, 5:43am

Hi guruprasadpr,
Many thanks for your reply. I have tried the command you suggest but it tells me: "join: File 2 is not in sorted order" which I think probably comes from the header line because column 4 is in sorted order.
Any suggestions how to get around that problem? I don't want to make a copy of the files without headers because they are quite big and our disk space is limited.

pamu · October 17, 2012, 6:15am

If you don't mind using two awk..

try

awk 'FNR==NR{$NF="";$(NF-1)="";$(NF-2)="";a[$2]=$0;next}{if(a[$2]){print a[$2],$8}}' file1 file2 > temp

awk 'FNR==NR{a[$2]=$0;next}{if(a[$2]){print a[$2],$9,$10}}' temp file3

bartman2099 · October 17, 2012, 6:19am

Hi pamu, I wouldn't mind using more than one step even though in reality I have 5 files instead of 3.
Could you broadly explain the awk syntax please? My awk knowledge is almost zero.
Thanks for your reply!

pamu · October 17, 2012, 6:31am

It's Better that you should read some basic of awk then

Read this first..

then read.. this:D

awk 'FNR==NR{$NF="";$(NF-1)="";$(NF-2)="";a[$2]=$0;next}   # Read first file, make $NF,$(NF-1) and $(NF-2) NULL as we don't need those. Create an array a with $2 is index and which stores $0 ( whole line).                                                            

{if(a[$2]){print a[$2],$8}}' file1 file2 > temp   # Here we are reading second file. Check if $2 present in this file or not. If it is present then file1 line matches with file2 line then print line from file1 and append $8 as you have requested. And write this to a new temp file.

awk 'FNR==NR{a[$2]=$0;next}  # Read temp file create an array a, which stores $0(whole line)

{if(a[$2]){print a[$2],$9,$10}}' temp file3 # read second file. If there is match between files then print temp file line and append $9 and $10 from file3 as per your required output.

Scrutinizer · October 17, 2012, 6:32am

One awk:

awk '
  f==2{
    F2[$2]=$8
  } 
  f==3{
    F3a[$2]=$9
    F3b[$2]=$10
  }
  f==1{
    $8=F2[$2]
    $9=F3a[$2]
    $10=F3b[$2]
    print
  }
' f=2 file2 f=3 file3 f=1 file1

bartman2099 · October 17, 2012, 9:40am

Thanks pamu and Scrutinizer. I will try your suggestions and read up a bit on awk and will let you guys know.
Thanks a lot!

---------- Post updated at 02:40 PM ---------- Previous update was at 11:39 AM ----------

pamu, your code actually works very well! I would just need to adapt it because my files don't really have 10 columns, more like 80 but the principle is the same. It would just be quite a long command.

Scrutinizer, your code works as well but what if file1 has more columns? How could I remove some of them if I don't want them?
Also, do I have to run the command with the spacing as is? When I tried to put into one long line I got some syntax errors.

awk: f==2{F2[$2]=$8} f==3{F3a[$2]=$9 F3b[$2]=$10} f==1{$8=F2[$2] $9=F3a[$2] $10=F3b[$2] print}
awk:                                        ^ syntax error
awk: f==2{F2[$2]=$8} f==3{F3a[$2]=$9 F3b[$2]=$10} f==1{$8=F2[$2] $9=F3a[$2] $10=F3b[$2] print}
awk:                                                               ^ syntax error
awk: f==2{F2[$2]=$8} f==3{F3a[$2]=$9 F3b[$2]=$10} f==1{$8=F2[$2] $9=F3a[$2] $10=F3b[$2] print}
awk:                                                                           ^ syntax error
awk: f==2{F2[$2]=$8} f==3{F3a[$2]=$9 F3b[$2]=$10} f==1{$8=F2[$2] $9=F3a[$2] $10=F3b[$2] print}
awk:                                                                                    ^ syntax error

Thanks for all your help.

rdrtx1 · October 17, 2012, 9:45am

If joining the lines in previous awk post then include a ; between joined lines.

Scrutinizer · October 17, 2012, 9:55am

Like rdrtx1 says, use semicolons, so it would look like this:

awk 'f==2{F2[$2]=$8} f==3{F3a[$2]=$9; F3b[$2]=$10} f==1{ $8=F2[$2]; $9=F3a[$2]; $10=F3b[$2]; print}' f=2 file2 f=3 file3 f=1 file1

But that also makes it more difficult to read IMO.

--
To remove columns, there are several methods. What does file1 look like?

bartman2099 · October 17, 2012, 10:21am

Ok, so this is a full row from file1 which has 78 columns:

--- rs144097162 NA 58246214 C A 546966 0.99949 0.73373 1419.4 3.594 0 0 1189.9 0.129 0 0 843.87 2.127 0 0 1307.7 2.272 0 0 1246.7 0.291 0 0 2235.2 0.837 0 0 624.04 6.957 0 0 2696.9 2.146 0 0 11564 18.353 0 0 4134 12.969 0 7429.6 5.384 0 0.00079231 0.0015637 0.00036207 0 1 1 1 1 1 1 1 1 1 1 4.3291 1.5837 11.833 -1 -1 -1 4.3239 1.5824 11.815 9.5513e-07 0.5773 3.389 0.69154

Your command as far as I understand takes the values i want from files 2 and 3 and puts them in the same positions in file1 and prints the rest of file 1 as is. Let's say I want to do all that and on top keep a subset of columns, say column 65 and 75 from the remaining columns. Is that clear? Could I just replace "print" with "print $1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$65,$75" since now $8,$9 and $10 have the values from the other files?

I Think it does work!

---------- Post updated at 03:21 PM ---------- Previous update was at 03:08 PM ----------

Thanks again for all your help guys!