Merging multiple files from multiple columns

Hi guys,
I have very basic linux experience so I need some help with a problem.
I have 3 files from which I want to extract columns based on common fields between them.

File1:

--- rs74078040 NA 51288690 T G 461652 0.99223 0.53611 3
--- rs77209296 NA 51303525 T G 461843 0.98973 0.60837 3
--- rs7985170 NA 51304203 C T 461852 0.99215 0.541 3

File2:

id rsid chromosome pos allele_A allele_B index average_call info
--- rs74078040 NA 51288690 T G 461652 0.99852 0.52725 2
--- rs77209296 NA 51303525 T G 461843 0.99992 0.90562 2
--- rs7985170 NA 51304203 C T 461852 0.99994 0.93283 2
--- rs145780716 NA 19032912 G A 2054321 0.9988 0.026123 2

File3:
id rsid chromosome pos allele_A allele_B index average_call info

--- rs74078040 NA 51288690 T G 461652 0.956 0.8558 1
--- rs77209296 NA 51303525 T G 461843 0.9843 0.8989 1
--- rs7985170 NA 51304203 C T 461852 0.9323 0.6754 1
--- rs145780716 NA 19032912 G A 2054321 0.9098 0.7632 1 

I want to take all the rows from File1 and say columns 1-7 (which are the same in all the files) and then add column 8 from File 2 and columns 9 and 10 from File3.
Note that File1 doesn't have a header but the other files do. Also, File1 has less lines than the other files which have the same number of lines.

The output would look like:

--- rs74078040 NA 51288690 T G 461652 0.99852 0.8558 1
--- rs77209296 NA 51303525 T G 461843 0.99992 0.8989 1
--- rs7985170 NA 51304203 C T 461852 0.99994 0.6754 1

Any help would be greatly appreciated.
Thanks!

Hi

Assuming your files are a1, a2 and a3:

$ join -1 2 -2 2 -o  1.1,1.2,1.3,1.4,1.5,1.6,1.7,2.8 a1 a2  | join -1 2 -2 2 -o 1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,2.9,2.10 - a3
--- rs74078040 NA 51288690 T G 461652 0.99852 0.8558 1
--- rs77209296 NA 51303525 T G 461843 0.99992 0.8989 1
--- rs7985170 NA 51304203 C T 461852 0.99994 0.6754 1

Guru.

Hi guruprasadpr,
Many thanks for your reply. I have tried the command you suggest but it tells me: "join: File 2 is not in sorted order" which I think probably comes from the header line because column 4 is in sorted order.
Any suggestions how to get around that problem? I don't want to make a copy of the files without headers because they are quite big and our disk space is limited.

If you don't mind using two awk..

try

awk 'FNR==NR{$NF="";$(NF-1)="";$(NF-2)="";a[$2]=$0;next}{if(a[$2]){print a[$2],$8}}' file1 file2 > temp

awk 'FNR==NR{a[$2]=$0;next}{if(a[$2]){print a[$2],$9,$10}}' temp file3

Hi pamu, I wouldn't mind using more than one step even though in reality I have 5 files instead of 3.
Could you broadly explain the awk syntax please? My awk knowledge is almost zero.
Thanks for your reply!

It's Better that you should read some basic of awk then :slight_smile:

Read this first..

then read.. this:D

awk 'FNR==NR{$NF="";$(NF-1)="";$(NF-2)="";a[$2]=$0;next}   # Read first file, make $NF,$(NF-1) and $(NF-2) NULL as we don't need those. Create an array a with $2 is index and which stores $0 ( whole line).                                                            

{if(a[$2]){print a[$2],$8}}' file1 file2 > temp   # Here we are reading second file. Check if $2 present in this file or not. If it is present then file1 line matches with file2 line then print line from file1 and append $8 as you have requested. And write this to a new temp file.

awk 'FNR==NR{a[$2]=$0;next}  # Read temp file create an array a, which stores $0(whole line)

{if(a[$2]){print a[$2],$9,$10}}' temp file3 # read second file. If there is match between files then print temp file line and append $9 and $10 from file3 as per your required output.

One awk:

awk '
  f==2{
    F2[$2]=$8
  } 
  f==3{
    F3a[$2]=$9
    F3b[$2]=$10
  }
  f==1{
    $8=F2[$2]
    $9=F3a[$2]
    $10=F3b[$2]
    print
  }
' f=2 file2 f=3 file3 f=1 file1
1 Like

Thanks pamu and Scrutinizer. I will try your suggestions and read up a bit on awk and will let you guys know.
Thanks a lot!

---------- Post updated at 02:40 PM ---------- Previous update was at 11:39 AM ----------

pamu, your code actually works very well! I would just need to adapt it because my files don't really have 10 columns, more like 80 but the principle is the same. It would just be quite a long command.

Scrutinizer, your code works as well but what if file1 has more columns? How could I remove some of them if I don't want them?
Also, do I have to run the command with the spacing as is? When I tried to put into one long line I got some syntax errors.

awk: f==2{F2[$2]=$8} f==3{F3a[$2]=$9 F3b[$2]=$10} f==1{$8=F2[$2] $9=F3a[$2] $10=F3b[$2] print}
awk:                                        ^ syntax error
awk: f==2{F2[$2]=$8} f==3{F3a[$2]=$9 F3b[$2]=$10} f==1{$8=F2[$2] $9=F3a[$2] $10=F3b[$2] print}
awk:                                                               ^ syntax error
awk: f==2{F2[$2]=$8} f==3{F3a[$2]=$9 F3b[$2]=$10} f==1{$8=F2[$2] $9=F3a[$2] $10=F3b[$2] print}
awk:                                                                           ^ syntax error
awk: f==2{F2[$2]=$8} f==3{F3a[$2]=$9 F3b[$2]=$10} f==1{$8=F2[$2] $9=F3a[$2] $10=F3b[$2] print}
awk:                                                                                    ^ syntax error

Thanks for all your help.

If joining the lines in previous awk post then include a ; between joined lines.

1 Like

Like rdrtx1 says, use semicolons, so it would look like this:

awk 'f==2{F2[$2]=$8} f==3{F3a[$2]=$9; F3b[$2]=$10} f==1{ $8=F2[$2]; $9=F3a[$2]; $10=F3b[$2]; print}' f=2 file2 f=3 file3 f=1 file1

But that also makes it more difficult to read IMO.

--
To remove columns, there are several methods. What does file1 look like?

1 Like

Ok, so this is a full row from file1 which has 78 columns:

--- rs144097162 NA 58246214 C A 546966 0.99949 0.73373 1419.4 3.594 0 0 1189.9 0.129 0 0 843.87 2.127 0 0 1307.7 2.272 0 0 1246.7 0.291 0 0 2235.2 0.837 0 0 624.04 6.957 0 0 2696.9 2.146 0 0 11564 18.353 0 0 4134 12.969 0 7429.6 5.384 0 0.00079231 0.0015637 0.00036207 0 1 1 1 1 1 1 1 1 1 1 4.3291 1.5837 11.833 -1 -1 -1 4.3239 1.5824 11.815 9.5513e-07 0.5773 3.389 0.69154 

Your command as far as I understand takes the values i want from files 2 and 3 and puts them in the same positions in file1 and prints the rest of file 1 as is. Let's say I want to do all that and on top keep a subset of columns, say column 65 and 75 from the remaining columns. Is that clear? Could I just replace "print" with "print $1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$65,$75" since now $8,$9 and $10 have the values from the other files?

I Think it does work!

---------- Post updated at 03:21 PM ---------- Previous update was at 03:08 PM ----------

Thanks again for all your help guys!