Join files on multiple fields

Hello all,

I want to join 2 tabbed files on the first 2 fields, and filling the missing values with 0. The 3rd column in each file is constant for the entire file.

file1

12658699	ST5	XX2720	0	1	0	1					
53039541	ST5	XX2720	1	0	1.5	1	

file2

53039541	ST5	X23	0	1	0	1					
1267456	ST1	X23	1	0	1.4	1	

Desired output

12658699	ST5	XX2720	0	1	0	1	X23	0	0	0	0
53039541	ST5	XX2720	1	0	1.5	1	X23	0	1	0	1
1267456	ST1	XX2720	0	0	0	0	X23	1	0	1.4	1

Its throwing me an error when I do the following, after sorting..

join  -a1 -a2  -t$'\t'   -1 1,2 -2 1,2 file1 file2
join: invalid field number: `1,2'

Please help !

Hello sheetalk,

I haven't tested in different scenarios but could you please check following if that helps you.

awk -F"\t" 'FNR==NR{A[$1]=$0;next} ($1 in A){Q=$1;$1="";gsub(/^[[:space:]]+/,X,$0);print A[Q] OFS $0;delete A[Q];next} !($1 in A){$3="XX2720" OFS 0 OFS 0 OFS 0 OFS 0 OFS "X23";print $0} END{for(i in A){gsub(/[[:space:]]+$/,X,A);print A OFS "X23" OFS 0 OFS 0 OFS 0 OFS 0}}' OFS="\t" file1 file2
 

Output will be as follows.

53039541 ST5    XX2720  1       0       1.5     1       X23     0       1       0       1
1267456  ST1    XX2720  0       0       0       0       X23     1       0       1.4     1
12658699 ST5    XX2720  0       1       0       1       X23     0       0       0       0
 

EDIT: Adding a non-one liner form of solution on same.

 awk -F"\t" 'FNR==NR{
                        A[$1]=$0;
                        next
                   }
                        ($1 in A){
                                        Q=$1;
                                        $1="";
                                        gsub(/^[[:space:]]+/,X,$0);
                                        print A[Q] OFS $0;
                                        delete A[Q];
                                        next
                                 }
                        !($1 in A){
                                        $3="XX2720" OFS 0 OFS 0 OFS 0 OFS 0 OFS "X23";
                                        print $0
                                  }
            END    {
                         for(i in A){
                                        gsub(/[[:space:]]+$/,X,A);
                                        print A OFS "X23" OFS 0 OFS 0 OFS 0 OFS 0
                                    }
                   }
           ' OFS="\t" file1 file2
 

Thanks,
R. Singh

1 Like

thanks ! can the 3rd column be picked up on the fly and not hard-coded ? The reason is I have multiple pairs of files to join..and I want to do them in a loop rather than hardcoding the names...

The error msg is self explaining: join can have one single field per file only to join on.

yes, is there any alternative? I have to join on the first 2 fields, and not on the first field..also I believe the solution provided is also based on just joining on the first field ?

See if this is of any use.

#!/usr/bin/perl
#
use strict;
use warnings;

# two files must be given at command line
my $first_file = shift or die;
my $second_file = shift or die;

my %data;
my $f;

# process lines from first file
open $f, "<", $first_file or die "$!\n";
while(<$f>) {
    # split line into fields
    my @fields = split;
    # create a key based on first two fields separated by tab
    my $key = join "\t", @fields[0..1];
    # add to data structure and append a list as place holder
    # for second file data
    $data{$key} = [@fields[2..$#fields], ("X23", 0, 0, 0, 0)];
}
close $f;

# process all lines from second file
open $f, "<", $second_file or die "$!\n";
while(<$f>) {
    my @fields = split;
    my $key = join "\t", @fields[0..1];
    # the same key exist in first and second file
    # remove the place holder data
    if(exists $data{$key}){
        $data{$key} = [@{$data{$key}}[0..4], @fields[2..$#fields]];
    # key only exist in second file. Add padding in front.
    }else{
        $data{$key} = [("XX2720", 0, 0, 0, 0), @fields[2..$#fields]]
    }
}
close $f;

for my $k (keys %data) {
    print join "\t", ($k, @{$data{$k}});
    print "\n";
}

Save: mergex.pl
Run: perl mergex.pl file1 file2

1 Like

Note that this is technically not a join operation since you are also merging on fields that the files do not have in common, so the join command would not work anyway.

Also, these are TAB separated files and you are leaving out the last 5 empty fields in each file.

I cut off the last 5 fields is to assume there are no spaces in the fields and there are no other empty fields and use the default FS instead of \t plus use $1=$1 so the empty fields are discarded. If that is OK than this could work:

awk '
  BEGIN {
    OFS="\t"
  }
  FNR==1{
    X[++c]=$3                             # save the third fied; X[1] of the first file X[2] of the second one.
  }
  {
    $1=$1                                 # Discard last 5 trailing fields because FS is default
    i=$1 OFS $2                           # set i to the first two fields, separated by OFS
  } 
  NR==FNR {                               # process first file
    A=$0                               # Put record from first file into array A with first two fields as index
    next
  }
                                          # process second file
  i in A {                                # if the record exist in A, print joined record
    print A,$3,$4,$5,$6,$7
    delete A                           # delete record since index was matched
    next
  }
  {                                       # if the record does not exist in A
    print i,X[1],0,0,0,0,$3,$4,$5,$6,$7   # print it with the remaing fields zeroed and the saved 3rd field of file 1
  } 
  END {
    for(i in A)                           # For the remaining records in file 1 that were not matched
      print A,X[2],0,0,0,0             # print them with the remaing fields zeroed and the saved 3rd field of file 2
  }
' file1 file2                             # process file1 first and then file 2

If that is not OK then the fields need to be assigned differently (or the trailing fields need to be discardedd in a different way) and FS should be set to \t

1 Like