Join 3 or more files using matching column

Dear Forum,

Full title of the topic would be: "Join 3 or more files using matching column without full list in any of these columns"

I have several, typically 3 or 4 files which I need to join, something like FULL JOIN in slq scripts, all combinations of matches should be printed into an output file, including those lines where no match to any other file exists. I used mysql where FULL JOIN statement does not exist but some workarounds do the job, at least in case of 3 files but sometimes I got duplicates or even multiplicates. And most important mysql is slow with big files.

I give a single column examle hoping that I manage to implement it to multicolumn cases:
File 1
col1
aaa
bbb
abb
fff

File2
col1
bbb
abb
ccc
fff

File3
aaa
ccc
dce
fff
Output
col1_file1 col1_file2 col1_file3
aaa null aaa
bbb bbb null
abb abb null
null ccc ccc
null null dce
fff fff fff

The best wold be when I can add more files with a ease to the script!

appreciating your ideas!

cyz

three files version:

awk 'ARGIND==1{a[$1]=$0 " null null";next;} 
    ARGIND==2{ if($1 in a)sub(/ null /, " "$1" ",a[$1]);else a[$1]="null "$1" null";next; }
    ARGIND==3{if($1 in a)sub(/ null$/," "$1" ",a[$1]); else a[$1]="null null "$1;}
END{for(i in a)print a}' file1 file2 file3 

---------- Post updated at 13:17 ---------- Previous update was at 13:06 ----------

multi-files version:

just fill the files in "files" list

#!/usr/bin/python
files=['file1','file2','file3']
dict={}
len = files.__len__()
for s in files:
    idx = files.index(s)
    f = open(s)
    line = [x.replace("\n","") for x in f.readlines()]
    for l in line:
        if(not dict.has_key(l)):
            dict[l] = ["null"]*len
        dict[l][idx] = l
    f.close()
keys = dict.keys()
for k in keys:
    print  reduce(lambda x,y: x + " " + y,dict[k])

Great!

For 4 files with awk I tried:

awk 'ARGIND==1{a[$1]=$0 "null null null";next;}
ARGIND==2{if($1 in a)sub(/ null /, " "$1" ",a[$1]);else a[$1]="null "$1" null null";next; }
ARGIND==3{if($1 in a)sub(/ null$/," "$1" ",a[$1]); else a[$1]="null null "$1" null";}
ARGIND==4{if($1 in a)sub(/ null$/," "$1" ",a[$1]); else a[$1]="null null null "$1"";}
END{for(i in a)print a[i]}' file1 file2 file3 file4

Seems to work, only columns coming from file3 and file 4 a swapped...

Minor question, by some reason there are 13 spaces between columns 1 and 2, how come?

cyz

the red part is not correct for 4 files.
I suggest that you try the python way if you have more than 3 files.

ARGIND==3{if($1 in a)sub(/ null$/," "$1" ",a[$1]); else a[$1]="null null "$1" null";}
    ARGIND==4{if($1 in a)sub(/ null$/," "$1" ",a[$1]); else a[$1]="null null null "$1"";}
1 Like