Combine columnes from different files

gvj · March 16, 2011, 6:21am

Hi all,
I found some similar probs and solutions but not working for my case. I have multiple files like this:

file1:

g1.1 2
g2.2 4
g2.1 5
g4.1 3

file2:

g1.1 2
g1.2 3
g4.1 4

file3:

g1.2 3
g5.1 3

I need out put like

gene file1 file2 file3
g1.1 2 2 -
g1.2 - 3 3
g2.2 4 - -
g2.1 5 - -
g4.1 3 4 -
g5.1 - - 3

So compare the first columns and collect the value for that from files. if the entry is missing then it should give -. Most of the entries will be shared between files but some of them are missing. I have around 8 files like this with two columns .

Thanks in advance

sk1418 · March 16, 2011, 7:01am

take a look below statements

kent$ awk 'ARGIND==1{a[$1]=$0 " - -";next;} 
ARGIND==2{ if($1 in a)sub(/ - /, " "$2" ",a[$1]);else a[$1]=$1" - "$2" -";next; }
ARGIND==3{if($1 in a)sub(/ -$/," "$2" ",a[$1]); else a[$1]=$1" - - "$2;}
END{for(i in a)print a}' file1 file2 file3 |sort
g1.1 2 2 -
g1.2 - 3 3 
g2.1 5 - -
g2.2 4 - -
g4.1 3 4 -
g5.1 - - 3

gvj · March 16, 2011, 8:45am

@sk1418: thanks for the help, but I may have more file then I think i have to extent the script with more ARGIND check. Is it possible to make it some loop or something? I think I have to check with ARGIND==0 ?

Even then I am only getting the last file counts and for the rest just a - .

sk1418 · March 16, 2011, 9:45am

hi, this script is just for 3 or less then 3 files as input. If you have more files, the logic needs to be changed as well, not only the ARGIND part. say you have 10 files, then one line could be

g5.1 - - - - - 7 - - - -

the script above cannot give you that output. So "extend" doesn't work. you have to rewrite.

awk can do this, but not so easy, at least I haven't thought a shortcut for that, glad to know if someone has. You may try write a python script. I think that wouldn't be complicated.

---------- Post updated at 14:45 ---------- Previous update was at 14:07 ----------

here is the py script to do your job.

#!/usr/bin/python
files=['file1','file2','file3']
dict={}
len = files.__len__()
for s in files:
    idx = files.index(s)
    f = open(s)
    line = [x.replace("\n","") for x in f.readlines()]
    for l in line:
        k,v = l.split(" ")[0], l.split(" ")[1]
        if(not dict.has_key(k)):
            dict[k] = list("-"*len)
        dict[k][idx] = v
    f.close()

keys = dict.keys()
keys.sort()
for k in keys:
    print k+" "+ reduce(lambda x,y: x + " " + y,dict[k])

save above codes to a .py file. e.g. t.py
then:

kent$ python t.py
g1.1 2 2 -
g1.2 - 3 3
g2.1 5 - -
g2.2 4 - -
g4.1 3 4 -
g5.1 - - 3

you could change the "files" list, add as many files as you want. of course, you can get the files list by a given path. e.g. os.listdir(path)...

pravin27 · March 16, 2011, 12:49pm

How about perl?
Invocation

perl merger.pl file1 .. fileN

merger.pl

#!/usr/bin/perl

while (<@ARGV>){
open (FH,$_)|| die "FAIL- $!\n";
while (<FH>) {
        @fld=split;
        if ($i > 0) {
                if (exists  $hash{$fld[0]} ) {
                        $sc= scalar @{$hash{$fld[0]}};
                        if ( $sc == $i ) {
                        push(@{$fld[0]},$fld[1]);
                        } else {
                        $sc=scalar @{$hash{$fld[0]}};
                        $sc=$i - $sc;
                        for($j=1;$j<=$sc;$j++) { push(@{$fld[0]},"-");}
                        push(@{$fld[0]},$fld[1]);
                        }
                } else {
                $sc=$i;
                for($j=1;$j<=$sc;$j++) { push(@{$fld[0]},"-");}
                push(@{$fld[0]},$fld[1]);
                $hash{$fld[0]}=\@{$fld[0]};
                }
        } else {
        push(@{$fld[0]},$fld[1]);
        $hash{$fld[0]}=\@{$fld[0]};
        }
}
$i=$i+1;
close(FH);
}

foreach (keys(%hash)) {
        if (scalar @{$hash{$_}} != $i ) {
        $sc=$i-scalar @{$hash{$_}};
        for($j=1;$j<=$sc;$j++) { push(@{$hash{$_}},"-");}
        }
}

print $_," ","@{$hash{$_}}","\n" foreach (keys(%hash));