Algorthm bug of my code

yifangt · October 1, 2011, 5:17pm

Hello,
This porblem bugged me for some time. It is to merge different files of hundred rows to have a union with the ID as key column (kind of similar to join!) and absence with 0.

ID File1
A 1
C 3
D 4
M 6

ID File2
A 5
B 10
C 15
Z 26

ID File3
A 2
B 6
O 20
X 9

I want the output as

ID  File File2 File3
A 1 5 2 
B 0 10 6
C 3 15 0
D 4 0 0
M 6 0 0 
O 0 0 20
X 0 0 9
Z 0 26 0

I search the site that there some posts about merge two files, by a common column, but my case is different. I tried my code which is working but the output lost some of the information

#!/usr/bin/perl -w

use strict;

my $Fname1="./path/file1.txt"; #tab delimited format 
my $Fname2="./path/file2.txt";
my $Fname3="./path/file3.txt";

my %combinedfile;
my key;

open(F1, "<$Fname1") or die "Cann't find the input file $Fname1 becuase of $!";
while (my line1 = <F1>) {
  chomp ($line1);
my ($ID1, $count)=split("\t", $line1);
$key=$ID1;
$combinedfile{$key}=$count;
}
close (F1);

open(F2, "<$Fname2") or die "Cann't find the input file $Fname1 becuase of $!";
while (my line2 = <F2>) {
  chomp ($line2);
my ($ID2, $count2)=split("\t", $line2);
$key=$ID2;
if (exists($combinedfile{$key}
 { $combinedfile{$key}.="\n$count2";}
else {
 $combinedfile{$key}="0\n$count2";
}
close (F2);

open(F3, "<$Fname3") or die "Cann't find the input file $Fname1 becuase of $!";
 while (my line3 = <F3>) {
   chomp ($line3);
 my ($ID3, $count3)=split("\t", $line3);
 $key=$ID3;
if (exists($combinedfile{$key}
 { $combinedfile{$key}.="\n$count3";}
else {
 $combinedfile{$key}="0\n0\n$count3";
}
close (F3);

foreach (my $member (keys %combinedfile)){
 split ("/n", $combinedfile{$member));
print $member, "\t", (join("\t", split ("/n", $combinedfile{$member)), "\n";
}

The output is:

 ID File File2 File3
A 1 5 2 
B 0 10 6
C 3 15 
D 4
M 6
O 0 0 20
X 0 0 9
Z 0 26

I know there is a bug with the algorithm, e.g. D in File1, when reading File2, D is supposed to be saved as:

D 4\n0

and when reading File3, it should be saved as:

D 4\n0\n0

But it was skipped because it is not in File2 or File3. The fact seems only the new "KEY" of the hash is properly added, and the existing KEY not listed in latter files (File2 or File3) will be skipped.

How to fix this bug? I met in my work occasionally, and seems a common job similar to join but different. Hope there is command like "union" for this job (leave all the 0 with NA!, my wish though!)
Thanks a lot in advance!

Yifang

binlib · October 1, 2011, 7:47pm

join -a 1 -a 2 -e 0 -o0,1.2,2.2 f1 f2 |join -a 1 -a 2 -e 0 -o0,1.2,1.3,2.2 - f3

MacMonster · October 2, 2011, 10:13am

Use associative array is easier to check a specified key is exists or not.

Since the format of your files are the same, duplicating the code for each file is not a good idea. I prefer using command line arguments to pass the filenames and loop through them. That is run the command likes this:

./yourscript.pl file1.txt file2.txt file3.txt

#!/usr/bin/perl -w

use strict;

my %combinedfile = ();
my @file_list = ();

foreach my $file (@ARGV)
{
    my $basename = substr($file, rindex($file, '/') + 1);
    my $name = uc(substr($basename, 0, rindex($basename, '.')));
    push(@file_list, $name);

    if (open(F, $file))
    {
        while (my $line = <F>)
        {
            chomp($line);

            my @item = split("\t", $line);
            my $id = defined($item[0]) ? $item[0] : '';
            my $count = defined($item[1]) ? $item[1] : 0;
            next if ($id eq '');

            $combinedfile{$id} = () if (!defined($combinedfile{$id}));
            $combinedfile{$id}->{$name} = 0 if (!defined($combinedfile{$id}->{$name}));
            $combinedfile{$id}->{$name} = $count;
        }

        close(F);
    }
}

print "ID";

foreach my $name (@file_list)
{
    print "\t$name";
}

print "\n";

foreach my $id (sort keys %combinedfile)
{
    print "$id";

    foreach my $name (@file_list)
    {
        my $count = defined($combinedfile{$id}->{$name}) ? $combinedfile{$id}->{$name} : 0;
        print "\t$count";
    }

    print "\n";
}

exit(0);

yifangt · October 2, 2011, 10:53am

A little too advance to me, as I can't catch your algorithm although I seem understand each line.

 while (my $line = <F>){            
chomp($line);              
my @item = split("\t", $line);                                # Understand
my $id = defined($item[0]) ? $item[0] : '';             #Start to get lost, the purpose of the empty string
my $count = defined($item[1]) ? $item[1] : 0;           #??? if there is no count there, how can I assign $item[1] with 0. Biggest trick
next if ($id eq '');             
$combinedfile{$id} = () if (!defined($combinedfile{$id}));
$combinedfile{$id}->{$name} = 0 if (!defined($combinedfile{$id}->{$name}));
$combinedfile{$id}->{$name} = $count;      
}

This part seems to me playing the trick. Can you explain a little bit more, even by pseudo code? Thanks a lot!

MacMonster · October 2, 2011, 11:04am

Those annoying lines are for avoiding the following warnings:

Use of uninitialized value in string eq at ...

They initialize the values in case the line has less than 2 columns.

If you don't use "-w", the lines can simply rewrite to:

my ($id, $count) = split("\t", $line);

yifangt · October 5, 2011, 12:43am

Thanks MacMonster!
I tried to understand this part of your script which is the trick of the whole thing to me.

 $combinedfile{$id} = () if (!defined($combinedfile{$id}));
 $combinedfile{$id}->{$name} = 0 if (!defined($combinedfile{$id}->{$name}));
 $combinedfile{$id}->{$name} = $count;

Could you explain a little more about it so that I can have full catch of it? Thanks a lot!
Yifang

MacMonster · October 5, 2011, 10:53am

 

# Add "id" to "$combinedfile" and initialize the element as a hash
$combinedfile{$id} = () if (!defined($combinedfile{$id}));

# Add "name" to "$combinedfile{$id}" and initialize the element as an integer zero
 $combinedfile{$id}->{$name} = 0 if (!defined($combinedfile{$id}->{$name}));

# Assign the value to the element
 $combinedfile{$id}->{$name} = $count;

The "defined" function is to check the element exists or not. If not exist, initialize a value to it. This avoids using an undefined index.

yifangt · October 5, 2011, 2:36pm

Thanks Mac!
I seem understanding more of your code now, but not quite sure the exact algorithm. As a learner, I'd like to confirm your logic flow with my comment for the tricky part. If this is too novice to reply, just ignore my post, although I hope to get your correction if I mis-understand it.

Thanks a lot!

my %combinedfile = ();
my @file_list = ();

foreach my $file (@ARGV)
{
    my $basename = substr($file, rindex($file, '/') + 1);
    my $name = uc(substr($basename, 0, rindex($basename, '.')));

# Get the file to the array @file_list

 push(@file_list, $name);

    if (open(F, $file))  # Open the first file
    {
        while (my $line = <F>)
        {
            chomp($line);
 
            my @item = split("\t", $line);
            my $id = defined($item[0]) ? $item[0] : '';

# split each row the $id and get the $count

 my $count = defined($item[1]) ? $item[1] : 0;  $count
            next if ($id eq '');

# If the $id is NOT defined, define it!

            $combinedfile{$id} = () if (!defined($combinedfile{$id}));

#This is the hash of hash right? using $id as the key for the first layer of hash and $name as key for the second layer. This part is tricky to me.
#But why the second layer of the hash need not be declared at the beginning?

$combinedfile{$id}->{$name} = 0 if (!defined($combinedfile{$id}->{$name}));

# Use the hash of hash to get the value of the second layer of hash, right?

$combinedfile{$id}->{$name} = $count;  
        }

        close(F);
    } # Then iterate to the next file 
}print "ID";

foreach my $name (@file_list)
{
    print "\t$name";
}

print "\n";

#sort is not neccessary right?

foreach my $id (sort keys %combinedfile)
{
    print "$id";

    foreach my $name (@file_list)
    {
        my $count = defined($combinedfile{$id}->{$name}) ? $combinedfile{$id}->{$name} : 0;
        print "\t$count";
    }

    print "\n";
}

exit(0);

If my understanding is correct, I have made progress about the data structure with perl. Thanks a lot again!

Yifang

MacMonster · October 6, 2011, 12:42pm

yifangt:

Thanks Mac!
I seem understanding more of your code now, but not quite sure the exact algorithm. As a learner, I'd like to confirm your logic flow with my comment for the tricky part. If this is too novice to reply, just ignore my post, although I hope to get your correction if I mis-understand it.

Thanks a lot!
my %combinedfile = ();
my @file_list = ();

foreach my $file (@ARGV)
{
   my $basename = substr($file, rindex($file, '/') + 1);
   my $name = uc(substr($basename, 0, rindex($basename, '.')));
# Get the file to the array @file_list
 push(@file_list, $name);

   if (open(F, $file))  # Open the first file
   {
   while (my $line = <F>)
   {
   chomp($line);
 
   my @item = split("\t", $line);
   my $id = defined($item[0]) ? $item[0] : '';
# split each row the $id and get the $count
 my $count = defined($item[1]) ? $item[1] : 0;  $count
   next if ($id eq '');
# If the $id is NOT defined, define it!
   $combinedfile{$id} = () if (!defined($combinedfile{$id})); 
#This is the hash of hash right? using $id as the key for the first layer of hash and $name as key for the second layer. This part is tricky to me.
#But why the second layer of the hash need not be declared at the beginning?

The second layer stores one integer only, not another hash array, so you don't need to initialize it as an empty hash like the first layer.
$combinedfile{$id}->{$name} = 0 if (!defined($combinedfile{$id}->{$name}));
# Use the hash of hash to get the value of the second layer of hash, right?

Yes.
$combinedfile{$id}->{$name} = $count;  
   }

   close(F);
   } # Then iterate to the next file 
}print "ID";

foreach my $name (@file_list)
{
   print "\t$name";
}

print "\n";
#sort is not neccessary right?

It is not neccessary, use it or not depends on your output style.
foreach my $id (sort keys %combinedfile)
{
   print "$id";

   foreach my $name (@file_list)
   {
   my $count = defined($combinedfile{$id}->{$name}) ? $combinedfile{$id}->{$name} : 0;
   print "\t$count";
   }

   print "\n";
}

exit(0);
If my understanding is correct, I have made progress about the data structure with perl. Thanks a lot again!

Yifang

yifangt · October 6, 2011, 1:00pm

This post is really helping me a lot. Thank you very much Mac!
Yifang