Perl, sorting and eliminating duplicates

gabrysfe · June 4, 2012, 10:28am

Hi guys! I'm trying to eliminate some duplicates from a file but I'm like this :wall: !!!

My file looks like this:

ID_1  0.02
ID_2  2.4e-2
ID_2  4.3.e-9
ID_3  0.003
ID_4  0.2
ID_4  0.05
ID_5  1.2e-3

What I need is to eliminate all the duplicates considering the first column (in this example an ID_2 and an ID_4) but I would like to eliminate all the duplicates but one considering the value in column 2. Particularly where there are some duplicates I want to maintain the one with the lower value associated. In this case I would like to eliminate ID_2 2.4e-2 and ID_4 0.2.

Someone can help me in some ways?!!?

Thank you!!!
cheers

Skrynesaver · June 4, 2012, 11:05am

Something like the following should work, WARNING not tested code.

#!/usr/bin/perl

use strict;
use warnings;
my %values;
open (my $file, "<", $ARGV[0]);
while(<file>){
  my ($id,$val)=/(ID_\d+)\s+(.+)$/;
  if (! defined $values{$id}){
    $values{$id}=sprintf ("%.15f\n",$val);
  }
  elsif (sprintf ("%.15f\n",$val) < $values{$id}){
    $values{$id}=sprintf ("%.15f\n",$val);
  }
}
for (sort keys %values){
  print "$_\t$values{$_};
}

gabrysfe · June 4, 2012, 11:19am

I tryed it, but I get an error like this:

Unknown open () mode 'file.txt' at perl.pl line 4

.

?

balajesuri · June 4, 2012, 12:16pm

[root@host dir]# cat input
ID_1  0.02
ID_2  2.4e-2
ID_2  4.3e-9
ID_3  0.003
ID_4  0.2
ID_4  0.05
ID_5  1.2e-3
[root@host dir]#
[root@host dir]# perl -lane '
if (defined $x{$F[0]}) {
    if ($x{$F[0]} > $F[1]) {
        $x{$F[0]} = $F[1];
    }
}
else {
    $x{$F[0]} = $F[1];
}
END {
    for (sort keys %x) { print "$_ $x{$_}" }
}' input
ID_1 0.02
ID_2 4.3e-9
ID_3 0.003
ID_4 0.05
ID_5 1.2e-3
[root@host dir]#

Corona688 · June 4, 2012, 1:04pm

A simpler way in awk:

$ awk '$1 in A { if(($2+0) > A[$1]) next } { A[$1]=$2+0; B[$1]=$2 } END { for(X in B) print X, B[X] }' data
ID_1 0.02
ID_2 2.4e-2
ID_3 0.003
ID_4 0.05
ID_5 1.2e-3

$

binlib · June 4, 2012, 5:37pm

With GNU sort

sort -k1,1 -k2,2g input-file |sort -msu -k1,1

gabrysfe · June 5, 2012, 5:11am

thank you all! very nice suggestions!!
the awk works perfectly for my case! thank you!!