Hi guys! I'm trying to eliminate some duplicates from a file but I'm like this :wall: !!!
My file looks like this:
ID_1 0.02
ID_2 2.4e-2
ID_2 4.3.e-9
ID_3 0.003
ID_4 0.2
ID_4 0.05
ID_5 1.2e-3
What I need is to eliminate all the duplicates considering the first column (in this example an ID_2 and an ID_4) but I would like to eliminate all the duplicates but one considering the value in column 2. Particularly where there are some duplicates I want to maintain the one with the lower value associated. In this case I would like to eliminate ID_2 2.4e-2 and ID_4 0.2.
Someone can help me in some ways?!!?
Thank you!!!
cheers
Something like the following should work, WARNING not tested code.
#!/usr/bin/perl
use strict;
use warnings;
my %values;
open (my $file, "<", $ARGV[0]);
while(<file>){
my ($id,$val)=/(ID_\d+)\s+(.+)$/;
if (! defined $values{$id}){
$values{$id}=sprintf ("%.15f\n",$val);
}
elsif (sprintf ("%.15f\n",$val) < $values{$id}){
$values{$id}=sprintf ("%.15f\n",$val);
}
}
for (sort keys %values){
print "$_\t$values{$_};
}
I tryed it, but I get an error like this:
Unknown open () mode 'file.txt' at perl.pl line 4
.
?
[root@host dir]# cat input
ID_1 0.02
ID_2 2.4e-2
ID_2 4.3e-9
ID_3 0.003
ID_4 0.2
ID_4 0.05
ID_5 1.2e-3
[root@host dir]#
[root@host dir]# perl -lane '
if (defined $x{$F[0]}) {
if ($x{$F[0]} > $F[1]) {
$x{$F[0]} = $F[1];
}
}
else {
$x{$F[0]} = $F[1];
}
END {
for (sort keys %x) { print "$_ $x{$_}" }
}' input
ID_1 0.02
ID_2 4.3e-9
ID_3 0.003
ID_4 0.05
ID_5 1.2e-3
[root@host dir]#
A simpler way in awk:
$ awk '$1 in A { if(($2+0) > A[$1]) next } { A[$1]=$2+0; B[$1]=$2 } END { for(X in B) print X, B[X] }' data
ID_1 0.02
ID_2 2.4e-2
ID_3 0.003
ID_4 0.05
ID_5 1.2e-3
$
With GNU sort
sort -k1,1 -k2,2g input-file |sort -msu -k1,1
thank you all! very nice suggestions!!
the awk works perfectly for my case! thank you!!