Help in modifying a PERL script to sort Singletons and Duplicates

gimley · July 2, 2015, 11:42pm

I have a large database which has the following structure

a=b
where a is one language and b is the other and = is the delimiter

Since the data treats of language, homographs occur i.e. the same word on the left hand side can map in two different entries to two different glosses on the right hand side as in th example below.

a=b
a=c
b=p
b=q
c=d
r=s

I had written the following script which does give results when the database is small but when the database increases (around 100,000+ entries) it does not perform very well. The expected output in a single file would be as under:

SINGLETONS
c=d
r=s
DUPES
a=b
a=c
b=p
b=q

I am giving the Perl script below

#!/usr/bin/perl

$dupes = $singletons = "";		# This goes at the head of the file

do {
    $dupefound = 0;			# These go at the head of the loop
    $text = $line = $prevline = $name = $prevname = "";
    do {
	$line = <>;
	$line =~ /^(.+)\=.+$/ and $name = $1;
	$prevline =~ /^(.+)\=.+$/ and $prevname = $1;
	if ($name eq $prevname) { $dupefound += 1 }
	$text .= $line;
	$prevline = $line;
    } until ($dupefound > 0 and $text !~ /^(.+?)\=.*?\n(?:\1=.*?\n)+\z/m) or eof;
    if ($text =~ s/(^(.+?)\=.*?\n(?:\2=.*?\n)+)//m) { $dupes .= $1 }
    $singletons .= $text;
} until eof;
print "SINGLETONS\n$singletons\n\DUPES\n$dupes";

I feel that the problem lies with a prior sort of the data since the script gives better results (but not perfect) when sorted.
Could someone please provide a better solution and if the code is being modified, could the change be commented since I am still learning the intricacies of Perl, which explains the comments which I added on to the script.
Many thanks for your help

Corona688 · July 3, 2015, 12:55pm

It doesn't perform very well because you're keeping 100,000 things in one string and using regular expressions to find them. This is never going to be efficient -- you're scanning through all 100,000 things every time you add a single one.

Use a hash to organize it into several smaller things instead of one big thing of 100,000 and avoid needing regexes to find them.

#!/usr/bin/perl

my %hg=(), $n;
my @s=();

do {
        my ($a, $b, $c)=split(/[=\n]/,<STDIN>);

        if(!defined($hg{$a})) { $hg{$a}="";     }

        # Append string to hash{a}
        $hg{$a}=$hg{$a} . " " . $b;
} until eof;

foreach my $a (keys %hg) {
        my @l=split(/ /, $hg{$a});

        # save singletons for later
        if($#l == 1) {  push(@s, $a);   }
        else    {
                for($n=1; $n<=$#l; $n++)
                {
                        printf("%s=%s\n", $a, $l[$n]);
                }
        }
}

print "SINGLETONS\n";

for($n=1; $n <= $#s; $n++)
{
        printf("%s=%s\n", $s[$n], $hg{$s[$n]});
}

Aia · July 3, 2015, 1:22pm

Use as starting point.

#!/usr/bin/perl

# segregate.pl

use strict;
use warnings;

# data structure as a hash
my %db;
# populate a data structure based on first field
while(my $line = <>) {
    # remove new line
    chomp $line;
    # get field one
    my ($id) = split /=/, $line;
    # add field1 => line to the structure
    push @{$db{$id}}, $line;
}

# sorted keys for singletons
my @single;
# sorted keys for duplicates
my @dup;
# iterate through db hash using sorted keys
for my $key (sort keys %db) {
    if(scalar @{$db{$key}} == 1) {
        # add to singles list if one item
        push @single, $key;
    }
    else {
        # more than one item, add to dup list
        push @dup, $key;
    }
}

# Display lists
print "SINGLETONS:\n";
# iterate only through sorted singles
for my $single (@single) {
    print @{$db{$single}}, "\n";
}

print "DUPES:\n";
# iterate only through sorted duplicates
for my $dup (@dup) {
    print join "\n", @{$db{$dup}};
    print "\n";
}

Usage:
perl segregate.pl datafile

gimley · July 3, 2015, 8:23pm

Many thanks for your help and above all for taking the pains to comment the code. Walking through it helped me understand what was wrong with my approach. The script worked and was fast and neatly segregated the two types.

---------- Post updated at 07:23 PM ---------- Previous update was at 07:18 PM ----------

Many thanks. I walked through the code and also your comments. I now understand why my script kept on giving very few results. Most importantly I learnt how to hash data in Perl. Aia's script was the add-on. Thanks to forum members for help and above all the pains people take to correct and comment on code written by people struggling with code.