finding duplicates with perl

dangral · January 27, 2003, 12:44pm

I have a huge file (over 30mb) that I am processing through with perl. I am pulling out a list of filenames and placing it in an array called @reports.
I am fine up till here. What I then want to do is go through the array and find any duplicates. If there is a duplicate, output it to the screen. But once I find one duplicate of a filename, I want to go through and look for a duplicate of the next filename.
Thanks!

criglerj · January 27, 2003, 1:03pm

Without more specifics about your problem, I think a hash might be more appropriate than an array. Then you can keep a count of where each filename is called out, or a list of callouts or whatever. If you need to preserve the order of the filenames, store the record number each filename was first found in, say, then sort on the record number. But a hash is a fundamental perl idiom for detecting duplicates. It'll work in ruby, too, BTW.

dangral · January 27, 2003, 2:08pm

I'm not really sure what exactly that would entail. I will post what I have thought of (using an array), although it is not working.

open(DUPS, ">duplicates.txt") or die "Can't open duplicates.txt $!";
for $i (0 .. $#reports){
for $j (0 .. $#reports){
if ($i != $j && $reports[$i] eq $reports[$j]){
print DUPS "\n $reports[$i]";
last;
}

}

}
close(DUPS);

criglerj · January 28, 2003, 11:50am

Okay, if you need the list of @reports in an array for some other reason, the next best thing is a hash just to gather the duplicates. You can destroy the hash later if you need to. Warning: untested code

%h = ();
foreach $r (@reports) {
    if (!exists($h{$r}))  {
        # First time we've seen this one
        $h{$r} = 0
    } elsif ($h{$r}) {
        # We've seen this one before and reported
        $h{$r}++
    } else {
        # Second time, so report the duplicate
        print DUPS "\n $r";
        $h{$r} = 1
    }
}
%h = ();   # Destroy %h if you're done with it

Now %h contains the number of "extra" of each member of @reports, i.e, one less than the number that's actually there. If you don't need @reports for anything else, you can embed this logic into the loop thats reading your large data file and save some memory.

Another option is to put off the reporting of duplicates, putting that in another loop after the one shown above so you can report the number of times a report is found in @reports.