Help in modifying existing Perl Script to produce report of dupes

gimley · April 25, 2012, 9:10pm

Hello,
I have a large amount of data with the following structure:
Word=Transliterated word
I have written a Perl Script (reproduced below) which goes through the full file and identifies all dupes on the right hand side. It creates successfully a new file with two headers: Singletons and Dupes.
I have tried to modify the script to produce additionally a record listing the frequency count of all dupes. Thus in the sample provided, I would like to know how many times the dupe Albert has been transliterated in different ways. I am providing pseudo-data since the original data is in a foreign script.

The script should give me a report in a separate output with the following structure:

The final output would thus have two files:
The output file listing Singletons and Dupes
The report which would have the dupes listed along with their frequency.
I am not very good at generating reports in Perl and hence the request:
Perl script follows.
Many thanks for excellent help and advice given.

#!/usr/bin/perl

$dupes = $singletons = "";		# This goes at the head of the file

do {
    $dupefound = 0;			# These go at the head of the loop
    $text = $line = $prevline = $name = $prevname = "";
    do {
	$line = <>;
	$line =~ /^(.+)\=.+$/ and $name = $1;
	$prevline =~ /^(.+)\=.+$/ and $prevname = $1;
	if ($name eq $prevname) { $dupefound += 1 }
	$text .= $line;
	$prevline = $line;
    } until ($dupefound > 0 and $text !~ /^(.+?)\=.*?\n(?:\1=.*?\n)+\z/m) or eof;
    if ($text =~ s/(^(.+?)\=.*?\n(?:\2=.*?\n)+)//m) { $dupes .= $1 }
    $singletons .= $text;
} until eof;
print "SINGLETONS\n$singletons\n\DUPES\n$dupes";

balajesuri · April 26, 2012, 1:16am

[user@cygwin ~]$ cat input.txt
Albert=albt
Albert=albut
Albert=albat
Mary=mari
Mary=meri
Mary=merry
Mary=marey
[user@cygwin ~]$
[user@cygwin ~]$ perl -F= -ane 'BEGIN {open O, "> output.txt"}
chomp $F[1]; $x{$F[0]} .= "$F[1],"; $y{$F[0]}++;
END {
    for (sort keys %x) {
        $x{$_} =~ s/,$//;
        print O "$_,$y{$_},$x{$_}\n";
    }
    close O;
}' input.txt
[user@cygwin ~]$
[user@cygwin ~]$ cat output.txt
Albert,3,albt,albut,albat
Mary,4,mari,meri,merry,marey
[user@cygwin ~]$

gimley · April 26, 2012, 6:14am

Hi,
Many thanks.
Unluckily I work under windows and "cat" commands do not function correctly under this OS.
I cut the snippet of the code and applied it but it would not work.
Many thanks for the help all the same

balajesuri · April 26, 2012, 6:55am

#! C:\Perl\bin\perl.exe
use strict;
use warnings;

my (@F, %x, %y);

open I, "< input.txt";
for (<I>) {
    chomp;
    @F = split /=/;
    $x{$F[0]} .= "$F[1],";
    $y{$F[0]}++;
}
close I;

open O, "> output.txt";
for (sort keys %x) {
    $x{$_} =~ s/,$//;
    print O "$_,$y{$_},$x{$_}\n";
}
close O;

pravin27 · April 26, 2012, 7:04am

How about this ?

#!/usr/bin/perl

while (<DATA>) {
        chomp;
        ($word,$meaning)=split(/\=/);
        push  @{$word} , $meaning ;
        $hashWord{$word}=\@{$word};
}

foreach (keys %hashWord ) {
        $KeyWord=$_;
        printf "%s,%d",$KeyWord,scalar @{$hashWord{$_}};
        foreach (@{$hashWord{$_}}) {
        printf ",%s",$_;
        }
        print "\n";
}



__DATA__
Albert=albt
Albert=albut
Albert=albat
Mary=mari
Mary=meri
Mary=merry
Mary=marey

gimley · April 26, 2012, 9:41am

Many thanks. The script runs like a charm and sorts and identifies dupes