Hello,
I have a large amount of data with the following structure:
Word=Transliterated word
I have written a Perl Script (reproduced below) which goes through the full file and identifies all dupes on the right hand side. It creates successfully a new file with two headers: Singletons and Dupes.
I have tried to modify the script to produce additionally a record listing the frequency count of all dupes. Thus in the sample provided, I would like to know how many times the dupe Albert has been transliterated in different ways. I am providing pseudo-data since the original data is in a foreign script.
The script should give me a report in a separate output with the following structure:
The final output would thus have two files:
The output file listing Singletons and Dupes
The report which would have the dupes listed along with their frequency.
I am not very good at generating reports in Perl and hence the request:
Perl script follows.
Many thanks for excellent help and advice given.
#!/usr/bin/perl
$dupes = $singletons = ""; # This goes at the head of the file
do {
$dupefound = 0; # These go at the head of the loop
$text = $line = $prevline = $name = $prevname = "";
do {
$line = <>;
$line =~ /^(.+)\=.+$/ and $name = $1;
$prevline =~ /^(.+)\=.+$/ and $prevname = $1;
if ($name eq $prevname) { $dupefound += 1 }
$text .= $line;
$prevline = $line;
} until ($dupefound > 0 and $text !~ /^(.+?)\=.*?\n(?:\1=.*?\n)+\z/m) or eof;
if ($text =~ s/(^(.+?)\=.*?\n(?:\2=.*?\n)+)//m) { $dupes .= $1 }
$singletons .= $text;
} until eof;
print "SINGLETONS\n$singletons\n\DUPES\n$dupes";