Deleting duplicate glosses in a dictionary entry

gimley · August 19, 2013, 8:59pm

I am working on an Urdu to Hindi dictionary and I have created the following file structure:

Headword=Gloss1,Gloss2,Gloss3

i.e. glosses delimited by a comma.

It so happens that in some cases (around 6000+ in a file of over 200,000+ the glosses are duplicated.
Since this may be a recurrent phenomenon, could a macro or a script be deployed which could check the glosses on the right hand side and if there are duplicates, remove the same and maintain only a single gloss.
An example will make this clear:
Input

a=b,c,b
d=p,q,p
e=z,y,g,z,g,y

Th expected output would be

a=b,c
d=p,q
e=g,y,z

In case live data is need here is a sample:

=,
=,
=,
=,
=,
=,
=,,,
=,
=,,,
= ,
= ,
= ,
= ,
=,
=,,

An Awk or Perl script would be of help. I am on Windows Vista and have no access to Unix
I tried the following script posted on the site, but it does not give expected results:

{
for (I=1;I<NF;I++)
{
for (J=I+1;J<=NF;J++)
{
if ($I == $J ) { print $I": " $0 }
}
}
}

Many thanks

balajesuri · August 19, 2013, 9:28pm

Here's a perl program, though, I couldn't test it with the actual data (urdu and hindi characters). It works for ASCII characters input (a=b,c,b.......)

#! /usr/bin/perl

use warnings;
use strict;

my ($line, @lr, %hindi_words);
open I, "< file.txt";
while ($line = <I>) {
    chomp ($line);
    undef %hindi_words;
    @lr = split ('=', $line);
    for (split(',', $lr[1])) {
        $hindi_words{$_} = 1;
    }
    print "$lr[0]=", join(',', keys(%hindi_words)), "\n";
}
close I;

By the way, for this program logically similar words like , or , or , are different.

rdcwayx · August 19, 2013, 10:24pm

awk -F "[=,]" '{delete a
                printf $1 "="
                for (i=2;i<=NF;i++) a[$i]
                for (i in a) printf i ","
                printf RS}' infile |sed 's/,$//'

gimley · August 19, 2013, 10:48pm

Many thanks. The programs worked beautifully. I hope someone else will also find the programs useful.