Merging dupes on different lines in a dictionary

gimley · September 9, 2012, 11:39am

I am working on a homonym dictionary of names i.e. names which are clustered together according to their �sound-alike� pronunciation:
An example will make this clear:

Since the dictionary is manually constructed it often happens that inadvertently two sets of �homonyms� which should be grouped together are grouped separately. Thus:

�vishnu� is shared in both the first set and the second and actually both sets should be reduced to one:

I have written a program which points out such �dupes� and also the line on which they occur in the database. But since I am a newbie in Perl try as I might, I cannot write a perl program which will safely merge both sets where there are dupes. I have a script in Ultraedit format which does the job, but it is dreadfully slow and takes too much time.

I am giving below a sample of such dupes:

The expected output should be

Ideally the program should also weed out duplicates in a given row but I have an awk program that does the job efficently.

Any help would be really great. Many thanks in advance for a PERL or AWK script. I work under windows and hence sed will not help.

Chubler_XL · September 9, 2012, 10:42pm

Here is an awk script:

awk -F= '
{ k=$1
  for(i=1;i<=NF;i++)
     if($i in same) k=same[$i];
  for(i=1;i<=NF;i++)
     same[$i]=k;
}
END {
   for(i in same)
      keys[same]=keys[same] "=" i;
   for(k in keys)
      print substr(keys[k],2);
}' infile

gimley · September 9, 2012, 11:13pm

Dear Chubler_XL
It works like magic. For the first time my database has no errors and all the names are perfectly merged. This has saved me days of checking and validation. The diagnostic routine I had written to identify dupes along multiple lines now shows that there are no dupes in any file.
Many thanks