Hello,
I have a database of name variants with the following structure:
variant=variant=variant
The number of variants can be as many as thirty to forty.
Since the database is quite large (at present around 60,000 lines) duplicate sets of variants creep in. Thus
John=Johann=Jon
and some Hundred lines on
Jan=Johann
What I need is a script (PERL or AWK, since I work under Windows) which could do the following:
- Identify such duplicates. Thus in the example above
John
is a duplicate entry
2. Connect up both entries resulting in one single entry:
John=Johann=Jon=Jan=Johann
- Clean up the dupe(s) and provide one single set of Unique name variants.
John=Johann=Jon=Jan
The script, I am sure, would also prove useful for others who face similar problems of duplication iin their databases.
I am giving below a pseudo example as input:
Peter=Pieter=Miotr
Mary=Mariam
Pierre=Peter
Marium=Mary=Marie=Maria
Shyam=Syam=Siam
Shym=Shyam=Shhyam=Shayam=Sham=Syam=Siam=Sam
The expected output would be:
Marium=Mary=Marie=Maria=Mariam
Peter=Pieter=Piotr=Pierre
Sam=Sham=Shayam=Shhyam=Shyam=Shym=Siam=Syam
Many thanks in advance for your help