Identifying dupes within a database and creating unique sub-sets

gimley · December 16, 2013, 8:58pm

Hello,
I have a database of name variants with the following structure:

variant=variant=variant

The number of variants can be as many as thirty to forty.
Since the database is quite large (at present around 60,000 lines) duplicate sets of variants creep in. Thus

John=Johann=Jon

and some Hundred lines on

Jan=Johann

What I need is a script (PERL or AWK, since I work under Windows) which could do the following:

Identify such duplicates. Thus in the example above

John

is a duplicate entry
2. Connect up both entries resulting in one single entry:

John=Johann=Jon=Jan=Johann

Clean up the dupe(s) and provide one single set of Unique name variants.

John=Johann=Jon=Jan

The script, I am sure, would also prove useful for others who face similar problems of duplication iin their databases.
I am giving below a pseudo example as input:

Peter=Pieter=Miotr
Mary=Mariam
Pierre=Peter
Marium=Mary=Marie=Maria
Shyam=Syam=Siam
Shym=Shyam=Shhyam=Shayam=Sham=Syam=Siam=Sam

The expected output would be:

Marium=Mary=Marie=Maria=Mariam
Peter=Pieter=Piotr=Pierre
Sam=Sham=Shayam=Shhyam=Shyam=Shym=Siam=Syam

Many thanks in advance for your help

Chubler_XL · December 16, 2013, 10:01pm

You could try this, but I'm not sure how quick it will be:

awk '
function remove_dups(list, have, num, keys, i, new) {
    have[""]
    num=split(list, keys, "=")
    for(i=1;i<=num;i++) {
       if(!(keys in have)) new=new "=" keys
       have[keys]
    }
    return substr(new,2)
}
function merge(list, num, keys,i,new) {
   new=remove_dups(list)
   num=split(new, keys, "=")
   master=keys[1]
   for(i=1;i<=num;i++)
      if(keys in Found) {
          new = remove_dups(List[Found[keys]] "=" new)
          delete List[Found[keys]]
      }
   num=split(new, keys, "=")
   List[master]=new
   for(i=1;i<=num;i++) Found[keys]=master
}
{merge($0)}
END { for (l in List) print List[l] }' infile

gimley · December 16, 2013, 11:00pm

Many thanks. It was pretty fast. Zipped through 20,000 lines in a few seconds. I doubt that there are any issues, since I tested the output file for dupes and there were none.
Many thanks.