Remove duplicate files based on text string?

spangberg · July 7, 2009, 12:02pm

Hi

I have been struggling with a script for removing duplicate messages from a shared mailbox.
I would like to search for duplicate messages based on the �Message-ID� string within the messages files.

I have managed to find the duplicate �Message-ID� strings and (if I would like) delete the files in which they where found.
My problem is who to preserve one of each file.

My script so far:

#!/bin/tcsh
set dir=/my/maildir

foreach file (`grep -h "Message-ID: <" $dir/* | uniq -d |xargs -i \grep -l "{}" $dir/*`)

rm -f "$file"

end

Any ideas?

Thanks // Tomas

---------- Post updated at 06:02 PM ---------- Previous update was at 10:18 AM ----------

Fyi, solved
-------------------
#!/bin/tcsh
set maildir=/my/maildir
foreach dupstring ("`grep -m 1 -h -R "^Message-ID:" $maildir/ | sort | uniq -d`")
grep -l -R "$dupstring" $maildir/ |sed 1d |xargs -i \rm -f "{}"
end
-------------------

// Tomas

sitney · August 28, 2009, 2:10am

I need the exact same solution to the same problem. But I get an error when I run your script:

#!/bin/tcsh
set maildir=/my/maildir
foreach dupstring ("`grep -m 1 -h -R "^Message-ID:" $maildir/ | sort | uniq -d`")
grep -l -R "$dupstring" $maildir/ |sed 1d |xargs -i \rm -f "{}"
end

I did set the mail archive directory correctly, but that is not the issue here.

Scripts ]$ ./remove_dupes.sh 
./remove_dupes.sh: line 4: syntax error near unexpected token `"('

Not sure if this is because I use the bash shell as opposed to tcsh. Or is this a nested double quote issue? I have tried fixing, but my syntax skills are still developing. Help appreciated.

---------- Post updated 08-28-09 at 02:10 AM ---------- Previous update was 08-27-09 at 12:54 PM ----------

blake [ ~/scratch ]$ xargs -i
xargs: illegal option -- i

#!/bin/tcsh
set maildir=/Users/blake/Library/Mail/Mailboxes/Archive.mbox/Messages
foreach dupstring ("`grep -m 1 -h -R ^Message-ID: $maildir/ | sort | uniq -d`")
grep -l -R "$dupstring" $maildir/ |sed 1d |xargs \rm -f "{}"
end

I removed the -i from the xargs command and tested it on a directory with a few sample emails with duplicate Message-ID and it worked, so now I applied this script to my backed up archive of 76,000+ emails with tons of duplicates and started it last night. It's still running. And it may continue running for days because the way this is constructed, it will compare the "Message-ID:" string from 76,000 emails to 76,000 other emails. That's 5.7 billion greps!

Looking at my data I see that in the vast majority of cases (in every case I found), the emails with duplicate Message-IDs are literally right next to each other. Here is a small section of the directory:

-rw-r--r--  1 blake  staff     76634 Jan 30  2008 101576.emlx **
-rw-r--r--  1 blake  staff     76627 Jan 30  2008 101577.emlx **
-rw-r--r--  1 blake  staff     12083 Jan 30  2008 101587.emlx
-rw-r--r--  1 blake  staff    104673 Jan 30  2008 101588.emlx
-rw-r--r--  1 blake  staff     67374 Jan 30  2008 101597.emlx **
-rw-r--r--  1 blake  staff     67374 Jan 30  2008 101598.emlx **

** = duplicate Message-IDs

How do I modify this script so that it only compares a Message-ID in one file to the Message-ID in the next file (or next n files)? That should dramatically speed up this process. Thanks.