Remove Duplicate Filenames in 2 very large directories

jaysunn · September 24, 2009, 10:48am

Hello Gurus,

O/S RHEL4
I have a requirement to compare two linux based directories for duplicate filenames and remove them. These directories are close to 2 TB each. I have tried running a:

Prompt>diff -r data1/ data2/

I have tried this as well:

jason@jason-desktop:~$ cat script.sh 
#!/bin/bash



for files in $(diff -r data1 data2/ | awk -F":" '{print $2}'); do

echo $files
done
jason@jason-desktop:~$

I wanted to get the output of the above command and place in a variable for a deletion. This scenario does not work and the machines load goes to high for production. I have also thought of trying a rsync with the delete flag, and I am unsure if this will compare both directories successfully.

Can someone please point me in the right direction as to what commands or scenarios will best accomplish my task.

I have also tried to google this on unix.com as well as the web.

Your support and assistance is greatly appreciated.

Jaysunn

peterro · September 24, 2009, 11:19am

Your solutions don't work because they don't work or because they increase the load too much on the production system?

I would probably test whatever solution on a test pair of directories that you hand build so you can see if the solution is working or not.

If it is a load issue, try using the 'nice' command to lower the priority of your process.

rsync would probably work as well, but I would test a lot on sample data to make sure it's doing what you want.

methyl · September 24, 2009, 12:00pm

Your example of "diff -r" actually compares the contents of each file rather than the filename.

How many files are there in each directory tree?

Please expand and explain what constitutes a "duplicate filename". Is it a file in the same relative position in the tree as a file with the same name, or something more complex?

Please explain when a "duplicate filename" is found, which one (if any) you prefer to keep.

jaysunn · September 24, 2009, 12:44pm

Thanks for your reply.

I have never tried to perform a wc -l cause it takes so long. I would estimate around 2 million files in each partition ranging no larger than 2MB.

The directory structure is 2 separate partitions that reside on a serial attached storage system.

The files are all *.mp3 or *.flv files. We are running out of space on this system and I have confirmed that there are duplicate files e.g.

/data1/586950.mp3
/data2/586950.mp3

Every file file has seven numbers followed by either the .mp3 or .flv extension. I would like to have a script to look at each partition, if it finds a copy of itself, remove it from /data1 partition freeing up space on /data2.

I hope I explained my scenario well enough.

Thanks Again,

Jaysunn

methyl · September 24, 2009, 12:57pm

The above sentence does not make sense to me.

Also, is there a directory hierarchy or is there just /data1 and /data2 with no subdirectories?

jaysunn · September 24, 2009, 1:24pm

Wow,
I realized from your questions that I really did not provide much detail. Thanks for attempting to decipher.

Once the script identifies that there is a duplicate file residing on the /data1 partition I would like to then pass a rm argument to remove the file from /data2 cleaning up space on that partition.

Yes there is a hierarchy involved. Here is a snippet of it for you. Each partition has a 4 to 6 letter subdirectory that is mirrored on each partition. Files in that structure could be the same.

/data1/wcnn/*.mp3
/data1/wxxr/*.mp3
/data1/trrn/*.mp3

/data2/wcnn/*.mp3
/data2/wxxr/*.mp3
/data2/trrn/*.mp3

So there may be the same mp3 file in the station abbreviation on /data1 and /data2. I only need that file in one partition.

If I can provide any output commands please let me know.

Jaysunn

danmero · October 20, 2009, 8:16pm

I check back your post's and I find this one.
Suggestion: use fdupes(1) to find duplicate file

jaysunn · October 20, 2009, 8:34pm

Hey Thanks,

I actually wrote this. I obtained the files and locations from a find command creating the data1 and data2 txt files.

F1=data1.txt
F2=data2.txt


while IFS= read -r line

do

        cf=$line
        grep -q "$cf" ${F1}

        if [ $? == 0 ]
            then

            diff $line `echo $line | sed 's/data2/data1/g' `

            if [ $? == 0 ]
                then
                echo $line 
                fi

            fi



done < "$F2"

Then I removed them with the below. Most likely a poor way to achieve, however it worked.

#!/bin/bash

for file in `cat duplicate_data`
do

dFile=`echo $file | sed s/data2/data1/ `


#create md5sum 
mdFile1=`md5sum $file |awk '{print $1}'`
mdFile2=`md5sum $dFile |awk '{print $1}'`

if [ -f $file ]
then
  if [ $dFile ]
    then
      if [ $mdFile1 == $mdFile2 ]
        then
          echo $file 
          rm -f $dFile
      fi
   fi
fi



done

I used a input file to

---------- Post updated at 08:34 PM ---------- Previous update was at 08:25 PM ----------

Wow,

How simple. I wish I had this before.

Thanks so much for your help.

Jaysunn