Remove Duplicate Filenames in 2 very large directories

Hello Gurus,

O/S RHEL4
I have a requirement to compare two linux based directories for duplicate filenames and remove them. These directories are close to 2 TB each. I have tried running a:

Prompt>diff -r data1/ data2/

I have tried this as well:

jason@jason-desktop:~$ cat script.sh 
#!/bin/bash



for files in $(diff -r data1 data2/ | awk -F":" '{print $2}'); do

echo $files
done
jason@jason-desktop:~$ 

I wanted to get the output of the above command and place in a variable for a deletion. This scenario does not work and the machines load goes to high for production. I have also thought of trying a rsync with the delete flag, and I am unsure if this will compare both directories successfully.

Can someone please point me in the right direction as to what commands or scenarios will best accomplish my task.

I have also tried to google this on unix.com as well as the web.

Your support and assistance is greatly appreciated.

Jaysunn

Your solutions don't work because they don't work or because they increase the load too much on the production system?

I would probably test whatever solution on a test pair of directories that you hand build so you can see if the solution is working or not.

If it is a load issue, try using the 'nice' command to lower the priority of your process.

rsync would probably work as well, but I would test a lot on sample data to make sure it's doing what you want.

Your example of "diff -r" actually compares the contents of each file rather than the filename.

How many files are there in each directory tree?

Please expand and explain what constitutes a "duplicate filename". Is it a file in the same relative position in the tree as a file with the same name, or something more complex?

Please explain when a "duplicate filename" is found, which one (if any) you prefer to keep.

Thanks for your reply.

I have never tried to perform a wc -l cause it takes so long. I would estimate around 2 million files in each partition ranging no larger than 2MB.

The directory structure is 2 separate partitions that reside on a serial attached storage system.

The files are all *.mp3 or *.flv files. We are running out of space on this system and I have confirmed that there are duplicate files e.g.

/data1/586950.mp3
/data2/586950.mp3

Every file file has seven numbers followed by either the .mp3 or .flv extension. I would like to have a script to look at each partition, if it finds a copy of itself, remove it from /data1 partition freeing up space on /data2.

I hope I explained my scenario well enough.

Thanks Again,

Jaysunn

The above sentence does not make sense to me.

Also, is there a directory hierarchy or is there just /data1 and /data2 with no subdirectories?

Wow,
I realized from your questions that I really did not provide much detail. Thanks for attempting to decipher.

Once the script identifies that there is a duplicate file residing on the /data1 partition I would like to then pass a rm argument to remove the file from /data2 cleaning up space on that partition.

Yes there is a hierarchy involved. Here is a snippet of it for you. Each partition has a 4 to 6 letter subdirectory that is mirrored on each partition. Files in that structure could be the same.

/data1/wcnn/*.mp3
/data1/wxxr/*.mp3
/data1/trrn/*.mp3

/data2/wcnn/*.mp3
/data2/wxxr/*.mp3
/data2/trrn/*.mp3

So there may be the same mp3 file in the station abbreviation on /data1 and /data2. I only need that file in one partition.

If I can provide any output commands please let me know.

Jaysunn

I check back your post's and I find this one.
Suggestion: use fdupes(1) to find duplicate file :wink:

Hey Thanks,

I actually wrote this. I obtained the files and locations from a find command creating the data1 and data2 txt files.

F1=data1.txt
F2=data2.txt


while IFS= read -r line

do

        cf=$line
        grep -q "$cf" ${F1}

        if [ $? == 0 ]
            then

            diff $line `echo $line | sed 's/data2/data1/g' `

            if [ $? == 0 ]
                then
                echo $line 
                fi

            fi



done < "$F2"

Then I removed them with the below. Most likely a poor way to achieve, however it worked.

#!/bin/bash

for file in `cat duplicate_data`
do

dFile=`echo $file | sed s/data2/data1/ `


#create md5sum 
mdFile1=`md5sum $file |awk '{print $1}'`
mdFile2=`md5sum $dFile |awk '{print $1}'`

if [ -f $file ]
then
  if [ $dFile ]
    then
      if [ $mdFile1 == $mdFile2 ]
        then
          echo $file 
          rm -f $dFile
      fi
   fi
fi



done

I used a input file to

---------- Post updated at 08:34 PM ---------- Previous update was at 08:25 PM ----------

Wow,

How simple. I wish I had this before.

Thanks so much for your help.

Jaysunn