O/S RHEL4
I have a requirement to compare two linux based directories for duplicate filenames and remove them. These directories are close to 2 TB each. I have tried running a:
Prompt>diff -r data1/ data2/
I have tried this as well:
jason@jason-desktop:~$ cat script.sh
#!/bin/bash
for files in $(diff -r data1 data2/ | awk -F":" '{print $2}'); do
echo $files
done
jason@jason-desktop:~$
I wanted to get the output of the above command and place in a variable for a deletion. This scenario does not work and the machines load goes to high for production. I have also thought of trying a rsync with the delete flag, and I am unsure if this will compare both directories successfully.
Can someone please point me in the right direction as to what commands or scenarios will best accomplish my task.
I have also tried to google this on unix.com as well as the web.
Your support and assistance is greatly appreciated.
Your example of "diff -r" actually compares the contents of each file rather than the filename.
How many files are there in each directory tree?
Please expand and explain what constitutes a "duplicate filename". Is it a file in the same relative position in the tree as a file with the same name, or something more complex?
Please explain when a "duplicate filename" is found, which one (if any) you prefer to keep.
I have never tried to perform a wc -l cause it takes so long. I would estimate around 2 million files in each partition ranging no larger than 2MB.
The directory structure is 2 separate partitions that reside on a serial attached storage system.
The files are all *.mp3 or *.flv files. We are running out of space on this system and I have confirmed that there are duplicate files e.g.
/data1/586950.mp3
/data2/586950.mp3
Every file file has seven numbers followed by either the .mp3 or .flv extension. I would like to have a script to look at each partition, if it finds a copy of itself, remove it from /data1 partition freeing up space on /data2.
Wow,
I realized from your questions that I really did not provide much detail. Thanks for attempting to decipher.
Once the script identifies that there is a duplicate file residing on the /data1 partition I would like to then pass a rm argument to remove the file from /data2 cleaning up space on that partition.
Yes there is a hierarchy involved. Here is a snippet of it for you. Each partition has a 4 to 6 letter subdirectory that is mirrored on each partition. Files in that structure could be the same.
I actually wrote this. I obtained the files and locations from a find command creating the data1 and data2 txt files.
F1=data1.txt
F2=data2.txt
while IFS= read -r line
do
cf=$line
grep -q "$cf" ${F1}
if [ $? == 0 ]
then
diff $line `echo $line | sed 's/data2/data1/g' `
if [ $? == 0 ]
then
echo $line
fi
fi
done < "$F2"
Then I removed them with the below. Most likely a poor way to achieve, however it worked.
#!/bin/bash
for file in `cat duplicate_data`
do
dFile=`echo $file | sed s/data2/data1/ `
#create md5sum
mdFile1=`md5sum $file |awk '{print $1}'`
mdFile2=`md5sum $dFile |awk '{print $1}'`
if [ -f $file ]
then
if [ $dFile ]
then
if [ $mdFile1 == $mdFile2 ]
then
echo $file
rm -f $dFile
fi
fi
fi
done
I used a input file to
---------- Post updated at 08:34 PM ---------- Previous update was at 08:25 PM ----------