Best way to diff two huge directory trees

same1290 · August 12, 2008, 6:42pm

Hi

I have a job that will be running nightly incremental backsup of a large directory tree.

I did the initial backup, now I want to write a script to verify that all the files were transferred correctly. I did something like this which works in principle on small trees:

diff -r -q $src_dir $dst_dir  &gt;& diffreport.txt

The problem with this is that it is very slow. The directory I am backing up is about 2 TB.

I also tried using the tools find and sum to dump the checksums to two file s, one for source directory and one for destination and comparing them. This is the command I used:

find $src_dir -type f -print0 | xargs -0 sum > src_dir_checksums.txt
find $dst_dir -type f -print0 | xargs -0 sum > dst_dir_checksums.txt
diff src_dir_checksums.txt dst_dir_checksums.txt

But for some reason this produces a different search order for the two directories which are on different machines.

Any help would greatly appreciated.

Thanks in advance,
Sam

danmero · August 12, 2008, 6:55pm

Try rsync, you can google for rsync incremental backup.

Ikon · August 12, 2008, 7:03pm

What about just compairing the output of

# cd /path/to/directory
# du
16      ./somedir
7200    ./somedir/1
1200    ./somedir/2
80      ./someotherdir
14512   .

This wont check the files as far as being exact copies but would verify the size of the files in the directories.

same1290 · August 12, 2008, 8:56pm

Hi,
Thanks for your reply.

I actually already compared the sizes using du. They're quite similar but not the same, I think maybe because the directory entries sizes are also part of the total and those are different on the different machines where the two trees are stored (that's just a guess).

So I think I need something more reliable.

Sam

Annihilannic · August 12, 2008, 9:54pm

sort the checksum files by filename before you diff them.

drl · August 16, 2008, 11:21pm

Hi.

There is a script cmptree at Unix Review > The Shell Corner: cmptree
which may be useful. It uses cmp to compare files. Utility cmp reads a file as binary, so non-text files can be successfully compared.

If you are solving this problem essentially once, then my feeling is that to read an entire file to get the checksum may be wasting cycles if the differences occur early in the files.

In fact, the method I prefer is first to check the length of the files. This is a low-overhead operation, either with utility stat in Linux or utility ls otherwise. If the lengths are different, then the files are different. If the lengths are the same, then one can use something like cmp to compare the files.

The one disadvantage that I saw in cmptree is that does not handle filenames with embedded whitespace, so if you have such files, then the published version of cmptree will not be useful ... cheers, drl

same1290 · August 18, 2008, 5:34pm

Hi drl
Thanks for that feedback. That's a good idea. Getting the lengths is probably enough of a check and a lot quicker which is my main problem (with terabytes of data to check).

Sam