Simple directory tree diff script

LMHmedchem · March 16, 2013, 8:24pm

I have had some issues with a data drive and have copied all of the data to a new drive. The size used is not the same on both drives with a 3GB difference (less on the new drive). There are millions of files on the data drive, so it is not an easy task to determine if there are some files missing on the the new copy. Is there a simple script I can run that will identify any files that are present on the original drive but are missing on the new drive?

I create the copy with cp -Rfp &> logfile, and the logfile did not indicate that there were any files that could not be copied.

I could run rsync in one direction, but there are some issues with the time stamps on the original drive, so I'm not sure how that would work. I'm not looking to correct any discrepancies, just to identify it they exist. I have found some dir diff scripts, but they all seem over complicated for what I need.

This is ntfs under windows XP and I am running bash under cygwin.

Thanks for the advice.

LMHmedchem

jim_mcnamara · March 16, 2013, 10:45pm

This will work on cygwin - I just tried it. It will take a loooong time.

diff <(find /drivea )  <(find /driveb -)  >  /tmp/diff.txt

This will speed it it up a little

find /drivea > /tmp/drivea &
find /driveb > /tmp/driveb &
wait
diff /tmp/drivea /tmp/driveb > /tmp/diff.txt

Using

/tmp

on some architectures really improves performance.

/drivea is the mountpoint of one filesystem, /driveb is the other mountpoint.

This will NOT check file similarity, only existence of names and directories.

[lecture]
And large (in the sense of inodes (UNIX term for file name slots)) directory trees are inherently inefficient, and become prone to errors when free inodes become scarce. i.e., 'millions' of files on a single file system are usually a really terrible idea.

Windows filesystems are not immune to this problem.

Develop a means of archiving off fasta files or whatever your are using. Save last month's files on permanent storage - disk is NOT permanent. You just discovered that, I see.

Then remove them from the disk. Just keep recent data. I realize that research or medical testing means keeping data almost forever. Ask your legal guys how long 'almost forever' is. And maybe learn about off-site archival storage in the neighborhood. Having a defined retention policy is better than trying to keep everything. Less costly, too.
[/lecture]

LMHmedchem · March 17, 2013, 1:29pm

Thanks for the tip.

On cygwin, I don't see a mount point in the way that you describe it. I usually access drive as /cygdrive/c/ for C:, etc. I made a script using the code you suggest.

#!/usr/bin/bash

find /cygdrive/e/_test > /tmp/e_test &
find /cygdrive/i/_test > /tmp/i_test &
wait
sed 's/\/cygdrive\/e//g' /tmp/e_test > /tmp/check_e
sed 's/\/cygdrive\/i//g' /tmp/i_test > /tmp/check_i
diff  /tmp/check_e /tmp/check_i > /tmp/diff.txt

because each path starts with a different cygdrive, I had to use sed to remove that part of each path. After adding that, it seems to work fine on the test directories I used. I will try with some larger directories and see if there are any issues.

I will reply to your other comments later when I get this going. You are definitely not going to get a lecture in return. I have had a number of ideas as to what "permanent storage" entails and haven't landed on anything particularly useful in that regard.

LMHmedchem

alister · March 17, 2013, 1:48pm

You don't need to use sed. Just cd and run each find command on the current working directory.

( cd /cygdrive/e/_test && find . > /tmp/e_test ) &
( cd /cygdrive/i/_test && find . > /tmp/i_test ) &

Or, it could just be done all in the same shell (which would be easier if you wanted to add error handling after each cd):

cd /cygdrive/e/_test
find . > /temp/e_test &
cd /cygdrive/i/_test
find . > /tmp/i_test &

Keep in mind that find is not guaranteed to return the members of a directory in any particular. Especially given the timestamp differences that you mentioned, if in just one directory a pair of subdirectories are visited in different order, diff will generate a LOT of noise even though the contents may be identical.

If that's an issue, sort the output of find and then use comm.

Regards,
Alister

LMHmedchem · March 17, 2013, 2:17pm

Well I ran this on my main data directory and the diff file is 500MB. This seems far too large for the size discrepancy between the two drives. Is there some particular option I should use with sort? Is there some reason to not use diff on the sorted files and use comm instead?

LMHmedchem

alister · March 17, 2013, 2:24pm

No need to use any options with sort. The default full line lexicographical sort is appropriate.

You could use diff, I suppose. In the rare case that some of your filenames begin with a tab, the diff output will be less ambiguous than comm -3 .

Regards,
Alister

LMHmedchem · March 17, 2013, 6:51pm

Well the sorted find files differ by ~3000 lines. I take this to mean that there are ~3000 files that are missing from the one directory. The output off comm is 3091728 lines, which is the same number of lines as are in the find for the original directory. I presume this is because the col 3 output of comm are files that are in both, and output I don't need to see. I presume I want comm -3 for the output I want, meaning files that are in one director tree and not in the other?

LMHmedchem

---------- Post updated at 05:28 PM ---------- Previous update was at 02:43 PM ----------

This is the final script that I used. I have brushed it up a bit so that it is more generalized and checks a few things.

#!/usr/bin/bash

# accepts path to two directories and compares the file lists in each
# path should begin with /cygdrive/driveletter/

# assign location for output
TMPDIR='/cygdrive/c/cygwin/tmp/dir_compare'

# assign directory trees to compare
TREE1=$1
TREE2=$2

# check arguments, print help if no arguments are passed
if [ $# -eq 0 ]
then
   echo "this script expects two arguments"
   echo "each argument should be the path to a directory"
   echo "each path should start with /cygdrive/, not a relative path"
   echo "the script will compare the list of files in the directory and subdirectories"
   echo "and will report any instance where a file exists in one directory but not the other"
   echo 'output will be printed to '$TMPDIR'/comm.txt'
   exit
fi

# check if TREE1 exists
if [ ! -d $1 ];
then
   echo " "
   echo "directory " $1 "not found"
   echo "exiting"
   exit
fi
# check if TREE2 exists
if [ ! -d $2 ];
then
   echo " "
   echo "directory " $2 "not found"
   echo "exiting"
   exit
fi

# clean tmp dir if it contains files
cd $TMPDIR
FILES=(*)
FILES=${#FILES[@]}
if (( "$FILES" > 0 )) ; then
   rm *
fi

# echo some information
echo " "
echo "comparing file list of " $TREE1
echo "with file list of " $TREE2
echo " "

# cd to TREE1 and create file list for tree
cd $TREE1
find . > $TMPDIR'/check_1' &

# cd to TREE2 and create file list for tree
cd $TREE2
find . > $TMPDIR'/check_2' &

# wait for find to finish
wait

# sort output of find to keep file list from both dir trees is in registration
sort $TMPDIR'/check_1' > $TMPDIR'/check_1_sorted'
sort $TMPDIR'/check_1' > $TMPDIR'/check_2_sorted'

# print the number of lines (files) in each directory tree
wc -l $TMPDIR'/check_1_sorted'
wc -l $TMPDIR'/check_2_sorted'

# compare the two files, only print instances where a file exists in one tree but not the other
comm -3  $TMPDIR'/check_1_sorted'  $TMPDIR'/check_2_sorted' > $TMPDIR'/comm.txt'

Running this script indicates that I have 6,189,828 files in each tree and the script does not find any difference in file names. I found that I had one extra directory in one of the trees. This came from some testing I was doing to see if a copy of my files had the same issue with the time stamps as the original. When I deleted this copy directory, the comm file is empty.

The only problem is that I still have a 3GB size discrepancy between the two partitions.

$ df -h
Filesystem Size Used Avail Use% Mounted on
E: 879G 502G 378G 58% /cygdrive/e
I: 831G 499G 332G 61% /cygdrive/i

The size of the E partition didn't change when I deleted the extra directory, even though the folder was quite large. I expected that to make the sizes the same. I'm not sure what else I can do to check that my copy has all of the data from the original. The results would imply that some of the files exist on both drives, but are not the same size. Is there a reasonable way to check that? I would seem like that would be a non-trivial addition to what I am doing. Is it possible for the same exact files to be on both drives but to take up different amounts of space?

LMHmedchem

---------- Post updated at 05:44 PM ---------- Previous update was at 05:28 PM ----------

I see I had a typo in the script, so I wasn't doing the correct compare. I am running again with the corrected script.

---------- Post updated at 06:51 PM ---------- Previous update was at 05:44 PM ----------

Running the corrected script, there are a few files that are different, but the total size is not much. I keep my browser profiles here and these are different because one is the browser I am using and one is a copy made yesterday.

There is nothing here that accounts for 3GB of data.

Any suggestions on what to do next? I suppose I could use the sorted find files to do a diff between each file pair, but that wouldn't exactly be speedy. The find files don't differentiate between files and directories and I don't know what happens if you feed diff a pair of directories instead of files.

LMHmedchem

alister · March 17, 2013, 8:22pm

If you want to determine if there are any differences, the only way is to read every byte of every file and compare it to its counterpart. Otherwise, even if file sizes match, there could still be a discrepancy. cmp may be useful for that task.

du (and df) measure the amount of storage allocated for files. They do not report file sizes. Two identical files may consume different amount of storage on different partitions/filesystems. One factor that may affect the storage allocated to a file is the block size of the file system. Another factor is sparseness.

A very large sparse file may occupy very little space on disk even though ls and stat report a large file size. But, if that file is copied to a filesystem that does not support sparse files, or using a tool that doesn't support sparse files, the disk space consumed will balloon to match the file's size as reported by ls/stat.

Regards,
Alister

LMHmedchem · March 18, 2013, 12:31am

I suspect that there is simply an issue with the reported size, since all of the file name match. For both partitions, they are windows ntfs, and the copy is smaller the the source. The copy was made using cp -Rfp under cygwin, so it may be that cp stored the copies more efficiently than they were stored in the original versions.

If I was to use the two sorted find files as a starting list for cmp, how would I differentiate the entries in the sorted list that are directories from those that are files? Since cmp is for files, will it just throw an exception if what you pass to it is a directory?

---------- Post updated 03-18-13 at 12:31 AM ---------- Previous update was 03-17-13 at 09:16 PM ----------

I have added the following to the end of my script to check each file pair with cmp. If I get this working, I will add logic to run this part based on an argument.

# further process sorted find list by checking each file pair with cmp
while read input
do
#  remove leading . from each line in find output
   TEMP=$(echo $input | sed 's/^.//g')
#  escape spaces
   LOCALFILE=$(echo $TEMP | sed 's/ /\\ /g')

   echo $TREE1$LOCALFILE
   echo $TREE2$LOCALFILE

   cmp $TREE1$LOCALFILE  $TREE2$LOCALFILE > $TMPDIR'/byte_compare.txt'

done < $TMPDIR'/check_1_sorted'

This uses one of the sorted find lists to identify each file. If the entry from find is a directory, is seems as if cmp just prints a notification to stderr and moves on. The problem I am having now is that cmp won't except what I have done above to escape the spaces in file names. I have done echo on the path for each file, and it appears correct, but I am getting an error from cmp,

/cygdrive/e/nlite/Presets/Last\ Session_u.ini
/cygdrive/i/nlite/Presets/Last\ Session_u.ini
cmp: invalid --ignore-initial value `/cygdrive/i/nlite/Presets/Last\'

cmp doesn't seem to be seeing anything past the escape. Am I not escaping this properly? If I don't escape the space, I get a similar error indicating that the space is breaking the input.

LMHmedchem

RudiC · March 18, 2013, 4:01am

Keep in mind that disk space is always allocated in clusters, usually 4k. So a one byte file would use 4k disk space. This still would mean there's (not quite) a million files that spoil 4k of disk space each, though.
Are the two disk using the same file system type? There's fs out there that make intelligent use of inode list space to store incomplete clusters, while others don't and store those incomplete clusters on disk and "spoil" the empty part of the cluster.
And, finally, the two disks have a different size. Not much, but it may suffice to create a difference in the fs managing structures' overhead size.

drl · March 18, 2013, 9:54am

Hi.

Observations:

1) I agree with jim mcnamara about "millions of files" (especially if they are in just a few directories -- *nix filesystems can handle that, but at a cost of several indirect lookups), and with RudiC that allocation sizes may be involved.

2) If I were doing this, I would look at the comparison between lengths of the files. The stat function for size is easy and fast to obtain, with command stat, perl, c, etc. Then only if the file pairs had different sizes would I investigate farther.

3) There is some information about backslash-escape in cygwin at bash - Cygwin: using a path variable containing a windows path (with a space in it) - Stack Overflow -- I attribute that to the use of "\" as a path separator in MS systems -- basically, the advice seems to be use quotes.

4) My vague recollection is that directory sizes are never decreased even when significant numbers of files and sub-directories are removed, at least in *nix. I have no idea if that concept holds in MS systems.

5) This problem may be in a gray area between *nix-like systems and MS systems. About the latter I know very little.

Best wishes ... cheers, drl

LMHmedchem · March 18, 2013, 12:49pm

My file system started with a single primary data directory. This was primarily for the purposes of data backup, since it simplified the rsync setup. At this point, I have 4 primary data directories. The vast majority of the files on this drive are chemical structures in electronic format (mol files and SMILES strings). Mission critical data, such as src code, exists on DVD and even in hard copy printout at other locations. Electronic structure data defies some such permanent storage solutions, since they are of little or no value when printed on paper. Archiving such data means moving it to another hard drive, or possibly a DVD. I am skeptical about optical storage, since I have a case of CDs downstairs that I purchased with out silk screened logo on them. That was a few years ago, but they are already unreadable by any software that I can find. When I put one of them into an optical drive, I get a message that the disk is not in a readable format, so I cannot write to them. I think that it is understandable that this kind of thing makes me hesitant about storing data on such a medium.

The solution I have taken to is to have every important file on at least 4 hard drives, over at least two locations. This means two internal hard drives, synced with rsync, and two external hard drives (one off site). I replace hard drives every two years and have had good luck with this solution up to this point. On larger drives, some of my newer setups have two partitions with a smaller "working" partition at the outer edge of the drive and a larger "archive" partition for the rest.

This system does not keep things up to date in real time (like a raid1), but raid has it's issues as well. I have lost many more files through my own stupidity of accidentally deleting things than I ever have through hardware or software failures. Not even a raid array can protect you from being a moron from time to time, oh that it could...

I can certainly spread my data over more directories at higher levels, or even add more partitions. All of these partitions are ntfs if that matters. I am actually getting ready to rebuild this rig, so now would be a good time to make changes. I don't often do searches from higher up directories, since there are individual project folders.

I can move things around to whatever extent would be helpful, but there are still millions of files that need to be kept somewhere (several somewheres for backup). I can dump many of them onto external drives and put them in the firesafe, but I don't know if they would be any better preserved there than in an archive partition. How long can you leave a hard drive sitting in the closet and still expect it to fire up? I guess there might not be much data on that at this point, since 1TB drives are only a few years old.

As far a my current script, changing to double quotes seems to work,

cmp "$TREE1$LOCALFILE" "$TREE2$LOCALFILE" > $TMPDIR'/byte_compare.txt'

I know I tried this with quotes, but it must have been single quotes. I remove the code to escape spaces. I tried with stat by doing,

SIZE1=$(stat -c%s "$TREE1$LOCALFILE")
SIZE2=$(stat -c%s "$TREE2$LOCALFILE")

if [ "$SIZE1" != "$SIZE2" ]; then
   echo "$TREE1$LOCALFILE" >> $TMPDIR'/size_compare.txt'
fi

This takes about 20% longer than doing cmp, so unless I have set this up incorrectly, there doesn't seem to be a performance advantage, especially if you are going to do cmp anyway if you find files of different size ( I am using size in bytes synonymously with length, so let me know if that is not correct).

Thanks for all of the help so far.

LMHmedchem

drl · March 18, 2013, 2:32pm

Hi.

I may be comparing apples(MS systems) to oranges (*nix systems), but here is a timing comparison of stat and cmp on a GNU/Linux box, with 2 identical files:

#!/usr/bin/env bash

# @(#) s1	Demonstrate compare timings for stat and cmp.

# Utility functions: print-as-echo, print-line-with-visual-space, debug.
# export PATH="/usr/local/bin:/usr/bin:/bin"
pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
C=$HOME/bin/context && [ -f $C ] && $C stat cmp

N=${1-10000}

pl " Input data file f1 f2:"
specimen -3 -n f1 f2 | cut -c1-78

pl " Results, time for $N stat calls:"
rm -f f3
time for ((i=1;i<=$N;i++))
do
  s1=$(stat -c%s f1)
  s2=$(stat -c%s f2)
  if [ "$s1" != "$s2" ]
  then 
    pe "f1" >> f3
  fi
done
if [ -e f3 ]
then
  pe " Lines in f3: $(wc -l <f3)"
fi

pl " Results, time for $N cmp calls:"
rm -f f3
time for ((i=1;i<=$N;i++))
do
  if ! cmp f1 f2
  then
    pe "f1" >> f3
  fi
done
if [ -e f3 ]
then
  pe " Lines in f3: $(wc -l <f3)"
fi

exit 0

./s1

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64
Distribution        : Debian GNU/Linux 5.0.8 (lenny) 
bash GNU bash 3.2.39
stat (GNU coreutils) 6.10
cmp (GNU diffutils) 2.8.1

-----
 Input data file f1 f2:
Edges: 3:0:3 of 17777 lines in file "f1"
     1	Preliminary Matter.  
     2	
     3	This text of Melville's Moby-Dick is based on the Hendricks House editi
   ---
 17775	THEY GLIDED BY AS IF WITH PADLOCKS ON THEIR MOUTHS; THE SAVAGE SEA-HAWK
 17776	D WITH SHEATHED BEAKS. +ON THE SECOND DAY, A SAIL DREW NEAR, NEARER, AN
 17777	KED ME UP AT LAST. +IT WAS THE DEVIOUS-CRUISING +RACHEL, THAT IN HER RE

Edges: 3:0:3 of 17777 lines in file "f2"
     1	Preliminary Matter.  
     2	
     3	This text of Melville's Moby-Dick is based on the Hendricks House editi
   ---
 17775	THEY GLIDED BY AS IF WITH PADLOCKS ON THEIR MOUTHS; THE SAVAGE SEA-HAWK
 17776	D WITH SHEATHED BEAKS. +ON THE SECOND DAY, A SAIL DREW NEAR, NEARER, AN
 17777	KED ME UP AT LAST. +IT WAS THE DEVIOUS-CRUISING +RACHEL, THAT IN HER RE

-----
 Results, time for 10000 stat calls:

real	0m39.595s
user	0m10.397s
sys	0m27.694s

-----
 Results, time for 10000 cmp calls:

real	0m55.188s
user	0m27.122s
sys	0m25.958s

So perhaps MS sysems require a lot more work to get the size.

For the case of perl, that same amount of work for stat can be done in under 0.1 seconds real time:

#!/usr/bin/env perl

# @(#) p1	Demonstrate stat on open (and un-opened) files.

use strict;
use warnings;

my ($debug);
$debug = 1;
$debug = 0;

my ( $f1, $f2, $f3 );
my ( $s1, $s2, $i, $j, $N );

# Make sure files are around, then close them.
open( $f1, "<", "f1" ) || die " Cannot open file f1\n";
open( $f2, "<", "f2" ) || die " Cannot open file f2\n";
open( $f3, ">", "f3" ) || die " Cannot open file f3 for write\n";
$s1 = ( stat("f1") )[7];
$s2 = ( stat("f2") )[7];
print " Length of f1, f2: $s1, $s2\n";
close $f1;
close $f2;

$j = 0;
$N = 10;
$N = 10000;
for ( $i = 1; $i <= $N; $i++ ) {
  $s1 = rand();
  $s2 = rand();
  $s1 = ( stat("f1") )[7];
  $s2 = ( stat("f2") )[7];
  if ( $s1 != $s2 ) {
    print $f3 " Found mismatch at iteration $i\n";
    $j++;
  }
  print " Length of f1, f2: $s1, $s2\n" if $debug;
}
print STDERR " Called stat $i (-1) times on each file, compared sizes.\n";
if ( $j != 0 ) {
  print STDERR " File f3 was written to $j times.\n";
}

exit(0);

producing:

time ./p1
 Length of f1, f2: 1205404, 1205404
 Called stat 10001 (-1) times on each file, compared sizes.

real	0m0.091s
user	0m0.036s
sys	0m0.040s

However, as I mentioned I don't know about MS systems. It does seem odd that obtaining the length of a file (in *nix, just pull in the length from the inode), whereas reading every byte in two files and comparing them would be so different (and on the wrong side, it seems to me).

Best wishes ... cheers, drl

RudiC · March 19, 2013, 4:24am

I'm really surprised that cmp is so close a runner up as it needs to read and compare each single byte in both files.
One idea to speed up things might be to run stat once for a couple of files, e.g. for an entire directory, so not creating a new process for every single file...

LMHmedchem · March 19, 2013, 5:17pm

One of the reasons why cmp may not be taking so much time in this case is that most of these files are quite small (<4k). Cygwin is not like a native linux install since it runs on top of windows. I find that many tasks take much longer to complete under cygwin than in a native linux environment. One other difference for me is that I did not evaluate the output of cmp in the script as I did for the two stat operations. I just did a redirect of the cmp output to a file. As far as I understand it, cmp does not produce output when the files are the same. For stat, you are doing two calls to stat, assigning the results of both calls to shell variables, evaluating the variables in a conditional, and then printing output if the conditional evaluates true. For cmp, you are just passing the two file names to cmp. All of the lifting for cmp is done in the compiled binary, where much of what is done for stat is in the script. It may be that for small files, there isn't much difference. I didn't test this extensively. My script has run for more than 24 hours so far. I didn't think to put a counter in to print to the terminal, so I don't have any way to know how close it is to being finished. There is no output to the file yet, so nothing has been found to be different so far. At a rate of 10000 per minute, it should have finished long ago, but there are some very large files here that could take quite a while.

I don't know how long I will let this run. I am fairly well convinced that if there were problems, I would have had something report as different by now.

LMHmedchem

drl · March 19, 2013, 10:38pm

Hi.

I found a copy of cygwin as a guest in virtual machine Windows-7 install. The hardware is a Xeon CPU,the host OS is Debian, virtualization is VMWare, but the memory allocation was decreased because of other VMs running -- down to 300 MB.

I changed my timing script slightly to allow the basic tasks to be run. The results:

./s1
OS, ker|rel, machine: CYGWIN_NT-6.1, 1.7.16(0.262/5/3), i686
bash GNU bash 4.1.10
stat (GNU coreutils) 8.15
cmp (GNU diffutils) 3.2

-----
 Input data file f1 f2:
==> f1 <==
Preliminary Matter.

This text of Melville's Moby-Dick is based on the Hendricks House edition.

==> f2 <==
Preliminary Matter.

This text of Melville's Moby-Dick is based on the Hendricks House edition.

-----
 Results, time for 20 stat calls:

real    0m12.750s
user    0m0.015s
sys     0m5.947s

-----
 Results, time for 20 cmp calls:

real    0m6.281s
user    0m0.152s
sys     0m2.950s

-----
 Results of internal perl stat calls:
 Length of f1, f2: 1205404, 1205404
 Called stat 101 (-1) times on each file, compared sizes, expected 1205404.

real    0m0.938s
user    0m0.046s
sys     0m0.420s

This agrees with LMHmedchem's comparison of stat and cmp. Note that perl does 5 times as much as the stat portion in less than 1/10 the time. I remain amazed that command stat is so slow.

So if the comparison needs to be re-run, I'd suggest a perl code.

Best wishes ... cheers, drl

LMHmedchem · March 20, 2013, 12:02am

I would like to give the perl code a try, but I'm not very good with perl. The code you posted runs the same test many times on random values. What I need to do is to read a path from a file, here is a sample of the sorted find file.
.
./_copy.sh
./_database_project
./_database_project/12-11-10
./_database_project/12-11-10/_database_notes_12-11-10.txt
./_database_project/12-11-10/test.db.sqlite
./_database_project/12-11-10/test.db.sqlite$
./_database_project/12-11-10/test_input1.txt
./_database_project/12-11-10/test_input1.xlsx
./_database_project/12-11-10/test_input2.txt

I need to strip the leading "." and then append two different root paths to create a path for each file in a matching pair on the two drives.

/cygdrive/e/_Data_Level/_copy.sh
/cygdrive/i/_Data_Level/_copy.sh

/cygdrive/e/_Data_Level/_database_project (this is a directory)
/cygdrive/i/_Data_Level/_database_project (this is a directory)

/cygdrive/e/_Data_Level/_database_project/12-11-10 (this is a directory)
/cygdrive/i/_Data_Level/_database_project/12-11-10 (this is a directory)

/cygdrive/e/_Data_Level/_database_project/12-11-10/_database_project/12-11-10/_database_notes_12-11-10.txt
/cygdrive/i/_Data_Level/_database_project/12-11-10/_database_project/12-11-10/_database_notes_12-11-10.txt

Each pair needs to be compared for length.

$s1 = ( stat("/cygdrive/e/_Data_Level/_copy.sh") )[7];
$s2 = ( stat("/cygdrive/i/_Data_Level/_copy.sh") )[7];
if ( $s1 != $s2 ) {
print $f3 " Found mismatch at iteration $i\n";
$j++;
}

I'm not sure what perl will do with the entries in the find file that are directories and not files.

LMHmedchem

RudiC · March 20, 2013, 3:50am

# time find / -iname \* -exec stat -c"%n  %s" {} + > file

real    0m4.829s
user    0m1.432s
sys    0m3.112s
# wc -l file
310854 file

This was 300000 files only, but may be extrapolateable. Do this on both file systems and diff the resulting files. May need to be sort ed as find doesn't guarantee a certain ordering. Make sure to use the + sign to end the find command.