Newbie question: if[command not null]

zangarules · April 7, 2011, 4:52am

hi,

i have to put in my script a command that should tell me if the contents of two different paths are the same.

I thought to write an "if" command who makes the diff of two files which contains the `ls` of the folders and go on with the script if is not null, but i'm afraid of the fact that the diff is not null if the order of the files is different.

Please,can someone write me the best if command to compare the folders?

DGPickett · April 8, 2011, 12:35pm

Contents of paths, that means all (dir, file, link, device, FIFO) entry names in the subtree, permissions, flat file contents, link counts, modified dates, access dates -- well, they are probably never the same unless you can access both trees identically in each second. Diff two dirs does some of this. ls goes in alpha order, if I remember the man page right. You can always sort. find might be better than ls, providing relative paths of every entry name. cmp can compare file contents, even if binary, one file at a time. So, the first trick is good requirement writing!

mirni · April 8, 2011, 10:04pm

I'd try:

cd /path/to/dir1 
find . | sort > /path/outside/of/here/log1
cd /path/to/dir2 
find . | sort > /path/outside/of/here/log2
d="`diff /path/outside/of/here/log{1,2}`"
if [ -z "$d" ] ; then   #is $d empty?
  echo "Dirs are the same"
else
  echo "Dirs difer in these: " 
  echo $d
fi

Sorting because, even if the file subtrees contain same files, find may output them in different order.

alister · April 8, 2011, 10:19pm

You can use diff recursively on two directories and use the exit status to discern whether the paths are identical (special device files and the like excepted).

if diff -r dir1 dir2 >/dev/null; then
    echo identical
else
    echo different
fi

Regards,
Alister

DGPickett · April 11, 2011, 2:35pm

Uses multiple CPUs and finds all flat file differences, even binary files:

#!/usr/bin/bash
 
comm -3 <(
  cd dir1
  find * -type f | xargs -n999 cksum | sed 's/\(.*\) \(.*\)/\2 \1/' | sort
 ) <(
  cd dir2
  find * -type f | xargs -n999 cksum | sed 's/\(.*\) \(.*\)/\2 \1/' | sort
 ) |sed '
  s/^\t/dir2 /
  t
  s/^/dir 1 /
 '

mirni · April 11, 2011, 4:13pm

That's a cool solution. I like the cksum part.
Just a little modification to deal with spaces in files:

find * -type f -print0 | xargs -0 -n999 cksum

DGPickett · April 12, 2011, 4:12pm

I never usek find -print0, since I can control that with start dir arg(s).

Here, it messes up the comm, making identical files on identical relative paths show different (all show different).

The xargs -0 option is nice when the input is lines of badly behaved file names. Some have hidden their borrowed space using 'mkdir ". " ' to make a hidden directory that is a little hard and scary to remove.

mirni · April 12, 2011, 4:43pm

@DGPickett:

What do you mean by this? How?

The construct with xargs -0 works fine on my machine (centos 5.3; GNU bash, version 3.2.25).

That's exactly why I suggested it. Your original reply failed when I ran it on a dir with files containing spaces.

DGPickett · April 12, 2011, 5:08pm

"find *" is relative paths, "find $PWD" is absolute paths.

Absolute paths means lines never match, dir1 to dir2.

My bro said "file names with spaces, only Microsoft would so something stupid like that!" Actually, they are just about as legal in UNIX, just shunned by the culture. Use a '_', '.', '-' or nothing. I like the -, handy and easy to read past.

mirni · April 12, 2011, 5:18pm

I agree that including spaces in filenames is a very dirty practice, yet it is not uncommon. My version with xargs -0 worked fine -- I ran it with relative path like this:

comm -3 < ( cd dir1; find . -type f -print0 | xargs -0 -n999 cksum | 
                     sed 's/\(.*\) \(.*\)/\2 \1/' | sort ) 
        < ( cd dir2;  find . -type f -print0 | xargs -0 -n999 cksum | 
                     sed 's/\(.*\) \(.*\)/\2 \1/' | sort ) | 
sed '
  s/^\t/dir2 /
  t 
  s/^/dir1 /
'

You used find * , but the result should be the same

alister · April 12, 2011, 6:23pm

I disagree. I personally see nothing "stupid" or "dirty" about spaces in filenames. If a system has problems with spaces in filenames, I'm more inclined to look unfavorably on the system than on whoever named the file.

The problem, in my opinion, is that since UNIX filenames are allowed to contain '\t' and '\n', the output from find(1) is not guaranteed to be a decipherable text stream where newlines delimit a record/line and tabs delimit fields. (It doesn't help that the portable subset of xargs features is a bit lacking as well.)

Perhaps the text stream has outlived its usefulness. Time for everyone to migrate to Powershell?

Regards,
Alister

Corona688 · April 12, 2011, 6:37pm

Don't be hasty. There's one and exactly one character that's not allowed to be in a UNIX filename: NULL. if find can be made to print NULL, and xargs can be made to use NULL as a delimiter, that is a 100% safe to delimit filenames.

As it happens GNU find has the -print0 option to print nulls instead of newlines, and GNU xargs has the --null option to use nulls as separators.

If your argument is just that these arguments aren't portable, perhaps it's time to mandate them in POSIX.

There's always -exec, too, which will always be given a correct filename.

alister · April 12, 2011, 9:03pm

The find/xarg -print0/-0 tandem is useful but it's of no help when the output is coming from ls or stat or lsof or any one of countless tools which output filenames.

I do agree with you though that it would be useful for posix xargs to support some means of specifying a delimiter.

My argument is simply that traditional unix tools were designed to pipe text streams but filenames, a central system abstraction, cannot be part of a text stream without the potential for ambiguity/corruption (Is this newline actually the beginning of a new line or part of a foolish filenaming scheme?). Due to this we often have to choose between a simple, 80% solution (which is often sufficient) or a comparatively cumbersome approach.

The design decision was made long ago; it's not going anywhere; and in practice it isn't usually a problem. Still, I think would have been a good decision to have disallowed \t and \n in filenames. <insert visions of {file,path}names shuttling through pipes without a care in the world>

Regards,
Alister

DGPickett · April 13, 2011, 4:28pm

Run, high priests on rampage !

Yes, xargs -0 is nice for spaces in file names, if you have them, and similarly, find . is safer than find *, no meta-in-entry-name vulnerability, if you do not mind the './' prefix, or tell sed to toss it.