Excluding directories from a find

I've looked at a few similar threads, but I can't bridge from those examples to what I'm working on, so I'm hoping someone can help.

I want to extend the following statement

find $PathToCheck -type f \( -not -iwholename "$ScriptDir/*" \) -exec md5sum "{}" \;>$NewSigs

to exclude several directories that are below the directory I am searching.

When I used -name like in this post

w w w . unix.com / unix-dummies-questions-answers/16921-question-non-recursive-find-syntax.html
(had to put in spaces... forum wouldn't let me post otherwise)

I got

find: warning: Unix filenames usually don't contain slashes (though pathnames do). (etc.)

and the find didn't work.

The path that I am passing in the variable ($PathToCheck) looks like:

/home/username/

and the paths I want to exclude (also want to pass in variables) will look like

/home/username/public_html/somedirectory
/home/username/public_html/otherdirectory/etc/ignorethis

(There are many other directories under /home/username/public_html/ that I want to search - just in case that information makes a difference.)

I'm trying to write my script so I can easily configure multiple exclude directories in (a) shell variable(s), but as long as I know how to write the general form of the statement with 2 or 3 exclusions, I should be able to work the rest out.

Any help would be most appreciated.

Hi

Try this:

$ X="-name public_html/somedirectory -o -name public_html/otherdirectory/etc/ignorethis"
$ find . -type d  \( $X \) -prune -o  -print  -exec md5sum '{}' \;

Guru.

Thanks for the quick reply guruprasadpr

I tried

ScriptDir="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"  # Evaluates to /home/username/wd
X="-name $ScriptDir/ -o -name /home/username/mail"
find $PathToCheck -type d  \( $X \) -prune -o  -print  -exec md5sum '{}' >TestFind.txt

and got the following:

find: warning: Unix filenames usually don't contain slashes (though pathnames do).  

That means that -name '/home/username/wd/'' will probably evaluate to false all the time on this system. You might find the -wholename test more useful, or perhaps -samefile . Alternatively, if you are using GNU grep, you could use find ... -print0 | grep -FzZ '/home/username/wd/' .
find: warning: Unix filenames usually don't contain slashes (though pathnames do). That means that -name '/home/username/mail' will probably evaluate to false all the time on this system. You might find the -wholename test more useful, or perhaps -samefile . Alternatively, if you are using GNU grep, you could use

find ... -print0 | grep -FzZ '/home/username/mail'
find: missing argument to '-exec'

Also tried this

ScriptDir="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"  # Evaluates to /home/username/wd
X="-name $ScriptDir/ -o -name /home/username/mail"
find $PathToCheck -type d  \( $X \) -prune -o  -print  -exec md5sum "{}" \;>TestFind.txt

and got a ton of error messages md5sum: /../.../../: Is a directory . And all the directories under /home/username/mail were still included. -print isn't needed because I only need the output from the md5sum, and "" must be used for the substituion from find to work. The find must only give files, not directories so I also tried:

ScriptDir="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"  # Evaluates to /home/username/wd
X="-name $ScriptDir/ -o -name /home/username/mail"
find $PathToCheck -type f  \( $X \) -prune -o  -exec md5sum "{}" \;>TestFind.txt

which still gave a ton of errors about Is a directory and both the directories that should have been excluded were processed.

Any suggestions?

Well, think this'll work - if you need more complicated directory matching you could just get the entire list of directories and then use grep -v instead of the find -type d:

#!/bin/sh
STARTDIR="/home/XXX/.cache"
EXCLDIRS="Thunar sessions"
exclude=""

for dir in ${EXCLDIRS}; do 
    exclude="${exclude} -name ${dir} -o"
done

# Lil sloppy, but just kill the -o at the end.
exclude="`echo $exclude | sed 's/-o$//'`"

# Run the find - this will produce all the directories we -want-
find ${STARTDIR} -type d -a \! \( ${exclude} \) -print | while read line; do
    # and this will run whatever we want against the files in them...
    find ${line} -type f -exec md5sum {} \;
done
find "$path" -type d \( -iwholename "$ScriptDir" -o -iwholename "$anotherDir" \) -prune \
          -o -type f -exec md5sum {} +

This approach should also run faster because md5sum is no longer called once per file.

Regards,
Alister

---------- Post updated at 11:33 AM ---------- Previous update was at 11:22 AM ----------

The original solution excludes all files below the excluded directory. Your approach will include files in subdirectories below any excluded directories.

On an unrelated note, we do not know for certain if there are directories with whitespace in their names (increasingly common these days). If there are, your solution won't work.

Regards,
Alister

1 Like

Thanks for the replies Vryali & Alister

A special thanks to Alister this:

find "$path" -type d \( -iwholename "$ScriptDir" -o -iwholename "$anotherDir" \) -prune \
          -o -type f -exec md5sum {} +

is exactly what I am looking for - works great & really fast as you said.

A couple of follow ups:

  1. What does the + sign do?
  2. Thanks for the comment about the whitespace-I hadn't thought about that. (I don't believe that whitespace is legal on an apache server, but the MD5 does work on files in the tree that contain white space (I tested it)). If I did need to use a directory that incudes whitespace, could I just include escape sequences in the exclude variables?

@nixie
1) Post #1 was the worst formatted post I have seen in months.
2) Please use code tags when posting code or data.
3) Please avoid punctuating code with English punctuation - particularly quotes. It makes the code read like nonsense.
4) Please do not a use Microsoft character set when posting unix code. Copy/paste via Windows Notepad to get rid of weird characters which have no meaning in unix scripts.
5) Please mention what Operating System and version you have and what Shell you are using. There is much variation on the find command and you and @alister refer to much obscure syntax which is definitely not from a unix find .

@guruprasadpr
Please avoid posting untested and erroneous code. Where does $X come from? How does you code work for multiple directories?

See man find for your system (whatever that is?). If it is not there, just use \; as usual. The + can be faster on certain Operating System (e.g. modern Solaris).

Apache is a package which you can install on unix of Linux. Whether it be unix of Linux, the Operating System will support space characters in filenames.

The sentence supplement containing "MD5" makes no sense whatsoever to me.

"Escape Sequences" are usually associated with driving special effects on Printers and VDUs. What do you mean?

Please provide examples of awkward directory names which you wish to exclude (blotting anything confidential with X's).
When the list gets awkward the best approach is to use sed with a sedfile to eliminate unwanted file or directory names. Using a sedfile means that the Shell does not see the awkward names and therefore cannot confuse the situation.

1 Like

Sorry that was my very first post.

Every single option when referring to options on a long command???

It wasn't code... it was a diagnostic message issued by awk copied and pasted from the terminal window. Some of the things marked above are
awk options not a complete statement. Are you saying you want every tiny fragment put in CODE tags as was done above?

I wasn't aware it was doing that, I was pasting from a plain text editor.

I don't know how to determine the version of Linux, (find (GNU findutils) 4.4.2) but neither are obscure, (although I agree the options could be considered obscure) they are from a very current version of CentOS, which is being run by a major web hosting company (AFAIK many of the major shared hosts use this OS).

---------- Post updated at 11:36 AM ---------- Previous update was at 10:40 AM ----------

Thanks for this - I saw this \; sequence on the original code that I got the statement from, but it was using \ to break the statement up into multiple lines. I thought \; was a an artifact left over from editing (having the meaning of "null/empty line"). I couldn't find \; in the man pages - I went back and checked again and found + which appears to be the preferred syntax. Find is very powerful, but I find it very hard to grok - been programming for 35 years, but very new to shell scripting. I don't do enough bash scripting to keep sharp with the syntax, so I have to look almost everything up as I go so I really appreciate when the gurus here can look at a statement and give a quick pointer or hint. :b:

Sorry maybe I should have said escaping?? i.e. preceeding with \ to ignore the special meaning of the next character or include a character like a blank or quote.

I am running this script in the home directory of a shared host. AFAIK whitespace is not allowed in URL's (filenames in public_html) I certainly don't use whitespace in these names. I'm also not expecting to see them in the files of open source packages like wordpress, drupal, etc. Is module with whitespace.class.php even legal? AFAIK it isn't, and my current belief is that no good programmer would do it.

Am I naive in these assumptions? I didn't expect to find these, so I didn't design the script with that in mind, but after your post I did a test and the line of code in question did work correctly even on a filename with a space in it.

The code tags thing is so everyone can see significant space characters in commands or messages. It is much easier to read a post if code is separated from comment. It is not necessary to put every code fragment in code tags but it can reduce ambiguty when the command is an English word (like "find").

I too do not like space characters in file names or URLs, though I believe that the are valid in both circumstances. When in an URL, a space character is represented by %20 .

1 Like