How to form a correct syntax to sift out according to complementary patterns with 'find'?

scrutinizerix · January 13, 2018, 3:30am

I need to find all files and folders containing keyword from the topmost directory deep down the tree but omitting all references to keyword in web-search logs and entries, i.e. excluding search and browsing history made using web-browser1, web-browser2, web-browser3, (bypassing all entries of the type "/Users/myuser/Library/web-browser1, 2, 3/History/keyword/blahblahblah" etc. )

I use

echo MYPASSWORD | sudo -S find -E / -regex '.*/(keyword|KEYWORD)/.*' -and -not -path '.*/(web-browser1|web-browser2|web-browser3)/.*'

So far it only gets right the first half of this expression (ending with keyword) but fails to execute the second half. I tried using grep -v piping the first half of the original expression to it instead with the same argument, I modified it to

\! -path '.*/(web-browser1|web-browser2|web-browser3)/.*'

only to end up with the same result (all entries with the paths, each containing keyword). What I'm doing wrong? Is it possible to do complementary match with find or maybe I failed to arrange the regular expression correctly?

drysdalk · January 13, 2018, 7:50pm

Hi,

I think this might be time to break out the -prune flag to find , perhaps. It can be used to exclude from your results all things that match a particular criteria.

So, an example. I created a sort-of-similar directory tree to yours, with one single test file containing the text "FOO" copied to all directories.

So a simple find with an appropriate -exec returns this:

$ find . -type f -exec grep -l FOO \{\} \;
./Users/myuser/Library/browser3/test
./Users/myuser/Library/Stuff/test
./Users/myuser/Library/Things/test
./Users/myuser/Library/browser2/test
./Users/myuser/Library/browser1/test
$

So far, so normal. But now here's what happens if I use prune to exclude directories that match the pattern browser? :

$ find . -type d -name 'browser?' -prune -o -type f -exec grep -l FOO \{\} \;
./Users/myuser/Library/Stuff/test
./Users/myuser/Library/Things/test
$

The files underneath browser1 , browser2 and browser3 were ignored, because even though they themselves matched we'd removed all directories whose name matched the pattern 'browser?' from consideration.

The basic syntax for prune is you set up your conditions first for what you want to exclude (so -type d -name 'browser?' in this case), and then after the -prune -o you put the conditions for what you actually want to match once what's being pruned is excluded.

Hope this helps.

scrutinizerix · January 17, 2018, 3:51pm

Hi,
Thanks for your reply. I tried your syntax but it failed to do what I wanted. Actually what I passed on my previous post for browser1, browser2, browser3 were all different names: Safari, Opera, Firefox - so the regex in this case should be

(Safari|Opera|Firefox)

(is it correct, btw?) and since these are just parts then the question how do I define that arranging my regexes. I'd like it to be directories to be skipped (in which case I composed the regex as

'.*/(Safari|[Oo]pera|Firefox|[Mm]ozilla)/.*'

, though I feel uncertain of its correctness; I used this notation too with no or converse output than desired) or all types of files containing one of those names (

'.*(Safari|Opera|Firefox).*'

?). I'm not sure how to handle that.

In my case I tried

'.*(Safari|Opera|Firefox).*'

and in this scenario I appended find with -E option and prepended

'.*(Safari|Opera|Firefox).*'

with -regex so the most common variant to which other my combinations used could be reduced is

echo MYPASSWORD | sudo -S find -E / -regex '.*(Safari|[Oo]pera|Firefox|[Mm]ozilla).*' -prune -o -name '.*(keyword|KEYWORD).*'

I've tried so many variants I can barely remember what an option output what. I tried appending -exec (

echo MYPASSWORD | sudo -S find -E / -regex '.*(Safari|[Oo]pera|Firefox|[Mm]ozilla).*' -prune -o -exec grep '.*(keyword|KEYWORD).*' {} ';'

it printed many lines of the kind "grep: .*(Safari|[Oo]pera|Firefox|[Mm]ozilla).*: no such file or directory") , I tried piping to grep the latter used as an argument to xargs (

echo MYPASSWORD | sudo -S find -E / -regex '.*(keyword|KEYWORD).*' | xargs -I {} grep -RLE {} '.*(Safari|[Oo]pera|Firefox|[Mm]ozilla).*'

and got the message "no termination character" or smth like that).
Right now it hanged with no output or error message at all executing

echo MYPASSWORD | sudo -S find -E / -name  '.*(Safari|[Oo]pera|Firefox|[Mm]ozilla).*' -prune -o -type f -exec grep -il *keyword*  {} ';'

. I tried using {} + at the end of this line instead of {} ';', I added and omitted -type option - no difference either. Interesting that when I used operators -and -not -path (or -name in place of -path) '.*(Safari|[Oo]pera|Firefox|[Mm]ozilla).*' it would return results containing one of those names/paths. That's just weird.

The shell is bash 3.2.48

Don_Cragun · January 17, 2018, 6:36pm

scrutinizerix:

Hi,
Thanks for your reply. I tried your syntax but it failed to do what I wanted. Actually what I passed on my previous post for browser1, browser2, browser3 were all different names: Safari, Opera, Firefox - so the regex in this case should be
(Safari|Opera|Firefox)
(is it correct, btw?) and since these are just parts then the question how do I define that arranging my regexes. I'd like it to be directories to be skipped (in which case I composed the regex as
'.*/(Safari|[Oo]pera|Firefox|[Mm]ozilla)/.*'
, though I feel uncertain of its correctness; I used this notation too with no or converse output than desired) or all types of files containing one of those names (
'.*(Safari|Opera|Firefox).*'
?). I'm not sure how to handle that.

In my case I tried
'.*(Safari|Opera|Firefox).*'
and in this scenario I appended find with -E option and prepended
'.*(Safari|Opera|Firefox).*'
with -regex so the most common variant to which other my combinations used could be reduced is
echo MYPASSWORD | sudo -S find -E / -regex '.*(Safari|[Oo]pera|Firefox|[Mm]ozilla).*' -prune -o -name '.*(keyword|KEYWORD).*'
I've tried so many variants I can barely remember what an option output what. I tried appending -exec (
echo MYPASSWORD | sudo -S find -E / -regex '.*(Safari|[Oo]pera|Firefox|[Mm]ozilla).*' -prune -o -exec grep '.*(keyword|KEYWORD).*' {} ';'
it printed many lines of the kind "grep: .*(Safari|[Oo]pera|Firefox|[Mm]ozilla).*: no such file or directory") , I tried piping to grep the latter used as an argument to xargs (
echo MYPASSWORD | sudo -S find -E / -regex '.*(keyword|KEYWORD).*' | xargs -I {} grep -RLE {} '.*(Safari|[Oo]pera|Firefox|[Mm]ozilla).*'
and got the message "no termination character" or smth like that).
Right now it hanged with no output or error message at all executing
echo MYPASSWORD | sudo -S find -E / -name  '.*(Safari|[Oo]pera|Firefox|[Mm]ozilla).*' -prune -o -type f -exec grep -il *keyword*  {} ';'
. I tried using {} + at the end of this line instead of {} ';', I added and omitted -type option - no difference either. Interesting that when I used operators -and -not -path (or -name in place of -path) '.*(Safari|[Oo]pera|Firefox|[Mm]ozilla).*' it would return results containing one of those names/paths. That's just weird.

The shell is bash 3.2.48

I don't see why you think that is weird. -and instead of -a and -not instead of ! are not an issue. They are, respectively, synonyms. But -name and -path are completely different. The -name primary is not affected by the -E option and the pattern used for filename (not pathname) matches is a shell pathname matching pattern. So:

find -E -name '.*(Safari|[Oo]pera|Firefox|[Mm]ozilla).*'

is looking for a file with a name that starts with a <period> followed by any string of zero or more characters followed by an <open-parenthesis> followed by the string Safari followed by a <vertical-bar> followed by an upper-case or lower-case o followed by the string pera|Firefox| followed by an upper-case or lower-case m followed by a <close-parenthesis> followed by a <period> followed by a string of zero or more characters. My guess would be that you don't have any directories that are matched by that filename matching pattern so no directories are pruned from your search.

As you have been told before, if you don't tell us what operating system and shell you're using, questions like this waste a lot of our time and yours guessing at what might or might not be failing on your end because you're using options that are only available on some systems and are using some options that may behave differently on different operating systems.

Furthermore, your specification is not at all clear. Sometimes you're trying to exclude pathnames that contain a directory name that is a case-insensitive spelling of "keyword". Other times you trying to exclude regular files (no matter what the file's name is) if the file contains a case-insensitive spelling of keyword .

You haven't shown us any sample filenames or pathname nor their contents for files that should and should not have their pathnames printed, what should be printed in addition to their pathnames (if anything), ...

Please give us a clear specification of:

what you are trying to do,
what operating system you're using,
what shell you're using,
what the file hierarchy you're searching looks like,
what files look like that should be searched,
what files look like that should not be search, and
the output you are trying to produce from that sample file hierarchy.

scrutinizerix · January 17, 2018, 10:12pm

I thought it was obvious that my system is OS X since Apple icon I sticked to the top of my message and besides I indicated at the bottom that my shell is bash 3.2.48.

I have an app whose bundle id contains the word keyword. Also this word is a significant part of all of the files and folders that got installed or created together with the main bundle upon me running and installing the app. On OS X these files can be installed across the entire system (HFS+) beginning with /private/var/db or /private/var/folders, /Library folders and, of course those of ~/Library (plists, cache file, application support folders etc.residing in different places). Unfortunately when searching with the most basic form of find ( find / -name keyword ) I got a bunch of pathnames that are references to web-entries, containing the keyword whose practical value is negligible. I need to filter out those and have only those files and folders that belong to the items created by the app proper. They contain either keyword or KEYWORD. That's why I used alternation operator | with the items grouped inside parentheses. Since the history of web-search contains the same keyword that is a part of pathnames each containing the name of one of these web-browsers I wanted to skip every pathname containing the respective name of any of the browsers. As browser's name
appears both in the pathname of some of the browser's folders (like /Users/myuser/Library/Safari, /Users/myuser/Library/Cache/Safari, /Users/myuser/Library/Preferences/com.apple.Safari.plist etc., the same's true for Firefox containing sometimes "mozilla" or "Mozilla" in its folder names too) and its regular files of smaller size, I tried to form the regular expression meeting these criteria with the alternation operator as well with regards to case sensitivity (find all patterns matching the keyword irrespective of the case; ignore all the pathnames containing the name of one of the browsers included in the parentheses on the alternative basis).

The thing is I'm confused about purpose and meaning of the syntax that looks similar: do I need to use -exec find , -exec grep (but then in the latter case I need the option -E because | is Extended Set)? Or maybe pipe to grep instead?
I noticed two options to grep: -L, --files-without-match

What does "from which no output would normally have been printed." mean?

"The scanning will stop on the first match" - Match to what? And if it will stop how do I get the output.

Furthermore, -l, --files-with-matches

"from which output would normally have been printed"?

I used --directories=skip because

,
I thought the directories whose pathnames contains names of these browsers would be skipped.

Let's say I write

find -E /" -regex '.*/(Safari|[Oo]pera|Firefox|[Mm]ozilla)/.*' -prune -o -exec grep -iE './keyword/.*' {} ;

IF the syntax itself is correct then I have no clue what to expect. I cannot be sure which option to pick since I don't understand this:

from man find on -exec utility [argument ...]

What the current file is? What's the deal with ; as a control operator? What does it control?
On the other hand

from man find on -exec utility [argument ...] {} +

``{}'' is replaced with as many pathnames as possible for each invocation of utility .

How would you compose the line to achieve the task? Because all the explanations are crystal clear while they manipulate simple examples. That one is more advance, I dare to think.

bakunin · January 17, 2018, 11:28pm

find is not only a utility to find files - that is, to produce a list of filenames to be printed - but a "programmable commandline filemanager", so to say.

How is that done? The basic operation is find finds all files and directories and prepares an initial "result set". Then you have one or more clauses which returns a logical value, TRUE or FALSE. Each file/directory in the result set is presented to the first clause. If it returns TRUE the file/directory is kept in the result set, otherwise it is dropped. If it is kept, it is presented to the second clause, etc..

An example:

find /some/path -name "foo" -print

The initial result set is all files/directories in /some/path . This list is presented to the clause -name foo and if the name of the file/directory is "foo" it is kept, otherwise dropped. What still is in the result set is then presented to the -print clause, which just prints it, without modifying the result set further.

So far, so basic, but it is necessary to understand this mechanism of presenting one filename/directory name after the other to each of these clauses successively.

Remember i said "file manager" up there? Up to now we only produce - more or less tailored - lists of file-/directorynames. Now we want to actually do something with the files/directories found that way. For this there is a special clause: -exec .

-exec takes a "template commandline and executes this template commandline with every file/diretory in the result set. An example:

find /some/path -name "foo" -type f -print
/some/path/dir1/foo
/some/path/dir2/foo
/some/path/dir2/subdir/foo

Now we replace the -print with -exec in this command:

find /some/path -name "foo" -type f -exec echo file found: {} \;
file found: /some/path/dir1/foo
file found: /some/path/dir2/foo
file found: /some/path/dir2/subdir/foo

What has happened? First, the {} is the placeholder for the filename, which is presented to the clause. That is, -exec executed these commands:

echo file found: /some/path/dir1/foo
echo file found: /some/path/dir2/foo
echo file found: /some/path/dir2/subdir/foo

Second, you need a way to tell the shell, into which you type the whole find -command, where the "template-commandline" for -exec ends and the normal commandline resumes. This is done by an (escaped) semicolon, hence the \; at the end. Here is a more complex example i have annotated:

                                          normal commandline resumes
                                                                   |
                                  template commandline ends here   |
                                                               |   |
<--------normal commandline-------------> <--template cmdl-->  V <---->
find /some/path -name "foo" -type f -exec echo file found: {} \; -print

Notice, that you can use the -exec -clause even to select from the result set: if the template command returns TRUE when executed with the file/dir the file/dir will be further included in the result set, otherwise it will be dropped. You can even have more than one -exec -clauses, where some will only help shape the result set and the final one will actually do the work.

Finally, some performance considerations: consider the following example:

find . -name "*txt" -exec cat {} \;

This will produce a (potentially long) list of commands cat foo.txt , cat bar.txt , etc.. As this list could grow very long there will be many processes started which might tax the system (starting a process is actually "expensive" resource-wise). But cat could be called this way:

cat file1 file2 file3 [...]

and this way one cat -process would be started for a whole group of files and not for eeach one. This is what the + is for. Use this:

find . -name "*txt" -exec cat {} +

To do exctly that: group the files in the result set and call cat with each group instead of with each file individually.

I hope this helps.

bakunin

scrutinizerix · January 18, 2018, 6:59am

Thanks, it was a nice explanation. What about

grep -l

or

grep -L

man entries I highlighted? Confusing as hell.

Also, what about

-prune

? How does it work exactly?

Would you mind finding some time to explain it in your comprehensible way, please? Would aid a lot.

drysdalk · January 18, 2018, 12:29pm

Hi,

Taking these in turn:

grep -l basically just prints the names of files that match the pattern you're searching for, rather than printing the matching lines in the files themselves. For example, compare:

$ cat test.txt
This is a test file.
This is the only line that contains the string 'FOO'.
This line doesn't contain it, and is the last line in the file.
$ grep FOO test.txt
This is the only line that contains the string 'FOO'.
$ grep -l FOO test.txt
test.txt
$

So we can see that when we used the -l flag, we just got the filename returned, rather than the matching line within the file.

grep -L prints out only the names of those files which do not match the given string. Again, best demonstrated with an example.

$ cat test.txt
This is a test file.
This is the only line that contains the string 'FOO'.
But because of that, a 'grep -L FOO' won't return the name of this file.
$ cat test2.txt
This file does not contain the string we're searching for.
So its filename will be printed when we do a grep -L
$ grep -L FOO *.txt
test2.txt
$

So here, we only got test2.txt in our output and not test.txt , since test2 did not contain the string we were searching for, whereas test did. Because we wanted to only see the names of those files which did not contain our string, this makes sense.

I don't have time right now for a full write-up of how -prune behaves, but basically it tells find not to consider everything that the arguments before the -prune lfag found, more or less. If this isn't clear then I'll try to reply again tonight with a bit more detail on this last point.

Hope this helps.

-

Right, a bit more detail on -prune . As others have mentioned throughout previous replies, in its usual usage find performs a series of tests to ultimately return whatever it is you've asked it to find. By default, all of these tests must pass, and only things which pass all of the tests you've specified will be acted upon by find in the end.

So in the case of the command find . -type f -exec grep -l FOO \{\} \; , there are a few things being tested here.

Firstly, that the thing being considered resides underneath the current working directory, represented by . (the first argument is always the root of the path that find will start from).

Second, we are only interested in things which are files. The -type flag can be used to search for directories, files, symbolic links, device files, all kinds of things. Files are represented by -type f

Now before we go on, remember that in order to proceed to the next test, the previous test must have passed. So at this point we've found all files that reside in or somewhere beneath our current directory. So the third step is continued into.

The third flag is a bit different, or might seem so at first. The purpose of -exec is to execute an external command on whatever it is we've found up 'til now. So here, we execute the command grep -l FOO on all files that reside in or beneath the current working directory. The item currently being processed is represented by two curly brackets, and the end of the command is signified by a semi-colon. So the \{\} is substituted at execution time by whichever of our found-so-far-things is currently up for consideration. And the ; signifies the termination of the command to be executed.

Now, that explains things so far. But what if we first want to exclude a category of things from consideration that would otherwise normally be caught by find ? That's where -prune comes in. It will remove from subsequent consideration anything that has been matched by any flags or actions taken up until this point in the find command. So things matched prior to the -prune flag will not be matched by anything that follows the -prune .

So if we look again at my original proposed solution to your problem, the command find . -type d -name 'browser?' -prune -o -type f -exec grep -l FOO \{\} \; can be broken down as follows:

The path we are running our find beneath. This is the current working directory, .
-type d -name 'browser?' , meaning "all directories whose names match the regular expression 'browser?'".
-prune -o , meaning "we want to exclude from consideration all things matched by whatever comes next, if they also match whatever came before this point". So no content within directories in or underneath the current working directory whose names match 'browser?' will be affected by whatever follows this point.
-type f -exec grep -l FOO \{\} \; , meaning "execute grep -l FOO on all files". But because of our previous -prune -o , the end result of this is only to execute the grep -l FOO on all files that do not reside in or underneath a directory whose name matches the regex 'browser?'.

I hope this helps clear things up. If you have any more questions let me know and I'll be happy to help if I can.

scrutinizerix · January 21, 2018, 10:37pm

Huge thanks, your explanatory skills are impressive indeed. I went online only now after several days of intensive brainwork I tried to go through on my own and was only astonished that we share the similar understanding.
I think I may nailed the essence of my problem of why I was unable to filter out unneeded entries. After failing numerous tests trying to search just for the entries I wanted to be skipped instead of trying to actually I stumbled upon the fact that I had missed the logic when using

'.*(Safari|[Oo]pera).*

in that it output no results when I tried to feed it as an argument to

-and -not -path

parameter in the combination with

'.*(keyword|KEYWORD).*

. I bypassed an obvious thing which was that 1) expressions of the type

'.*(Safari|[Oo]pera).*

can NOT be arguments to

-path

,

-name

and 2) nor it coexist with any other parameters than

-regex

because it was what the name of the parameter's implies - "a regular extension of the Extended Set" - it could NOT be used to do what I sought not in conjunction with it.
So the correct logic of this part of the command line was

-regex  '.*(keyword|KEYWORD).* -and -not -regex '.*(Safari|[Oo]pera).*

. Use of -path and -regex on the same line with Extended Set regexes was like comparing incomparable. By further investigation I discovered that I could not make much of use of

-exec grep

or pipe into either

-grep

or

-xargs

, cause grepping is useful mostly for manipulating strings in target files or in output of such commands as

ls

so I dropped that option.

I looked closely at description of

-prune

once more and it was this phrase that caught my attention and made into my enlightenment:

.

So this is what I needed: to omit the entire pathname and get only the highest level for every matched result according to the pattern. It meant I had append this option to the command line without any following constructs. Having tested with simpler instances I glanced through the man page and threw in

-x

to omit constant "/dev/fd 3: not a directory" lines.

So, to sum the entire line that I struggled to come up with to do the task had to be:

echo PASSWORD | sudo -S find -E -x / -regex '.*(keyword|KEYWORD).*' -and -not -regex '.*(Safari|[Oo]pera).*' -and -not -path *OtherAlwaysShowingUpUselessLine* -prune

Notice that the argument to -path is NOT a regular expression of the extended set with which you'd use -E option to

find

necessary to provide if

-regex

is used too. In this case it conforms to the logic nicely and assists in the required manner.

That way I was able to reduce the output to only 5 lines, "sandboxing" that app in the search results that I then had opportunity to apply further actions to.