List all file names that contain two specific words. ( follow up )

Symbo53 · November 30, 2016, 9:05am

Being new to the forum, I tried finding a solution to find files containing 2 words not necessarily on the same line.
This thread

"List all file names that contain two specific words."

answered it in part, but I was looking for a more concise solution.

Here's a one-line suggestion using awk

find . -name "*" -exec awk -v w1=WORD1 -v w2=WORD2 '
BEGIN {W1=0;W2=0} {if($0 ~ w1) {W1=1
}
if($0 ~ w2) { 
W2=1
}
} END {if(W1+W2>1) {print FILENAME
}
}
' {} \; 2>/dev/null

This can be embedded in a shell calling the 2 arguments WORD1 and WORD2....

jim_mcnamara · November 30, 2016, 9:58am

Okay good try. - let's consider some things.

How about:

for fname in $(find . -type f ) 
do
     grep -Fq WORD1 "$fname" && grep -Fq WORD2 "$fname" && echo "$fname"
done

I think your solution would report directories, for example. It also would report a file on a search for "goo" when the file had the word "good".

My example also has failings.

You decide. It all depends on the exact requirements for the script. Plus, you can script the same solution using multiple tools. In this case awk, grep, or even just plain bash.

disedorgue · November 30, 2016, 10:19am

Hi,
If you are under linux or/and if your grep support option:

grep -R -Pzl 'WORD1(.*\n)*.*WORD2|WORD2(.*\n)*.*WORD1' .

Regards.

rbatte1 · November 30, 2016, 10:55am

Could you consider a double grep?

find . | xargs grep -l "WORD1" | xargs grep -l "WORD2"

Notes:-

There is no need to specify -name "*" on the find command.
The -l flag (lower case L) means you only get filenames out of grep
This assumes that there are only regular files in the current directory. Add -type f to the find if this is not the case.
If you need exact word searching (i.e. not match good when searching for goo ) you can add the -w flag to each grep
If the files are large, this will read some files twice, although if the matches are early it will not have to read the whole file.

I hope that this helps,
Robin

drl · November 30, 2016, 12:07pm

Hi.

We ran across a need for this some time ago, and wrote a solution that has worked for us.

In between projects, we discuss how we should publish our code: our own website, sourceforge, girhub, as a post in a thread (as Corona688 has done here, for example, among others). No consensus so far, sigh.

We have agreed that we can at least post the documentation for our utilities in hopes that it may provide motivation for others to use approaches that have worked (at least for us).

So here are some details on our rapgrep -- this is clearly not a one-line suggestion

rapgrep Require all patterns grep. (what)
Path    : ~/bin/rapgrep
Version : 1.2
Length  : 307 lines
Type    : Perl script, ASCII text executable
Shebang : #!/usr/bin/perl
Help    : probably available with [     ]-h
Modules : (for perl codes)
 warnings       1.23
 strict 1.08
 English        1.09
 Carp   1.3301
 Data::Dumper   2.151_01
 Getopt::Long   2.42

and the help :

Script rapgrep reads files and matches patterns as provided by the
caller.  If all patterns successfully match at least once, then
the file name is printed.  Some details of the matching results
may be requested to be printed.

usage: rapgrep [options] -- [files]

options:
--all
  Force all lines to be searched.  The default is to quit if
  all matches are successful even if EOF is not read yet.

-e=pattern
 Use perl pattern for searching.  More than one -e=p may be used.
 However, if the control statement becomes unwieldy, see -f.

--file=pathname
  Read file at pathname for patterns, one per line.  More than
  one --filename=path may used.  All -e and -f contents are
  collected and used.  A "#" may be used for comment lines in the
  files.

--ignore
  Ignore case in matches.  Default is case is significant.

--reverse
  Invert the sense of success: if a filename normally would 
  not be printed, then print it; if normally printed, omit it.

--list=rx
  List the reasons why a filename is not printed ("r").  List the
  details of the pattern matches: how many of which pattern in
  what file.

--comment=string
  Change the comment character in the pattern files to any in the
  string.

--h (or -h)
  print this message and quit.

--version
  print this message and quit.

Best wishes ... cheers, drl

Symbo53 · December 1, 2016, 4:48am

Thanks a lot for your contributions, which I compared...

My solution was slowest, ( so I guess using exec within the find command is not very efficient ) and as you predicted, included files containing any character string, not just whole words. But this can be a requirement, actually.

Times:                  real    0m45.672s : user    0m28.487s : sys     0m15.590s
Jim's loop took         real    0m20.383s : user    0m8.548s  : sys     0m10.992s
Robin's was faster      real    0m4.126s  : user    0m2.883s  : sys     0m0.303s

... and disedorgue's solution caused a core dump as run, so I didn't try fiddling with it too much, as it's not my system ! In any case our production system has many non-linux machines, so bash options won't work.
Not sure where the code is for drl's solution, didn't find rapgrep on Bing either. What am I missing ?

Regards,

---------- Post updated at 04:48 AM ---------- Previous update was at 03:56 AM ----------

Hi again,
You guys have opened my eyes regarding find . -exec... which I regularly use
I know this isn't strictly the post subject, but I just wanted to comment on the difference between

time find . 2>/dev/null | xargs grep -l "$chn1" 2>/dev/null | xargs grep -l "$chn2" 2>/dev/null

real 0m6,38s

and

time find . -exec grep -l "$chn1" {} \; 2>/dev/null | xargs grep -l "$chn2" 2>/dev/null

real 2m15,43s !!!!!!!!!!!

Thanks for this revelation !

rbatte1 · December 1, 2016, 5:53am

The difference you are seeing is probably because your find . -exec grep ..... runs the grep command individually for each file. The use of xargs in my suggestion reduces the number of command calls and therefore the number of process spawned. It may not be the best way, but it works okay.

You might be able to use a + and the end of your -exec section of the find instead of the \; , but it depends on which version of find you have.

Be aware that times may vary depending on the number of files and their sizes, so searching a very few large files may be slow with my suggestion because it will potentially read the files twice.

Glad to have helped a bit this time, but do keep experimenting if the times get longer.

Kind regards,
Robin

drl · December 1, 2016, 7:50am

Hi.

We have not published the code yet, we are still deciding how and where to do it:

Best wishes ... cheers, drl

disedorgue · December 1, 2016, 9:37am

Ok, my solution is still too experimental...
Another solution with perl:

find . -type f -exec perl -ne '$w1=$w2=0 if eof; next if ($w1==1 &&  $w2==1) ; $w1=1 if (m/WORD1/) ; $w2=1 if(m/WORD2/) ; if($w1 & $w2){ print $ARGV."\n"  }' {} \+

Regards.