Can someone please help me optimize my code (script searches subdirectories)?

Here is my code. What it does is it reads an input file (input.txt which contains roughly 2,000 search phrases) and searches a directory for files that contains the search phrase. The directory contains roughly 1900 files and 84 subdirectories. The output is a file (output.txt) that shows only the file names that contains the searched keyword. I timed this code and it took roughly 3.38 hours to run!!! Can someone help me optimize my code? Or provide my with some suggestions?

#!/bin/sh
start=$SECONDS
while read word
do
a=$(find /path/to/files -exec grep -wi $word /dev/null {} \; | sort -u | cut -d \: -f1)
if [ -n "$a" ]; then
echo "$word is found in: $a"
fi
echo ""
done < input.txt >> output.txt
end3=$SECONDS
echo "Total Runtime: $((end3 - start3)) secs."

Suggestions:
1) Correct the Runtime calculation!

start3=$SECONDS

2) Only search files (-type f) and use "grep -l" to get the name of the file once only. Put quotes round "$word" if it is a "phrase".

a=$(find /path/to/files -type f -exec grep -wil "$word" /dev/null {} \;)

sorry, i had to modify parts of my code for the forums and that one slipped through the cracks!

1 Like

Additionally to methyl's suggestion try if having xargs improves the performance..

a=$(find /path/to/files -type f -print0 | xargs -0 grep -wil "$word" )

How about this using awk:

find /path/to/files -type f -print | awk '
NR==FNR{for(i=1;i<=NR;i++) w[tolower($i)]++ ; next }
{ FILE=$0
  while(getline< FILE) {
     for(i=1;i<=NR;i++) {
         if($i && tolower($i) in w) print tolower($i)" is found in: "FILE
      }
  }
  close(FILE)
}' input.txt - >> output.txt

---------- Post updated at 04:08 PM ---------- Previous update was at 03:50 PM ----------

Sorry didn't pick up that the requirement was to find phrases not individual words, this should work but not quite as blazing fast:

Edit: also avoids printing result more than once if phrase appears multiple times in file.

find /path/to/files -type f -print | awk '
NR==FNR{w[tolower($0)" "]++ ; next}
{ FILE=$0
  delete h
  while(getline< FILE) {
     $0=" "tolower($0)" "
     for(L in w)
           if(!(L in h) && match($0, " "L)) {
              print L "is found in: "FILE
              h[L]++
           }
  }
  close(FILE)
}' input.txt - >> output.txt

Does your grep have recursive capabilities (-r / -R )? Then you could perhaps use this instead of your script:

grep -Frilwf input.txt /path/to/files > output.txt

instead of your script

no recursive capabilities. now i'm jealous....

Perhaps you can try if this works instead of the grep -r:

find /path/to/files -type f -exec grep -Filwf input.txt {} \; > output.txt

this sort of works. The input file contains phrases found in multiple files under multiple directories. So with your code, I'd just get one large file with just the output and wouldn't know where one entry finishes and another starts. Guess I could modify it a bit.

That is how I interpreted the first post:

Can you specify what you are after?

right now, my script works, but it takes roughly 3 hours to run. I'd like to feed in an input files containng a list of phrases. Those phrases are found in several files in multiple directories. Right now, my output looks like the following:

"SEARCHED TERM" is linked to the following:
/path/to/file1.txt
/different/path/to/file2.txt
/another/path/to/file2.txt

Does your grep have a -o option? What OS are you using?

-o switch is not avaiable. I am running on an AIX machine.

Which version ? Do you have a -H option?

I timed my awk script (from post #5) against a midsized Gutenberg collection (5,354,620 lines of text in 203 documents, 20 directories) I have a phrase list of 3939 phrases.

Processing time: 1h 43min

The original script is still running (Over 17h now)

for some reason, i had some trouble running both those scripts. The only things i really need to change are the input/output files and path to directory, right?

the delete statement may also give you some issues on AIX, as I think it might be a GNU extension or only supported in later implementions of awk.

Try:

S=$SECONDS
find /path/to/files -type f -print | awk '
NR==FNR{w[" "tolower($0)" "]++ ; next}
{ FILE=$0;
  split("",h,",");
  while(getline< FILE) {
     $0=" "tolower($0)" "
     for(l in w)
           if(!(l in h) && match($0, l)) {
              print substr(l,2)"is found in: "FILE
                  h[l]++
           }
  }
  close(FILE)
}' input.txt - > output.txt
echo "Processing time: "$((SECONDS-S))

The split statement is a bit more of a portable way to clear an array.

Just change /path/to/files and input.txt to match your particular setup.

The delete statement is supported in POSIX awk, but not on arrays. So delete h is an extension, but this should work:

for(i in h)delete h

This would mean that only words or phrases in between spaces get matched but there are undoubtedly case with , . ! ? ; : and at the beginning or end of a line... ,no?

---------- Post updated at 13:14 ---------- Previous update was at 07:13 ----------

Given the word matching abilities of grep I thought the best approach would be to optimize your script. I changed the following parts:

  • replace all those finds with a single find and store the result in a variable called filelist.
  • by running grep with the -l option and changing the environment variable IFS so that it only contains a linefeed, the sort and cut and the call to /dev/null were no longer needed.
  • add -F flag to grep to switch off regex matching and use literal matching, it also ensures no unintended matches occur

This resulted in this script you could try:

oldIFS=$IFS
IFS="
"
filelist=$(find /path/to/files -type f)
while read word
do
  a=$(grep -Fwil "$word" $filelist)
  if [ -n "$a" ]; then
    echo "$word is found in: "
    echo "$a"
  fi
  echo ""
done < input.txt > output.txt
IFS=$oldIFS

Preliminary testing showed a factor 15 speed improvement, ymmv..

AIX 5.3 by default only has 4K for argument expansion, which can result in Argument/Parameter list too long errors when processing quite short parameter strings.

I'm almost sure that xargs would be required in the above script, depending on the actual length of /path/to/files and the current value of the ncargs OS parameter.

1 Like

You can try this, it's limit is shell expansion and it doesn't handle well if the same search string appear two or more times in same file.
It that case it will give output like :

<search string> is present in :
/path/to/filename1 
/path/to/filename1
INPUT=$(tr "\n" "|"  < input | sed "s/|$//")
find /path/to/dir -type f -exec egrep -wi "$INPUT" {} /dev/null \; | \
awk -F":" '{ a[$2] = a[$2] "|" $1 } END  { for ( i in a ) print i " is present in :" a } ' | \
tr "|" "\n"