Can someone please help me optimize my code (script searches subdirectories)?

jl487 · March 14, 2012, 8:05am

Here is my code. What it does is it reads an input file (input.txt which contains roughly 2,000 search phrases) and searches a directory for files that contains the search phrase. The directory contains roughly 1900 files and 84 subdirectories. The output is a file (output.txt) that shows only the file names that contains the searched keyword. I timed this code and it took roughly 3.38 hours to run!!! Can someone help me optimize my code? Or provide my with some suggestions?

#!/bin/sh
start=$SECONDS
while read word
do
a=$(find /path/to/files -exec grep -wi $word /dev/null {} \; | sort -u | cut -d \: -f1)
if [ -n "$a" ]; then
echo "$word is found in: $a"
fi
echo ""
done < input.txt >> output.txt
end3=$SECONDS
echo "Total Runtime: $((end3 - start3)) secs."

methyl · March 14, 2012, 8:24am

Suggestions:
1) Correct the Runtime calculation!

start3=$SECONDS

2) Only search files (-type f) and use "grep -l" to get the name of the file once only. Put quotes round "$word" if it is a "phrase".

a=$(find /path/to/files -type f -exec grep -wil "$word" /dev/null {} \;)

jl487 · March 14, 2012, 8:27am

sorry, i had to modify parts of my code for the forums and that one slipped through the cracks!

michaelrozar17 · March 14, 2012, 9:56am

Additionally to methyl's suggestion try if having xargs improves the performance..

a=$(find /path/to/files -type f -print0 | xargs -0 grep -wil "$word" )

Chubler_XL · March 15, 2012, 2:08am

How about this using awk:

find /path/to/files -type f -print | awk '
NR==FNR{for(i=1;i<=NR;i++) w[tolower($i)]++ ; next }
{ FILE=$0
  while(getline< FILE) {
     for(i=1;i<=NR;i++) {
         if($i && tolower($i) in w) print tolower($i)" is found in: "FILE
      }
  }
  close(FILE)
}' input.txt - >> output.txt

---------- Post updated at 04:08 PM ---------- Previous update was at 03:50 PM ----------

Sorry didn't pick up that the requirement was to find phrases not individual words, this should work but not quite as blazing fast:

Edit: also avoids printing result more than once if phrase appears multiple times in file.

find /path/to/files -type f -print | awk '
NR==FNR{w[tolower($0)" "]++ ; next}
{ FILE=$0
  delete h
  while(getline< FILE) {
     $0=" "tolower($0)" "
     for(L in w)
           if(!(L in h) && match($0, " "L)) {
              print L "is found in: "FILE
              h[L]++
           }
  }
  close(FILE)
}' input.txt - >> output.txt

Scrutinizer · March 15, 2012, 10:50am

Does your grep have recursive capabilities (-r / -R )? Then you could perhaps use this instead of your script:

grep -Frilwf input.txt /path/to/files > output.txt

instead of your script

jl487 · March 15, 2012, 10:56am

scrutinizer:

Does your grep have recursive capabilities (-r / -R )? Then you could perhaps use this instead of your script:
grep -Frilwf input.txt /path/to/files > output.txt
instead of your script

no recursive capabilities. now i'm jealous....

Scrutinizer · March 15, 2012, 11:16am

Perhaps you can try if this works instead of the grep -r:

find /path/to/files -type f -exec grep -Filwf input.txt {} \; > output.txt

jl487 · March 15, 2012, 12:15pm

this sort of works. The input file contains phrases found in multiple files under multiple directories. So with your code, I'd just get one large file with just the output and wouldn't know where one entry finishes and another starts. Guess I could modify it a bit.

Scrutinizer · March 15, 2012, 12:30pm

That is how I interpreted the first post:

Can you specify what you are after?

jl487 · March 15, 2012, 12:35pm

right now, my script works, but it takes roughly 3 hours to run. I'd like to feed in an input files containng a list of phrases. Those phrases are found in several files in multiple directories. Right now, my output looks like the following:

"SEARCHED TERM" is linked to the following:
/path/to/file1.txt
/different/path/to/file2.txt
/another/path/to/file2.txt

Scrutinizer · March 15, 2012, 1:56pm

Does your grep have a -o option? What OS are you using?

jl487 · March 15, 2012, 2:17pm

-o switch is not avaiable. I am running on an AIX machine.

Scrutinizer · March 15, 2012, 2:52pm

Which version ? Do you have a -H option?

Chubler_XL · March 15, 2012, 6:11pm

I timed my awk script (from post #5) against a midsized Gutenberg collection (5,354,620 lines of text in 203 documents, 20 directories) I have a phrase list of 3939 phrases.

Processing time: 1h 43min

The original script is still running (Over 17h now)

jl487 · March 15, 2012, 10:16pm

for some reason, i had some trouble running both those scripts. The only things i really need to change are the input/output files and path to directory, right?

Chubler_XL · March 15, 2012, 10:58pm

the delete statement may also give you some issues on AIX, as I think it might be a GNU extension or only supported in later implementions of awk.

Try:

S=$SECONDS
find /path/to/files -type f -print | awk '
NR==FNR{w[" "tolower($0)" "]++ ; next}
{ FILE=$0;
  split("",h,",");
  while(getline< FILE) {
     $0=" "tolower($0)" "
     for(l in w)
           if(!(l in h) && match($0, l)) {
              print substr(l,2)"is found in: "FILE
                  h[l]++
           }
  }
  close(FILE)
}' input.txt - > output.txt
echo "Processing time: "$((SECONDS-S))

The split statement is a bit more of a portable way to clear an array.

Just change /path/to/files and input.txt to match your particular setup.

Scrutinizer · March 16, 2012, 8:14am

The delete statement is supported in POSIX awk, but not on arrays. So delete h is an extension, but this should work:

for(i in h)delete h

This would mean that only words or phrases in between spaces get matched but there are undoubtedly case with , . ! ? ; : and at the beginning or end of a line... ,no?

---------- Post updated at 13:14 ---------- Previous update was at 07:13 ----------

Given the word matching abilities of grep I thought the best approach would be to optimize your script. I changed the following parts:

replace all those finds with a single find and store the result in a variable called filelist.
by running grep with the -l option and changing the environment variable IFS so that it only contains a linefeed, the sort and cut and the call to /dev/null were no longer needed.
add -F flag to grep to switch off regex matching and use literal matching, it also ensures no unintended matches occur

This resulted in this script you could try:

oldIFS=$IFS
IFS="
"
filelist=$(find /path/to/files -type f)
while read word
do
  a=$(grep -Fwil "$word" $filelist)
  if [ -n "$a" ]; then
    echo "$word is found in: "
    echo "$a"
  fi
  echo ""
done < input.txt > output.txt
IFS=$oldIFS

Preliminary testing showed a factor 15 speed improvement, ymmv..

Chubler_XL · March 19, 2012, 12:43am

AIX 5.3 by default only has 4K for argument expansion, which can result in Argument/Parameter list too long errors when processing quite short parameter strings.

I'm almost sure that xargs would be required in the above script, depending on the actual length of /path/to/files and the current value of the ncargs OS parameter.

Peasant · March 19, 2012, 3:27am

You can try this, it's limit is shell expansion and it doesn't handle well if the same search string appear two or more times in same file.
It that case it will give output like :

<search string> is present in :
/path/to/filename1 
/path/to/filename1

INPUT=$(tr "\n" "|"  < input | sed "s/|$//")
find /path/to/dir -type f -exec egrep -wi "$INPUT" {} /dev/null \; | \
awk -F":" '{ a[$2] = a[$2] "|" $1 } END  { for ( i in a ) print i " is present in :" a } ' | \
tr "|" "\n"