Here is my code. What it does is it reads an input file (input.txt which contains roughly 2,000 search phrases) and searches a directory for files that contains the search phrase. The directory contains roughly 1900 files and 84 subdirectories. The output is a file (output.txt) that shows only the file names that contains the searched keyword. I timed this code and it took roughly 3.38 hours to run!!! Can someone help me optimize my code? Or provide my with some suggestions?
#!/bin/sh
start=$SECONDS
while read word
do
a=$(find /path/to/files -exec grep -wi $word /dev/null {} \; | sort -u | cut -d \: -f1)
if [ -n "$a" ]; then
echo "$word is found in: $a"
fi
echo ""
done < input.txt >> output.txt
end3=$SECONDS
echo "Total Runtime: $((end3 - start3)) secs."
this sort of works. The input file contains phrases found in multiple files under multiple directories. So with your code, I'd just get one large file with just the output and wouldn't know where one entry finishes and another starts. Guess I could modify it a bit.
right now, my script works, but it takes roughly 3 hours to run. I'd like to feed in an input files containng a list of phrases. Those phrases are found in several files in multiple directories. Right now, my output looks like the following:
"SEARCHED TERM" is linked to the following:
/path/to/file1.txt
/different/path/to/file2.txt
/another/path/to/file2.txt
I timed my awk script (from post #5) against a midsized Gutenberg collection (5,354,620 lines of text in 203 documents, 20 directories) I have a phrase list of 3939 phrases.
Processing time: 1h 43min
The original script is still running (Over 17h now)
for some reason, i had some trouble running both those scripts. The only things i really need to change are the input/output files and path to directory, right?
The delete statement is supported in POSIX awk, but not on arrays. So delete h is an extension, but this should work:
for(i in h)delete h
This would mean that only words or phrases in between spaces get matched but there are undoubtedly case with , . ! ? ; : and at the beginning or end of a line... ,no?
---------- Post updated at 13:14 ---------- Previous update was at 07:13 ----------
Given the word matching abilities of grep I thought the best approach would be to optimize your script. I changed the following parts:
replace all those finds with a single find and store the result in a variable called filelist.
by running grep with the -l option and changing the environment variable IFS so that it only contains a linefeed, the sort and cut and the call to /dev/null were no longer needed.
add -F flag to grep to switch off regex matching and use literal matching, it also ensures no unintended matches occur
This resulted in this script you could try:
oldIFS=$IFS
IFS="
"
filelist=$(find /path/to/files -type f)
while read word
do
a=$(grep -Fwil "$word" $filelist)
if [ -n "$a" ]; then
echo "$word is found in: "
echo "$a"
fi
echo ""
done < input.txt > output.txt
IFS=$oldIFS
Preliminary testing showed a factor 15 speed improvement, ymmv..
AIX 5.3 by default only has 4K for argument expansion, which can result in Argument/Parameter list too long errors when processing quite short parameter strings.
I'm almost sure that xargs would be required in the above script, depending on the actual length of /path/to/files and the current value of the ncargs OS parameter.
You can try this, it's limit is shell expansion and it doesn't handle well if the same search string appear two or more times in same file.
It that case it will give output like :
<search string> is present in :
/path/to/filename1
/path/to/filename1
INPUT=$(tr "\n" "|" < input | sed "s/|$//")
find /path/to/dir -type f -exec egrep -wi "$INPUT" {} /dev/null \; | \
awk -F":" '{ a[$2] = a[$2] "|" $1 } END { for ( i in a ) print i " is present in :" a } ' | \
tr "|" "\n"