Can someone please help me optimize my code (script searches subdirectories)?

Scrutinizer · March 19, 2012, 5:33am

Hi Chubler, thanks, of course. That was pretty silly.

Then it gets a bit more complicated as would need to chop the file list up in snack size chunks and we could try this:

oldIFS=$IFS
IFS="
"
write_snack()
{
  if a=$(grep -Fwil "$word" $snack); then
    if ! $wordfound; then
      printf "%s\n" "$word is found in: "
      wordfound=true
    fi
    printf "%s\n" "$a"
  fi
  i=0
  snack=""
}

snacksize=25          # Nr of files to feed to grep at a time
i=0 snack="" 
filelist=$(find /path/to/files -type f)
while read word
do
  wordfound=false
  for f in $filelist
  do
    if [ $((i+=1)) -lt $snacksize ]; then
      snack=${snack}${IFS}${f}
    else
      write_snack
    fi
  done
  if [ $i -gt 0 ]; then
    write_snack
  fi
  printf "\n"
done < input.txt > output.txt

drl · March 19, 2012, 10:28am

Hi.

The Last post by Scrutinizer suggested to me that parallelization might be feasible here.

The OP has nothing about the characteristics of the AIX box, but I seem to recall that I have used AIX on a 12-CPU dual 3090 that had a lot of processing power (regrettably only 32-bit, but that's another story).

So if the box has enough oomph, then firing off a number of background processes, each handling a number of files, could decrease the real time, which is apparently what concerns the OP.

Best wishes ... cheers, drl

jl487 · March 19, 2012, 11:26am

scrutinizer:

Hi Chubler, thanks, of course. That was pretty silly.

Then it gets a bit more complicated as would need to chop the file list up in snack size chunks and we could try this:

oldIFS=$IFS
IFS="
"
write_snack()
{
  if a=$(grep -Fwil "$word" $snack); then
   if ! $wordfound; then
   printf "%s\n" "$word is found in: "
   wordfound=true
   fi
   printf "%s\n" "$a"
  fi
  i=0
  snack=""
}
 
snacksize=25          # Nr of files to feed to grep at a time
i=0 snack="" 
filelist=$(find /path/to/files -type f)
while read word
do
  wordfound=false
  for f in $filelist
  do
   if [ $((i+=1)) -lt $snacksize ]; then
   snack=${snack}${IFS}${f}
   else
   write_snack
   fi
  done
  if [ $i -gt 0 ]; then
   write_snack
  fi
  printf "\n"
done < input.txt > output.txt

this runtime for this code took about 20 mins vs. my script which has a runtime of 2 hours. YAY! Unfortunately, when i ran a comparison on the output file, I did I find several differences. I'll have to look at the individual files to see why they're different. Thanks!

Scrutinizer · March 19, 2012, 12:52pm

The difference may be because in my script grep is using the -F option. This means literal match. If you don't do that with arbitrary strings then you may get unintentional matches. For example a single . (dot) means "any character". If input.txt contains regular expressions instead of strings, then you should leave out the -F-option...