Script takes too long to complete

mohtashims · June 27, 2016, 8:31am

Hi,

I have a lengthy script which i have trimmed down for a test case as below.

more run.sh
#!/bin/bash
paths="allpath.txt"
while IFS= read -r loc
do
echo "Working on $loc"
startdir=$loc
find "$startdir" -type f \( ! -name "*.log*" ! -name "*.class*" \) -print |
while read file
do
echo "TIME-STAMP:"$(date)
input="alter.txt"
while IFS= read -r var
do
searchterm=$(echo $var | awk -F'=' '{print $1}')
replaceterm=$(echo $var | awk -F'=' '{print $2}')
echo "ST:"$searchterm
echo "RT:"$replaceterm
done < "$input"
done
done < "$paths"

more alter.txt

hello=yello
wow=how
ping=pong
seesaw=heehow
bongo=ringo
jazbant=toaster
westowin=restaurant

more allpath.txt

/tmp/web/var/APPLE_DOM
/tmp/bin/var/APPLE_ART

The above script reads for a physical path from allpath.txt . It then looks for all files except .logs & .class files using the find command find "$startdir" -type f $ ! -name "*.log*" ! -name "*.class*" $ -print .

Note: the find command above is instant if i fire it as a separate command from the bash shell on that directory location. Takes less than 2 secs to list all the files.

For each file found it searches for all the "search strings" mentioned in the alter.txt file while IFS= read -r var and replaces it with the corresponding text (this part of the code i have not shared considering not necessary)

For a folder 4GB in size it take around 25 mins to complete.

Can you help me optimize the script so it completes in less time.

RudiC · June 27, 2016, 8:55am

The entire logic and structure of that script seems suboptimal. For every file found, you (re)open "alter.txt", read every single line, invoke awk twice and - I'm guessing based on your other threads - run something like sed to do the replacements.

Depending on the found files' count this IS going to be lengthy.

I'm not talking of improving the innermost loop here - although there is quite some potential.
Why don't you leave the looping to one single instance of e.g. awk ?
Create a list of all file candidates ( find can have several paths as starting points) and run awk , first reading all the search/replacement pairs, and then working those on all files presented.

mohtashims · June 27, 2016, 10:35am

Yes, you are right ...i do have sed to do replacement but did not share for the sake of making it look simple for others.

I don't know if I understood correctly and if i can work this out.

What i understood is

You asking me to keep the find inside the while IFS= read -r var loop ?

Is that correct ?

---------- Post updated at 09:35 AM ---------- Previous update was at 08:31 AM ----------

Keeping the find inside the while IFS= read -r var loop helps cut down the time taken by more than half !!

Here is the latest code snippet

more run.sh
#!/bin/bash
paths="allpath.txt"
while IFS= read -r loc
do
echo "Working on $loc"
startdir=$loc
input="alter.txt"
while IFS= read -r var
do
searchterm=$(echo $var | awk -F'=' '{print $1}')
replaceterm=$(echo $var | awk -F'=' '{print $2}')
find "$startdir" -type f \( ! -name "*.log*" ! -name "*.class*" \) -print |
while read file
do
echo "TIME-STAMP:"$(date)
echo "ST:"$searchterm
echo "RT:"$replaceterm
done
done < "$input"
done < "$paths"

Can it be optimized further ?

RudiC · June 27, 2016, 12:03pm

No, this is not what I said.

To repeat my statement bluntly: It should be replaced.

MadeInGermany · June 27, 2016, 12:13pm

Yes, your inner loop (in post#1) should fill both variables in one stroke, use the correct InputFileSeparator

    input="alter.txt"
    while IFS="=" read -r searchterm replaceterm
    do
      echo "ST:$searchterm"
      echo "RT:$replaceterm"
    done < "$input"

RudiC · June 27, 2016, 12:30pm

With the assumptions:

not too many lines in "allpath.txt",
not too many files found,
bash being used,
wouldn't this do?

awk '
FNR == NR       {R[$1] = $2
                 next
                }
                {for (r in R) gsub (r, R[r])
                 print > (FILENAME ".new")
                }
' FS="=" alter.txt $(find $(< allpaths.txt) -type f \( ! -name "*.log*" ! -name "*.class*" \))

You will have to rename the ".new" files afterwards.

mohtashims · June 27, 2016, 12:31pm

madeingermany:

Yes, your inner loop (in post#1) should fill both variables in one stroke, use the correct InputFileSeparator
   input="alter.txt"
   while IFS="=" read -r searchterm replaceterm
   do
   echo "ST:$searchterm"
   echo "RT:$replaceterm"
   done < "$input"

If you look at the modified script in my last post .. it takes the same time with or without this suggestion.

I was able to bring down the execution time from 25 mins to -> just 7 mins.

Please suggest if there is anything else that can be done to optimize this ?

@RudiC:

I m sorry for not able to completely understand your suggestion.

Can you please elaborate the below only if the same is not covered in my last post with the updated script.

RudiC · June 27, 2016, 12:35pm

see my post (#6) just above yours.

MadeInGermany · June 27, 2016, 12:47pm

Your post#3 is 4 times faster but has more output lines, so I guess something goes wrong.
Try my optimization with your post#1; it is 10 times faster and produces indentical output.