BASH Execution Delay / Speedup

gmark99 · January 26, 2015, 5:48pm

I have a BASH script that runs a continuous loop, reading a line from a file, and then spawning a background process to use it. I've placed "date" commands inside it to see where it's slowing down, and everything inside -- including reading the line from the file -- is fast, but the loop bogs down AROUND THE "DONE" STATEMENT. Before the end of the loop and getting back to the top often takes 30 to 200 seconds!!

I've tried many things to no avail. I've also limited the number of background processes from 400 all the way down to ZERO (runs the single command in foreground), but the end of the loop is a KILLER.

HELP ME UNDERSTAND THIS (and fix it!!)

wisecracker · January 26, 2015, 6:07pm

As we have nothing or no code to view and test then we cannot possibly hazard a guess.

Please give us the script or something similar so that we can emulate your problem.

OS, machine, bash version and any other details would be of help to us too...

TIA...

gmark99 · January 26, 2015, 6:13pm

while [ true ] do

(copy a text file over if out of lines)

while [ pointer is less than end of file ] do

  (spawn a job) < ------ background (&) or foreground both take over 30-100 seconds to get past this

done
done

---------- Post updated at 05:13 PM ---------- Previous update was at 05:08 PM ----------

Okay -- I posted the code, and here are some details. I'm running BASH on Redhat Linux 4.6 on an HP server, and I'm usually the only person running on the machine.

Thanks for the quick reply, BTW!! MUCH appreciated!

Corona688 · January 26, 2015, 6:24pm

We are going to need to know what (spawn a job) actually is before we can tell you why it's taking 300 seconds to complete.

gmark99 · January 26, 2015, 7:52pm

What the job does is take that line of the file that it's given as an argument and uses that to make seven or eight data accesses. Right now they're stubs, but in the future, they'll be accessing some hardware that may take awhile. Hence, spawning the job to run in background.

At first, I thought that running over 200 of these jobs as bogging down the machine. So I reduced the number, and reduced and reduced until I had a single job that made some accesses and exited. Still a huge delay at the end of the loop. But there's no testing there -- it's just supposed to return and run the loop again. The test for the line number in the file is in the inner loop. The outer loop continues reading a new copy of the external file *inside* the loop, not where the slowdown is occurring. Not knowing exactly what BASH is doing internally, I don't know what's going on at the head and footer of the loop. But it occasionally slows down a LOT.

Don_Cragun · January 26, 2015, 8:51pm

And without seeing your actual bash script, we don't know what's going on at the head and footer of your loop.

I can imagine lots of things you could be doing that would cause the symptoms you're seeing. But instead of us guessing at what you're doing, why don't you actually show us your code so we can give you some input that might actually make a difference in the way your code runs???

gmark99 · January 26, 2015, 8:53pm

I am using BASH 4.1.2 on the Linux machine.

Since I've said I've "spawned" the job at that point in background, but folks are saying that it may still be responsible for the slowdown, is there some sort of delay in starting a job like that that may slow down the main loop regardless of it taking awhile to start running independently?

Don_Cragun · January 26, 2015, 8:58pm

There are thousands of things that could cause this. SHOW US YOUR CODE!

gmark99 · January 26, 2015, 9:07pm

4

 INPUT_POINTER=0
  5 while [ true ]
  6 do
  7     if [ ${INPUT_POINTER} -ge ${SIZE} ]
  8     then 
  9         cp ${REMOTE_FILE} ${LOCAL_FILE}
 10         > ${REMOTE_FILE}
 11         SIZE=`wc -l ${LOCAL_FILE} | sed "s;  *;;" | sed "s; .*;;"`
 12         INPUT_POINTER=1
 13     fi  
 14     
 15     while [ ${INPUT_POINTER} -le ${SIZE} ]
 16     do
 17         INPUT_POINTER=$((INPUT_POINTER + 1));
 18         INPUT=`cat ${LOCAL_FILE} | head -${INPUT_POINTER} | tail -1`
 19         my_job "${INPUT}" &
 20     done # Reading File Lines
 21 done # while TRUE

I've tried inputing the file into the "done" statement and executing a "while READ", which worked about the same.

Don_Cragun · January 27, 2015, 3:07am

So, from this code it seems that your intention is to simultaneously run an infinite number of jobs for every line in some file on your system. You do not care if any of these jobs start successfully. You do not care if any of these jobs complete successfully (or at all). You believe that you should be able to start an infinite number of jobs and all of those jobs should run as though they are the only job running on your system.

Unfortunately, I do not know of any system that will act anything at all like that. Nor, do I understand why you need an infinite loop to read lines in a file forever. Presumably your script terminates when the system kills it because it realizes you have exhausted system resources.

You say that a while read loop performs about the same way. You are correct in noting that an infinite loop is an infinite loop, but a while read loop would only have needed one process per line processed while your current nested loops use four processes per line processed plus 3 processes each time the file is processed. And, and a while read loop would read your file once each time you process the file while your current code reads the entire file n+1 times if your file contains n lines.

If, instead of processing a single file an infinite number of times, you want to process each line in a file once; and if instead of ignoring the success or failure of all of the jobs you start, you'd like to actually log any failures that might occur during processing and not terminate your script until all lines have been processed, tell us more about your system:

Given the expected load on your system and the number of processors available to run your script, how many simultaneous processes should your script expect to be able to run?
What does my_job do? (If it doesn't exit with a zero exit status if it completes successfully and exit with a non-zero exit status if it does not complete successfully, rewrite it so it does! If it exits before all children it has started have completed, rewrite it so it doesn't return until all of its children have finished!)

gmark99 · January 27, 2015, 10:57am

I apologize, Don. I don't provide all the code because I thought it would obfuscate things, but it seems I've made things more complicated. I really appreciate your patience, here.

First, I run "ps" and look for instances of "my_job" and have a maximum number that's checked before spawning another. I've run as many as 400 to stress things, and it worked (with the exception being the problem I'm talking about here, which doesn't seem to be affected at all by that number). I currently run a maximum of 20, but at this instant, for debugging purposes, I've set the limit at one. There is some proprietary stuff inside "my_job" that I'm hesitant to show (yes, I understand how difficult that makes this!)

As for reading the files continuously, the source of data is always on, populating the source text file. I copy the file over, erase the source copy, and then read each line until a counter exceeds the line size of the file, OR if the last line I read has a timestamp that is too old.

As for your suggestion to read the file a single time, yes, I used that successfully and just switched back with the suspicion that that method (the method you recommend here) was causing my current problem.

Exit status: It executes an "exit 0" on success or failure, but results are all echoed to a log file. Failures I check are all for functions that read or write data to and from hardware, but I still exit 0, and simply report the results of those functions.

Would you suggest waiting until the process count (of running "my_job" instances) dropped to some lower number or perhaps zero before fetching a new file full of records?

Thanks again!!
Mark

Corona688 · January 27, 2015, 11:15am

If you don't understand what's wrong, you also don't know what's relevant.

If you think the program is too big/complicated to post, whittle it down into a smaller, still-complete program which still shows the same problem. Sometimes just doing this can find the problem, too.

DGPickett · January 27, 2015, 11:33am

There are ways to limit the number of parallel processes for stable, robust system use; been there before, wrote bctl (birth control): Keep up constant number of parallel processes

wisecracker · January 27, 2015, 2:43pm

Hi gmark99...

<Cough> It would be wiser to make your my_job have an error code in real time, detect it in the main script yet still log it as before...
How about commenting out the line, or placing echo in its place, in the script that points to my_job .
Also try a sleep <secs> just before your spawned child. In other words deliberately slow things down and see what the results are compared to what you expect them to be.
LBNL, although you are launching your child in the background try strategically placing wait in the loop and observe what happens or launch it without the & and let the shell temporarily hang until it is finished...
Just a few ideas to play with.

Don_Cragun · January 27, 2015, 4:06pm

I also apologize. I should have gone to bed at midnight this morning instead of trying to help you with your problem. I completely overlooked line #10 in your code that wipes out the data that you have just copied (and occasionally another one or more chunks of data that have been added to the file between the time the cp on the previous line completes and the time the redirection wipes out the file you copied, which with 400 jobs running on your system could be minutes later.

If you mean that you run ps somewhere in the 1st three lines of your script (which you stripped out of the code you showed us), that won't have any effect on the number of jobs started in the background on line 19 in the loop on lines 5 through 21.

If you mean that you run ps in my_job , that won't affect the number of jobs started in the background on line 19 in your script nor the speed with which they are spawned.

If you mean that you have another loop between lines 18 and 19 in the code you showed us that keeps you from getting to line 19 until some of your background jobs complete, that would be CRUCIAL information that completely changes the way your script works that you have hidden from us.

From what you have shown us, the only thing limiting the number of invocations of my_job that you try to run concurrently is the number of lines available to process in your input file and how fast your "producer" can write data into that file.

As I mentioned above, the way you are copying and erasing the source file will sometimes silently discard some data. But, if you discard data if it is too old (something else we can't see in your code) maybe it doesn't matter

I can assure you that that wasn't your problem unless the problem was that you ran out of disk space due to the size of the file or you exceeded the maximum file size that could be written by the process that is adding data to your source file. (And the description of the symptoms you have provided do not support either of these possibilities.)

You tell us that you limit the number of jobs you are running simultaneously, but you don't show us any code that suggests that this is true. From what you have shown us, there is a high likelihood that attempts to spawn my_job in the background will fail do to exceeding the number of processes a user is allowed to run at once. Since you never wait for any of your background jobs to complete and never check the status of any of your background jobs, you will never know how many attempts to start my_job failed (and in these cases, my_job can't possibly log the fact that it never started).

You have ignored my requests for information about the type of system you're using and the number of threads you might be able to run concurrently. Unless you have a massively parallel processing system, running 400 background jobs is much more likely to cause thrashing and scheduling problems than it is likely to improve throughput.

What you have shown us is logically equivalent to a script like this:

while [ true ]
do      sleep 1&
done

which will bring any system to its knees in seconds.

gmark99 · January 29, 2015, 1:03pm

Hope this helps, Don

Let me know what else I can provide, such as more of what these called programs contain. Essentially, the "my_job" runs for up to 90 seconds maximum and then deposits whatever results in its own file for retrieval by the "unloader".
The Garbage Collection routine periodically checks for "heartbeat" files that haven't been updated in several minutes, tries to kill the number of the process inside it, if its still alive, and then discards the file.

#!/bin/bash
_last_update="Tue Jan 14 12:39:23 CST 2015"
# Linux 2.6.32-431.29.2.el6.x86_64 #1 SMP Sun Jul 27 15:55:46 EDT 2014 x86_64 x86_64 x86_64 GNU/Linux
#
    COL_C_TIMESTAMP=1
    COL_C_COMMAND=2

###############################################################
# 3. INSTALLATION-DEPENDENT VARIABLES
###############################################################

### Place on WFE where all this cool stuff happens
# WFE_HOME=`pwd`
WFE_HOME=/home/gmark/rje

### Place on Splunk server where other cool stuff happens
# Server where WFE is running
WFE_SERVER=wfe.ready.com

WFE_CONTROL=${WFE_HOME}/op-control

### Message Buffer directory
WFE_MSGS=${WFE_HOME}/MSGS

# Scheduled Global Abate done today already?
SGA_STAT_FILE=${WFE_HOME}/wfe-sga-donetoday
echo NOT_DONE > ${SGA_STAT_FILE}

### Archive file of CSV commands for SIMULATOR
CSV_INPUT_ARCHIVE_FILE=${WFE_HOME}/csv-command-archive-file

# Common name of CSV file on both systems
CSV_NAME=work.csv

# Type of command used to transfer files
XCOMMAND=sftp

# Path to Heartbeat Timestamp file
HEARTBEAT_FILE=${WFE_HOME}/wfe-heartbeat
touch ${HEARTBEAT_FILE}

###############################################################
# 4. MASTER PROCESS CONTROL FILE READ
###############################################################
#

### WFE ROP used to log debug and for status information
WFE_2_ROP=${WFE_HOME}/wfe-ropfile

### WFE Logfile used for communication to Splunk
WFE_2_SPLUNK=${WFE_HOME}/wfe-logfile

# WFE Process ID used to enforce single System Process
WFE_PID_FILE=${WFE_HOME}/wfe-process-id

# Initialize Process ID to Enforce threading requirements
THIS_WFE_PID=$$
echo ${THIS_WFE_PID} > ${WFE_PID_FILE}

# Initial index of records in local CSV input file
CMD_INPUT_POINTER=9999999

# Initial size of Local CSV Command Buffer
LOCAL_CSV_SIZE=0

# Initial Assumed Oldest ALERT Timetstamp
LOCAL_CSV_BIRTHDAY=0
CALC_TIMESTAMP=`date "+%s"`;

###############################################################
# 5.

CHECK CLONE STATUS
###############################################################
#
while [ true ]
do
MASTER_WFE_PID=`cat ${WFE_PID_FILE}`
if [ ${THIS_WFE_PID} != ${MASTER_WFE_PID} ]
then
echo "...`date "+%Y-%m-%d %H:%M:%S"`: Execution Stopped ..." >> ${WFE_2_ROP};
exit 0
fi

###############################################################
# 6. WORK TO DO? IF NOT, GET SOME.
###############################################################

wfe\_msg_unloader &

wfe\_garbage_collection &

NOW_TIME=\`date "\+%s"\`;
CSV_AGE=$\(\( $\{NOW_TIME\} - $\{LOCAL\_CSV_BIRTHDAY\} \)\)

\# Out of ALERTS? MOVE CSV from Splunk to WFE - purge any aging ALERTS
if [ $\{CMD\_INPUT_POINTER\} -ge $\{LOCAL\_CSV_SIZE\} -o \\
    $\{CSV_AGE:=0\} -gt $\{MAX\_ALERT\_REQ_AGE\} ]
then
    touch $\{HEARTBEAT_FILE\}
    cat /home/gmark/rje/COMMANDS.csv | grep "A[BL][AE]" &gt; $\{LOCAL_CSV\}
    &gt; $\{REMOTE_CSV\}
    LOCAL\_CSV_SIZE=\`wc -l $\{LOCAL_CSV\} | sed "s;^ *;;" | sed "s; .*;;"\`
    LOCAL\_CSV\_BIRTHDAY=$\{NOW_TIME\}
    CMD\_INPUT_POINTER=0
fi

while read CMD_INPUT
while [ true ]
do

###############################################################
# 8. VERIFY RUN STATUS, ELSE RESET NOW TIMER
###############################################################

    touch $\{HEARTBEAT_FILE\}

# An external "control" file with RUN=YES or RUN=NO to turn this off
RUN=`wfe_set_control RUN YES`
if [ ${RUN} != YES ]
then
echo "... ${NOW_TIME}: RUN=${RUN}: Execution Stopped by Request ..." >> ${WFE_2_ROP}
exit 0
fi

    CMD\_INPUT_POINTER=$\(\( $\{CMD\_INPUT_POINTER\} \+ 1 \)\)

    NOW_TIME=\`date "\+%s"\`;
    CSV_AGE=$\(\( $\{NOW_TIME\} - $\{LOCAL\_CSV_BIRTHDAY\} \)\)

    if [ $\{CSV_AGE:=0\} -gt $\{MAX\_ALERT\_REQ_AGE\} ]
    then
        &gt; $\{LOCAL_CSV\}
    fi

# This allows "read" statements to be placed in the loop for debugging
CMD_INPUT=`cat ${LOCAL_CSV} | head -${CMD_INPUT_POINTER} | tail -1`

    touch $\{HEARTBEAT_FILE\}

\# Better ways to do this, but none as dependable
    C_COMMAND="\`echo $\{CMD_INPUT\} | cut -d, -f$\{COL\_C_COMMAND\}\`"

    if [ $\{C_COMMAND\}x == ALERTx -o $\{C_COMMAND\}x == ABATEx ]
    then
        echo $\{CMD_INPUT\} &gt;&gt; $\{CSV\_INPUT\_ARCHIVE_FILE\}
        C_TIMESTAMP="\`echo $\{CMD_INPUT\} | cut -d, -f$\{COL\_C_TIMESTAMP\}\`"

# Heartbeat file checked by another process to make sure this is still running
touch ${HEARTBEAT_FILE}

# Uses Modulus function to do only periodic calls to the Unloader
# The Unloader checks for completed output files to forward to user
if [ $(( ${CMD_INPUT_POINTER} % ${MAX_UNLOADER_DELAY})) == 0 ]
then
wfe_msg_unloader &
fi

        NUM_PROCS=\`ps -u root | grep wfe\_voice_ | wc -l\`

        if [ $\{NUM_PROCS\} -lt $\{MAX\_NUM_PROCS\} ]
        then
            my_job "$\{CMD_INPUT\}" &
        else
            sleep 1
        fi
    else
        echo at $\{LINENO\} BOGUS COMMAND - SKIPPED &gt;&gt; $\{WFE\_2_ROP\};
    fi

# done < ${LOCAL_CSV}
done # Test Version for setting breakpoints

done # while TRUE[/CODE]

sea · January 29, 2015, 1:32pm

Append a:

sleep 3

before the final done , as already mentioned a few times.

As of now, during each loop you spawn 'check-jobs' to the background, ignoring wether or not they even had finished, while starting new jobs in the same loop..
As already said, even while [ true ];do sleep 1 & ; done can bring a machine to its knees, imagine what background jobs do, that actualy do something...

hth

PS:
You might want to have a look at: [BASH] Script to manage background scripts (running, finished, exit code)
The mods were kind and provided several working scripts.
And on the 3rd page its (currently) last post shows my solution using TUI, which runs multiple scripts in background, limiting/set an amount of allowed scripts and reports their exit status.

gmark99 · January 29, 2015, 2:28pm

Yes, Don, it helps a LOT.

Now, this is the code that checks for existing processes (the job name is "my_job") and only sleeps if the number hits MAX_NUM_PROCS (which
has been set as high as 400, but is now set at 2)

Does this work?

           NUM_PROCS=`ps -u root | grep my_job | wc -l`

            if [ ${NUM_PROCS} -lt ${MAX_NUM_PROCS} ]
            then
                my_job "${CMD_INPUT}" &
            else
                sleep 1
            fi
        else

---------- Post updated at 01:28 PM ---------- Previous update was at 01:26 PM ----------

Someone asked what "heartbeat" did, since I only "touch" it. I check that with a background watchdog process that just sees if it's been touched recently, and if not, assumes this process isn't well, kills it if it exists, and replaces it.

Make sense?

sea · January 29, 2015, 2:42pm

Was me, and i removed it because you already answered it but i had overseen.
Though, its not clear to me how it would identify which process to kill, as the file is just touched and you spawned multiple jobs without 'saving' their corresponding pid.
(edit: Unless that is handled in that other script.)

Ok, so you want to check if there are already started enough process, issue is, MAX_NUM_PROCS is not set anywhere in the code you posted.

gmark99 · January 29, 2015, 2:59pm

Okay -- I've seen a few references to process-limiting methods. What is "bctl" and how would I use it in my situation? Is my approach of using "ps" and gripping for the function name unusable? Or perhaps something like "bctl" just does it better?

Thanks again!

---------- Post updated at 01:59 PM ---------- Previous update was at 01:57 PM ----------

sea -- thanks again!

The "my_job" function keeps the PID in its own "heartbeat" file, so when the garbage collection routine comes around, it sees how long that file's been untouched, and then tries to kill "old" jobs using the contained PID.

Again, if there's a better way to do this, please, feel free to straighten me out!