Help needed on restart-from-point-of-failure in Parallel Processing

Hi Gurus,
Good morning... :slight_smile:
OS Info:
Linux 2.6.32-431.17.1.el6.x86_64 #1 SMP Fri Apr 11 17:27:00 EDT 2014 x86_64 x86_64 x86_64 GNU/Linux

I have a script which takes multiples parameters from a properties file one by one and run in background (to do parallel processing). As example:

$ cat properties.file
Account
Customer
Address
Phone

part of customCombinedScripts.sh

 
  
 while IFS= read -r fileLines; do
    /path/to/product/script/data_process "${fileLines}" || exit -1 &
    dpPIDs+=" $!"
 done < properties.file
 ## logic to check background job status (success or failed)
for chkPIDs in $dpPIDs; do
    if ! wait $chkPIDs; then
        failCnt=`expr $failCnt + 1`
    fi
done
 if [ $failCnt -gt 0 ]; then
    echo "[`date`][FATAL] There are errors in Data Processing, total number of DP jobs failure is $failCnt. Check log."
    exit -1
else
    echo "[`date`][SUCCESS] Data Processing for received files has been completed..."
fi
  
 

Now my requirement is: Suppose "Address" failed, how to restart the script which will take the failed parameters (i.e Address) only?

I have implemented restart-from-point-of-failure concept in my another script which has "Sequential processing" using help from this article, posted by Corona688 Resume from last failed command - Page 3

How to implement the same concept in "Parallel Processing"?

Kindly help / provide ideas.

Cheers,
Saptarshi

As a quick first pass, I'd suggest that your line that ends with || exit -1 & will actually put an exit -1 into the background if the data_process call fails (returns non-zero) but doesn't cause the overall call to data_process code to run in the background.

Have I got that wrong?

You might want this:-

(/path/to/product/script/data_process "${fileLines}" || exit -1) &

It might be simpler to have a directory called 'Running' that you create a marker file in when you start a job and only remove when you successfully exit. That way, you could write something to read the directory contents for a restart.

Would that logic help?

Robin

Thank you rbattle1,

make sense and corrected the script(though earlier it was running fine (:confused:), don't know how).

Currently I'm trying to get the command I'm running using below sample test script:

 $ [] cat bgPIDTest.ksh
# Some function that takes a long time to process
longprocess() {
        # Sleep up to 14 seconds
        #sleep $((RANDOM % 15))
        sleepTime=$((RANDOM % 5))
        # Randomly exit with 0 or 1
        exitCode=$((RANDOM % 2))
        echo "sleeping for: $sleepTime with exit code: $exitCode "
        sleep $sleepTime
        exit $exitCode
}
 pids=""
failCnt=0
# Run five concurrent processes
        ( longprocess ) &
        # store PID of process
        pids+=" $!"
        echo PID $pids
        ( longprocess ) &
        # store PID of process
        pids+=" $!"
        echo PID $pids
        ( longprocess ) &
        # store PID of process
        pids+=" $!"
        echo PID $pids
        ( longprocess ) &
        # store PID of process
        pids+=" $!"
        echo PID $pids
        ( longprocess ) &
        # store PID of process
        pids+=" $!"
        echo PID $pids
        ( longprocess ) &
        # store PID of process
        pids+=" $!"
        echo PID $pids
 
# Wait for all processes to finish, will take max 14s
echo "initial failCnt is $failCnt"
for p in $pids; do
        #if wait $p; then
        if ! wait $p; then
            cmdJobNM=`ps -p $p -o command=`
            failCnt=`expr $failCnt + 1`
            echo "failed command is --> $cmdJobNM, PID: $p"
        fi
done
echo "total failCnt is $failCnt"
if [ $failCnt -gt 0 ]; then
    exit -1
fi

Output:

 $ [] ksh bgPIDTest.ksh
PID 15858
sleeping for: 2 with exit code: 1
PID 15858 15859
sleeping for: 4 with exit code: 0
PID 15858 15859 15860
sleeping for: 0 with exit code: 0
PID 15858 15859 15860 15861
sleeping for: 3 with exit code: 0
PID 15858 15859 15860 15861 15862
sleeping for: 3 with exit code: 1
PID 15858 15859 15860 15861 15862 15863
initial failCnt is 0
sleeping for: 1 with exit code: 0
failed command is --> , PID: 15858
failed command is --> , PID: 15862
total failCnt is 2

Still I'm not getting the command, so that I can awk 'ed the passed argument and put into a file. So the logic will be:
Once I'll restart script:
if this new file exist
take this new file
else
use old config file

once successfully done, I'll remove the error file (if any). Please suggest if this feasible.

Cheers,
Saps.

The ps utility only returns information about currently active processes; not those that have exited and been reaped.

If you would tell us what shell an what version of that shell you're using, we could make suggestions about ways to store information about the commands you are running in the background along with the PIDs of those commands.

Hi Don,

Please find the below info as requested:

Sh version:

$ sh --version
GNU bash, version 4.1.2(1)-release (x86_64-redhat-linux-gnu)

Ksh version:

 $ ksh --version
  version         sh (AT&T Research) 93u+ 2012-08-01

 

I gave both of them as my main script is ksh and all product scripts are sh .

Please let me know if you need any further information.

Cheers,
Saps.