Issue with tracking successful completion of Child process running in background

dmukherjee · September 20, 2016, 3:57am

Hello All,

I am using Linux. I have two scripts:

inner_script.ksh
main_wrapper_calling_inner.ksh

Below is the code snippet of the main_wrapper_calling_inner.ksh:

#!/bin/ksh

ppids=() ---> Main array for process ids.
fppids=() ---> array to capture failed process ids.
pcnt=0 ---> success count
fpcnt=0 ---> fail count

echo ""
start_time=`date '+%Y/%m/%d:%H:%M:%S'`


for file in `cat ${CONFIG_DIR}/abc.txt`
do
        table_name=`echo ${file}`

        nohup ksh ${BIN_DIR}/inner_script.ksh ${table_name}>${LOG_DIR}/${table_name}_inner_script_${curr_date}.log &
        ppids+=($!)
        echo "Process ID:       $!."
        echo ""
        echo "Log File for ${table_name} is: ${LOG_DIR}/${table_name}_inner_script_${curr_date}.log"
done

echo ""
echo ""
echo "Starting Checking the process completion of all tables:"

export tot_table_cnt=`wc -l ${CONFIG_DIR}/abc.txt|awk '{ print $1 }'`
echo ""
echo "Total Number of tables: ${tot_table_cnt}."

while [ ${pcnt} -lt ${tot_table_cnt} ]; do
ptmp=()
   for p in ${ppids[@]}
   do
        if [[ ! -d /proc/${p} ]]; then
                wait ${p}
                sts=$?
                if [[ $sts -eq 0 || $sts -eq 127 ]]; then
                        echo "Process completed with Process ID ${p}; exit code: $sts; at `date '+%Y/%m/%d:%H:%M:%S'`"
                        pcnt=`expr $pcnt + 1`
                else
                        echo "Process failed for Process ID: ${p}"
                        index=`echo ${ppids[@]/$p//}|cut -d/ -f1 |wc -w |tr -d ' '`
                        unset ppids[$index]
                        pcnt=`expr $pcnt + 1`
                        fpcnt=`expr $fpcnt + 1`
                        fppids+=(${p})
                fi

        else
                zombie_lst=$(ps axo pid=,stat= | awk '$2~/^Z/ { print $1 }'|grep "$p")
                if [[ -z ${zombie_lst} ]]; then
                         ptmp+=(${p})
                else
                        wait ${p}
                        sts=$?
                        if [[ $sts -eq 0 || $sts -eq 127 ]]; then
                                echo "Process completed with Process ID ${p}; exit code: $sts; at `date '+%Y/%m/%d:%H:%M:%S'`"
                                pcnt=`expr $pcnt + 1`
                        elif [[ $sts -ne 0 || $sts -ne 127 ]]; then
                                echo "Process failed for Process ID: ${p}"
                                index=`echo ${ppids[@]/$p//}|cut -d/ -f1 |wc -w |tr -d ' '`
                                unset ppids[$index]
                                pcnt=`expr $pcnt + 1`
                                fpcnt=`expr $fpcnt + 1`
                                fppids+=(${p})
                        else
                                kill -TERM ${p}
                                index=`echo ${ppids[@]/$p//}|cut -d/ -f1 |wc -w |tr -d ' '`
                                unset ppids[$index]
                                pcnt=`expr $pcnt + 1`
                        fi
                fi

        fi

   done
   ppids=(${ptmp[@]})

done


if [[ $pcnt -eq ${tot_table_cnt} ]]; then
        echo "process for all tables is complete for ${curr_date}."

        if [[ $fpcnt -eq 0 ]]; then
                echo ""
                echo "process is successfully completed for all Tables."
                echo "DONE file is touched."
                touch ${TEMP_DIR}/inner_script_completion.done
                echo ""
        else
                echo ""
                echo "process failed for ${fpcnt} tables."
                echo "Failed Process IDs are ${fppids[@]}."
                echo "DONE File is not touched in ${TEMP_DIR} path. Need to verify or re-run the process manually."
        fi
fi

Config File abc.txt has newline separated values like:

a
b
c
d

Below is the code snippet for inner_script.ksh:

#!/bin/ksh
nohup hive -S -e "do something;" &
pid=$!
wait $pid
status=$?
if [[ $status -eq 0 ]]; then
   echo "Success"
   exit 0
else
   echo "failure"
   exit 1
fi

Scenario:
I am trying to execute the inner_script in parallel for each of the value in config file abc.txt and to track the completion of each child process. I want the entire execution time to be the maximum execution time of any child process.

Problem:

Parent is unable to track the successful completion of some of the child process. Atleast one child process becomes Zombie or defunct.
.
I am using zombie_lst=$(ps axo pid=,stat= | awk '$2~/^Z/ { print $1 }'|grep "$p") to identify if a child has become zombie and then trying to WAIT on that. Does WAIT works with a Zombie process?
.
At the end...I am doing a KILL -TERM ${p} if the child has become a Zombie. Is this KILL -TERM ${p} actually killing the entire process?

Kindly suggest.

Don_Cragun · September 22, 2016, 12:52am

Running extra processes is a sure way to slow down execution.

By definition, a zombie process is a process that has terminated and still exists only because its parent has not "wait"ed for it to gather its exit status. Killing a zombie will not have any effect on that zombie. The only things that will cause a zombie to disappear are:

for it parent to wait for it,
for its parent to terminate, for it to be adopted by the system's zombie collector (known as init on many systems) and for the zombie collector to wait for it, or
reboot the system.

Having zombies around doesn't have any appreciable effect on a system unless the system's process table is almost full (and if that is a problem for you, all of the extra processes you are creating looking for zombies and rearranging the array of running background jobs will be more of a problem).

In your code sequence:

                        if [[ $sts -eq 0 || $sts -eq 127 ]]; then
                                echo "Process completed with Process ID ${p}; exit code: $sts; at `date '+%Y/%m/%d:%H:%M:%S'`"
                                pcnt=`expr $pcnt + 1`
                        elif [[ $sts -ne 0 || $sts -ne 127 ]]; then
                                echo "Process failed for Process ID: ${p}"
                                index=`echo ${ppids[@]/$p//}|cut -d/ -f1 |wc -w |tr -d ' '`
                                unset ppids[$index]
                                pcnt=`expr $pcnt + 1`
                                fpcnt=`expr $fpcnt + 1`
                                fppids+=(${p})
                        else
                                kill -TERM ${p}
                                index=`echo ${ppids[@]/$p//}|cut -d/ -f1 |wc -w |tr -d ' '`
                                unset ppids[$index]
                                pcnt=`expr $pcnt + 1`
                        fi

There is absolutely no way that you will ever execute the else clause. If $sts expands to 0 or to 127, you will execute the first then clause. Otherwise (since we already know that $sts does not expand to 0) the first part of the or in the elif double square bracket expression must be true and the 2nd then clause will be executed.

All of the code looking for zombies and trying to reap the first zombie is a waste of CPU cycles and memory that could be better spent to running the remaining background jobs (and as noted before, killing zombies the way you are trying to kill them is just an expensive no-op).

Using nohup to run a shell script that does nothing but nohup another job echo one word and exit triples the number of processes you need to run a background job. This might or might not be part of your problem, but it certainly won't help you. And, starting up unneeded processes will slow down everything running on your system.

Showing us code with syntax errors, with undefined variables, and not telling us how many jobs you are trying to run in parallel makes it hard for us to give any firm suggestions on how to fix your code (or even to determine what might be wrong), but you might consider replacing the two snippets you showed us with a single snippet similar to the following:

#!/bin/ksh

# Note that the following four lines all fail with syntax errors...
ppids=() ---> Main array for process ids.
fppids=() ---> array to capture failed process ids.
pcnt=0 ---> success count
fpcnt=0 ---> fail count

export tot_table_cnt=0

echo
# start_time=$(date '+%Y/%m/%d:%H:%M:%S') This line ommented out: not used.
# $CONFIG_DIR is used but not set.
# $curr_date is used but not set.
# $LOG_DIR is used but not set.
# $TEMP_DIR is used but not set


while read -r table_name
do	nohup hive -S -e "do something;" > "${LOG_DIR}/${table_name}_inner_script_${curr_date}.log"&
        ppids+=($!)
        echo "Process ID:       $!."
        echo
        echo "Log File for ${table_name} is: ${LOG_DIR}/${table_name}_inner_script_${curr_date}.log"
	tot_table_cnt=$((tot_table_cnt + 1))
done < "$CONFIG_DIR/abc.txt"

echo
echo
echo 'Starting Checking the process completion of all tables:'
echo
echo "Total Number of tables: ${tot_table_cnt}."

for p in "${ppids[*]}"
do	wait $p
	sts=$?
	if [[ $sts -eq 0 ]]
	then	pcnt=$((pcnt + 1))
		echo Success
		echo "Process completed with Process ID ${p}; exit code: 0; at $(date '+%Y/%m/%d:%H:%M:%S')"
	else	echo failure
		echo "Process failed for Process ID: ${p}; exit code: $sts; at $(date '+%Y/%m/%d:%H:%M:%S')"
		fpcnt=$((fpcnt + 1))
		fppids+=(${p})
	fi
done
echo "process for all tables is complete for ${curr_date}."

if [[ $fpcnt -eq 0 ]]
then	echo
	echo 'process is successfully completed for all Tables.'
	echo 'DONE file is touched.'
	touch ${TEMP_DIR}/inner_script_completion.done
	echo
else	echo
	echo "process failed for ${fpcnt} tables."
	echo "Failed Process IDs are ${fppids[@]}."
	echo "DONE File is not touched in ${TEMP_DIR} path. Need to verify or re-run the process manually."
fi

I have no idea what hive is supposed to do. I have no idea how the variables CONFIG_DIR , curr_date , LOG_DIR , and TEMP_DIR (which are all used by your script, but not initialized) are supposed to be set. So, obviously, the above script is totally untested. But, it seems like it should get the background jobs you have started run faster than your current script (assuming that there aren't any other users consuming the cycles freed up by this simplified version of your code).

drl · October 3, 2016, 9:29am

Hi.

Apologies for the late post.

After a quick skim of this thread, I'd say GNU parallel might be useful.

Best wishes ... cheers, drl

dmukherjee · October 6, 2016, 3:16am

Could you please elaborate a little more on GNU Parallel.

Is there any sample code snippet you have for this?

Thanks in advance.

Juha_Nurmela · October 6, 2016, 4:28am

Your inner script looks kind of funny: you start a command, and wait for it to end. Maybe you have your reasons, but wouldn't

if hive ... # (without &)
then
     echo Success
     exit 0
fi
echo Failure
exit 1

have done the same with less fuss?

Juha

---------- Post updated at 11:28 AM ---------- Previous update was at 10:59 AM ----------

Also, aren't you making things too complicated, how about

for i in 1 4 2 3
do
        (
        echo Scripted error message >&2
        ls nonexistent
        sleep $i
        ) 1> out_$i 2> err_$i &
done

wait # all of them

for i in 1 4 2 3
do
        if [ -s err_$i ]
        then
                echo Check errors in err_$i
        fi
done

drl · October 6, 2016, 6:43am

Hi.

This script drives the script inner 4 times in parallel with values from input file data1:

#!/usr/bin/env bash

# @(#) s1       Demonstrate executing one script several times simultaneously, parallel.

# Utility functions: print-as-echo, print-line-with-visual-space, debug.
# export PATH="/usr/local/bin:/usr/bin:/bin"
LC_ALL=C ; LANG=C ; export LC_ALL LANG
pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
em() { pe "$*" >&2 ; }
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
C=$HOME/bin/context && [ -f $C ] && $C parallel

FILE=${1-data1}

pl " Input data file $FILE:"
cat $FILE

pl " Script to be run in parallel, \"inner\":"
cat inner

pl " Results:"
cat $FILE |
parallel -j4 ./inner {}

exit 0

producing:

$ ./s1

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 3.16.0-4-amd64, x86_64
Distribution        : Debian 8.6 (jessie) 
bash GNU bash 4.3.30
parallel GNU parallel 20130922

-----
 Input data file data1:
a
b
c
d

-----
 Script to be run in parallel, "inner":
#!/usr/bin/env bash

# @(#) inner    Demonstrate one process to be run.

echo " Hello world from $$ with input values" \""$*"\"

exit 0

-----
 Results:
 Hello world from 4452 with input values "a"
 Hello world from 4453 with input values "b"
 Hello world from 4454 with input values "c"
 Hello world from 4455 with input values "d"

The heart of the solution is the line:

parallel -j4 ./inner {}

which runs the inner script up to 4 times in parallel, each process with a data item from a line in file data1, which in turn is supplied to parallel by the cat command earlier in the pipeline.

For documentation, see man parallel and
[GNU Parallel

GNU Project - Free Software Foundation](GNU Parallel - GNU Project - Free Software Foundation)

Best wishes ... cheers, drl