Shell script runs fine in Solaris, in Linux hangs at wait command

aksaravanan · October 27, 2010, 2:34pm

HI,

I have a strange problem. A shell script that runs fine on solaris. when i ported to linux, it started hanging.

here is the core of the script

CFG_FILE=tab25.cfg
sort -t "!" -k 2 ${CFG_FILE} | egrep -v "^#|^$" |  while IFS="!" read  a b c
do
#echo "jobs output"
#jobs
#echo "jobs -p  "
#jobs -p
# echo "Before entering parallel process  $$ "
#                jobs  -l
#if [[ `jobs | wc -l` -ne 0 ]] ; then
#               jobs|wc -l
#                ps -ef|grep "`jobs -p`"
#fi
 
wait
done

my CFG_FILE is

this_is_a_test_ttt! lask;djfl;a ljkasfl; as l;jsladfjasldfj aslk;fjas flj;asf fABCDEFGHIJ KLMNOP ABCDELKJ :LKJSDFLKJSDFKLJDFKDJF  DFKSDHFIUI JSF SIDFJISDFISJDIF  ! lkjasdflkjasf

repeated for 32 times.

if i uncomment the body of while loop , here is the output (that shows what for "wait" is waiting)

[2] +  Running                 <command unknown>
[1] -  Running                 <command unknown>
jobs -p  
28418

Before entering parallel process 28416

[2] + 28418      Running                 <command unknown>
       1 
myid   28418 28416  0 14:26 pts/31   00:00:00 egrep -v ^#|^$
myid   28421 28416  0 14:26 pts/31   00:00:00 ps -ef
myid   28422 28416  0 14:26 pts/31   00:00:00 grep 28421?28418

what causes this script wait for a while/grep command in linux whereas in solaris it is fine.

To give further details, i suppose to call a (8) background process, for each of line read from the file. So i introduced a wait command and when all complete i read next line and so on.
Since wait was really waiting for grep and not for spawned child, my script broke.

Othercases when it can run fine in linux

fewer lines in config file
in one test, i shrink one of row length from 260 to 250, it worked.

i can't predict a consitance condition for this behaviour.

Any help is appreciated in fixing / explaining this behaviour.

thanks
AK

rbatte1 · October 27, 2010, 3:18pm

I regret that I can't decipher what you are actually trying to acheive, however my bet is that from earlier testing you may have a background task still hanging around and the wait command is watching for all background processes to finish (jobs 1 & 2 running unknown command). Try starting a new session to ensure that is not your problem.

I hope that this helps.

Robin
Blackburn/Liverpool
UK

jlliagre · October 27, 2010, 5:38pm

What shells are you using on Gnu/Linux and on Solaris ?

aksaravanan · October 28, 2010, 11:34am

my apologizes for not using tag . and thanks for tagging.

in both machines, i use ksh my program starts #!/bin/ksh , so it should override any shell is being used.

to answer the scenario , Robin's guess is right.
here is what i am doing

 
 
while read table, server_list
do
 
for server in server_list
do
do_somthing with table and server & # in background
done
wait # wait for all my background jobs to finish. 
echo now i can go and read next line 
 
done

now in my cash, wait started waiting for a zombie (unknown command, evetually a grep , sort) and never come out to read next line.

when i was trying to simplify the problem, i removed inner for loop , so my question looks awkward(stupid?)
technically, when i have while loop and have a wait command it shouldn't wait. ofcourse linux does something in background for while loops

i even used a intermediate file for doing sort and grep output and pass it on to while loop, but wait continue to hang.

again, when i started my config file with few lines, it was working fine in linux too. when i started adding more and more lines, it gave error

at one point, when i reduced row length from 260 char to 250 char , it worked too. but script still can work with line length exceeding more than 250 char when fewer lines were there. so i can't say this is the only scenario problem is exhibited.

could somebody help me to understand this..

thanks
AK

jlliagre · October 28, 2010, 1:59pm

What ksh implementation are you using on Gnu/Linux, the real one or pdksh ?

aksaravanan · October 28, 2010, 2:46pm

i don't know if i am using pdksh and realksh (i never heard pdksh earlier)
anyway, here is my output
/bin/ksh --version
version sh (AT&T Labs Research) 1993-12-28 n+

seems to be real?

thx

jlliagre · October 28, 2010, 3:55pm

Yes, that's the real one. What version on the Solaris side ?

DGPickett · October 28, 2010, 4:02pm

I recall on one O/S the ksh would run ksh scripts in the same pid, so you would see interactive background with wait, but in Solaris it is always a child process, and waits only for its own children. I ended up putting () around lines 2-$ to keep my environment from getting scrambled by my scripts.

aksaravanan · October 28, 2010, 7:09pm

SunOS xxxxxx 5.8 Generic_117350-25 sun4u sparc SUNW,Ultra-80

Linux yyyyyy 2.4.21-47.0.1.ELhugemem #1 SMP Fri Oct 13 17:48:02 EDT 2006 i686

---------- Post updated at 02:45 PM ---------- Previous update was at 02:42 PM ----------

i tried to enclose just while loop as well lines 2-$ within (), behaviour doesn't change.

---------- Post updated at 03:09 PM ---------- Previous update was at 02:45 PM ----------

Ok, I found a workaround to this problem.

i started collecting all my child jobs, wait only for those pid. this resolved my zombie wait

while loop
do
child_pids=

for loop
do
bg_work &
child_pids="$child_pids $!"
done

wait $child_pids
done

DGPickett · October 29, 2010, 11:39am

I guess wait on LINUX waits for everything. I wonder if nohup helps to move the script away. It might be an interesting man page read or such, to find out whether it is waiting for all processes on the tty or on the process group. But yes, collecting pids and waiting for them one at a time is best, as you get the exit return $? of each child from "wait $child_pid".

If the exit status is not a biggie, or you check that through log files, you can skill the wait and monitor the children through shared stdout and stderr, like this:

(
this&
that&
the_other&
) 2>&1 | cat >>$shared_log

This monitors not only the children but their children and so on, as long as they do not redirect both stdout and stderr. Even when "wait $child_pid" returns, the child may have antecedents still running, background or up-pipeline processes that close stdout but do not immediately exit, or someone down-pipeline exits cutting them off! $! is just the parent or last in pipeline pid.

sleep 99 | sleep 5 & wait $!    # wait waits for sleep 5 but sleep 99 is still running.

(sleep 99 & sleep 5 ) & wait $!    # wait waits for sleep 5 but sleep 99 is still running.

The ability of processes other than $! to get errors not reported on $? is one reason to rely on logs, or write a very attentive wrapper script to keep an eye on the children and report $? for all. Sometimes I get really formal, for money and my job security and all that. This is fine for interactive, but not so wise unattended:

cmd1|cmd2|cmd3

>$fail_log
(
  cmd1
  zret=$?
  if [ $zret != 0 ]
  then
   echo cmd1 returned $zret >>$fail_log
  fi
 ) | (
  cmd2
  zret=$?
  if [ $zret != 0 ]
  then
   echo cmd2 returned $zret >>$fail_log
  fi
 ) | (
  cmd3 . . . .
 )

if [ -s $fail_log ]
then
 exit 1
fi

fpmurphy · October 29, 2010, 11:46am

The ksh on Solaris is probably the old modified version of ksh88 that ships by default with Solaris.

What happens when you run the script on Solaris using /usr/xpg4/bin/sh?

agama · October 31, 2010, 11:33pm

This might be a bug in the version of ksh that you have which is likely fixed in the current release.

I'm running 'Version JM 93t+ 2009-02-02' on some boxes, and 'Version JM 93t+ 2010-06-21' on most of my linux boxes. Testing on the older of the two it handled your script without any problems:

>>>start: Sun Oct 31 23:18:46 EDT 2010
jobs output
[2] +  Running                 <command unknown>
[1] -  Running                 <command unknown>
jobs -p  
Before entering parallel process  18154 
>>>finish: Sun Oct 31 23:18:46 EDT 2010

I added start/finish messages to show the delay, if there was any. There is a more recent release than the 6/21/2010 version; it can be pulled direcly from AT&T Labs-Research; AST software download

As a further test, I put this little script together that reads lines with one or more sleep times and sets that many async sleep processes going. It is similar to the script you are running and it seems to have no issues with a more recent version of ksh.

while read list
do
        for x in $list
        do
                echo "$(date) sleeping $x"
                sleep $x &
        done

        echo "$(date) waiting"
        wait
        echo "$(date) looping"
done <xx

jlliagre · November 1, 2010, 9:27am

Your (agama) test script works fine too with the ksh version shipped with Solaris 10 (Version M-11/16/88i).

DGPickett · November 1, 2010, 10:40am

Sometimes I script parallel processing logging so it looks like it was done sequentially, both to keep from mixing line fragments and so as not to confuse the onlookers in their less sophisticated moments (-:

cmd1 >log1 2>&1 &
pid1=$!
 
cmd2 >log2 2>&1 &
pid2=$!
 
wait $pid1
zret1=$?
echo ==== cmd1 ====
cat log1
echo ===== cmd1 returned $zret1
 
wait $pid2
zret2=$?
echo ==== cmd2 ====
cat log2
echo ===== cmd2 returned $zret2
 
exit $(( $zret1 + $zret2 ))