Serializing script Failing for more commands

nkamatam · March 28, 2010, 5:44pm

I have a reqirement to serialise various rsh scripts that hit my server from an external scheduler. No matter how many scripts come via rsh, only one should execute at a time and others should wait.

I have made the scheduler make a request to my shell script with the command to be run as a parameter. This shell script will be responsible to queue the commands and execute one after the other.

The following is the code that I have written. This runs just fine for a few jobs, but as the number of jobs getting queued increases, the script fails. I am running on Ubuntu 8.4 and ksh.

Can anyone please tell me if there is anything that is obviously wrong? I know my request seems like a review of my code, but I would be greatful is anyone can share any simiar code that I could readily used.

I have read another similar post where various alternatives are suggested, but those will not work for me.

#!/usr/bin/ksh

pid_line=$$_$1
queue_file=/tmp/queue_file

# sleep for no reason for a random (5) seconds
sleep $(echo $$%5|bc)

# make your entry into queue at the end
 echo $pid_line >> $queue_file

# while you are not the top most job loop
 while [[ $(grep -n $pid_line $queue_file | cut -d: -f1 ) -ne 1 ]]
 do

# if by any rare chance someone removed you from the queue, join back
 if [[ $(grep $pid_line $queue_file | wc -l) -eq 0 ]]
 then
     echo $pid_line >> $queue_file
 fi

# if the current top of the queue job is killed or terminated before it could remove itself, remove it
 curr_pid_line=$(head -1 $queue_file)
 if [ $(ps -ef | grep $(echo $curr_pid_line|cut -d_ -f1) | grep -v grep  | wc -l ) -eq 0 ]
 then
    sleep 1
    grep -Ev $curr_pid_line $queue_file | cat > $queue_file
 else
# if someother job is running, wait for sometime before attempting
  sleep 20
 fi
 done

# you are the first job now, run yourself now
 $1
 result=$?

# remove yourself from the queue
 grep -vE $pid_line $queue_file | cat > $queue_file

# exit with the exit code of the command that you ran
 exit $result

jlliagre · March 28, 2010, 7:49pm

Your script isn't protected against simultaneous execution of critical areas. You need atomic lock operations to make sure one instance isn't going to modify your queue_file while it is processed by another. Scripting is probably not the best approach to do that kind of thing. Moreover, the "grep something file | cat > file" seems bogus to me. "file" will be erased before being read. This is at least what it does with ksh on Solaris.

nkamatam · March 29, 2010, 2:23pm

That command works with ksh - I am hoping it will work on Solaris also.

$ cp /etc/passwd testfile
$ grep -E abcd testfile | cat > testfile
$ cat testfile
abcd:x:100:100:ABCD,,,:/home/abcd:/bin/bash

I think the critical sections are those that are removing and adding lines to the queue.
That is,

echo $pid_line >> $queue_file

and

grep -vE $pid_line $queue_file | cat > $queue_file

I agree that inspite of my logic to re-add accidentally removed jobs back to the queue, there could be some problems.

But, I think this is what I have to go ahead with given the fact that there will never be more than 20 jobs at any given time - and anything more than this will be an overkill for the functional requirement that I have.

I am thinking writing some C program with named pipes when I have more time and money.

I am surprised that there is no easy way of doing this on Unix.

edit by bakunin: i provided the code-tags you surely have just forgotten - no problem, but please bring them with you the next time. Thank you.

jlliagre · March 29, 2010, 4:10pm

This command works "by accident". Its behavior is undefined. testfile might be cleared before being read depending on unpredictable factors.

What makes you believe there is not ? Unix bundles a job scheduler and "at -q xx" with xx as a custom queue with simultaneous jobs forbidden looks suited for this task although I didn't really tested that approach.

bakunin · March 29, 2010, 5:40pm

To be honest i think that the whole design is flawed - not only the pipeline into itself, as jiliagre has already pointed out (and rightfully so, i might add).

A better way to do this would be to use the filesystems ablilty to sort by date and instead of maintaining a file maintain a directory with (dated) filestamps to manage the jobs. Here is a sketch of a solution i think might work:

#! /bin/ksh

typeset workdir=/path/to/dir
typeset myself=$$
typeset action="$1"    # the job gets passed from outside

touch ${workdir}/job.${myself}     # enqueue the job

while [ "$(ls -rt ${workdir} | head -1)" != "job.${myself}" ] ; do
     sleep 5     # wait until our job is the oldest in the directory,
                    # which means we are the first in queue
done

# now that we are the first one do the action:

$action   # do whatever the job is supposed to do

rm ${workdir}/job.${myself}  # remove ourselves from the queue

exit 0

I hope this helps.

bakunin