Sleep command did not worked

mad_man · August 29, 2017, 9:51am

Hi All,

We have a process which is running for last 2 years well and good in production. But suddenly yesterday there was issue we faced in the process.

The actual process is what it does like below.

Receive the files in NAS directory(N/w attached storage).
Trigger the process(some.ksh script) one for each file in the NAS directory in gap of 1 second, this is in for do - done loop.

This is running well and good till yesterday but suddenly there were 3 parallel process triggered for 2 files in NAS directory at same time (For ex: 11:30:10 AM). So initial analysis say there is nothing wrong in code which was actually fine but some how the sleep command was not worked.

Is there any thing behind which makes sleep command not to work?
Any comments suggestions will be helpful.

I am using AIX 6.0 version of UNIX server.

Thanks.

Don_Cragun · August 29, 2017, 10:07am

My crystal ball is a little cloudy this morning. I am unable to see the sleep command in your code that you don't think is working.

Maybe you could show us your code, any logs and diagnostics produced, and explain why you think sleep is failing (and what the symptoms are when sleep fails)???

mad_man · August 29, 2017, 10:21am

  
 echo " Calling system_router.ksh"  `date` " <<<<<\n" >>$temp_log 
 nohup /$TLExxxx/scripts/appldata_proc.ksh >/dev/null 2>&1 & 
 sleep 1

Hi Dan here is the piece of code which makes the gap between files in NAS directory.

I am also showing sample log fragments which i feel could help here.

Normal run process start timing:
>>>>>	starting	appldata_proc.ksh	Fri	Aug	25	10:29:20	EDT	2017	<<<<<
>>>>>	starting	appldata_proc.ksh	Fri	Aug	25	10:29:19	EDT	2017	<<<<<
>>>>>	starting	appldata_proc.ksh	Fri	Aug	25	10:29:17	EDT	2017	<<<<<
>>>>>	starting	appldata_proc.ksh	Fri	Aug	25	10:29:18	EDT	2017	<<<<<

Problematic run process start timing:
>>>>>	starting	appldata_proc.ksh	Mon	Aug	28	10:45:12	EDT	2017	<<<<<
>>>>>	starting	appldata_proc.ksh	Mon	Aug	28	10:45:12	EDT	2017	<<<<<
>>>>>	starting	appldata_proc.ksh	Mon	Aug	28	10:45:12	EDT	2017	<<<<<

The application main driver was missing 1 second sleep timing.

Thanks.

Don_Cragun · August 29, 2017, 11:33am

What you have shown us is a script that writes a timestamp into a log file (which you haven't shown us), starts a process running in the background, sleeps for a second and exits. There is absolutely no indication that the sleep you have shown us should have any effect whatsoever on the behavior of this script.

What is producing the sample log fragments you did show us? Is it coming from appldata_proc.ksh ? (It is not coming from the echo in the code you did show us!)

What does the log that you did show us have for several seconds before and after the three lines that show a timestamp of 10:45:12?

Note that there is no reason to believe that a script that is started every second will finish every second. On a busy system, several invocations of appldata_proc.ksh could be running at the same time even if one invocation of appldata_proc.ksh usually runs in much less than a second.

bakunin · August 29, 2017, 5:14pm

And on top of the questions Don already asked: the first line of your appldata_proc.ksh should show which shell you use: Default shell in AIX 6.0 (out of service, btw.) is ksh88 but AIX offers a ksh93 too as /usr/bin/ksh93 .

bakunin

mad_man · August 30, 2017, 3:38am

Hi All,

The actual issue was found, the issue what happen is these parallel running scripts are sharing a folder in server to pick the corresponding processing files. At that point of time when these scripts were running the server is 100% busy and even when we have sleep 1 in script the scripts became slow and started to run parallel so the when the second process runs it also picked the file names of first process and its own (first process file name was not suppose to be in there at this point of time). So the second process used the first process file name and failed.

We have planned to pass the file name as argument instead of picking up from shared folder.

Thanks.

Corona688 · August 30, 2017, 3:07pm

That sounds much more reliable.

bakunin · August 31, 2017, 1:18pm

Either this or you create a semaphore file: consider the following script stub:

typeset fSemaphore="/path/to/file"

# are we already running?
if [ -e "$fSemaphore" ] ; then
     exit 1
fi

touch "$fSemaphore"

<.... your code here ....>

rm "$fSemaphore"
exit 0

This makes sure the script runs only one instance at a time.

I hope this helps.

bakunin

Don_Cragun · August 31, 2017, 11:07pm

bakunin:

Either this or you create a semaphore file: consider the following script stub:
typeset fSemaphore="/path/to/file"

# are we already running?
if [ -e "$fSemaphore" ] ; then
   exit 1
fi

touch "$fSemaphore"

<.... your code here ....>

rm "$fSemaphore"
exit 0
This makes sure the script runs only one instance at a time.

I hope this helps.

bakunin

Not quite. There is a race condition between the time the existence test ( [ -e "$fSemaphore" ] ) is executed and the time the semaphore file is created ( touch "$fSemaphore" ) that can allow two or more copies of this code to run simultaneously and not realize that another invocation is running.

This can be worked around with the shell's do not clobber option:

fSemaphore="/path/to/file"
set -C	# Set noclobber option.

# Are we already running?
if ! date "+File $fSemaphore created @ +c by PID $$" > "$fSemaphore"
then	exit 1
fi

# Set a trap to remove the semaphore when we exit.
trap 'rm -f "$fSemaphore"' EXIT
set +C	# Clear noclobber option.

<.... your code here ....>

exit

Note that both of these run a risk of leaving the semaphore file in place if the script is terminated by a kill signal (for the code above) or by any signal (for the code bakunin suggested). If this happens, the semaphore file will have to be manually removed before the script will run again successfully.

RudiC · September 1, 2017, 2:37am

Not sure if this has been discussed before - but how about using a symbolic link or a directory for the semaphore file? touch doesn't care if a file exists or not, but both ln -s or mkdir check and create the respective object in one atomic operation. Check the exit code and proceed on success or terminate on error.

bakunin · September 1, 2017, 4:07am

If i understood thread o/p correctly he starts instances of this script every second (set apart by a sleep 1 ), so this concern (although valid in general) won't apply here. In the general case you are right: race conditions need to be addressed.

This is true. I wanted to illustrate what a semaphore in general is, so i left out the "implementation details". Yes, trap ping signals should be done in production code.

bakunin

Don_Cragun · September 1, 2017, 9:55am

All that is required for a semaphore (or lock) file is that you can test for its presence and create it if it was not already present as an atomic operation. As you said, this can be done in C with any of the following library calls (and on most systems other calls are possible):

	rc = mkdir("filename", 0644);

	rc = mkfifo("filename", 0644);

	rc = open("filename", O_CREAT | O_EXCL, 0644);

	rc = symlink("filename", "contents");

and can be done in the shell command language with the following equivalent code:

mkdir "filename"; rc=$?

mkfifo "filename"; rc=$?

set -C; > "filename"; rc=$?
# "set +C" can be used to remove the O_EXCL flag from all
# future ">" redirections or ">|" can be used to remove the
# O_EXCL flag on individual future redirections.

ln -s "contents" "filename"; rc=$?

In either of these cases, contents must be a string that is valid as the pathname of a file. If there is more than one component in that pathname, the path prefix of the last component must name an existing directory accessible by the user calling symlink() or ln -s .

If you want to use the lock file to document what process created the lock and when (as I did with the:

date "+File $fSemaphore created @ +c by PID $$" > "$fSemaphore"

in the code I suggested in post #9), then I usually find using a regular file to be the easiest. You can do the same thing with a symbolic link by using the output of that date command as the contents of the symlink.

Creating a directory is a more I/O intensive operation than creating a regular file, a symbolic link, or a FIFO (AKA named pipe). Therefore, I seldom use a directory as a lock file. But that is just my personal preference.

Except, as noted in posts #1 and #3, even through there is a one second delay between starting invocations of the script, there are times when three copies of the script are running simultaneously and we have no indication of where in the script earlier invocations are hanging. On a system that is hanging due to heavy I/O load, it is quite possible for all three invocations to be hung waiting to access the directory where the semaphore file is located (i.e., hanging in the test for the existence of $fSemaphore ). In this case, this race condition is not only possible, but likely to occur.

Quite true. I just wanted to show the required sequence of operations. The trap to remove the semaphore file must come AFTER we have determined that this invocation of the script is the one that created the semaphore file. Many beginners mistakenly install their traps at the start of their script and accidentally remove a semaphore file set by another invocation of the script when they exit after finding that another process created the semaphore file.