random script termination

I'm writing a script to archive data. First, the files are all rsync'd to the archive directory via NFS mounts(I know not the most efficient, but the only choice in this situation), then I use md5sum to validate the transfers. During execution of the script, it will exit for no apparent reason. It runs under ksh, and I've executed the program with ksh -x for debugging, left myself all kinds of debug messages throughout the script, but cannot find a logical reason for the programs termination. At the time of exit, it's in a while loop that reads the archived files names from a list file then does an md5sum on the source and destination. It will exit at different points in the loop. The filenames it's parsing don't seem to be the issue as doing the same files over and over, it will exit at different files. The test directory that I'm using has 5700+ files in it. Sometimes albeit rarely, it will do all 5700, sometimes about 200, other around 1500. It seems totally random. I've niced the md5sum part and it appears to be working better, but still not completing most of the time.

My question is: Is their something that will terminate a program if it takes too much processing power, memory, etc.? Like a "runaway" process catcher? I don't really see any thing doing a ps -ef or ps auxc, or in chkconfig --list, and the program itself while running doesn't seem to be taking too much of the available resources. The reason I'm suspicious of this type of this is that a couple of times it would output "Quit"(which is NOT anywhere in the script) upon exit, and once it said "SIGTERM +6". I tried looking at the man pages for signal and couldn't find much.

Any Ideas?

Thanks in advance.

Just to be clear, is that getting terminated during transfers ( rsynce'd) or only during checksum process ?

Try tracing ( strace ) the running process, that should give more hint though looking at the strace log for the first time is irritating but its very useful.

Regarding runway - catcher, we do have similar to that like process_monitor_controller which is a sub version of the scheduler in kernels which will save the current process image, stack frames and push it to sleeping mode if it crosses limit that it had promised not to cross.

Not sure, this might help you but few points to think about

Does your script run to completion if you remove the md5sum validation code?

matrixmadhan,

Rsync runs to completion. At the time of termination, It's in a loop of getting the md5sums and checking the source against the destination file. The weird thing is that it exits at random points in the loop. This is where your suggestion of strace will probably come in quite handy. To be honest I haven't used it in years, and didn't consider it in regard to using for a script. Every time I'd used it in the past it was for compiled programs. Thanks for the brain slap!

In almost 20 years of programming, I've never had a script do this before. It's always been traceable to the code itself. I.E. programmer error. :slight_smile:

fpmurphy:

Yes it does finish without the md5sum. Thanks, I had highly suspected md5, but never ruled it in or out by removing it. The rest of the code in the loop just check if the file exist and if the md5sums are the same. Based on switches, either removes, or leaves the source file.

Thank you both. I'll let you know what I find with the strace.

MPH

Next thing to check for is that you have the latest version of md5sum.

As far as I know everything is up to snuff. I have the updater running and install updates whenever they show up.

The strace had some interesting output. I'll have to do more research Monday when I get to work.

Strace Output

Update:

This script is running on a Dell PowerEdge 400SC, with dual 2.8G processors running CentOS 5. If I disable one of the processors the script works fine. I can run the script on a AMD quad core running Mandriva on without issue. Nothing else on the Dell shows any indications of problems. I would have to deduce that there is either a problem with CentOS and smp, or the hardware itself. So, I've posted this on the CentOS support site forums.

Thanks to all who replied, your information was a great help.

MPH