Script running in parallel failing!

Hi,

I have a single script that was running fine in parallel on

Linux 2.6.9-89 

now it has been upgraded to

Linux 2.6.18-308.24.1.el5

and the script has started to fail unpredictably. Is this an upgrade issue? As the script runs fine for some parallel threads while fails for others. Please advise me on how to resolve this issue.

Thanks.

I hope that you realize that unless you show us the script that is failing, we have absolutely no way to guess at what might be wrong.

What is the script?

What diagnostics does it produce (or what other indication is there that it is failing)?

Hi Don Cragun,

Well the script is fetching files from remote location to the local directory through SFTP. The script is correct because it still runs fine for some streams while it fails for others.

 
for sftp_file_name in `cat remote_file_list_1` #there are 30 file list fetching files from 30 different IP
do 
fetch_file=`sftp -b $batch_file $remote_user@$remote_IP`
.....
.....
done

fetch_file variable would be blank and the script fails to fetch remote file. Note that the remote system is not changed and if I run the script again individually (not when 29 other streams are running simultaneously) then it successfully fetches file from the remote IP!!

Please understand that I won't be able to paste the whole script :frowning: but have posted only the snippet (while masking info about user, IP, etc.) where the problem occurs. I hope this is sufficient.

I assumed it to be more of a system issue than code issue and hence didn't post any code with it.......:frowning:

Thanks.

If you suspect that the kernel upgrade is to blame (which I kind-of doubt), have you tried booting the system off the old kernel and running the script again? If the upgrade is to blame, it should run as it did before. Have you also checked the version of sftp? If the system has been upgraded (for example) with a yum update, openssh-clients (or the equivalent for your distribution) may also have been upgraded, and the problem could lie there - or within a myriad of libraries.

I'm sorry, but in my experience, when a programmer tells me that code is correct because it runs fine some of the time, the most likely problem is that that code makes assumptions that are only true some of the time. And, especially when a programmer says his or her multithreaded code is correct because it works correctly when it is tested using a single thread, I have every reason to believe that the program is timing sensitive and any change to the system (added or reduced system load, new hardware, new software, etc.) may affect timing.

You are asking us to tell you why the assignment:

fetch_file=`sftp -b $batch_file $remote_user@$remote_IP`

is failing without showing us how batch_file , remote_user , and remote_IP are being set. We have no idea what is in remote_file_list_1 , but if there are sometimes whitespace characters in that file, it is highly likely that any use of sftp_file_name will fail to do what you intended.

You are not showing us where the problem occurs; you are telling us a line of code sometimes sets a variable to an empty string based on three variables that may have been set incorrectly by code we can't see.

You have the script. If you're convinced it is working correctly and won't let us see what is going on, there isn't anything we can do to help you.

I wish you luck.