[BASH] signalling

Hi guys,

I am using slurm to send file to make calculation on a server at my university.
The time limit for these calculation is 5 days but sometimes it is not enough. For this reason I need a clean up function that before the calculation ends copy the unfinished calculation file ( in order to restart them later). I trapped this function with a SIGUSR2 signal and I used the command --signal=SIGUSR2@600 to launch the signal 10 minutes before the end of the calculation but it doesn't seems to work. You will find following the code:

this is the command:

FILE=0
InPath="${inputdir}/*.gjf"
for FILE in $InPath ; do
   if [ -s ${FILE} ]
   then
        FILENAME=${FILE##*/}
        FILENAME=${FILENAME%.*}

        JOB=${FILENAME:0:5}
        echo ${FILE} was submitted as ${JOB}
        mv "${inputdir}/${FILENAME}".*  "${submitdir}/"
        outputpath="${outputdir}/${FILENAME}.out"

        # Launch the executable
SubFile="${submitdir}/${FILENAME}.gjf"

sbatch -A jgu-heinze-oshell -p ${QUEUE} -J ${JOB} -o ${outputpath} -n ${PROCS} -t ${RUN} --mem-per-cpu=2000 --signal=SIGUSR2@600  ./orca.sh ${SubFile}

FILENAME=""
fi    
done

This is the orca.sh file

# Store working directory to be safe
SAVEDPWD=$(pwd)
 
# We define a bash function to do the cleaning when the signal is caught
cleanup(){
	
	for FILE3 in "${rundir}"/*.gjf ; do
   if [ -s ${FILE3} ]
   then
        FILENAME=${FILE3##*/}
        FILENAME=${FILENAME%.*}

   

    rm /localscratch/${SLURM_JOB_ID}/*.tmp
    cp /localscratch/${SLURM_JOB_ID}/* "${outputdir}"/
    mv "${rundir}"/${FILENAME}.* "${findir}"/
    exit 0

   fi
done
}
 
# Register the cleanup function when SIGUSR2 is sent,
# ten minutes before the job gets killed
trap 'cleanup' SIGUSR2
 
# Copy input file

#SubFile="${submitdir}/*.gjf"
# echo ${SubFile}
for FILE2 in ${SubFile} ; do
   if [ -s ${FILE2} ]
   then
        FILENAME=${FILE2##*/}
        FILENAME=${FILENAME%.*}


# ls /localscratch/
cp "${submitdir}/${FILENAME}".* /localscratch/${SLURM_JOB_ID}


mv "${submitdir}/${FILENAME}".* "${rundir}/"



# Go to jobdir and start the program
cd /localscratch/${SLURM_JOB_ID}
set OMPI_MCA_btl=self,sm
/cluster/Apps/orca/3.0.2/orca ${FILENAME}.gjf

# Call the cleanup function when everything went fine
cleanup

FILENAME=""
   fi
done
 

I do not undestrand why it doesn't work !!! :mad::mad::mad:

Thank you for your help

Moderator comments were removed during original forum migration.

Hello,

Unluckily, there is no signal error.

The clean up function should just copy some file with the .Hess extention from the calculation computer to mine. And it should do it when the SIGUSR2 is received (should be 10 minutes before the end of the 5th day).
Since I do not receive the file on my computer, there are 2 possible explanation to my opinion:

1) The Cleanup function is written wrong.
2) The SIGUSR2 is not received.

Since I can't find error in the cleanup function I think it should be the second option.

Have you tested the cleanup function? How do you know it works as expected?

Have you tried manually sending SIGUSR2 to your program?

Have you tried using the sbatch program to send SIGUSR2 to a test program that can confirm whether it receives the signal?

Andrew

Hi Andrew,

Let's start saying i'm not a programmer. I am a PhD student in chemistry and I need to use this computer for some calculation for my promotion. I have just some small experience with php.

The code was already there when I arrived. Recently, they changed the enviroinment on the calculation computer (they moved to SLURM) and so we had to adapt the code. Particularly, in the "guide :wall::wall:" furnished by the university it is specified that slurm do not send automatically the signal when the calculation is running out of time and we should add this new command

"--signal=SIGUSR2@600"

Since the code was working before moving to SLURM I thought that the problem could be in the signal that don't arrive.

I tried once to signal manually and it didn't work, actually. But, since I am not experienced, I don't know If I did mistake in signalling manually, or if the code is wrong. I added even an

echo signal received

in the cleanup function but I didn't see nothing

If you have some suggestion you are really welcome :D:D.

Thank you.

Before we all poke around in the dark (SLURM, whatever it is, doesn't seem to be that familiar), I regard it to be your university's IT group / department - whoever moved to SLURM - responsibility to mitigate their actions' negative consequences.