Hi,
I was trying to run a program that calls 8 processors (with max. RAM of 2 GB per processor). I want to run this program on my cluster that runs SGE. The cluster has 2 nodes, and each node has 62 cores, and 248GB/node. Currently, I use the scripts below, but the program (softx below) crashes out after a little while (it runs but exits without any exception) and I am wondering if I am doing something wrong. Here are my scripts:
submit.sh:
#!/bin/bash
#$ -l mem_free=32G
softx /home/pc/code/test.sh
==========
I submit this on SGE with the following command:
qsub -q long.q submit.sh
Should I be specifying the number of processors that are generated by softx? If so, how do I do that?
thanks!
---------- Post updated at 10:20 AM ---------- Previous update was at 06:04 AM ----------
Here are the details. Which limit was exceeded? And how can I rectify?
% qacct -j 7408
==============================================================
qname long.q
hostname node02.local
department defaultdepartment
jobname submit.sh
jobnumber 7408
taskid undefined
account sge
priority 0
qsub_time Fri Mar 31 08:44:26 2017
start_time Fri Mar 31 08:44:41 2017
end_time Fri Mar 31 09:11:09 2017
granted_pe NONE
slots 1
failed 37 : qmaster enforced h_rt, h_cpu, or h_vmem limit
exit_status 137 (Killed)
ru_wallclock 1588s
ru_utime 0.110s
ru_stime 0.190s
ru_maxrss 5.520KB
ru_ixrss 0.000B
ru_ismrss 0.000B
ru_idrss 0.000B
ru_isrss 0.000B
ru_minflt 25267
ru_majflt 0
ru_nswap 0
ru_inblock 0
ru_oublock 176
ru_msgsnd 0
ru_msgrcv 0
ru_nsignals 0
ru_nvcsw 351
ru_nivcsw 95
cpu 10096.930s
mem 429.730GBs
io 76.911GB
iow 0.000s
maxvmem 8.635GB
arid undefined
ar_sub_time undefined
ar_sub_time undefined
category -q long.q -l h_rt=172800,mem_free=48G