Stop action in a nested for loop and resume it afterwards

Dear all

I have created a bash script that cd at different subfolders and perform an action (sbatch)

# Let's loop over all subfolders
for idx in 100 200 300 400 500 600 700 800 900 1000;
do
    # Let's copy, paste, and rename the relevant folder 
    if (("$idx"<1000));
    then
       cd "$idx-mT";
    else
       cd "1-T";
    fi
    for Geometry in Rectangular-Sample Square-Sample;
    do
        cd "$Geometry";
        for Temperature in 0-K 2-K;
        do
            cd "$Temperature";
            for Dipolar in Dipolar-Hierarchical Dipolar-Tensorial;
            do
                cd "$Dipolar";
                for DMI in Scaled-DMI Unscaled-DMI;
                do
                    cd "$DMI";
                    for DMI_Value in D3-D1-1 D3-D1-1with2 D3-D1-1with4 D3-D1-1with6 D3-D1-1with8 D3-D1-2 D3-D1-2with2 D3-D1-2with4 D3-D1-2with6 D3-D1-2with8 D3-D1-3;
                    do
                        cd "$DMI_Value";
                        if [ -f CrSBr-Field-Cooling.slurm ];
                        then
                                if [ ! -f slurm-* ];
                                then
                                     sbatch CrSBr-Field-Cooling.slurm;
                                fi
                        else
                                mv CrSBr* CrSBr-Field-Cooling.slurm;
                                sbatch CrSBr-Field-Cooling.slurm;
                        fi
                        cd  ..
                    done
                    cd ..
                done
                cd ..
            done
            cd ..
        done
        cd ..
    done
    cd ..
done

which in principle should only do the sbatch if a slurm* file does not exists in that subfolder (not sure yet that it does so).

In any case, I am interested in the following. When I launch this script, and the sbatch order gets in, a message like

Submitted batch job 8359814

which confirms that my job has submitted to the queue of the cluster. However, the queue has a limit capacity, and when that it is reached, the following message appears in the command window

sbatch: error: QOSMaxSubmitJobPerUserLimit
sbatch: error: Batch job submission failed: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits)

I would be interested in, somehow, to (i) stop the sbatch actions once the the sbatch error appears for the first time, (ii) remember at which subfolder it was unable to obtain a successful sbatch, and (iii) resumes from that subfolder the sbatch action afterwards the number of jobs in the queue is reduced.

Suggestions are welcome on how to achieve this!

Probably not an answer to this particular problem, but a side note for OP:

  1. You still didn't use shellcheck.net as you were previously requested to, it would have told you that e.g.:
Line 29:
if [ ! -f slurm-* ];
          ^-- SC2144 (error): -f doesn't work with globs. Use a for loop.
  1. Beware of mv CrSBr* CrSBr-Field-Cooling.slurm; as it would probably fail with
    mv: target 'CrSBr-Field-Cooling.slurm' is not a directory
  2. You most likely don't need 5 nested "for loops", you'd be perfectly fine with a single one (and a correctly used "brace expansion" mechanism); and you don't need copious number of cds - see my reply to your previous post.
2 Likes

The following uses the "collapsed for looop" suggestion.
It loops on the exit status of sbatch.

#!/bin/bash
function file_exists(){
  local f
  for f do [ -f "$f" ] && return
  done
  return 1
}

# Let's loop over all subfolders
for dir in {{100..900..100}-mT,1-T}/{Rectangular-Sample,Square-Sample}/{0-K,2-K}/{Dipolar-Hierarchical,Dipolar-Tensorial}/{Scaled-DMI,Unscaled-DMI}/{D3-D1-1,D3-D1-1with2,D3-D1-1with4,D3-D1-1with6,D3-D1-1with8,D3-D1-2,D3-D1-2with2,D3-D1-2with4,D3-D1-2with6,D3-D1-2with8,D3-D1-3}
do
    pushd "$dir" >/dev/null || continue
    if [ ! -f CrSBr-Field-Cooling.slurm ]
    then
        # Exactly one CrSBr* file exists
        mv CrSBr* CrSBr-Field-Cooling.slurm
    fi
    # Only one slurm-* file exists, nevertheless use a function to silence shellcheck
    if ! file_exists slurm-*
    then
        while ! sbatch CrSBr-Field-Cooling.slurm
        do
            sleep 10
        done
    fi
    popd >/dev/null
done
3 Likes