Running Many Matlab Jobs

We have given a resource guide on using GNU Parallel, but running it using Matlab on Linux adds some syntax quirks. This resource guide is for you if you run task parallel jobs on Matlab requiring some form of parameter sweeping across several windowed variables.

To provide a tutorial, we will work with two files, my_sum.m is a Matlab code that is purely used as a redundant matrix operation kernel.  The second file, is the application of GNU Parallel for running concurrent Matlab jobs on a single node

The first file is a clone of the example given in Matlab’s timeit function, except that we have modified the size of the Matrices A and B to be 15K by 15K.  As given by the documentation, the example combines several mathematical functions: “matrix transposition, element-by-element multiplication, and summation of columns”:

A = rand(15000,15000);
B = rand(15000,15000);
f = @() sum(A.'.*B, 1);

The next file uses GNU Parallel to span two unnamed variables – one variable spanning 100 to 199, and the second variable spanning 200 to 201, which means that ‘parallel’ will run a total of 200 cases, but only a maximum of 24 jobs at a time until all 200 jobs are finished.

Note that that double quotes go around the MATLAB execution, and subsequent double quotes need to be escaped with “\”.

Clearly, the csv_string is not used, it is just available for you to parse or to use for further application in a real Matlab script.

Both discussed files need to be in the same directory, and the job is submitted via sbatch

#SBATCH --job-name=matlab
#SBATCH --time=1:00:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=24
#SBATCH --partition=parallel
# If the parallel allocation no longer works for you, then use the scavenger
###SBATCH --partition=scavenger
###SBATCH --qos=scavenger
# SLURM job script to run serial MATLAB
# on MARCC using GNU parallel
ml parallel
ml matlab
mkdir -p logs
# --exclusive - distinct CPUs allocated for each job
# -N1 - one node
# -n1 - one task
srun="srun -N1 -n1 --exclusive"
# --delay .2 prevents overloading the controlling node
# -j is the number of tasks parallel runs so we set it to 24 (the number of steps we want to run)
# --joblog makes parallel create a log of tasks that it has already run
# --resume makes parallel use the joblog to resume from where it has left off
# the combination of --joblog and --resume allow jobs to be resubmitted if
# necessary and continue from where they left off
parallel="parallel --delay .2 -j 24 --joblog logs/runtask.log --resume"
echo $PWD is the present working directory
$parallel $srun "matlab -nodisplay -nojvm -nosplash -nodesktop \
    -r \"csv_string='{1},{2}', try, run('$PWD/my_sum.m'), catch, exit(1), end, exit(0);\"" ::: {100..199} ::: {200..201}
echo "matlab exit code: $?"

Post Analysis Commands

If you want to re-run all cases, you have to delete the logs/runtask.log file.

If you want to sort the Seq values in order:

sort -k1 -n logs/runtask.log

If you want human readable form of the epoch times in the log file:

for t in $(cat logs/runtask.log | cut -f3 | cut -d'.' -f1 | tail -n +2); do date -d @$t; done