MARCC uses SLURM (Simple Linux Universal Resource Manager) to manage resource scheduling and job submission. SLURM is an open source application with active developers and an increasing user community. It has been adopted by many HPC centers and universities. All users must submit jobs to the scheduler for processing, that is “interactive” use of login nodes for job processing is not allowed. Users who need to interact with their codes while these are running can request and interactive session using the script “interact”, which will submit a request to the queuing system that will allow interactive access to the node.

Partitions:

Slurm uses “partitions” to divide types of jobs (partitions are called queues on other schedulers). MARCC defines a few partitions that will allow sequential/shared computing and parallel (dedicated or exclusive nodes), GPU jobs and large memory jobs. The default partition is set to sequential. There is also a partition called “debug” that can use as many as 2 nodes. The following table describes the attributes for the different partitions:

Note on the scavenger partition

The scavenger partition is meant as a stop-gap once the group has run out of regular allocated hours. It requires the use of --qos=scavenger alongside --partition=scavenger.

PartitionDefault/Max time (hours)Default/Max cores per nodeDefault/Max Mem per nodeSerial/ParallelBackfilling
shared1 hour / 167 hours (7 days)1 / 245 GB / 128 GBSerial, multi-threadShared
unlimitedUnlimited; jobs not guaranteed to complete1 / 245 GB / 128 GBSerial, parallelShared
parallel1 hour / 167 hours (7 days)24 / 245 GB / 128 GBParallelExclusive
gpu1 hour / 167 hours (7 days)1 / 245Gb / 128GB CPU
20 GB GPU
Serial, parallelShared
lrgmem1 hour / 167 hours (7 days)1 / 48120GB / 1024GBSerial, parallelShared
scavengerpre-emptable or 12 hr1/245GB / 128GBSerial, parallelShared

SLURM commands:

Here is a list of useful SLURM commands. The equivalent commands for Torque/Maui are given for reference. NOTE: users can still use torque commands like qsub, qdel, qstat.

DescriptionSLURM commandTorque/PBS
Submit a job scriptsbatch script-name (qsub available)qsub script-name
Queue list and featuressinfoqstat -Q
Node listsinfopbsnodes -l
List all jobssqueueqstat -a
List jobs by usersqueue -u [userid]qstat -u
Check job statussqueue [job-id] (qstat -a avail)qstat -a [job-id]
Delete a jobscancel [job-id] (qdel avail)qdel job-id
Graphical utilitysviewxpbsmon
Hold a jobscontrol hold
Release a held jobscontrol release
Change job resourcesscontrol update
Show finished jobssacct

Environment variables:

SLURM will set or preset environmental variables that can be used in your script. Here is a table with the most common variables and a LOG file of the SLURM variables et by a SLURM job.

Description SLURM variable Torque/PBS
JobID $SLURM_JOBID $PBS_JOBID
Submit Directory $SLURM_SUBMIT_DIR (default) $PBS_O_WORKDIR
Submit Host $SLURM_SUBMIT_HOST $PBS_O_HOST
Node List $SLURM_JOB_NODELIST $PBS_NODEFILE
Job Array Index $SLURM_ARRAY_TASK_ID $PBS_ARRAYID

Common Flags

This is a list of the most common flags that any user may include on scripts to request different resources and features for jobs.

DescriptionJOB SpecificationShortcut
Script Directive #SBATCH#SBATCH
Job Name #SBATCH --job-name=My-Job_Name
Wall time hours#SBATCH --time=24:0:0#SBATCH -t[min] or -t[days-hh:min:sec]
Number of nodes requested#SBATCH --nodes=1#SBATCH -N min-max
Number of core per node reqeusted#SBATCH --ntasks-per-node=24
send mail at the end of teh job #SBATCH --mail-type=end
user's email address #SBATCH --mail-user=userid@jhu,.edu
Copy user's environment#SBATCH --export=[ALL|NONE|Variables]
Working Directory#SBATCH --workdir=dir-name
Job Restart#SBATCH --requeue
Share Nodes #SBATCH --shared
Sedicated nodes #SBATCH --exclusive
Memory Size#SBATCH --mem=[mem |M|G|T] or --mem-per-cpu
Account to Charge#SBATCH --account=[account]
Quality of Service #SBATCH --qos=[name]
Job Arrays #SBATCH --array=[array_spec]
Use specific resource #SBATCH --constraint="XXX"

Important Flags for your jobs

Users need to pay special attention to these flags because proper management will benefit both the user and the scheduler:

Walltime requested

Walltime requested using --time should be larger than, but close to, actual processing time. If the requested time is not enough, the job will be aborted before the program finishes and results may be lost, while SU’s will still be charged from your allocation. On the other hand, if the requested time is too long, the job will remain in the queue for a longer time as the scheduler tries to allocate the resources needed. Once resources are allocated to your job these will be unavailable for other jobs and will affect the scheduler’s ability to most efficiently allocate resources for all users.

Nodes, tasks, and cpus

Dedicated nodes can be specified with the --exclusive flag and all CPUs and memory for each node will be allocated. Programs that rely heavily on data transfer between tasks may be suited for exclusive nodes. If exclusive nodes are not needed, whether the jobs are too small for a single node or do not leverage shared memory, the --shared flag will designate that a fraction of each node may be used.

Parallel processing may be done with either multiple processes, threads, or a combination of both. A single process may have multiple threads sharing memory. Multiple processes require some form to communicate, for example MPI. In SLURM, the number of processes is controlled by setting the number of “tasks”, while threads are controlled by the number of “cpus” (see below for relevant flags).

The number of nodes can be specified using the --nodes or -N flags and takes the form of min-max (e.g. 2-4). If a single number is given, the scheduler will only allocate that number of nodes. You can also specify the resources needed by giving the number of tasks with --ntasks or -n along with the number of --cpus-per-task, in which case the scheduler will decide on the appropriate number of nodes for your job. You may also specify the number --ntasks-per-node, which will multiply --cpus-per-task if both are used. Be aware that if you ask for more CPUs than are available in a single node, the scheduler will refuse your request and throw an exception. Finally, you may also request a minimum number of CPUs with the --mincpus flag.

Memory

The --mem flag specifies the total amount of memory per node. The --mem-per-cpu specifies the amount of memory per allocated CPU. The two flags are mutually exclusive.

GPUs

Requesting gpus requires both the right partition as well as the “gres” flag. Also, one must request a total of 6 CPUs per GPU with a combination of --ntasks-per-node and --cpus-per-task. For example:

#SBATCH -p gpu
#SBATCH --gres=gpu:2
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=6

If you would like an interactive session, you can use the provided script:

$ interact -n 6 -p gpu -g 1

Note that the interact script has the shortcut -g flag for requesting gpus and does not take “gres” as an input. In the example we requested -n 6 because each gpu is associated with 6 cpus (cores).
The environment variable with the device your job is assigned to is $CUDA_VISIBLE_DEVICES. For example, if you requested 2 gpus, this variable may be set to “0,1”.

Job arrays

A job array can be specified in a script when submitted using sbatch. For example,

$ sbatch --array=0-15%4 script.sh

would submit script.sh 16 times, with id’s 0 through 15. The %4 is optional and would only allow 4 jobs to run concurrently.

Within script.sh, there are three environment variables that can be used: $SLURM_JOBID is sequential for each job and depends on the queue; $SLURM_ARRAY_JOB_ID is the same for all jobs in the array and equal to the $SLURM_JOBID of the first job; and $SLURM_ARRAY_TASK_ID is equal to the index specified with the array option (which could be for example --array=1,3,5,7 or --array=1-7:2 where 2 is the step size).

To specify slurm stdin, stdout, and stderr files, use %A instead of SLURM_ARRAY_JOB_ID and %a instead of SLURM_ARRAY_TASK_ID. For example:

$ sbatch -o slurm-%A_%a.out --array=0-15%4 script.sh

would output to files named slurm-45_0.out, slurm-46_1.out, slurm-47_2.out, … (assuming 45 is the id of the first job).

Example Script

A simple script to run an MPI job using 24 cores (a single node) would look like this:

#!/bin/bash -l

#SBATCH
#SBATCH --job-name=MyJob
#SBATCH --time=24:0:0
#SBATCH --partition=shared
#SBATCH --nodes=1
# number of tasks (processes) per node
#SBATCH --ntasks-per-node=24
# number of cpus (threads) per task (process)
#SBATCH --cpus-per-task=1
#SBATCH --mail-type=end
#SBATCH --mail-user=userid@jhu.edu

#### load and unload modules you may need
# module unload openmpi/intel
# module load mvapich2/gcc/64/2.0b
module list

#### execute code and write output file to OUT-24log.
# time mpiexec ./code-mvapich.x > OUT-24log
echo "Finished with job $SLURM_JOBID"

#### mpiexec by default launches number of tasks requested

Displaying Jobs

The squeue command will display all jobs that have been submitted to the queues. The output is usually long due to the large number of jobs running or waiting to be executed. The “sqme” script will show jobs that belong to the user.

$ sqme
Wed Sep 21 11:53:58 2016
JOBID PARTITION NAME USER     STATE   TIME TIME_LIMI NODES NODELIST(REASON)
88791 parallel  Job1 jcombar1 RUNNING 1:22 1:10:00     4   compute[0301-0304]

The columns are self explanatory. TIME indicates the time the job has consumed, TIME_LIMIT the maximum amount of time requested and NODELIST shows the nodes where the job is running.

Submitting/cancelling a Job

Jobs are usually submitted via a script file (see above). The sbatch command is used:

$ sbatch my-script.scr
88791

The number that shows after the script is submitted corresponds to the JobID. It can be used to cancel/kill the job:

$ scancel 88791