Queuing system (SLURM)
MARCC uses SLURM (Simple Linux Universal Resource Manager) to manage resource scheduling and job submission. SLURM is an open source application with active developers and an increasing user community. It has been adopted by many HPC centers and universities. All users must submit jobs to the scheduler for processing, that is “interactive” use of login nodes for job processing is not allowed. Users who need to interact with their codes while these are running can request an interactive session using the script “interact”, which will submit a request to the queuing system that will allow interactive access to the node.
SLURM uses “partitions” to divide types of jobs (partitions are called queues on other schedulers). MARCC defines a few partitions that will allow sequential/shared computing and parallel (dedicated or exclusive nodes), GPU jobs and large memory jobs. The default partition is “shared”.
The following table describes the attributes for the different partitions:
|Partition||Default/Max Time |
|Default/Max Cores per Node||Default/Max Mem per Node||Serial/Parallel||Backfilling||Limits|
|shared||1 hr / 72 hrs||1 / 24||4.9 GB / 117 GB||Serial (multithread)||Shared||1 node per job|
|unlimited||1 hr / unlimited;|
Jobs not guaranteed
|1 / 24||5 GB / 128 GB||Serial, parallel||Shared|
|parallel||1 hr / 72 hrs||24 / 24-28||4 GB / 96-128 GB||Parallel||Exclusive|
|gpuk80||1 hr / 18 hrs||1 / 24||5Gb / 128 GB CPU|
20 GB GPU
|gpup100||1 hr / 12 hrs||1 / 24||5Gb / 128 GB CPU|
24 GB GPU
|Serial||Shared||1 node (2 GPUs) per general user|
|lrgmem||1 hr / 72 hrs||1 / 48||120 GB / 1024 GB||Serial, parallel||Shared|
|scavenger||6 hours||1 / 24-28||5 GB / 128 GB||Serial, parallel||Shared||5 nodes per job|
|express||1 hr / 12 hrs||1 / 6||3.5 GB / 86 GB||Serial (multithread)||Shared||1 node per job|
|skylake||1 hr / 72 hrs||1 / 24||3.5 GB / 86 GB||Serial (multithread)||Shared||1 node per job|
We recommend that all users request more cores in order to increase their memory requirements, since the maximum amount of memory is automatically allocated with each core.
This partition has a maximum run time of 72 hours 93 days). It is designed to run serial/sequential and small parallel nodes with up to 24 cores. Multiple node jobs will be rejected. It has 247 Haswell nodes connected via Infiniband FDR fabric, 2.5 GHz and 24 cores per node. The memory limit per core is 4.9GB for a total of 117 GB per node. There is also a 600-job limit on the number of jobs a user/group can run at one time. Jobs from multiple users can share nodes.
The parallel partition has a maximum run time of 72 hours (3 days). This partition is exclusive, meaning it is dedicated to the parallel jobs. For multiple jobs it is recommended that users add the option “–contiguous” for best communication performance. This also mean the job may take longer to be scheduled to run (trade-off). There are 497 nodes assigned to this partition.
This partition is set to run jobs for a long period of time. Users should be able to set the time request to “unlimited” but we strongly recommend to use, for example, 30 days; 30-00:00:00. There are 12 nodes on this partition, a combination of gpu nodes, standard and large memory nodes. Both serial and parallel jobs can be run on this partition. It is also strongly recommended that users develop some strategy to check point their jobs. If the compute nodes or the system goes down there is no warranty that jobs will be saved. There are two gpk80 nodes on this partition for jobs that require more than two days of running time.
The debug limit is intended for testing, not for production throughput. Users are limited to 48 cores per user, 1 jobs per user, and 2 nodes maximum. Users may also wish to compile their codes on this partition.
This partition has 67 nodes with 2 Nvidia K80 GPUs per node. Users need to add the flag #SBATCH –gres=gpun (n=1,4) to gain access to the nodes. The maximum time limit is 48 hrs (2 days).
A partition with 5 nodes with Nvidia P100 gpus, gpu073 is a 16GB P100 GPU node and 74-77 are Nivida 12Gb gpu nodes.
A partition with two Nvidia V100 GPU nodes. gpudev001 is a 4 CPU (Skylake Gold 6130 2.1 GUhz, 64 cores, 376 GB RAM). It has 4 Nvidia 32GB V100 gpus. The other node, gpudev002, (Skylake 6126 2.6GHz, 24 core, 96GB RAM) and 2 Nvidia V100 16GB RAM.
The scavenger partition is meant as a stop-gap once the group has run out of regular allocated hours. It requires the use of
The express partition is a high-availability partition which offers up to 6 cores total for up to 12 hours. It is designed to reduce bottlenecks on the login nodes and offers a very low wait time. Users who wish to do post-processing tasks and data analysis should use this partition. It is the default partition and is used automatically if you do not ask for a specific partition. Note that codes should not be compiled on this partition because it uses a newer architecture than most of the machine. Use the debug nodes for compiling.
The “skylake” partition is a shared partition with the architecture of the same name. It has the same limits as the “shared” partition but occupies its own partition due to its slightly lower memory. We recommend that all users request more cores in order to increase their memory requirements, since the maximum amount of memory is automatically allocated with each core. Note that codes should not be compiled on this partition because it uses a newer architecture than most of the machine. Use the debug nodes for compiling.
This partition has a maximum run time of 72 hours 93 days).
Here is a list of useful SLURM commands. The equivalent commands for Torque/Maui are given for reference. NOTE: users can still use torque commands like qsub, qdel, qstat.
|Submit a job script||sbatch script-name (qsub available)||qsub script-name|
|Queue list and features||sinfo||qstat -Q|
|Node list||sinfo||pbsnodes -l|
|List all jobs||squeue||qstat -a|
|List jobs by user||squeue -u [userid]||qstat -u|
|Check job status||squeue [job-id] (qstat -a avail)||qstat -a [job-id]|
|Delete a job||scancel [job-id] (qdel avail)||qdel job-id|
|Hold a job||scontrol hold|
|Release a held job||scontrol release|
|Change job resources||scontrol update|
|Show finished jobs||sacct|
SLURM will set or preset environmental variables that can be used in your script. Here is a table with the most common variables and a LOG file of the SLURM variables et by a SLURM job.
|Submit Directory||$SLURM_SUBMIT_DIR (default)||$PBS_O_WORKDIR|
|Job Array Index||$SLURM_ARRAY_TASK_ID||$PBS_ARRAYID|
This is a list of the most common flags that any user may include on scripts to request different resources and features for jobs.
|Wall time hours||#SBATCH -t[min] or -t[days-hh:min:sec]|
|Number of nodes requested||#SBATCH -N min-max|
|Number of core per node reqeusted|
|Send mail at the end of the job|
|User's email address|
|Copy user's environment|
|Account to Charge|
|Quality of Service|
|Use specific resource|
Important Flags for Your Jobs
Users need to pay special attention to these flags because proper management will benefit both the user and the scheduler:
Walltime requested using
--time should be larger than, but close to, actual processing time. If the requested time is not enough, the job will be aborted before the program finishes and results may be lost, while SU’s will still be charged from your allocation. On the other hand, if the requested time is too long, the job will remain in the queue for a longer time as the scheduler tries to allocate the resources needed. Once resources are allocated to your job these will be unavailable for other jobs and will affect the scheduler’s ability to most efficiently allocate resources for all users.
Nodes, tasks, and cpus
Dedicated nodes can be specified with the
--exclusive flag and all CPUs and memory for each node will be allocated. Programs that rely heavily on data transfer between tasks may be suited for exclusive nodes. If exclusive nodes are not needed, whether the jobs are too small for a single node or do not leverage shared memory, the
--shared flag will designate that a fraction of each node may be used.
Parallel processing may be done with either multiple processes, threads, or a combination of both. A single process may have multiple threads sharing memory. Multiple processes require some form to communicate, for example MPI. In SLURM, the number of processes is controlled by setting the number of “tasks”, while threads are controlled by the number of “cpus” (see below for relevant flags).
The number of nodes can be specified using the
-N flags and takes the form of min-max (e.g. 2-4). If a single number is given, the scheduler will only allocate that number of nodes. You can also specify the resources needed by giving the number of tasks with
-n along with the number of
--cpus-per-task, in which case the scheduler will decide on the appropriate number of nodes for your job. You may also specify the number
--ntasks-per-node, which will multiply
--cpus-per-task if both are used. Be aware that if you ask for more CPUs than are available in a single node, the scheduler will refuse your request and throw an exception. Similarly, you may be denied the use of the ‘shared’ partition if you try to ask for more than 1 node. Finally, you may also request a minimum number of CPUs with the
--mem flag specifies the total amount of memory per node. The
--mem-per-cpu specifies the amount of memory per allocated CPU. The two flags are mutually exclusive.
If your job does not need a particular amount of memory, that is it will run within the minimum amount of memory per node (95GB), use these lines in your script.
#SBATCH -p parallel #SBATCH --mem=0
This measn teh job will use all the available amount of memory.
Requesting gpus requires both the right partition as well as the “gres” flag. Also, one must request a total of 6 CPUs per GPU with a combination of
--cpus-per-task. For example:
#SBATCH -p gpuk80 #SBATCH --gres=gpu:2 #SBATCH --ntasks-per-node=2 #SBATCH --cpus-per-task=6
If you would like an interactive session, you can use the provided script:
$ interact -n 6 -p gpuk80 -g 1
Note that the
interact script has the shortcut
-g flag for requesting gpus and does not take “gres” as an input. In the example we requested
-n 6 because each gpu is associated with 6 cpus (cores).
The environment variable with the device your job is assigned to is
$CUDA_VISIBLE_DEVICES. For example, if you requested 2 gpus, this variable may be set to “1,3”, which indicates the devices visible to your code. When setting the cuda device to use, you provide to
cudaSetDevices() the index of the device number to use (beginning at zero) and it maps to the node’s device numbering using
$CUDA_VISIBLE_DEVICES. In this example, using
cudaSetDevices(0) sets device number 1 and
cudaSetDevices(1) sets device number 3.
A job array can be specified in a script when submitted using
sbatch. For example,
$ sbatch --array=0-15%4 script.sh
would submit script.sh 16 times, with id’s 0 through 15. The
%4 is optional and would only allow 4 jobs to run concurrently.
Within script.sh, there are three environment variables that can be used:
$SLURM_JOBID is sequential for each job and depends on the queue;
$SLURM_ARRAY_JOB_ID is the same for all jobs in the array and equal to the $SLURM_JOBID of the first job; and
$SLURM_ARRAY_TASK_ID is equal to the index specified with the array option (which could be for example
--array=1-7:2 where 2 is the step size).
To specify slurm stdin, stdout, and stderr files, use %A instead of SLURM_ARRAY_JOB_ID and %a instead of SLURM_ARRAY_TASK_ID. For example:
$ sbatch -o slurm-%A_%a.out --array=0-15%4 script.sh
would output to files named slurm-45_0.out, slurm-46_1.out, slurm-47_2.out, … (assuming 45 is the id of the first job).
A simple script to run an MPI job using 24 cores (a single node) would look like this:
#!/bin/bash #SBATCH --job-name=MyJob #SBATCH --time=24:0:0 #SBATCH --partition=shared #SBATCH --nodes=1 # number of tasks (processes) per node #SBATCH --ntasks-per-node=24 #SBATCH --mail-type=end #SBATCH --firstname.lastname@example.org #### load and unload modules you may need # module unload openmpi/intel # module load mvapich2/gcc/64/2.0b module list #### execute code and write output file to OUT-24log. # time mpiexec ./code-mvapich.x > OUT-24log echo "Finished with job $SLURM_JOBID" #### mpiexec by default launches number of tasks requested
The squeue command will display all jobs that have been submitted to the queues. The output is usually long due to the large number of jobs running or waiting to be executed. The “sqme” script will show jobs that belong to the user.
$ sqme Wed Sep 21 11:53:58 2016 JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON) 88791 parallel Job1 jcombar1 RUNNING 1:22 1:10:00 4 compute[0301-0304]
The columns are self-explanatory. TIME indicates the time the job has consumed, TIME_LIMIT the maximum amount of time requested and NODELIST shows the nodes where the job is running.
Submitting/Canceling a Job
Jobs are usually submitted via a script file (see above). The sbatch command is used:
$ sbatch my-script.scr 88791
The number that shows after the script is submitted corresponds to the JobID. It can be used to cancel/kill the job:
$ scancel 88791