Queuing system (SLURM)
MARCC uses SLURM (Simple Linux Universal Resource Manager) to manage resource scheduling and job submission. SLURM is an open source application with active developers and an increasing user community. It has been adopted by many HPC centers and universities. All users must submit jobs to the scheduler for processing, that is “interactive” use of login nodes for job processing is not allowed. Users who need to interact with their codes while these are running can request and interactive session using the script “interact”, which will submit a request to the queuing system that will allow interactive access to the node.
Slurm uses “partitions” to divide types of jobs (partitions are called queues on other schedulers). MARCC defines a few partitions that will allow sequential/shared computing and parallel (dedicated or exclusive nodes), GPU jobs and large memory jobs. The default partition is set to sequential. There is also a partition called “debug” that can use as many as 2 nodes. The following table describes the attributes for the different partitions:
Note on the scavenger partition
The scavenger partition is meant as a stop-gap once the group has run out of regular allocated hours. It requires the use of
|Partition||Default/Max time (hours)||Default/Max cores per node||Default/Max Mem per node||Serial/Parallel||Backfilling|
|shared||1 hour / 100 hours||1 / 24||5 GB / 128 GB||Serial, multi-thread||Shared|
|unlimited||Unlimited; jobs not guaranteed to complete||1 / 24||5 GB / 128 GB||Serial, parallel||Shared|
|parallel||1 hour / 100 hours||24 / 24||5 GB / 128 GB||Parallel||Exclusive|
|gpu||1 hour / 100 hours||1 / 24||5Gb / 128GB CPU|
20 GB GPU
|lrgmem||1 hour / 100 hours||1 / 48||120GB / 1024GB||Serial, parallel||Shared|
|scavenger||6 hours||1/24||5GB / 128GB||Serial, parallel||Shared|
Here is a list of useful SLURM commands. The equivalent commands for Torque/Maui are given for reference. NOTE: users can still use torque commands like qsub, qdel, qstat.
|Submit a job script||sbatch script-name (qsub available)||qsub script-name|
|Queue list and features||sinfo||qstat -Q|
|Node list||sinfo||pbsnodes -l|
|List all jobs||squeue||qstat -a|
|List jobs by user||squeue -u [userid]||qstat -u|
|Check job status||squeue [job-id] (qstat -a avail)||qstat -a [job-id]|
|Delete a job||scancel [job-id] (qdel avail)||qdel job-id|
|Hold a job||scontrol hold|
|Release a held job||scontrol release|
|Change job resources||scontrol update|
|Show finished jobs||sacct|
SLURM will set or preset environmental variables that can be used in your script. Here is a table with the most common variables and a LOG file of the SLURM variables et by a SLURM job.
|Submit Directory||$SLURM_SUBMIT_DIR (default)||$PBS_O_WORKDIR|
|Job Array Index||$SLURM_ARRAY_TASK_ID||$PBS_ARRAYID|
This is a list of the most common flags that any user may include on scripts to request different resources and features for jobs.
|Wall time hours||#SBATCH -t[min] or -t[days-hh:min:sec]|
|Number of nodes requested||#SBATCH -N min-max|
|Number of core per node reqeusted|
|send mail at the end of teh job|
|user's email address|
|Copy user's environment|
|Account to Charge|
|Quality of Service|
|Use specific resource|
Important Flags for your jobs
Users need to pay special attention to these flags because proper management will benefit both the user and the scheduler:
Walltime requested using
--time should be larger than, but close to, actual processing time. If the requested time is not enough, the job will be aborted before the program finishes and results may be lost, while SU’s will still be charged from your allocation. On the other hand, if the requested time is too long, the job will remain in the queue for a longer time as the scheduler tries to allocate the resources needed. Once resources are allocated to your job these will be unavailable for other jobs and will affect the scheduler’s ability to most efficiently allocate resources for all users.
Nodes, tasks, and cpus
Dedicated nodes can be specified with the
--exclusive flag and all CPUs and memory for each node will be allocated. Programs that rely heavily on data transfer between tasks may be suited for exclusive nodes. If exclusive nodes are not needed, whether the jobs are too small for a single node or do not leverage shared memory, the
--shared flag will designate that a fraction of each node may be used.
Parallel processing may be done with either multiple processes, threads, or a combination of both. A single process may have multiple threads sharing memory. Multiple processes require some form to communicate, for example MPI. In SLURM, the number of processes is controlled by setting the number of “tasks”, while threads are controlled by the number of “cpus” (see below for relevant flags).
The number of nodes can be specified using the
-N flags and takes the form of min-max (e.g. 2-4). If a single number is given, the scheduler will only allocate that number of nodes. You can also specify the resources needed by giving the number of tasks with
-n along with the number of
--cpus-per-task, in which case the scheduler will decide on the appropriate number of nodes for your job. You may also specify the number
--ntasks-per-node, which will multiply
--cpus-per-task if both are used. Be aware that if you ask for more CPUs than are available in a single node, the scheduler will refuse your request and throw an exception. Finally, you may also request a minimum number of CPUs with the
--mem flag specifies the total amount of memory per node. The
--mem-per-cpu specifies the amount of memory per allocated CPU. The two flags are mutually exclusive.
Requesting gpus requires both the right partition as well as the “gres” flag. Also, one must request a total of 6 CPUs per GPU with a combination of
--cpus-per-task. For example:
#SBATCH -p gpu #SBATCH --gres=gpu:2 #SBATCH --ntasks-per-node=2 #SBATCH --cpus-per-task=6
If you would like an interactive session, you can use the provided script:
$ interact -n 6 -p gpu -g 1
Note that the
interact script has the shortcut
-g flag for requesting gpus and does not take “gres” as an input. In the example we requested
-n 6 because each gpu is associated with 6 cpus (cores).
The environment variable with the device your job is assigned to is
$CUDA_VISIBLE_DEVICES. For example, if you requested 2 gpus, this variable may be set to “1,3”, which indicates the devices visible to your code. When setting the cuda device to use, you provide to
cudaSetDevices() the index of the device number to use (beginning at zero) and it maps to the node’s device numbering using
$CUDA_VISIBLE_DEVICES. In this example, using
cudaSetDevices(0) sets device number 1 and
cudaSetDevices(1) sets device number 3.
A job array can be specified in a script when submitted using
sbatch. For example,
$ sbatch --array=0-15%4 script.sh
would submit script.sh 16 times, with id’s 0 through 15. The
%4 is optional and would only allow 4 jobs to run concurrently.
Within script.sh, there are three environment variables that can be used:
$SLURM_JOBID is sequential for each job and depends on the queue;
$SLURM_ARRAY_JOB_ID is the same for all jobs in the array and equal to the $SLURM_JOBID of the first job; and
$SLURM_ARRAY_TASK_ID is equal to the index specified with the array option (which could be for example
--array=1-7:2 where 2 is the step size).
To specify slurm stdin, stdout, and stderr files, use %A instead of SLURM_ARRAY_JOB_ID and %a instead of SLURM_ARRAY_TASK_ID. For example:
$ sbatch -o slurm-%A_%a.out --array=0-15%4 script.sh
would output to files named slurm-45_0.out, slurm-46_1.out, slurm-47_2.out, … (assuming 45 is the id of the first job).
A simple script to run an MPI job using 24 cores (a single node) would look like this:
#!/bin/bash -l #SBATCH #SBATCH --job-name=MyJob #SBATCH --time=24:0:0 #SBATCH --partition=shared #SBATCH --nodes=1 # number of tasks (processes) per node #SBATCH --ntasks-per-node=24 # number of cpus (threads) per task (process) #SBATCH --cpus-per-task=1 #SBATCH --mail-type=end #SBATCH --email@example.com #### load and unload modules you may need # module unload openmpi/intel # module load mvapich2/gcc/64/2.0b module list #### execute code and write output file to OUT-24log. # time mpiexec ./code-mvapich.x > OUT-24log echo "Finished with job $SLURM_JOBID" #### mpiexec by default launches number of tasks requested
The squeue command will display all jobs that have been submitted to the queues. The output is usually long due to the large number of jobs running or waiting to be executed. The “sqme” script will show jobs that belong to the user.
$ sqme Wed Sep 21 11:53:58 2016 JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON) 88791 parallel Job1 jcombar1 RUNNING 1:22 1:10:00 4 compute[0301-0304]
The columns are self explanatory. TIME indicates the time the job has consumed, TIME_LIMIT the maximum amount of time requested and NODELIST shows the nodes where the job is running.
Submitting/cancelling a Job
Jobs are usually submitted via a script file (see above). The sbatch command is used:
$ sbatch my-script.scr 88791
The number that shows after the script is submitted corresponds to the JobID. It can be used to cancel/kill the job:
$ scancel 88791