Slurm cheat sheet
Slurm is the job scheduler that we use in Unity. For an introduction on Slurm, see Introduction to Slurm: The Job Scheduler.
Common terms
The following is a list of common terms in Slurm:
- Node - a single computer.
- Socket - a single CPU.
- Core - a single compute unit inside a CPU.
- CPU - one core, except on power9 where it is a Thread within a Core.
- Job - a schedulable unit; an allocation of resources.
- Job step - a set of related processes within a task.
.batch
is the script as submitted,.0
...X
are anysrun
invocations from within the script - Task - a process within a job step.
Submit a job with arguments
To submit a job in Slurm, use the sbatch
command. If you want to set parameters for your job, there are many arguments available for you to add to your batch file. See the Introduction to batch jobs page for examples of how to create a batch file with arguments. Note that any arguments specified on the command line when submitting your job override those in the file.
The following table contains a list of arguments you can use to specify parameters for your batch job. For more detailed information on submitting jobs in Slurm, see the sbatch manpage.
Argument | Description |
---|---|
General | |
--time= , -t | Set the worst case estimate of job run time in Days-Hours:Minutes:Seconds format. |
--time-min= | Set the minimum amount of time the job can usefully run for. This should be smaller than --time= , and may allow the scheduler to start the job sooner. |
--job-name= , -J | Set the name of the job; default name: script name Use sacct --name=... to find it later. |
--output= , -o | Specify the filename to place output. If an error occurs, it will be placed in output by default. |
--error= , -e | Specify the filename to place error output. Only use this argument if you want a separate place to store errors. |
--exclusive | Request entire nodes. This results in better performance for jobs that can use multiple cores or most of the memory, but generally results in longer queue times. Recommended with --mem=0 . |
--mail-type=... | Send an email when the job changes state. Usually FAIL,END,TIME_LIMIT_80 are the most useful. See sbatch manpage for a complete list of values. |
Compute Resources | |
--nodes=<n> , -N | Specify the number of nodes to use; should be 1 unless it supports MPI. |
--cpus-per-task=<n> , -c | Specify the number of cores per task to allocate. |
--ntasks=<n> , -n | Specify the number of tasks to allocate space for (MPI=number of processes). |
--gpus=<n> ,--gres=gpu:<n> ,--gres=gpu:<type>:<n> | Specify the number of GPUs per Job. See Using GPUs. |
--mem=<n>g | Specify the number of Gigabytes of memory per-node. |
Alternative ways to request resources | Note some of these do not work together |
--ntasks-per-node=<n> | Specify the number of tasks per node (considered maximum when used with --ntasks ). |
--ntasks-per-gpu=<n> | Specify the number of tasks per GPU allocated. |
--constraint=... | Specify the Constraints See also sbatch manpage for more information. |
--mem-per-cpu=<n>g | Specify the number of Gigabytes of memory per core. |
--mem-per-gpu=<n>g | Specify the number of Gigabytes of CPU memory per GPU (NOT VRAM). |
Related Jobs | |
--array=<indices> | Create Array Job See also sbatch manpage for more information. |
--dependency=... | Configure dependencies between jobs. See sbatch manpage for more information. |
Uncommon | |
--account=pi... | Use a given account (most relevant for classes and Gypsum users). |
Job steps
Inside of your batch file, use the srun
command to specify the command to run across the nodes allocated.
It’s uncommon to need to specify other arguments with this command, but srun
accepts most of the arguments from the arguments table if necessary, with the exceptions of --array
and --dependency
. See the srun manpage for more detailed information.
Interactive jobs
To start an interactive job, use the salloc
command followed by arguments that specify details about your job.
Similar to srun
, salloc
takes the same arguments as sbatch
, except --array
. The Using SALLOC page has more information, as does the official salloc manpage.
Modify a job
It’s possible to change some job properties while they are pending, and a few after they start running with the scontrol modify jobid= command. Use <tab>
completion to see all the various parameters that can be changed for pending jobs. The following is a list of the most common arguments:
Argument | Description |
---|---|
arraytaskthrottle | Adjust the maximum number of array items that can run currently. |
mailtype | Change the events that generate an email for this job. |
mailuser | Email to send to ; uses account email by default. |
timelimit | Adjust the time limit for a job (while pending only). |
partition | Adjust the list of partitions the job is submitted to. |
qos | Set the QOS to use for this job (currently only adding short makes sense). |
nice | Lower the priority of a pending job. |
You can use the separate command scontrol top <jobid_list>
to give higher priority to specific jobs compared to your other jobs in the same partition. This command accepts a comma-separated list of job IDs. Note that this command only works for jobs within a single partition.
Cancel a job
To cancel running and pending jobs, use scancel jobid
. To cancel a running step, use scancel jobid.step
.
Check a running job
There are multiple ways to check the progress and efficiency of a running job. See Monitoring a Batch job for details.
Check on recent job status
To check on the status of recent jobs, use the squeue
command followed by an argument that specifies what type of information you want to view. Note that only jobs that are currently running or finished in the last ~5 minutes are available with the squeue
command.
The following table shows common argument options. For more details, see the squeue manpage.
Argument | Description |
---|---|
--me | Show only your jobs. |
--start | Show the most pessimistic estimate of when a job can start, if available, and the reason it’s waiting. In some cases the reason may not be available or may be wrong. |
-j <jobid> | Show the job specified. |
--account=pi... , -A | Show only jobs from a list of PI groups. |
--state=pd,r,f , -t | Show only jobs in the pending, running, or failed state. |
Check on older job status
To check on the status of an older job, use the sacct
command followed by an argument that specifies what type of information you want to view. The following table shows common argument options. For more detailed information, see the sacct manpage.
Argument | Description |
---|---|
--user=username | List jobs from another user (defaults to your own jobs only). |
-A , --account=pi... | List all jobs from a given group. |
--start=<date/time> --end=<date/time> | Show only jobs started or running between these times. Formats can be YYYY-MM-DDThh:mm:ss (i.e., literal T between) YYYY-MM-DD , MMDD , or hh:mm . --end defaults to now , and --start defaults to previous midnight. |
--state=... | Limit jobs to only a list of states. Must specify --end for this to work. requeue requires specifying --duplicates . States include completed, failed, running, pending, node_fail, requeue, timeout . |
--name= | Limit result to jobs with a given name or list of names. |
Check on node status
To check on the status of slurm nodes and partitions, use the sinfo
command followed by an argument that specifies what type of information you want to view. The following table shows common argument options. For more details, see the sinfo manpage.
Argument | Description |
---|---|
--summary , -s | Show summary statistics of nodes (Allocated/Idle/Other/Total). |
--partition= , -p | Limit display to a list of partitions. |