Introduction to batch jobs
A batch job refers to a task or a series of tasks that can be executed without user intervention. These jobs are submitted to a job scheduler, which manages resources and executes them when the required resources (such as CPUs, memory, etc.) become available. Unity uses Slurm, a popular open-source job scheduler used in many supercomputing clusters and high-performance computing (HPC) setups.
sbatch
is a command within Slurm that is used to submit batch jobs. sbatch
is a non-blocking command, meaning there is no circumstance where running the command will cause it to hold. If the resources requested in the batch job are unavailable, the job will be placed into a queue and will start to run once resources become available. To see the status of all your jobs while they are pending or running, use the squeue --me
command. Alternatively, to see the status of a certain job at any time, use the command sacct -j YOUR_JOBID
.
sbatch
is based around running a single file. You don’t need to specify any parameters in the command other than sbatch <your batch file>
, because you can specify all parameters in the command inside the file itself.
Create a batch job
A batch script must start with #!/bin/bash
, or whichever interpreter you need, at the top line. If you are unsure of which interpreter to use, use #!/bin/bash
. The #!/bin/bash
line must be immediately followed by the #SBATCH <param>
parameters.
The following is an example of a batch script with common sbatch
parameters and a simple script:
#!/bin/bash
#SBATCH -c 4 # Number of Cores per Task
#SBATCH --mem=8192 # Requested Memory
#SBATCH -p gpu # Partition
#SBATCH -G 1 # Number of GPUs
#SBATCH -t 01:00:00 # Job time limit
#SBATCH -o slurm-%j.out # %j = job ID
module load cuda/10.1.243
/modules/apps/cuda/10.1.243/samples/bin/x86_64/linux/release/deviceQuery
The last two lines of this example load the required module and script. As defined by the parameters, the script allocates four CPUs and one GPU in the GPU partition. It queries the available GPUs, and prints only one device to the specified file.
Feel free to remove or modify any of the parameters in the script to suit your needs. Additionally, Slurm provides a wide variety of additional parameters for use with sbatch
.
Receive emails about your job status
To receive emails based on the status of your job, use the --mail-type
argument. Common mail types are BEGIN, END, FAIL, INVALID_DEPEND, and REQUEUE
. For more information on which mail type makes the most sense for you, see Slurm’s sbatch page.
To check that the email feature works for you with either salloc
or sbatch
, use the following code samples:
salloc --mail-type=BEGIN /bin/true
Or:
#!/bin/bash
#SBATCH --mail-type=BEGIN
/bin/true
The BEGIN
mail type sends you an email once your job begins.
--mail-user
argument.Receive a time limit email to prevent a loss of work
Your job will be terminated as soon as it reaches its time limit, regardless of how close it was to finishing its task. Without checkpointing, those CPU hours would be lost, and you would have to schedule the job all over again.
Another way to prevent losing your work is to check on your job’s output as it approaches its time limit.
To receive an email about your job’s output as it approaches its time limit, use the --mail-type=TIME_LIMIT_80
argument. With the --mail-type=TIME_LIMIT_80
argument, Slurm emails you if 80% of the time limit has passed and your job is still running. Then, you can check on the job’s output and determine if it will finish in time. If you do not think your job will finish in time, email us at hpc@umass.edu or ask on the Community Slack and we can extend your job’s time limit.