Unity
Unity
About
News
Events
Docs
Contact Us
code
search
login
Unity
Unity
About
News
Events
Docs
Contact Us
dark_mode
light_mode
code login
search

Documentation

  • Requesting An Account
  • Get Started
    • Quick Start
    • Common Terms
    • HPC Resources
    • Theory of HPC
      • Overview of threads, cores, and sockets in Slurm for HPC workflows
    • Git Guide
  • Connecting to Unity
    • SSH
    • Unity OnDemand
    • Connecting to Desktop VS Code
  • Get Help
    • Frequently Asked Questions
    • How to Ask for Help
    • Troubleshooting
  • Cluster Specifications
    • Node List
    • Partition List
      • Gypsum
    • Storage
    • Node Features (Constraints)
      • NVLink and NVSwitch
    • GPU Summary List
  • Managing Files
    • Command Line Interface (CLI)
    • Disk Quotas
    • FileZilla
    • Globus
    • Scratch: HPC Workspace
    • Unity OnDemand File Browser
  • Submitting Jobs
    • Batch Jobs
      • Array Batch Jobs
      • Large Job Counts
      • Monitor a batch job
    • Helper Scripts
    • Interactive CLI Jobs
    • Unity OnDemand
    • Message Passing Interface (MPI)
    • Slurm cheat sheet
  • Software Management
    • Building Software from Scratch
    • Conda
    • Modules
      • Module Usage
    • Renv
    • Unity OnDemand
      • JupyterLab OnDemand
    • Venv
  • Tools & Software
    • ColabFold
    • R
      • R Parallelization
    • Unity GPUs
  • Datasets
    • AI and ML
      • AlpacaFarm
      • audioset
      • bigcode
      • biomed_clip
      • blip_2
      • blip_2
      • coco
      • Code Llama
      • DeepAccident
      • DeepSeek
      • DINO v2
      • epic-kitchens
      • florence
      • gemma
      • glm
      • gpt
      • gte-Qwen2
      • ibm-granite
      • Idefics2
      • Imagenet 1K
      • inaturalist
      • infly
      • instruct-blip
      • internLM
      • intfloat
      • LAION
      • lg
      • linq
      • llama
      • Llama2
      • llama3
      • llama4
      • Llava_OneVision
      • Lumina
      • mixtral
      • msmarco
      • natural-questions
      • objaverse
      • openai-whisper
      • phi
      • playgroundai
      • pythia
      • qwen
      • R1-1776
      • rag-sequence-nq
      • red-pajama-v2
      • s1-32B
      • satlas_pretrain
      • scalabilityai
      • sft
      • SlimPajama
      • t5
      • Tulu
      • V2X
      • video-MAE
      • videoMAE-v2
      • vit
      • wildchat
    • Bioinformatics
      • AlphaFold3 Databases
      • BFD/MGnify
      • Big Fantastic Database
      • checkm
      • ColabFoldDB
      • dfam
      • EggNOG
      • EggNOG
      • gmap
      • GMAP-GSNAP database (human genome)
      • GTDB
      • igenomes
      • Kraken2
      • MGnify
      • NCBI BLAST databases
      • NCBI RefSeq database
      • NCBI RefSeq database
      • Parameters of Evolutionary Scale Modeling (ESM) models
      • params
      • PDB70
      • PDB70 for ColabFold
      • PINDER
      • PLINDER
      • Protein Data Bank
      • Protein Data Bank database in mmCIF format
      • Protein Data Bank database in SEQRES records
      • Tara Oceans 18S amplicon
      • Tara Oceans MATOU gene catalog
      • Tara Oceans MGT transcriptomes
      • Uniclust30
      • UniProtKB
      • UniRef100
      • UniRef30
      • UniRef90
      • Updated databases for ColabFold
    • Using HuggingFace Datasets

Documentation

  • Requesting An Account
  • Get Started
    • Quick Start
    • Common Terms
    • HPC Resources
    • Theory of HPC
      • Overview of threads, cores, and sockets in Slurm for HPC workflows
    • Git Guide
  • Connecting to Unity
    • SSH
    • Unity OnDemand
    • Connecting to Desktop VS Code
  • Get Help
    • Frequently Asked Questions
    • How to Ask for Help
    • Troubleshooting
  • Cluster Specifications
    • Node List
    • Partition List
      • Gypsum
    • Storage
    • Node Features (Constraints)
      • NVLink and NVSwitch
    • GPU Summary List
  • Managing Files
    • Command Line Interface (CLI)
    • Disk Quotas
    • FileZilla
    • Globus
    • Scratch: HPC Workspace
    • Unity OnDemand File Browser
  • Submitting Jobs
    • Batch Jobs
      • Array Batch Jobs
      • Large Job Counts
      • Monitor a batch job
    • Helper Scripts
    • Interactive CLI Jobs
    • Unity OnDemand
    • Message Passing Interface (MPI)
    • Slurm cheat sheet
  • Software Management
    • Building Software from Scratch
    • Conda
    • Modules
      • Module Usage
    • Renv
    • Unity OnDemand
      • JupyterLab OnDemand
    • Venv
  • Tools & Software
    • ColabFold
    • R
      • R Parallelization
    • Unity GPUs
  • Datasets
    • AI and ML
      • AlpacaFarm
      • audioset
      • bigcode
      • biomed_clip
      • blip_2
      • blip_2
      • coco
      • Code Llama
      • DeepAccident
      • DeepSeek
      • DINO v2
      • epic-kitchens
      • florence
      • gemma
      • glm
      • gpt
      • gte-Qwen2
      • ibm-granite
      • Idefics2
      • Imagenet 1K
      • inaturalist
      • infly
      • instruct-blip
      • internLM
      • intfloat
      • LAION
      • lg
      • linq
      • llama
      • Llama2
      • llama3
      • llama4
      • Llava_OneVision
      • Lumina
      • mixtral
      • msmarco
      • natural-questions
      • objaverse
      • openai-whisper
      • phi
      • playgroundai
      • pythia
      • qwen
      • R1-1776
      • rag-sequence-nq
      • red-pajama-v2
      • s1-32B
      • satlas_pretrain
      • scalabilityai
      • sft
      • SlimPajama
      • t5
      • Tulu
      • V2X
      • video-MAE
      • videoMAE-v2
      • vit
      • wildchat
    • Bioinformatics
      • AlphaFold3 Databases
      • BFD/MGnify
      • Big Fantastic Database
      • checkm
      • ColabFoldDB
      • dfam
      • EggNOG
      • EggNOG
      • gmap
      • GMAP-GSNAP database (human genome)
      • GTDB
      • igenomes
      • Kraken2
      • MGnify
      • NCBI BLAST databases
      • NCBI RefSeq database
      • NCBI RefSeq database
      • Parameters of Evolutionary Scale Modeling (ESM) models
      • params
      • PDB70
      • PDB70 for ColabFold
      • PINDER
      • PLINDER
      • Protein Data Bank
      • Protein Data Bank database in mmCIF format
      • Protein Data Bank database in SEQRES records
      • Tara Oceans 18S amplicon
      • Tara Oceans MATOU gene catalog
      • Tara Oceans MGT transcriptomes
      • Uniclust30
      • UniProtKB
      • UniRef100
      • UniRef30
      • UniRef90
      • Updated databases for ColabFold
    • Using HuggingFace Datasets

On this page

  • Watch job progress by tailing log files
  • Check status quickly using sstat
  • Confirm job utilization using srun
    • Check CPU and memory usage
    • Check GPU usage
  1. Unity
  2. Documentation
  3. Submitting Jobs
  4. Batch Jobs
  5. Monitor a batch job

Monitor a batch job

For running jobs, there are several ways to monitor it’s progress, whether that’s tailing log files, using sstat or connecting to the node with srun.

Watch job progress by tailing log files

If your job produces output as it runs, you can use tail on the log file to watch the progress. Using -F causes it to watch for new lines. Use Ctrl-C (or Cmd-C on macOS) to stop monitoring the file. This does not affect the running job.

tail -F slurm-XXX.out

Check status quickly using sstat

The sstat command can provide information about a running job’s use of resources. For best formatting, use the following command:

sstat -a -j <jobid> -o JobID%-15,TRESUsageInTot%-85

Ignore the <jobid>.extern step. If you use srun, mpirun, or mpiexec, then the numbered job steps show the usage of that program. The .batch contains the usage of all of the commands in your batch script.

The following table shows some sample output from a job submitted with -n 32 -N 4 --ntasks-per-node=8, which spread 32 tasks 4 nodes with 8 cores on each (not a recommended layout):

JobIDTRESUsageInTot
jobid.externcpu=00:00:00,energy=0,fs/disk=…,mem=0,pages=0,vmem=0
jobid.batchcpu=8-11:09:22,energy=0,fs/disk=…,mem=4164440K,pages=0,vmem=4289444K
jobid.0cpu=25-09:24:52,energy=0,fs/disk=…,mem=12502612K,pages=0,vmem=12532576K

This job was running for about 25.5 hours. The usage in .batch only represents the 8 cores on the BatchHost. The .0 step is the usage of the MPI program. The usage here is 609 hours, which is less than the 816 hours expected (32 cores * 25.5 hours), indicating some inefficiency. This could be due to network traffic, I/O wait, or some non-MPI process that ran first, although that alone would not account for all the time.

Confirm job utilization using srun

In order to confirm that a job is utilizing the all the nodes, cores, and GPUs requested, you may connect to a node interactively using the following command:

srun --overlap --nodelist=<nodename> --pty --jobid=<jobid> /bin/bash

The --nodelist argument should only contain one name and is only required if you want to connect to a node other than the first one. Use the following command to see your job’s assigned nodes, cores, and GPUs:

scontrol -d show job <jobid>

Where BatchHost is the node your batch script is running on, and NodeList is the list of all nodes allocated to your job. CPU_IDS lists the cores on each node assigned to your job, and the IDX field shows which GPUs are available to it. Sample output:

JobId=... JobName=...
...
   NodeList=uri-gpu003
   BatchHost=uri-gpu003
   JOB_GRES=gpu:a100:4
     Nodes=uri-gpu003 CPU_IDs=0-63 Mem=515000 GRES=gpu:a100:4(IDX:0-3)
...

Check CPU and memory usage

The recommended tool to see CPU utilization is (copy as-is; no need to expand the variables yourself):

systemd-cgtop system.slice/${HOSTNAME}_slurmstepd.scope/job_${SLURM_JOB_ID}

The %CPU column shows the sum of the utilization on all of the cores assigned on this node. This should be close to 100 times the number of cores. The Memory column should show a value close to what you requested. Note tools like htop may also work, but the make sure the CPU numbers are 0 based.

Check GPU usage

The recommended tool to see GPU utilization is nvitop. See Unity GPUs for more information. Note it doesn’t show GPUs for other jobs on the node, even if they’re also your jobs.

Last modified: Friday, April 18, 2025 at 7:26 PM. See the commit on GitLab.
University of Massachusetts Amherst University of Massachusetts Amherst University of Rhode Island University of Rhode Island University of Massachusetts Dartmouth University of Massachusetts Dartmouth University of Massachusetts Lowell University of Massachusetts Lowell University of Massachusetts Boston University of Massachusetts Boston Mount Holyoke College Mount Holyoke College Smith College Smith College
search
close