Use MPI for multi-node job execution
The following guide provides an overview of the message passing interface (MPI) and how to use MPI to run jobs across multiple nodes.
What is MPI?
The Message Passing Interface (MPI) provides a standard way to pass around low level variables between processes within a node or between nodes. Almost all languages have support for it, but it is most frequently used in C/C++ and Fortran code, and to a lesser extent with mpi4py.
Determine if your software support MPI
When using existing software, review its documentation to see if it supports MPI. If it doesn’t mention MPI explicitly, it probably doesn’t support it (with few exceptions).
If you are compiling software yourself, look at the options of ./configure --help
or cmake -LAH
to find MPI related settings. See the Software Installation Guide (Coming soon) for more information on how to build software from scratch.
If you are doing ML/AI jobs using pytorch
, you can use torchrun
to do a distributed launch, but make sure your code using torch.distributed
appropriately (see PyTorch Distributed Overview for more information).
MPI implementations
Since MPI only defines the interface for passing messages, there are many possible implementations. Among these, there are a few major implementations that are widely used:
- OpenMPI - The most common implementation with an open source license (BSD-3)
- MPICH - Another widely used implementation with a permissive license - many other MPI stacks derive from this
- Intel MPI - Shipped with the Intel Compiler (now branded with Intel oneAPI) - also derived from MPICH
- MVAPICH - Multiple purpose-built implementations
In theory, any MPI program should build against any of these implementations. However, once built against a given implementation, the software must run with the same support libraries. For example, it is not possible to compile against OpenMPI and then run using Intel’s MPI implementation. The program generally either crashes or exits quickly with an error when it fails under this condition.
General compiling instructions
When compiling an MPI program, you may need to specify a different name for the compiler, such as:
Language | GNU Compiler | Intel Compiler | GNU MPI | Intel MPI |
---|---|---|---|---|
C | cc /gcc | icc | mpicc | mpiicc |
C++ | c++ /g++ | icpc /icx | mpic++ | mpiicpc /mpicxx |
Fortran | gfortran | ifort /ifx | mpif77 /mpif90 /mpifort | mpiifort /mpifc |
The mechanism for selecting the compiler varies by software.
General usage instructions
The configuration of an MPI stack varies significantly based the MPI library’s compile-time options, so there isn’t a one-size-fits-all set of usage instructions. However, the following sections will guide you through how to start an MPI program based on some common use cases.
Slurm’s srun
Slurm provides the srun
command to start an MPI program (it can also run non-MPI programs if you want extra accounting via sacct
).
Generally, srun
doesn’t require any extra parameters or environment variables to run; however, be aware that it does inherit its information via the $SLURM_*
variables that sbatch
sets. If you use #SBATCH --export=NONE
in particular, then you may need to do srun --export=ALL
if your program relies on environment variables being set.
Option | Description |
---|---|
--label | Prefix each line with the Rank it came from |
OpenMPI mpirun
or mpiexec
If you are using a Slurm aware OpenMPI (orte-info | grep plm: slurm
to find out), and mpirun
is available (some installs insist on srun
), then running it directly without any parameters should suffice. The more standard mpiexec
is also available.
OpenMPI has many parameters that affect how it runs. See their Fine Tuning Guide for all the details.
Option | Description |
---|---|
--tag-output | Prefix each line with the Job and Rank it came from |
--timestamp-output | Prefix lines with the current timestamp |
MPICH mpiexec
mpiexec
should not require any parameters.
Option | Description |
---|---|
-prepend-rank | Prefix each line with the Rank it came from |
Intel mpiexec
mpiexec
should not require any parameters.
Option | Description |
---|---|
-prepend-rank | Prefix each line with the Rank it came from |
It’s also possible to use srun
with Intel MPI; in that case, you should set the following:
export I_MPI_PMI_LIBRARY=/usr/lib/x86_64-linux-gnu/libpmi2.so
Other variables
Some MPI implementations allow you to use environment variables to control and customize their behavior. Generally, if these MPI variables aren’t explicitly set, the libraries will auto-detect the environment. However, there are some edge cases where manually setting these environment variables may be beneficial.
Libfabric based (fi_info
)
Variable | Value | Meaning |
---|---|---|
FI_PROVIDER | tcp | Use regular communication between nodes; should always work |
FI_PROVIDER | shm | Only use shared memory; only works with --nodes=1 |
FI_PROVIDER | verbs | Use Infiniband low latency network; only works with --constraint=ib |
FI_PROVIDER | shm,verbs | Separate multiple providers with a comma |
UCX based (ucx_info
)
Variable | Value | Meaning |
---|---|---|
UCX_TLS | tcp | Use regular communication between nodes; should always work |
UCX_TLS | shm | Only use shared memory; only works with --nodes=1 |
UCX_TLS | cuda | Use CUDA support; only works with --gpu=<X> |
UCX_TLS | rc_x | Use Infiniband low latency network; only works with --constraint=ib |
UCX_TLS | shm,rc_x | Separate multiple providers with a comma |