1. Introduction
- A parallel job is a job that uses more than one CPU.
Because the cluster is a shared resource, it is the GE (the Grid Engine, or the job scheduler) that allocates CPUs, known as slots in GE jargon, to jobs, hence a parallel job must request a set of slots (i.e. CPUs) when it is submitted.
A parallel job request a parallel environment and a number of slots (CPUs) using the
-pe <name> N
specification toqsub
:the
<name>
is eithermpich
,orte
ormthread
(see below);- the value of
N
is the number of requested slots (CPUs) and can be specified aN-M
, whereN
is the minimum andM
the maximum number of slots requested; that option can be an embedded directive.
The job script access the value of assigned slots via an environment variable (
NSLOTS
) and, for MPI jobs, gets the list of computed nodes via a so-called machines file.
- There are two types of parallel jobs:
- MPI or distributed jobs: the CPUs can be distributed over multiple compute nodes.
PROs: there is conceptually no limit on how many CPUs can be used, the cumulative amount of CPUs and memory a job can use can get quite large. The GE can find (a lot of) unused CPUs on a busy machine by finding them on different nodes
CONs: each CPU should assume to be on a separate compute node and thus must communicate with the other CPUs to exchange information (aka message passing), Programming can get more complicated and the inter-process communication can become a bottleneck. - Multi-threaded jobs: all the CPUs must be on the same compute node.
PROs: all CPUs can share a common memory space, inter-process communication can be very efficient (being local) and programming is often simpler;
CONs: you can only use as many CPUs as there is on the largest compute node, and you can get them only if they are not in use by someone else.
- How do I know which type of parallel job to submit to?
The author of the software will in most cases specify if the application can be parallelized and how- Some analysis are parallelized by submitting a slew of independent serial jobs, where using a job array may be the best approach;
- some analysis use explicit message passing (MPI); while
- some analysis use a programming model that can use multiple threads.
- We currently do not support hybrid parallel jobs: jobs that would use both multi-threaded and distributed models.
2. MPI, or Distributed Parallel Jobs with Explicit Message Passing
- A MPI job runs code that uses an explicit message passing programming scheme known as MPI.
- There are two distinct implementations of the MPI protocol:
- MPICH and
- ORTE;
- OpenMPI is an ORTE implementation of MPI;
- MVAPICH is a MPICH implementation, using explicitly the InfiniBand as transport fabric (faster message passing).
- OpenMPI is not OpenMP
- OpenMPI is the ORTE implementation of MPI
- OpenMP is a
The follow grid of modules, corresponding to a combination of compiler & implementation, is available on Hydra:
ORTE | MPICH | MVAPICH | |
---|---|---|---|
GNU | gcc/openmpi | gcc/mpich | gcc/mvapich |
GNU gcc 4.4.7 | gcc/4.4/openmpi | gcc/4.4/mpich | gcc/4.4/mvapich |
GNU gcc 4.9.1 | gcc/4.9/openmpi | gcc/4.9/mpich | gcc/4.9/mvapich |
GNU gcc 4.9.2 | gcc/4.9/openmpi-4.9.2 | n/a | n/a |
Intel | intel/mpi | n/a | n/a |
Intel v15.x | intel/15/mpi | n/a | n/a |
Intel v16.x | intel/16/mpi | n/a | n/a |
PGI | pgi/openmpi | pgi/mpich | pgi/mvapich |
PGI v14.x | pgi/14/openmpi | pgi/14/mpich | pgi/14/mvapich |
PGI v15.x | pgi/15/openmpi | pgi/15/mpich | pgi/15/mvapich |
PGI v15.x | pgi/15/openmpi-15.9 | pgi/15/mpich-15.9 | pgi/15/mvapich-15.9 |
In fact, there are more version specific module available, check with
% ( module -t avail ) | & grep pi
for a complete list, then use
% module whatis <module-name>
or
% module help <module-name>
where <module-name>
is one of the listed module to get more specific information.
2.a ORTE or OpenMPI
The following example shows how to write an ORTE/OpenMPI job script
2.b MPICH or MVAPICH
The following example shows how to write an ORTE/OpenMPI job script
3. Multi-threaded, or OpenMP, Parallel Jobs
note: OpenMP is not OpenMPI
(more to come)