[ some features still need to be updated ]
A total of 8 GPUs on 3 nodes are available on Hydra.
- One node has two dual GPU cards (NVIDIA K80)
- Two nodes have two GPU cards (NVIDIA GV100)
each GPU corresponds to:
Type CUDA Cores Memory Mem b/w K80 2,496 12GB 480 GB/s GV100 5,120 32GB 870 GB/s
note that CUDA cores are not like CPU cores.
- The GPUs are configured as follow:
- persistence mode is ENABLED (NVIDIA driver remains loaded even when there are no active clients)
- compute mode is set to EXCLUSIVE_PROCESS (only one context is allowed per device, usable from multiple threads at a time)
man nvidia-smi- accessible by loading the
This means that GPU applications will
- start faster (no need to re-load the driver), and
- only one process can use a GPU at a time (exclusive use.)
Only one process per GPU can run at the same time, each process gets a different GPU.
Starting one more process than available GPUs will fail with the following error message:
Error: all CUDA-capable devices are busy or unavailable
in file XXX.cu at line no NNN
The CUDA compiler (v10.0) is available on the login nodes, by loading the module
The examples have been installed under
/share/apps/nvidia/cuda/samples and under
CUDA versions 8.0, 9.0, and 9.2 are also available (modules:
NVSMI: The NVIDIA System Management Interface
- NVSMI version 418.67 is available.
- The following tools are available, but only on the nodes with GPUs:
|NVIDIA CUDA Multi Process Service management program|
|install, upgrade, or uninstall the NVIDIA Accelerated Graphics Driver Set|
|Load the NVIDIA kernel module and create NVIDIA character device files|
|A daemon to maintain persistent software state in the NVIDIA driver|
|configure the NVIDIA graphics driver|
|NVIDIA System Management Interface program|
|manipulate X configuration files for the NVIDIA driver|
- Read the corresponding man pages to learn more.
- The man pages are accessible on the GPU nodes by loading the
Query and Monitor
The most useful tool is
nvidia-smi, it allows you to query & monitor the status of the GPU card(s):
read the man page (
GDK/NVML/pynvml The GDK version, the NVML documentation is available at NVIDIA's web site
pyNVML7.352.0 is available via the
nvidia/pynvlmmodule, and the documentation is on-line.
- The PGI compilers support OpenACC.
- OpenACC, similarly to OpenMP, instructs a compiler to produce code that will run on the GPU
- It uses
pragmas, i.e., instructions to the compilers that look otherwise like comments, to specify what part of the computation should be offset to the GPU.
- A single pair of such
pragmasproduced a >300x speed up of the Julia set test case.
- This requires an additional license that is available on
Hydra(not at SAO, tho).
- The PGI compilers also support CUDA FORTRAN (aka CUF).
- You can write or modify existing FORTRAN code to use the GPU like you can using C/C++ & CUDA.
- Simple examples are available in
We have tow local GPU related tools (acessible by loading the tools/local module):
check-gpuse: checks current GPU usage on Hydra;
get-gpu-info: queries whether a node has a GPUs, returns the GPUs properties and which process(es) use(s) the GPUs.
It's a simple C-shell wrapper to run the python script
get-gpu-info.py, that python script uses the pyNVML (python bindings to NVML).
get-gpu-info wrapper checks if the first argument is in the form
NN-MM, and if it is will run
in other words, "
get-gpu-info 79-01 0 -d" is equivalent to "
ssh -xn compute-79-01 get-gpu-info 0 -d"
- Two test queues are available:
qGPU.iq, batch or interactive queues, to access the GPUs:
- the batch queue has no time limit, the interactive queue has a 24hr elapsed time limit
- you must request to use a GPU to run in these queues, you can specify how many GPU cards you want to use:
qsub -l gpuor
qrsh -l gpu
- these are equivalent to
qsub -l gpu,ngpu=1or
qrsh -l gpu,ngpu=1
- If your job will use two GPU cards:
qsub -l gpu,ngpu=2
- The current resource limit for GPUs are:
- 3 GPUs per user in the batch queue,
- 2 GPUs per user in the interactive queue, and
- only 1 concurrent interactive jobs.
- These are a test queue:
- their configuration may change, and
- access is limited to authorized users.
email us to be added to the authorized users list.
I wrote a trivial test case: computing a Julia set (fractals) and saving the corresponding image. It is derived from NVIDIA's own example.
You can find that example, and equivalent codes, under
/home/hpc/examples/gpu - I wrote
|CUDA and C++ code (.cu .cpp Makefile)|
|MATLAB, using standalone compiled code (available for now only at SAO)|
|IDL CPU-only equivalent (for comparison)|
- I wrote more sophisticated alternative to that example, to achieve a 500:1 speed up compared to the equivalent computation running a single CPU.
- That's reducing a 7.5 hour long computation to less than 1 minute, in a case that is intrinsically fully "parallelizable."
- It illustrates the potential gain, compared to the cost of coding using CUDA (an extension of C++)
- You can find the NVIDIA examples under
- I was able to build them, but not on the login nodes, as they use libraries that are currently only available on the GPU nodes
- You can find this under
- I use a simple fractal computation (Julia set) to run timings. The computation is simply
z =: z * z + c
care two complex numbers.
- The assignment is iterated
Ntimes and computed on a
M x Mgrid, where
z = x + i yand
yare equispaced values between -1.5 and 1.5.
- The final value of
zis converted to an index
iz = 255*exp(-abs(z))and the index
izconverted to a color triplet using a rainbow color map.
- the computation is repeated for a range of values of
- I wrote various versions using
- C++ CUDA (a basic one, an optimized one),
- CUDA FORTRAN
- C++ and FORTRAN (CPU-only as reference),
- MATLAB (CPU) and MATLAB using GPU,
- and IDL (CPU only).
- In the single precision versions,
care each represented by
- in the double precision versions, they are represented by
- In the tests I ran for timing purposes I did not save the resulted image.
- The codes, job files and log files are available in
I ran them on 201 values of
c and for N=150, timings units are
[HH:]MM:SS[.S](not yet complete)
|single precision||speed||double precision||speed|
|size M x M||4k x 4k||8k x 8k||16x 16k||ratio||4k x 4k||8k x 8k||16k x 16k||ratio||Comments||Note|
|GPU cases:||GPU specific code|
|CUDA||NVIDIA CUDA - their example|
|Optimized CUDA||NVIDIA CUDA - optimized|
|MATLAB GPU||MATLAB equiv, using GPU||1|
|GPU cases, using ||Regular code with special directives|
|PGI C++ OpenACC||C++ with OpenACC||2,3a|
|PGI C++ OpenACC (optimized)||C++ with OpenACC, optimized||2,3a|
|PGI C++ OpenACC (||uses ||2,3b|
|PGI C++ OpenAC (||uses ||2,3c|
|PGI F90 OpenACC||F90 with OpenACC pragma||2,4|
|PGI (CUF Fortran CUDA)||PGI FORTRAN CUDA (optimized)||2,4|
|CPU only cases:||CPU equivalent cases|
|C++ (PGI)||C++, PGI compiler (v15.9)|
|C++ (Intel)||C++, Intel compiler|
|C++ (gcc)||C++, GNU compiler (v4.9.2)|
|PGI C++ (||uses ||3b|
|PGI C++ (||uses ||3c|
|PGI F90||F90, PGI compiler (v15.9)||4|
|MATLAB (1 thread)||MATLAB on 1 CPU||1|
|MATLAB (multi-threaded)||MATLAB on multiple CPUs||1|
|IDL (1 thread)||IDL on 1 CPU||1|
|IDL (multi-threaded)||IDL on multiple CPUs||1|
- Green means ran faster, red slower, grey is the reference (yellow: not run).
- Optimized CUDA and CUDA FORTRAN ran 371 and 376 times faster than single CPU C++
- MATLAB and IDL tests use the run-time environment, for now we only have MATLAB licenses at SAO (including compiler).
- Access to the OpenACC and FORTRAN CUDA from PGI requires a license upgrade - our license was upgraded in May 2016.
- C++ does not support complex arithmetic like FORTRAN does, so I used 3 approaches:
- write out the complex arithmetic explicitly using only floating points operations,
- use a C++ class
myComplex, and let the compiler figure out how to port it to the GPU,
- use the C++
std::complex<>class template, and also let the compiler figure out how to port it to the GPU.
The Julia set computation generates NaN, handling this in some of the C++ cases may explain the slow down.
- FORTRAN supports complex arithmetic as a built-in data type
Last updated SGK