...
A total of 4 GPUs on 2 nodes are available on Hydra.
- One node has two dual four GPU cards (NVIDIA K80) and is not yet back in productionL40S)
- Two nodes have two GPU cards (NVIDIA GV100)
each GPU corresponds to:
Type CUDA Cores Memory Mem b/w K80L40S218,496176 12GB48GB480 864 GB/s GV100 5,120 32GB 870 GB/s note that CUDA cores are not like CPU cores.
...
- The GPUs are configured as follow:
- The current driver is CUDA 12.4 (Driver Version 550.54.15)
- persistence mode is ENABLED (NVIDIA driver remains loaded even when there are no active clients)
- compute mode is set to EXCLUSIVE_PROCESS (only one context is allowed per device, usable from multiple threads at a time)
- This configuration is reset at reboots.
(seeman nvidia-smi
- accessible on the gpu nodes
- .
- )
This means that GPU applications will
...
Starting one more process than available GPUs will fail with the following error message:
Error: all CUDA-capable devices are busy or unavailable
in file XXX.cu at line no NNN
Anchor | ||||
---|---|---|---|---|
|
CUDA
- The CUDA compiler is now part of the NVIDIA compilers, and is accessible loading the NVIDIA module
% module load nvidia
that loads by default NVIDIA
...
23.9 and CUDA
...
12.
...
2.
- Other version are available, check with
% module whatis nvidia
The CUDA 10.2 sdk is also available (module whatis cuda10.2
).
NVSMI: The NVIDIA System Management Interface
NVSMI: The NVIDIA System Management Interface
- NVSMI v550.54.15 and dcgm v 3.35 are available on the GPU nodesNVSMI version 470.57.02 is available.
- The following tools are available, but only on the nodes with GPUs, and by loading the
cuda10.2
module:
nvidia-cuda-mps-control | NVIDIA CUDA Multi Process Service management program | ||
nvidia-installer | install, upgrade, or uninstall the NVIDIA Accelerated Graphics Driver Set | nvidia-modprobe | Load the NVIDIA kernel module and create NVIDIA character device files |
nvidia-persistenced | A daemon to maintain persistent software state in the NVIDIA driver | ||
nvidia-settings | configure the NVIDIA graphics driver | ||
nvidia-smi | NVIDIA System Management Interface program | ||
nvidia-xconfig | manipulate X configuration files for the NVIDIA driver |
- Read the corresponding man pages to learn more, load the
cuda-dcgm
module to accessdcgm
. - The man pages are accessible on the GPU nodes by loading the
cuda10.2
module..
Query and Monitor
- The most useful tool is
nvidia-smi
, it allows you to query & monitor the status of the GPU card(s):
Code Block | ||||
---|---|---|---|---|
| ||||
hpc@compute-79-01% nvidia-smi -l hpc@compute-79-01% nvidia-smi dmon -d 30 -s pucm -o DT hpc@compute-79-01% nvidia-smi pmon -d 10 -s um -o DT hpc@compute-79-01% nvidia-smi \ --query-compute-apps=timestamp,gpu_uuid,pid,name,used_memory \ --format=csv,nounits -l 15 hpc@compute-79-01% nvidia-smi \ --query-gpu=name,serial,index,memory.used,utilization.gpu,utilization.memory \ --format=csv,nounits -l 15 hpc@compute-79-01% nvidia-smi -q -i 0 --display=MEMORY,UTILIZATION,PIDS |
...
- Read the man page (
man nvidia-smi
)
GDK/NVML/pynvml
The GPU development kit (GDK), NVIDIA Management Library (NVML) and the python bindings to NVML (pyNVML
) are available.
The GDK version,
the NVML documentation is available at NVIDIA's web site
pyNVML
7.352.0 is available via the nvidia/pynvlm
module, and the documentation is on-line.
NVIDIA OpenACC/CUF
- The
...
- NVIDIA compilers support OpenACC.
- OpenACC, similarly to OpenMP, instructs a compiler to produce code that will run on the GPU
- It uses
pragmas
, i.e., instructions to the compilers that look otherwise like comments, to specify what part of the computation should be offset to the GPU.
...
A single pair of such pragmas
produced a >300x speed up of the Julia set test case.
- This requires an additional license that is available on
Hydra
(not at SAO, tho). - The NVIDIA compilers also support CUDA FORTRAN (aka CUF).
- You can write or modify existing FORTRAN code to use the GPU like you can using C/C++ & CUDA.
- Simple examples are available in
/home/hpc/examples/gpu/cuda
Local Tool
- We
...
- provide two local GPU related tools (
...
- accessible by loading the
tools/local-admin
module):check-gpuse
: checks current GPU usage
...
- :
Code Block | ||
---|---|---|
| ||
hpc@hydra-login% check-gpuse
hostgroup: @gpu-hosts (3 hosts)
- --- memory (GB) ---- - #GPU - --------- slots/CPUs ---------
hostname - total used resd - a/u - nCPU used load - free unused
compute- |
...
50-01 - |
...
503. |
...
3 |
...
19. |
...
6 |
...
483. |
...
7 - 4/0 - 64 0 0.0 - 64 64.0 compute-79-01 - 125. |
...
4 |
...
18. |
...
1 |
...
107.3 - 2/0 - 20 0 0.0 - 20 20.0 |
...
compute-79-02 - 125.5 46.2 79.3 - 2/1 - 20 1 2.1 - 19 17.9
Total #GPU=8 used=1 (12.5%)
|
get-gpu-info
: queries whether a node has a GPU, returns the GPU(s) properties and which process(es) use(s) the GPUs.
It's a simple C-shell wrapper to run the python script get-gpu-info.py
, that python script uses the pyNVML (python bindings to NVML).
the get-gpu-info
wrapper checks if the first argument is in the form NN-MM
, and if it is will run get-gpu-info.py
on compute-NN-MM
in other words, "get-gpu-info 79-01 0 -d
" is equivalent to "ssh -xn compute-79-01 get-gpu-info 0 -d
"
Code Block | ||||
---|---|---|---|---|
| ||||
usage: get-gpu-info.py3 [-h] [-i] [-d] [-l [LOOP]] [-c COUNTS]
[--ntstamp NTSTAMP]
[id]
get-gpu-info.py3: show info about GPU(s)
positional arguments:
id specify the GPU id, implies --info
optional arguments:
-h, --help show this help message and exit
-i, --info |
...
It's a simple C-shell wrapper to run the python script get-gpu-info.py
, that python script uses the pyNVML (python bindings to NVML).
the get-gpu-info
wrapper checks if the first argument is in the form NN-MM
, and if it is will run get-gpu-info.py
on compute-NN-MM
in other words, "get-gpu-info 79-01 0 -d
" is equivalent to "ssh -xn compute-79-01 get-gpu-info 0 -d
"
Code Block | ||||
---|---|---|---|---|
| ||||
usage: get-gpu-info.py [-h] [-i] [-d] [-l [LOOP]] [-c COUNTS] show [--ntstamp NTSTAMP] info for each GPU -d, --details show details of running process, implies --info -l [idLOOP] get, -gpu-info.py: show info about GPU(s) positional arguments: id-loop [LOOP] repeat every LOOP specify the GPU id, implies --info optional arguments: -h, --help[in sec: 10 to 3600], default is 30, show this help message andimplies exit--info -ic COUNTS, --infocounts COUNTS show info for each GPU -d, --details limits the no. showof detailstimes ofto running processloop, implies --info -l [LOOP], --loop [LOOP] --ntstamp NTSTAMP specify how often to put a time stamp, by default puts one every repeat every LOOP [in sec: 10 to 3600], default is 30, 10 readings Ver 1.1/0 Oct 2021/SGK |
For example:
Code Block | ||||
---|---|---|---|---|
| ||||
hpc@hydra-login-01% get-gpu-info 0 GPU on hydra-login-01 hpc@login01% get-gpu-info 50-01 -d 4 GPUs on 50-01 Thu May 16 15:30:06 2024 id implies ------info memory -c COUNTS, --counts COUNTS ----- ------ bar1 -------- ---- usage ---- --- name --- used/total limits the no. of timesused/total to loop, implies --info --ntstamp NTSTAMP specifygpu how oftenmem to#proc 0 put aNVIDIA_L40S time stamp, by default puts 479.1M/44.99G 1.0% 1.688M/64.00G 0.0% 0% 0% 0 1 NVIDIA_L40S 479.1M/44.99G one1.0% every 10 readings Ver 1.0/0 Feb 2016/SGK | ||||
Code Block | ||||
| ||||
hpc@hydra-login-01% get-gpu-info 0 GPU on hydra-login-01 hpc@hydra-login-01% ssh -xn compute-79-01 get-gpu-info 2 GPUs on 79-01 hpc@hydra-login-01% get-gpu-info 73-01 -d 4 GPUs on 73-01 Tue Mar 8 13:27:45 2016 id ------ memory ------ ------ bar1 -------- ---- usage ---- -- name -- used/total688M/64.00G 0.0% 0% 0% 0 2 NVIDIA_L40S 479.1M/44.99G 1.0% 1.688M/64.00G 0.0% 0% 0% 0 3 NVIDIA_L40S 479.1M/44.99G 1.0% 1.688M/64.00G 0.0% 0% 0% 0 hpc@login01% get-gpu-info 79-01 -d 2 GPUs on 79-01 Thu May 16 15:30:12 2024 id used/total ------ memory ------ ------ bar1 -------- gpu---- mem #proc 0 Tesla_K80 760.0M/11.25G 6.6% 4.562M/16.00G 0.0% 0% 0% 1 pid=64617 name=./loopjulia2xGpu used_memory=735.9M 1 Tesla_K80 22.80M/11.25G 0.2% 2.562M/16.00G 0.0% 0% 0% 0 2 Tesla_K80 22.80M/11.25G 0.2% 2.562M/16.00G 0.0% usage ---- --- name --- used/total used/total gpu mem #proc 0 Quadro_GV100 276.5M/32.00G 0.8% 2.688M/256.0M 1.0% 0% 0% 0 31 TeslaQuadro_K80GV100 22276.80M5M/1132.25G00G 0.2%8% 2.562M688M/16256.00G0M 01.0% 0% 0% 0 hpc@hydra-login01% ssh -xn compute-79-01 hpc@login01% get-gpu-info 79-02 -d 2 GPUs on 79-0102 Thu DecMay 1216 1015:0230:3219 20192024 id ------ memory ------ ------ bar1 -------- ---- usage ---- --- name --- used/total used/total gpu mem #proc 0 Quadro_GV100 64 6.00k212G/3132.72G00G 19.4% 0.0% 25.566M188M/256.0M 12.0% 0% 0% 0 1 1 pid=494189 name=b'python3' used_memory=5.939G 1 Quadro_GV100 64279.00k1M/3132.72G00G 0.0%9% 23.566M188M/256.0M 1.0%2% 0% 0% 0 |
Anchor | ||||
---|---|---|---|---|
|
- Two Four queues are available to available to access the GPUs:
- an interactive queue,
qgpu.iq
, that has a batch queue,lTgpu.q
, that has the long-T time limit, andan interactive queue,qgpu.iq
, that has a 24hr elapsed time limit24hr elapsed time limit, and - three batch queues
sTgpu.q, mTgpu
.q &lTgpu.q
, corresponding to a short, medium and long time limit.
- an interactive queue,
- There also limits on (CPU) memory usage and how many concurrent GPUs can be used, see the Available Queues page.
Note:
- You must request to use a GPU to run in these queues, as follows:
...
- If your job will use two GPUs, use::
% qsub -l gpu,ngpu=2
- You can specify what type of GPU to use with
-l gpu,gpu_arch=L40S
or
% qsub -l gpu,ngpu=2
Currently we impose the following resource limits for GPUs are:
...
gpu_arch=GV100
Anchor | ||||
---|---|---|---|---|
|
...
You can find that example, and equivalent codes, under /home/hpc/examples/gpu
- I wrote
cuda/ | CUDA and C++ code (.cu .cpp Makefile) |
cuda/gpu | GPU example |
cuda/cpu | CPU equivalent |
matlab/ | MATLAB, using standalone compiled code (available for now only at SAO) |
matlab/gpu | GPU example |
matlab/cpu | CPU equivalent |
idl/ | IDL CPU-only equivalent (for comparison) |
Note:
I wrote more sophisticated alternative to that example, to achieve a 500:1 speed up compared to the equivalent computation running a single CPU.
That's reducing a 7.5 hour long computation to less than 1 minute, in a case that is intrinsically fully "parallelizable."
It illustrates the potential gain, compared to the cost of coding using CUDA (an extension of C++)
NVIDIA Examples
You can find the NVIDIA examples under /share/apps/nvidia/cuda/samples
and under /cm/shared/apps/cuda10.0/sdk
.
I was able to build them, but not on the login nodes, as they use libraries that are currently only available on the GPU nodes
You can find this under /home/hpc/tests/gpu/cuda/samples/
.
Timing
[This section is old and these tests have yet to be re-run]
I use a simple fractal computation (Julia set) to run timings. The computation is simplyz =: z * z + c
where z
and c
are two complex numbers.
The assignment is iterated N
times and computed on a M x M
grid, where z = x + i y
and x
and y
are equispaced values between -1.5 and 1.5.
The final value of z
is converted to an index iz = 255*exp(-abs(z))
and the index iz
converted to a color triplet using a rainbow color map.
the computation is repeated for a range of values of c
.
I wrote various versions using
C++ CUDA (a basic one, an optimized one),
CUDA FORTRAN
C++ and FORTRAN (CPU-only as reference),
MATLAB (CPU) and MATLAB using GPU,
and IDL (CPU only).
In the single precision versions, z
and c
are each represented by floats
,
in the double precision versions, they are represented by doubles.
In the tests I ran for timing purposes I did not save the resulted image.
The codes, job files and log files are available in /home/hpc/examples/gpu/timing/
I ran them on 201 values of c and for N=150
, timings units are [HH:]MM:SS[.S]
(not yet complete)
single precision | speed | double precision | speed | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
size M x M | 4k x 4k | 8k x 8k | 16x 16k | ratio | 4k x 4k | 8k x 8k | 16k x 16k | ratio | Comments | Note | ||
GPU cases: | GPU specific code | |||||||||||
CUDA | 03:05.0 | 12:20.8 | 49:23.6 | 9.202 | NVIDIA CUDA - their example | |||||||
Optimized CUDA | 00:05.1 | 00:17.6 | 01:09.3 | 371.6 | 00:07.6 | 00:27.8 | 01:50.3 | 238.8 | NVIDIA CUDA - optimized | |||
MATLAB GPU | 01:33.2 | 04:46.0 | 18:56.4 | 22.04 | 02:06.2 | 07:13.0 | 14.69 | MATLAB equiv, using GPU | 1 | |||
GPU cases, using pragmas: | Regular code with special directives | |||||||||||
PGI C++ OpenACC | 00:54.3 | 02:09.4 | 42.30 | C++ with OpenACC | 2,3a | |||||||
PGI C++ OpenACC (optimized) | 00:12.1 | 00:42.2 | 02:46.2 | 155.5 | 00:13.8 | 00:52.1 | 03:27.6 | 128.5 | C++ with OpenACC, optimized | 2,3a | ||
PGI C++ OpenACC (myComplex ) | 10:52:44 | 0.043 | uses myComplex class | 2,3b | ||||||||
PGI C++ OpenAC (std::complex<> ) | 10:52:45 | 0.043 | uses std::complex<float> | 2,3c | ||||||||
PGI F90 OpenACC | 00:06.0 | 00:21.1 | 01:23.0 | 311.8 | 00:09.5 | 00:34.4 | 02:18.6 | 191.4 | F90 with OpenACC pragma | 2,4 | ||
PGI (CUF Fortran CUDA) | 00:05.1 | 00:17.3 | 01:08.2 | 376.0 | 00:07.7 | 00:27.8 | 01:54.6 | 234.8 | PGI FORTRAN CUDA (optimized) | 2,4 | ||
CPU only cases: | CPU equivalent cases | |||||||||||
C++ (PGI) | 28:13.7 | 01:55:10 | 07:30:38 | 1.000 | 27:49.2 | 01:51:15 | 07:25:02 | 1.021 | C++, PGI compiler (v15.9) | |||
C++ (Intel) | 25:35.6 | 01:42:21 | 06:49:39 | 1.109 |
| 01:42:08 | 06:48:22 | 1.112 | C++, Intel compiler | |||
C++ (gcc) | 28:24.6 | 01:53:37 | 07:35:41 | 0.999 | 28:36.9 | 01:54:11 | 07:36:53 | 0.994 | C++, GNU compiler (v4.9.2) | |||
PGI C++ | 28:53.6 | 01:55:38 | 0.986 | 28:33.2 | 01:54:01 | 0.999 | (duplicate) | 3a | ||||
PGI C++ (myComplex ) | 28:11.0 | 01:52:02 | 1.015 | uses myComplex class | 3b | |||||||
PGI C++ (std::complex<> ) | 03:53:08 | 15:29:09 | 0.123 | uses std::complex<float> | 3c | |||||||
PGI F90 | 33:36.8 | 02:13:50 | 0.850 | 34:04.9 | 02:31:51 | 0.793 | F90, PGI compiler (v15.9) | 4 | ||||
MATLAB (1 thread) | 01:09:00 | 04:34:24 | 0.414 | 01:03:18 | 04:06:06 | 0.457 | MATLAB on 1 CPU | 1 | ||||
MATLAB (multi-threaded) | 14:05.7 | 56:09.6 | 2.027 | 13:25.4 | 53:25.9 | 2.129 | MATLAB on multiple CPUs | 1 | ||||
IDL (1 thread) | 01:32:13 | 06:24:22 | 0.303 | IDL on 1 CPU | 1 | |||||||
IDL (multi-threaded) | 12:50.0 | 01:32:50 | 1.720 | IDL on multiple CPUs | 1 |
Notes:
Green means ran faster, red slower, grey is the reference (yellow: not run).
Optimized CUDA and CUDA FORTRAN ran 371 and 376 times faster than single CPU C++
MATLAB and IDL tests use the run-time environment, for now we only have MATLAB licenses at SAO (including compiler).
Access to the OpenACC and FORTRAN CUDA from PGI requires a license upgrade - our license was upgraded in May 2016.
C++ does not support complex arithmetic like FORTRAN does, so I used 3 approaches:
- write out the complex arithmetic explicitly using only floating points operations,
- use a C++ class
myComplex
, and let the compiler figure out how to port it to the GPU, - use the C++
std::complex<>
class template, and also let the compiler figure out how to port it to the GPU.
Using OpenACC with C++ is more tricky than using it with FORTRAN.
The Julia set computation generates NaN, handling this in some of the C++ cases may explain the slow down.
FORTRAN supports complex arithmetic as a built-in data type COMPLEX.
...
Last updated SGK