[ some features still need to be updated ]
A total of 8 GPUs on 3 nodes are available on Hydra.
each GPU corresponds to:
Type | CUDA Cores | Memory | Mem b/w |
---|---|---|---|
K80 | 2,496 | 12GB | 480 GB/s |
GV100 | 5,120 | 32GB | 870 GB/s |
note that CUDA cores are not like CPU cores.
man nvidia-smi
- accessible by loading the nvidia/nvsmi
module .)This means that GPU applications will
Only one process per GPU can run at the same time, each process gets a different GPU.
Starting one more process than available GPUs will fail with the following error message:
Error: all CUDA-capable devices are busy or unavailable
in file XXX.cu at line no NNN
The CUDA compiler (v10.0) is available on the login nodes, by loading the module nvidia
:
% module load cuda10.0 % nvcc -o testGPU test.cu |
The examples have been installed under /share/apps/nvidia/cuda/samples
and under /cm/shared/apps/cuda10.0/sdk
CUDA versions 8.0, 9.0, and 9.2 are also available (modules: nvidia/cuda80
, nvidia/cuda90
, or nvidia/cuda92
).
nvidia-cuda-mps-control | NVIDIA CUDA Multi Process Service management program |
nvidia-installer | install, upgrade, or uninstall the NVIDIA Accelerated Graphics Driver Set |
nvidia-modprobe | Load the NVIDIA kernel module and create NVIDIA character device files |
nvidia-persistenced | A daemon to maintain persistent software state in the NVIDIA driver |
nvidia-settings | configure the NVIDIA graphics driver |
nvidia-smi | NVIDIA System Management Interface program |
nvidia-xconfig | manipulate X configuration files for the NVIDIA driver |
nvidia
module.The most useful tool is nvidia-smi
, it allows you to query & monitor the status of the GPU card(s):
hpc@compute-79-01% nvidia-smi -l hpc@compute-79-01% nvidia-smi dmon -d 30 -s pucm -o DT hpc@compute-79-01% nvidia-smi pmon -d 10 -s um -o DT hpc@compute-79-01% nvidia-smi \ --query-compute-apps=timestamp,gpu_uuid,pid,name,used_memory \ --format=csv,nounits -l 15 hpc@compute-79-01% nvidia-smi \ --query-gpu=name,serial,index,memory.used,utilization.gpu,utilization.memory \ --format=csv,nounits -l 15 hpc@compute-79-01% nvidia-smi -q -i 0 --display=MEMORY,UTILIZATION,PIDS |
read the man page (man nvidia-smi
)
The GPU development kit (GDK), NVIDIA Management Library (NVML) and the python bindings to NVML (pyNVML
) are available.
pyNVML
7.352.0 is available via the nvidia/pynvlm
module, and the documentation is on-line.pragmas
, i.e., instructions to the compilers that look otherwise like comments, to specify what part of the computation should be offset to the GPU.pragmas
produced a >300x speed up of the Julia set test case.Hydra
(not at SAO, tho)./home/hpc/examples/gpu/cuda
We have tow local GPU related tools (acessible by loading the tools/local module):
check-gpuse
: checks current GPU usage on Hydra;
hpc@hydra-login% check-gpuse hostgroup: @gpu-hosts (3 hosts) - --- memory (GB) ---- - #GPU - --------- slots/CPUs --------- hostname - total used resd - a/u - nCPU used load - free unused compute-73-01 - 251.7 36.7 215.0 - 4/0 - 64 0 0.0 - 64 64.0 compute-79-01 - 125.3 35.0 90.3 - 2/0 - 20 0 0.0 - 20 20.0 |
get-gpu-info
: queries whether a node has a GPUs, returns the GPUs properties and which process(es) use(s) the GPUs.It's a simple C-shell wrapper to run the python script get-gpu-info.py
, that python script uses the pyNVML (python bindings to NVML).
the get-gpu-info
wrapper checks if the first argument is in the form NN-MM
, and if it is will run get-gpu-info.py
on compute-NN-MM
in other words, "get-gpu-info 79-01 0 -d
" is equivalent to "ssh -xn compute-79-01 get-gpu-info 0 -d
"
usage: get-gpu-info.py [-h] [-i] [-d] [-l [LOOP]] [-c COUNTS] [--ntstamp NTSTAMP] [id] get-gpu-info.py: show info about GPU(s) positional arguments: id specify the GPU id, implies --info optional arguments: -h, --help show this help message and exit -i, --info show info for each GPU -d, --details show details of running process, implies --info -l [LOOP], --loop [LOOP] repeat every LOOP [in sec: 10 to 3600], default is 30, implies --info -c COUNTS, --counts COUNTS limits the no. of times to loop, implies --info --ntstamp NTSTAMP specify how often to put a time stamp, by default puts one every 10 readings Ver 1.0/0 Feb 2016/SGK |
hpc@hydra-login-01% get-gpu-info 0 GPU on hydra-login-01 hpc@hydra-login-01% ssh -xn compute-79-01 get-gpu-info 2 GPUs on 79-01 hpc@hydra-login-01% get-gpu-info 73-01 -d 4 GPUs on 73-01 Tue Mar 8 13:27:45 2016 id ------ memory ------ ------ bar1 -------- ---- usage ---- -- name -- used/total used/total gpu mem #proc 0 Tesla_K80 760.0M/11.25G 6.6% 4.562M/16.00G 0.0% 0% 0% 1 pid=64617 name=./loopjulia2xGpu used_memory=735.9M 1 Tesla_K80 22.80M/11.25G 0.2% 2.562M/16.00G 0.0% 0% 0% 0 2 Tesla_K80 22.80M/11.25G 0.2% 2.562M/16.00G 0.0% 0% 0% 0 3 Tesla_K80 22.80M/11.25G 0.2% 2.562M/16.00G 0.0% 0% 0% 0 hpc@hydra-login01% ssh -xn compute-79-01 get-gpu-info -d 2 GPUs on 79-01 Thu Dec 12 10:02:32 2019 id ------ memory ------ ------ bar1 -------- ---- usage ---- -- name -- used/total used/total gpu mem #proc 0 Quadro_GV100 64.00k/31.72G 0.0% 2.566M/256.0M 1.0% 0% 0% 0 1 Quadro_GV100 64.00k/31.72G 0.0% 2.566M/256.0M 1.0% 0% 0% 0 |
uTGPU.tq
and qGPU.iq
, batch or interactive queues, to access the GPUs:qsub -l gpu
or qrsh -l gpu
qsub -l gpu,ngpu=1
or qrsh -l gpu,ngpu=1
qsub -l gpu,ngpu=2
I wrote a trivial test case: computing a Julia set (fractals) and saving the corresponding image. It is derived from NVIDIA's own example.
You can find that example, and equivalent codes, under /home/hpc/examples/gpu
- I wrote
cuda/ | CUDA and C++ code (.cu .cpp Makefile) |
cuda/gpu | GPU example |
cuda/cpu | CPU equivalent |
matlab/ | MATLAB, using standalone compiled code (available for now only at SAO) |
matlab/gpu | GPU example |
matlab/cpu | CPU equivalent |
idl/ | IDL CPU-only equivalent (for comparison) |
/share/apps/nvidia/cuda/samples
and under /cm/shared/apps/cuda10.0/sdk
./home/hpc/tests/gpu/cuda/samples/
.z =: z * z + c
z
and c
are two complex numbers.N
times and computed on a M x M
grid, where z = x + i y
and x
and y
are equispaced values between -1.5 and 1.5.z
is converted to an index iz = 255*exp(-abs(z))
and the index iz
converted to a color triplet using a rainbow color map.c
.z
and c
are each represented by floats
,doubles.
/home/hpc/examples/gpu/timing/
c and for N=150
, timings units are [HH:]MM:SS[.S]
(not yet complete)single precision | speed | double precision | speed | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
size M x M | 4k x 4k | 8k x 8k | 16x 16k | ratio | 4k x 4k | 8k x 8k | 16k x 16k | ratio | Comments | Note | ||
GPU cases: | GPU specific code | |||||||||||
CUDA | 03:05.0 | 12:20.8 | 49:23.6 | 9.202 | NVIDIA CUDA - their example | |||||||
Optimized CUDA | 00:05.1 | 00:17.6 | 01:09.3 | 371.6 | 00:07.6 | 00:27.8 | 01:50.3 | 238.8 | NVIDIA CUDA - optimized | |||
MATLAB GPU | 01:33.2 | 04:46.0 | 18:56.4 | 22.04 | 02:06.2 | 07:13.0 | 14.69 | MATLAB equiv, using GPU | 1 | |||
GPU cases, using pragmas: | Regular code with special directives | |||||||||||
PGI C++ OpenACC | 00:54.3 | 02:09.4 | 42.30 | C++ with OpenACC | 2,3a | |||||||
PGI C++ OpenACC (optimized) | 00:12.1 | 00:42.2 | 02:46.2 | 155.5 | 00:13.8 | 00:52.1 | 03:27.6 | 128.5 | C++ with OpenACC, optimized | 2,3a | ||
PGI C++ OpenACC (myComplex ) | 10:52:44 | 0.043 | uses myComplex class | 2,3b | ||||||||
PGI C++ OpenAC (std::complex<> ) | 10:52:45 | 0.043 | uses std::complex<float> | 2,3c | ||||||||
PGI F90 OpenACC | 00:06.0 | 00:21.1 | 01:23.0 | 311.8 | 00:09.5 | 00:34.4 | 02:18.6 | 191.4 | F90 with OpenACC pragma | 2,4 | ||
PGI (CUF Fortran CUDA) | 00:05.1 | 00:17.3 | 01:08.2 | 376.0 | 00:07.7 | 00:27.8 | 01:54.6 | 234.8 | PGI FORTRAN CUDA (optimized) | 2,4 | ||
CPU only cases: | CPU equivalent cases | |||||||||||
C++ (PGI) | 28:13.7 | 01:55:10 | 07:30:38 | 1.000 | 27:49.2 | 01:51:15 | 07:25:02 | 1.021 | C++, PGI compiler (v15.9) | |||
C++ (Intel) | 25:35.6 | 01:42:21 | 06:49:39 | 1.109 |
| 01:42:08 | 06:48:22 | 1.112 | C++, Intel compiler | |||
C++ (gcc) | 28:24.6 | 01:53:37 | 07:35:41 | 0.999 | 28:36.9 | 01:54:11 | 07:36:53 | 0.994 | C++, GNU compiler (v4.9.2) | |||
PGI C++ | 28:53.6 | 01:55:38 | 0.986 | 28:33.2 | 01:54:01 | 0.999 | (duplicate) | 3a | ||||
PGI C++ (myComplex ) | 28:11.0 | 01:52:02 | 1.015 | uses myComplex class | 3b | |||||||
PGI C++ (std::complex<> ) | 03:53:08 | 15:29:09 | 0.123 | uses std::complex<float> | 3c | |||||||
PGI F90 | 33:36.8 | 02:13:50 | 0.850 | 34:04.9 | 02:31:51 | 0.793 | F90, PGI compiler (v15.9) | 4 | ||||
MATLAB (1 thread) | 01:09:00 | 04:34:24 | 0.414 | 01:03:18 | 04:06:06 | 0.457 | MATLAB on 1 CPU | 1 | ||||
MATLAB (multi-threaded) | 14:05.7 | 56:09.6 | 2.027 | 13:25.4 | 53:25.9 | 2.129 | MATLAB on multiple CPUs | 1 | ||||
IDL (1 thread) | 01:32:13 | 06:24:22 | 0.303 | IDL on 1 CPU | 1 | |||||||
IDL (multi-threaded) | 12:50.0 | 01:32:50 | 1.720 | IDL on multiple CPUs | 1 |
Notes:
myComplex
, and let the compiler figure out how to port it to the GPU,std::complex<>
class template, and also let the compiler figure out how to port it to the GPU.COMPLEX.
Last updated SGK