[ some features still need to be updated ]

Introduction

A total of 8 GPUs on 3 nodes are available on Hydra.

GPU Configuration

(lightbulb) This means that GPU applications will

(warning) Only one process per GPU can run at the same time, each process gets a different GPU.

(warning) Starting one more process than available GPUs will fail with the following error message:

Error: all CUDA-capable devices are busy or unavailable
 in file XXX.cu at line no NNN

Available Tools

CUDA

The CUDA compiler (v10.0) is available on the login nodes, by loading the module nvidia:

% module load cuda10.0
% nvcc -o testGPU test.cu

The examples have been installed under /share/apps/nvidia/cuda/samples and under /cm/shared/apps/cuda10.0/sdk

CUDA versions 8.0, 9.0, and 9.2 are also available (modules: nvidia/cuda80, nvidia/cuda90, or nvidia/cuda92).

NVIDIA Tools

NVSMI: The NVIDIA System Management Interface

nvidia-cuda-mps-controlNVIDIA CUDA Multi Process Service management program
nvidia-installerinstall, upgrade, or uninstall the NVIDIA Accelerated Graphics Driver Set
nvidia-modprobeLoad the NVIDIA kernel module and create NVIDIA character device files
nvidia-persistencedA daemon to maintain persistent software state in the NVIDIA driver
nvidia-settingsconfigure the NVIDIA graphics driver
nvidia-smiNVIDIA System Management Interface program
nvidia-xconfigmanipulate X configuration files for the NVIDIA driver

Query and Monitor

The most useful tool is nvidia-smi, it allows you to query & monitor the status of the GPU card(s):

  hpc@compute-79-01% nvidia-smi -l
  hpc@compute-79-01% nvidia-smi dmon -d 30 -s pucm -o DT
  hpc@compute-79-01% nvidia-smi pmon -d 10 -s um -o DT
  hpc@compute-79-01% nvidia-smi \
           --query-compute-apps=timestamp,gpu_uuid,pid,name,used_memory \
           --format=csv,nounits -l 15
  hpc@compute-79-01% nvidia-smi \
           --query-gpu=name,serial,index,memory.used,utilization.gpu,utilization.memory \
           --format=csv,nounits -l 15
  hpc@compute-79-01% nvidia-smi -q -i 0 --display=MEMORY,UTILIZATION,PIDS

read the man page (man nvidia-smi)

GDK/NVML/pynvml

The GPU development kit (GDK), NVIDIA Management Library (NVML) and the python bindings to NVML (pyNVML) are available.

PGI OpenACC/CUF

Local Tool

We have tow local GPU related tools (acessible by loading the tools/local module):

  1. check-gpuse: checks current GPU usage on Hydra;

    hpc@hydra-login% check-gpuse
    hostgroup: @gpu-hosts (3 hosts)
                    - --- memory (GB) ----  -  #GPU - --------- slots/CPUs --------- 
    hostname        -   total   used   resd -  a/u  - nCPU used   load - free unused 
    compute-73-01   -   251.7   36.7  215.0 -  4/0  -   64    0    0.0 -   64   64.0
    compute-79-01   -   125.3   35.0   90.3 -  2/0  -   20    0    0.0 -   20   20.0
    
    


  2. get-gpu-info: queries whether a node has a GPUs, returns the GPUs properties and which process(es) use(s) the GPUs.

It's a simple C-shell wrapper to run the python script get-gpu-info.py, that python script uses the pyNVML (python bindings to NVML).

(lightbulb) the get-gpu-info wrapper checks if the first argument is in the form NN-MM, and if it is will run get-gpu-info.py on compute-NN-MM

in other words, "get-gpu-info 79-01 0 -d" is equivalent to "ssh -xn compute-79-01 get-gpu-info 0 -d"
 

usage: get-gpu-info.py [-h] [-i] [-d] [-l [LOOP]] [-c COUNTS]
                       [--ntstamp NTSTAMP]
                       [id]
get-gpu-info.py: show info about GPU(s)
positional arguments:
  id                    specify the GPU id, implies --info
optional arguments:
  -h, --help            show this help message and exit
  -i, --info            show info for each GPU
  -d, --details         show details of running process, implies --info
  -l [LOOP], --loop [LOOP]
                        repeat every LOOP [in sec: 10 to 3600], default is 30,
                        implies --info
  -c COUNTS, --counts COUNTS
                        limits the no. of times to loop, implies --info
  --ntstamp NTSTAMP     specify how often to put a time stamp, by default puts
                        one every 10 readings
Ver 1.0/0 Feb 2016/SGK


hpc@hydra-login-01% get-gpu-info
0 GPU on hydra-login-01
hpc@hydra-login-01% ssh -xn compute-79-01 get-gpu-info
2 GPUs on 79-01
hpc@hydra-login-01% get-gpu-info 73-01 -d
4 GPUs on 73-01
Tue Mar  8 13:27:45 2016
id           ------ memory ------ ------ bar1 -------- ---- usage ----
  -- name --   used/total           used/total          gpu  mem #proc
0 Tesla_K80  760.0M/11.25G   6.6% 4.562M/16.00G   0.0%   0%   0% 1
    pid=64617 name=./loopjulia2xGpu used_memory=735.9M
1 Tesla_K80  22.80M/11.25G   0.2% 2.562M/16.00G   0.0%   0%   0% 0
2 Tesla_K80  22.80M/11.25G   0.2% 2.562M/16.00G   0.0%   0%   0% 0
3 Tesla_K80  22.80M/11.25G   0.2% 2.562M/16.00G   0.0%   0%   0% 0
hpc@hydra-login01% ssh -xn compute-79-01 get-gpu-info -d
2 GPUs on 79-01
Thu Dec 12 10:02:32 2019
id           ------ memory ------ ------ bar1 -------- ---- usage ----
  -- name --   used/total           used/total          gpu  mem #proc
0 Quadro_GV100 64.00k/31.72G   0.0% 2.566M/256.0M   1.0%   0%   0% 0
1 Quadro_GV100 64.00k/31.72G   0.0% 2.566M/256.0M   1.0%   0%   0% 0

Available Queues

Available Examples

Trivial Examples

I wrote a trivial test case: computing a Julia set (fractals) and saving the corresponding image. It is derived from NVIDIA's own example.

You can find that example, and equivalent codes, under /home/hpc/examples/gpu - I wrote

cuda/CUDA and C++ code (.cu .cpp Makefile)
cuda/gpuGPU example
cuda/cpuCPU equivalent
matlab/MATLAB, using standalone compiled code (available for now only at SAO)
matlab/gpuGPU example
matlab/cpuCPU equivalent
idl/IDL CPU-only equivalent (for comparison)

Note:

NVIDIA Examples

Timing


single precisionspeed
double precision
speed

size M x M4k x 4k8k x 8k16x 16kratio
4k x 4k8k x 8k16k x 16k
ratioCommentsNote
GPU cases:









GPU specific code
CUDA03:05.012:20.849:23.69.202





NVIDIA CUDA - their example
Optimized CUDA00:05.100:17.601:09.3371.6
00:07.600:27.801:50.3
238.8NVIDIA CUDA - optimized
MATLAB GPU01:33.204:46.018:56.422.04
02:06.207:13.0

14.69MATLAB equiv, using GPU1
GPU cases, using pragmas:









Regular code with special directives
PGI C++ OpenACC00:54.302:09.4
42.30





C++ with OpenACC2,3a
PGI C++ OpenACC (optimized)00:12.100:42.202:46.2155.5
00:13.800:52.103:27.6
128.5C++ with OpenACC, optimized2,3a
PGI C++ OpenACC (myComplex)10:52:44

0.043





uses myComplex class2,3b
PGI C++ OpenAC (std::complex<>)10:52:45

0.043





uses std::complex<float>2,3c
PGI F90 OpenACC00:06.000:21.101:23.0311.8
00:09.500:34.402:18.6
191.4F90 with OpenACC pragma2,4
PGI (CUF Fortran CUDA)00:05.100:17.301:08.2376.0
00:07.700:27.801:54.6
234.8PGI FORTRAN CUDA (optimized)2,4
CPU only cases:









CPU equivalent cases
C++ (PGI)28:13.701:55:1007:30:381.000
27:49.201:51:1507:25:02
1.021C++, PGI compiler (v15.9)
C++ (Intel)25:35.6 01:42:2106:49:391.109

25:32.9

01:42:0806:48:22
1.112C++, Intel compiler
C++ (gcc) 28:24.601:53:3707:35:410.999
28:36.901:54:1107:36:53
0.994C++, GNU compiler (v4.9.2)
PGI C++28:53.601:55:38
0.986
 28:33.201:54:01

0.999(duplicate)3a
PGI C++ (myComplex)28:11.001:52:02
1.015





uses myComplex class3b
PGI C++ (std::complex<>)03:53:0815:29:09
0.123





uses std::complex<float>3c
PGI F9033:36.802:13:50
0.850
34:04.902:31:51

0.793F90, PGI compiler (v15.9)4
MATLAB (1 thread)01:09:0004:34:24
 0.414
01:03:1804:06:06

0.457MATLAB on 1 CPU1
MATLAB (multi-threaded)14:05.756:09.6
2.027
13:25.453:25.9

2.129MATLAB on multiple CPUs1
IDL (1 thread)01:32:1306:24:22
0.303





IDL on 1 CPU1
IDL (multi-threaded)12:50.001:32:50
1.720





IDL on multiple CPUs1

Notes:

  1. MATLAB and IDL tests use the run-time environment, for now we only have MATLAB licenses at SAO (including compiler).
  2. Access to the OpenACC and FORTRAN CUDA from PGI requires a license upgrade - our license was upgraded in May 2016.
  3. C++ does not support complex arithmetic like FORTRAN does, so I used 3 approaches:
    1. write out the complex arithmetic explicitly using only floating points operations,
    2. use a C++ class myComplex, and let the compiler figure out how to port it to the GPU,
    3. use the C++ std::complex<> class template, and also let the compiler figure out how to port it to the GPU.
    Using OpenACC with C++ is more tricky than using it with FORTRAN.
    The Julia set computation generates NaN, handling this in some of the C++ cases may explain the slow down.
  4. FORTRAN supports complex arithmetic as a built-in data type COMPLEX.


Last updated   SGK