GPUs on Hydra
GPU Configuration
Available Tools
- CUDA
- NVIDIA
- PGI OpenACC/CUF
- Local tools
Available Queues
Examples

[ some features still need to be updated ]

Introduction

A total of 8 GPUs on 3 nodes are available on Hydra.

One node has two dual GPU cards (NVIDIA K80)
Two nodes have two GPU cards (NVIDIA GV100)
- each GPU corresponds to:
  Type CUDA Cores Memory Mem b/w
  K80 2,496 12GB 480 GB/s
  GV100 5,120 32GB 870 GB/s
  note that CUDA cores are not like CPU cores.

GPU Configuration

The GPUs are configured as follow:
- persistence mode is ENABLED (NVIDIA driver remains loaded even when there are no active clients)
- compute mode is set to EXCLUSIVE_PROCESS (only one context is allowed per device, usable from multiple threads at a time)
This configuration is reset at reboots.
(see man nvidia-smi - accessible by loading the nvidia/nvsmi module .)

This means that GPU applications will

start faster (no need to re-load the driver), and
only one process can use a GPU at a time (exclusive use.)

Only one process per GPU can run at the same time, each process gets a different GPU.

Starting one more process than available GPUs will fail with the following error message:

Error: all CUDA-capable devices are busy or unavailable
in file XXX.cu at line no NNN

Available Tools

CUDA

The CUDA compiler (v10.0) is available on the login nodes, by loading the module nvidia:

% module load cuda10.0
% nvcc -o testGPU test.cu

The examples have been installed under /share/apps/nvidia/cuda/samples and under /cm/shared/apps/cuda10.0/sdk

CUDA versions 8.0, 9.0, and 9.2 are also available (modules: nvidia/cuda80, nvidia/cuda90, or nvidia/cuda92).

NVIDIA Tools

NVSMI: The NVIDIA System Management Interface

NVSMI version 418.67 is available.
The following tools are available, but only on the nodes with GPUs:

`nvidia-cuda-mps-control`	NVIDIA CUDA Multi Process Service management program
`nvidia-installer`	install, upgrade, or uninstall the NVIDIA Accelerated Graphics Driver Set
`nvidia-modprobe`	Load the NVIDIA kernel module and create NVIDIA character device files
`nvidia-persistenced`	A daemon to maintain persistent software state in the NVIDIA driver
`nvidia-settings`	configure the NVIDIA graphics driver
`nvidia-smi`	NVIDIA System Management Interface program
`nvidia-xconfig`	manipulate X configuration files for the NVIDIA driver

Read the corresponding man pages to learn more.
The man pages are accessible on the GPU nodes by loading the nvidia module.

Query and Monitor

The most useful tool is nvidia-smi, it allows you to query & monitor the status of the GPU card(s):

  hpc@compute-79-01% nvidia-smi -l
  hpc@compute-79-01% nvidia-smi dmon -d 30 -s pucm -o DT
  hpc@compute-79-01% nvidia-smi pmon -d 10 -s um -o DT
  hpc@compute-79-01% nvidia-smi \
           --query-compute-apps=timestamp,gpu_uuid,pid,name,used_memory \
           --format=csv,nounits -l 15
  hpc@compute-79-01% nvidia-smi \
           --query-gpu=name,serial,index,memory.used,utilization.gpu,utilization.memory \
           --format=csv,nounits -l 15
  hpc@compute-79-01% nvidia-smi -q -i 0 --display=MEMORY,UTILIZATION,PIDS

read the man page (man nvidia-smi)

GDK/NVML/pynvml

~~The GPU development kit (GDK), NVIDIA Management Library (NVML) and the python bindings to NVML (pyNVML) are available.~~

~~The GDK version,~~
~~the NVML documentation is available at NVIDIA's web site~~
~~pyNVML 7.352.0 is available via the nvidia/pynvlm module, and the documentation is on-line.~~

PGI OpenACC/CUF

The PGI compilers support OpenACC.
- OpenACC, similarly to OpenMP, instructs a compiler to produce code that will run on the GPU
- It uses pragmas, i.e., instructions to the compilers that look otherwise like comments, to specify what part of the computation should be offset to the GPU.
- A single pair of such pragmas produced a >300x speed up of the Julia set test case.
- This requires an additional license that is available on Hydra (not at SAO, tho).
The PGI compilers also support CUDA FORTRAN (aka CUF).
- You can write or modify existing FORTRAN code to use the GPU like you can using C/C++ & CUDA.
Simple examples are available in /home/hpc/examples/gpu/cuda

Local Tool

We have tow local GPU related tools (acessible by loading the tools/local module):

check-gpuse: checks current GPU usage on Hydra;

hpc@hydra-login% check-gpuse
hostgroup: @gpu-hosts (3 hosts)
                - --- memory (GB) ----  -  #GPU - --------- slots/CPUs --------- 
hostname        -   total   used   resd -  a/u  - nCPU used   load - free unused 
compute-73-01   -   251.7   36.7  215.0 -  4/0  -   64    0    0.0 -   64   64.0
compute-79-01   -   125.3   35.0   90.3 -  2/0  -   20    0    0.0 -   20   20.0

get-gpu-info: queries whether a node has a GPUs, returns the GPUs properties and which process(es) use(s) the GPUs.

It's a simple C-shell wrapper to run the python script get-gpu-info.py, that python script uses the pyNVML (python bindings to NVML).

the get-gpu-info wrapper checks if the first argument is in the form NN-MM, and if it is will run get-gpu-info.py on compute-NN-MM

in other words, "get-gpu-info 79-01 0 -d" is equivalent to "ssh -xn compute-79-01 get-gpu-info 0 -d"

usage: get-gpu-info.py [-h] [-i] [-d] [-l [LOOP]] [-c COUNTS]
                       [--ntstamp NTSTAMP]
                       [id]
get-gpu-info.py: show info about GPU(s)
positional arguments:
  id                    specify the GPU id, implies --info
optional arguments:
  -h, --help            show this help message and exit
  -i, --info            show info for each GPU
  -d, --details         show details of running process, implies --info
  -l [LOOP], --loop [LOOP]
                        repeat every LOOP [in sec: 10 to 3600], default is 30,
                        implies --info
  -c COUNTS, --counts COUNTS
                        limits the no. of times to loop, implies --info
  --ntstamp NTSTAMP     specify how often to put a time stamp, by default puts
                        one every 10 readings
Ver 1.0/0 Feb 2016/SGK

hpc@hydra-login-01% get-gpu-info
0 GPU on hydra-login-01
hpc@hydra-login-01% ssh -xn compute-79-01 get-gpu-info
2 GPUs on 79-01
hpc@hydra-login-01% get-gpu-info 73-01 -d
4 GPUs on 73-01
Tue Mar  8 13:27:45 2016
id           ------ memory ------ ------ bar1 -------- ---- usage ----
  -- name --   used/total           used/total          gpu  mem #proc
0 Tesla_K80  760.0M/11.25G   6.6% 4.562M/16.00G   0.0%   0%   0% 1
    pid=64617 name=./loopjulia2xGpu used_memory=735.9M
1 Tesla_K80  22.80M/11.25G   0.2% 2.562M/16.00G   0.0%   0%   0% 0
2 Tesla_K80  22.80M/11.25G   0.2% 2.562M/16.00G   0.0%   0%   0% 0
3 Tesla_K80  22.80M/11.25G   0.2% 2.562M/16.00G   0.0%   0%   0% 0
hpc@hydra-login01% ssh -xn compute-79-01 get-gpu-info -d
2 GPUs on 79-01
Thu Dec 12 10:02:32 2019
id           ------ memory ------ ------ bar1 -------- ---- usage ----
  -- name --   used/total           used/total          gpu  mem #proc
0 Quadro_GV100 64.00k/31.72G   0.0% 2.566M/256.0M   1.0%   0%   0% 0
1 Quadro_GV100 64.00k/31.72G   0.0% 2.566M/256.0M   1.0%   0%   0% 0

Available Queues

Two test queues are available: uTGPU.tq and qGPU.iq, batch or interactive queues, to access the GPUs:
- the batch queue has no time limit, the interactive queue has a 24hr elapsed time limit
- you must request to use a GPU to run in these queues, you can specify how many GPU cards you want to use:
  qsub -l gpu or qrsh -l gpu
- these are equivalent to
  qsub -l gpu,ngpu=1 or qrsh -l gpu,ngpu=1
- If your job will use two GPU cards:
  qsub -l gpu,ngpu=2
- The current resource limit for GPUs are:
  - 3 GPUs per user in the batch queue,
  - 2 GPUs per user in the interactive queue, and
  - only 1 concurrent interactive jobs.
- These are a test queue:
  1. their configuration may change, and
  2. access is limited to authorized users.
    email us to be added to the authorized users list.

Available Examples

Trivial Examples

I wrote a trivial test case: computing a Julia set (fractals) and saving the corresponding image. It is derived from NVIDIA's own example.

You can find that example, and equivalent codes, under /home/hpc/examples/gpu - I wrote

`cuda/`	CUDA and C++ code (.cu .cpp Makefile)
`cuda/gpu`	GPU example
`cuda/cpu`	CPU equivalent
`matlab/`	MATLAB, using standalone compiled code (available for now only at SAO)
`matlab/gpu`	GPU example
`matlab/cpu`	CPU equivalent
`idl/`	IDL CPU-only equivalent (for comparison)

Note:

I wrote more sophisticated alternative to that example, to achieve a 500:1 speed up compared to the equivalent computation running a single CPU.
That's reducing a 7.5 hour long computation to less than 1 minute, in a case that is intrinsically fully "parallelizable."
It illustrates the potential gain, compared to the cost of coding using CUDA (an extension of C++)

NVIDIA Examples

You can find the NVIDIA examples under /share/apps/nvidia/cuda/samples and under /cm/shared/apps/cuda10.0/sdk.
I was able to build them, but not on the login nodes, as they use libraries that are currently only available on the GPU nodes
You can find this under /home/hpc/tests/gpu/cuda/samples/.

Timing

I use a simple fractal computation (Julia set) to run timings. The computation is simply
z =: z * z + c
where z and c are two complex numbers.
The assignment is iterated N times and computed on a M x M grid, where z = x + i y and x and y are equispaced values between -1.5 and 1.5.
The final value of z is converted to an index iz = 255*exp(-abs(z)) and the index iz converted to a color triplet using a rainbow color map.
the computation is repeated for a range of values of c.
I wrote various versions using
- C++ CUDA (a basic one, an optimized one),
- CUDA FORTRAN
- C++ and FORTRAN (CPU-only as reference),
- MATLAB (CPU) and MATLAB using GPU,
- and IDL (CPU only).

In the single precision versions, z and c are each represented by floats,
in the double precision versions, they are represented by doubles.
In the tests I ran for timing purposes I did not save the resulted image.
The codes, job files and log files are available in /home/hpc/examples/gpu/timing/

I ran them on 201 values of c and for N=150, timings units are [HH:]MM:SS[.S] (not yet complete)

	single precision			speed	double precision			speed
size M x M	4k x 4k	8k x 8k	16x 16k	ratio	4k x 4k	8k x 8k	16k x 16k	ratio	Comments	Note
GPU cases:									GPU specific code
CUDA	`03:05.0`	`12:20.8`	`49:23.6`	`9.202`					NVIDIA CUDA - their example
Optimized CUDA	`00:05.1`	`00:17.6`	`01:09.3`	`371.6`	`00:07.6`	`00:27.8`	`01:50.3`	`238.8`	NVIDIA CUDA - optimized
MATLAB GPU	`01:33.2`	`04:46.0`	`18:56.4`	`22.04`	`02:06.2`	`07:13.0`		`14.69`	MATLAB equiv, using GPU	1
GPU cases, using `pragmas:`									Regular code with special directives
PGI C++ OpenACC	`00:54.3`	`02:09.4`		`42.30`					C++ with OpenACC	2,3a
PGI C++ OpenACC (optimized)	`00:12.1`	`00:42.2`	`02:46.2`	`155.5`	`00:13.8`	`00:52.1`	`03:27.6`	`128.5`	C++ with OpenACC, optimized	2,3a
PGI C++ OpenACC (`myComplex`)	`10:52:44`			`0.043`					uses `myComplex` class	2,3b
PGI C++ OpenAC (`std::complex<>`)	`10:52:45`			`0.043`					uses `std::complex<float>`	2,3c
PGI F90 OpenACC	`00:06.0`	`00:21.1`	`01:23.0`	`311.8`	`00:09.5`	`00:34.4`	`02:18.6`	`191.4`	F90 with OpenACC pragma	2,4
PGI (CUF Fortran CUDA)	`00:05.1`	`00:17.3`	`01:08.2`	`376.0`	`00:07.7`	`00:27.8`	`01:54.6`	`234.8`	PGI FORTRAN CUDA (optimized)	2,4
CPU only cases:									CPU equivalent cases
C++ (PGI)	`28:13.7`	`01:55:10`	`07:30:38`	`1.000`	`27:49.2`	`01:51:15`	`07:25:02`	`1.021`	C++, PGI compiler (v15.9)
C++ (Intel)	`25:35.6`	`01:42:21`	`06:49:39`	`1.109`	`25:32.9`	`01:42:08`	`06:48:22`	`1.112`	C++, Intel compiler
C++ (gcc)	`28:24.6`	`01:53:37`	`07:35:41`	`0.999`	`28:36.9`	`01:54:11`	`07:36:53`	`0.994`	C++, GNU compiler (v4.9.2)
PGI C++	`28:53.6`	`01:55:38`		`0.986`	`28:33.2`	`01:54:01`		`0.999`	(duplicate)	3a
PGI C++ (`myComplex`)	`28:11.0`	`01:52:02`		`1.015`					uses `myComplex` class	3b
PGI C++ (`std::complex<>`)	`03:53:08`	`15:29:09`		`0.123`					uses `std::complex<float>`	3c
PGI F90	`33:36.8`	`02:13:50`		`0.850`	`34:04.9`	`02:31:51`		`0.793`	F90, PGI compiler (v15.9)	4
MATLAB (1 thread)	`01:09:00`	`04:34:24`		`0.414`	`01:03:18`	`04:06:06`		`0.457`	MATLAB on 1 CPU	1
MATLAB (multi-threaded)	`14:05.7`	`56:09.6`		`2.027`	`13:25.4`	`53:25.9`		`2.129`	MATLAB on multiple CPUs	1
IDL (1 thread)	`01:32:13`	`06:24:22`		`0.303`					IDL on 1 CPU	1
IDL (multi-threaded)	`12:50.0`	`01:32:50`		`1.720`					IDL on multiple CPUs	1

Notes:

Green means ran faster, red slower, grey is the reference (yellow: not run).
Optimized CUDA and CUDA FORTRAN ran 371 and 376 times faster than single CPU C++

MATLAB and IDL tests use the run-time environment, for now we only have MATLAB licenses at SAO (including compiler).
Access to the OpenACC and FORTRAN CUDA from PGI requires a license upgrade - our license was upgraded in May 2016.
C++ does not support complex arithmetic like FORTRAN does, so I used 3 approaches:
1. write out the complex arithmetic explicitly using only floating points operations,
2. use a C++ class myComplex, and let the compiler figure out how to port it to the GPU,
3. use the C++ std::complex<> class template, and also let the compiler figure out how to port it to the GPU.
Using OpenACC with C++ is more tricky than using it with FORTRAN.
The Julia set computation generates NaN, handling this in some of the C++ cases may explain the slow down.
FORTRAN supports complex arithmetic as a built-in data type COMPLEX.

Last updated 12 Dec 2019 SGK

Type	CUDA Cores	Memory	Mem b/w
K80	2,496	12GB	480 GB/s
GV100	5,120	32GB	870 GB/s