...

A total of 4 GPUs on 2 nodes are available on Hydra.

One node has two dual four GPU cards (NVIDIA K80) and is not yet back in productionL40S)
Two nodes have two GPU cards (NVIDIA GV100)
- each GPU corresponds to:
  Type CUDA Cores Memory Mem b/w
  ~~K80~~L40S 218,496176 ~~12GB~~48GB 480 864 GB/s
  GV100 5,120 32GB 870 GB/s
  note that CUDA cores are not like CPU cores.

...

The GPUs are configured as follow:
- The current driver is CUDA 12.4 (Driver Version 550.54.15)
- persistence mode is ENABLED (NVIDIA driver remains loaded even when there are no active clients)
- compute mode is set to EXCLUSIVE_PROCESS (only one context is allowed per device, usable from multiple threads at a time)
- This configuration is reset at reboots.
  (see man nvidia-smi - accessible on the gpu nodes
, after loading the cud10
- .
2 module.
- )

This means that GPU applications will

...

Starting one more process than available GPUs will fail with the following error message:

Error: all CUDA-capable devices are busy or unavailable
in file XXX.cu at line no NNN

Anchor
Tools
Tools
3. Available Tools

CUDA

The CUDA compiler is now part of the NVIDIA compilers, and is accessible loading the NVIDIA module

% module load nvidia

that loads by default NVIDIA

...

23.9 and CUDA

...

12.

...

2.

Other version are available, check with

% module whatis nvidia

The CUDA 10.2 sdk is also available (module whatis cuda10.2).

NVSMI: The NVIDIA System Management Interface

NVSMI v550.54.15 and dcgm v 3.35 are available on the GPU nodesNVSMI version 470.57.02 is available.
The following tools are available, but only on the nodes with GPUs, and by loading the cuda10.2 module:

`nvidia-cuda-mps-control`	NVIDIA CUDA Multi Process Service management program
`nvidia-installer`	install, upgrade, or uninstall the NVIDIA Accelerated Graphics Driver Set	`nvidia-modprobe`	Load the NVIDIA kernel module and create NVIDIA character device files
`nvidia-persistenced`	A daemon to maintain persistent software state in the NVIDIA driver
`nvidia-settings`	configure the NVIDIA graphics driver
`nvidia-smi`	NVIDIA System Management Interface program
`nvidia-xconfig`	manipulate X configuration files for the NVIDIA driver

Read the corresponding man pages to learn more, load the cuda-dcgm module to access dcgm.
The man pages are accessible on the GPU nodes by loading the cuda10.2 module..

Query and Monitor

The most useful tool is nvidia-smi, it allows you to query & monitor the status of the GPU card(s):

Code Block

language	text
title	Try one of the following commands:

  hpc@compute-79-01% nvidia-smi -l
  hpc@compute-79-01% nvidia-smi dmon -d 30 -s pucm -o DT
  hpc@compute-79-01% nvidia-smi pmon -d 10 -s um -o DT
  hpc@compute-79-01% nvidia-smi \
           --query-compute-apps=timestamp,gpu_uuid,pid,name,used_memory \
           --format=csv,nounits -l 15
  hpc@compute-79-01% nvidia-smi \
           --query-gpu=name,serial,index,memory.used,utilization.gpu,utilization.memory \
           --format=csv,nounits -l 15
  hpc@compute-79-01% nvidia-smi -q -i 0 --display=MEMORY,UTILIZATION,PIDS

...

Read the man page (man nvidia-smi)

GDK/NVML/pynvml

~~The GPU development kit (GDK), NVIDIA Management Library (NVML) and the python bindings to NVML (pyNVML) are available.~~

~~The GDK version,~~

~~the NVML documentation is available at NVIDIA's web site~~

~~pyNVML 7.352.0 is available via the nvidia/pynvlm module, and the documentation is on-line.~~

NVIDIA OpenACC/CUF

The

...

NVIDIA compilers support OpenACC.
OpenACC, similarly to OpenMP, instructs a compiler to produce code that will run on the GPU
It uses pragmas, i.e., instructions to the compilers that look otherwise like comments, to specify what part of the computation should be offset to the GPU.

...

A single pair of such pragmas produced a >300x speed up of the Julia set test case.

This requires an additional license that is available on Hydra (not at SAO, tho).
The NVIDIA compilers also support CUDA FORTRAN (aka CUF).
You can write or modify existing FORTRAN code to use the GPU like you can using C/C++ & CUDA.
Simple examples are available in /home/hpc/examples/gpu/cuda

Local Tool

We

...

provide two local GPU related tools (

...

accessible by loading the tools/local-admin module):
- check-gpuse: checks current GPU usage

...

- :

Code Block

language	text

hpc@hydra-login% check-gpuse
hostgroup: @gpu-hosts (3 hosts)
                - --- memory (GB) ----  -  #GPU - --------- slots/CPUs --------- 
hostname        -   total   used   resd -  a/u  - nCPU used   load - free unused 
compute-

...

50-01   -

...

503.

...

19.

...

483.

...

7 -  4/0  -   64    0    0.0 -   64   64.0
compute-79-01   -   125.

...

18.

...

107.3 -  2/0  -   20    0    0.0 -   20   20.0

...

compute-79-02   -   125.5   46.2   79.3 -  2/1  -   20    1    2.1 -   19   17.9

Total #GPU=8 used=1 (12.5%)

- get-gpu-info: queries whether a node has a GPU, returns the GPU(s) properties and which process(es) use(s) the GPUs.

It's a simple C-shell wrapper to run the python script get-gpu-info.py, that python script uses the pyNVML (python bindings to NVML).

the get-gpu-info wrapper checks if the first argument is in the form NN-MM, and if it is will run get-gpu-info.py on compute-NN-MM

in other words, "get-gpu-info 79-01 0 -d" is equivalent to "ssh -xn compute-79-01 get-gpu-info 0 -d"

Code Block

language	text
title	Here is how to use it:

usage: get-gpu-info.py3 [-h] [-i] [-d] [-l [LOOP]] [-c COUNTS]
                        [--ntstamp NTSTAMP]
                        [id]

get-gpu-info.py3: show info about GPU(s)

positional arguments:
  id                    specify the GPU id, implies --info

optional arguments:
  -h, --help            show this help message and exit
  -i, --info

...

It's a simple C-shell wrapper to run the python script get-gpu-info.py, that python script uses the pyNVML (python bindings to NVML).

the get-gpu-info wrapper checks if the first argument is in the form NN-MM, and if it is will run get-gpu-info.py on compute-NN-MM

in other words, "get-gpu-info 79-01 0 -d" is equivalent to "ssh -xn compute-79-01 get-gpu-info 0 -d"

Code Block

language	text
title	Here is how to use it:

usage: get-gpu-info.py [-h] [-i] [-d] [-l [LOOP]] [-c COUNTS]
                      show [--ntstamp NTSTAMP]
     info for each GPU
  -d, --details         show details of running process, implies --info
  -l [idLOOP]
get, -gpu-info.py: show info about GPU(s)
positional arguments:
  id-loop [LOOP]
                        repeat every LOOP specify the GPU id, implies --info
optional arguments:
  -h, --help[in sec: 10 to 3600], default is 30,
                    show this help message andimplies exit--info
  -ic COUNTS, --infocounts COUNTS
           show info for each GPU
  -d, --details      limits the no. showof detailstimes ofto running processloop, implies --info
  -l [LOOP], --loop [LOOP]
--ntstamp NTSTAMP     specify how often to put a time stamp, by default puts
                        one every repeat every LOOP [in sec: 10 to 3600], default is 30,
         10 readings

Ver 1.1/0 Oct 2021/SGK

For example:

Code Block

language	text
title	Examples:

hpc@hydra-login-01% get-gpu-info
0 GPU on hydra-login-01
hpc@login01% get-gpu-info 50-01 -d
4 GPUs on 50-01
Thu May 16 15:30:06 2024
id               implies ------info
 memory -c COUNTS, --counts COUNTS
                -----  ------ bar1 --------  ---- usage ----
   --- name ---    used/total        limits the no. of timesused/total to loop, implies --info
  --ntstamp NTSTAMP     specifygpu how oftenmem to#proc
0 put aNVIDIA_L40S time stamp, by default puts
        479.1M/44.99G   1.0%  1.688M/64.00G   0.0%    0%   0% 0
1  NVIDIA_L40S   479.1M/44.99G    one1.0% every 10 readings
Ver 1.0/0 Feb 2016/SGK

Code Block

language	text
title	Examples:

hpc@hydra-login-01% get-gpu-info
0 GPU on hydra-login-01
hpc@hydra-login-01% ssh -xn compute-79-01 get-gpu-info
2 GPUs on 79-01
hpc@hydra-login-01% get-gpu-info 73-01 -d
4 GPUs on 73-01
Tue Mar  8 13:27:45 2016
id           ------ memory ------ ------ bar1 -------- ---- usage ----
  -- name --   used/total688M/64.00G   0.0%    0%   0% 0
2  NVIDIA_L40S   479.1M/44.99G   1.0%  1.688M/64.00G   0.0%    0%   0% 0
3  NVIDIA_L40S   479.1M/44.99G   1.0%  1.688M/64.00G   0.0%    0%   0% 0

hpc@login01% get-gpu-info 79-01 -d
2 GPUs on 79-01
Thu May 16 15:30:12 2024
id             used/total  ------ memory ------  ------ bar1 --------  gpu----  mem #proc
0 Tesla_K80  760.0M/11.25G   6.6% 4.562M/16.00G   0.0%   0%   0% 1
    pid=64617 name=./loopjulia2xGpu used_memory=735.9M
1 Tesla_K80  22.80M/11.25G   0.2% 2.562M/16.00G   0.0%   0%   0% 0
2 Tesla_K80  22.80M/11.25G   0.2% 2.562M/16.00G   0.0%   usage ----
   --- name ---    used/total            used/total           gpu  mem #proc
0  Quadro_GV100  276.5M/32.00G   0.8%  2.688M/256.0M   1.0%    0%   0% 0
31  TeslaQuadro_K80GV100  22276.80M5M/1132.25G00G   0.2%8%  2.562M688M/16256.00G0M   01.0%    0%   0% 0
hpc@hydra-login01% ssh -xn compute-79-01 
hpc@login01% get-gpu-info 79-02 -d
2 GPUs on 79-0102
Thu DecMay 1216 1015:0230:3219 20192024
id               ------ memory ------  ------ bar1 --------  ---- usage ----
   --- name ---    used/total            used/total           gpu  mem #proc
0  Quadro_GV100 64 6.00k212G/3132.72G00G  19.4% 0.0% 25.566M188M/256.0M   12.0%    0%   0% 0
1 1
    pid=494189 name=b'python3' used_memory=5.939G
1  Quadro_GV100  64279.00k1M/3132.72G00G   0.0%9%  23.566M188M/256.0M   1.0%2%    0%   0% 0

Anchor
Queues
Queues
4. Available Queues

Two Four queues are available to available to access the GPUs:
- an interactive queue, qgpu.iq, that has a batch queue, lTgpu.q, that has the long-T time limit, andan interactive queue, qgpu.iq, that has a 24hr elapsed time limit24hr elapsed time limit, and
- three batch queues sTgpu.q, mTgpu.q & lTgpu.q, corresponding to a short, medium and long time limit.
There also limits on (CPU) memory usage and how many concurrent GPUs can be used, see the Available Queues page.

Note:

You must request to use a GPU to run in these queues, as follows:

...

If your job will use two GPUs, use::

% qsub -l gpu,ngpu=2

You can specify what type of GPU to use with

-l gpu,gpu_arch=L40S

or

% qsub -l gpu,ngpu=2

Currently we impose the following resource limits for GPUs are:

...

gpu_arch=GV100

Anchor
Examples
Examples
5. Examples

...

You can find that example, and equivalent codes, under /home/hpc/examples/gpu - I wrote

`cuda/`	CUDA and C++ code (.cu .cpp Makefile)
`cuda/gpu`	GPU example
`cuda/cpu`	CPU equivalent
`matlab/`	MATLAB, using standalone compiled code (available for now only at SAO)
`matlab/gpu`	GPU example
`matlab/cpu`	CPU equivalent
`idl/`	IDL CPU-only equivalent (for comparison)

Note:

I wrote more sophisticated alternative to that example, to achieve a 500:1 speed up compared to the equivalent computation running a single CPU.

That's reducing a 7.5 hour long computation to less than 1 minute, in a case that is intrinsically fully "parallelizable."

It illustrates the potential gain, compared to the cost of coding using CUDA (an extension of C++)

NVIDIA Examples

~~You can find the NVIDIA examples under /share/apps/nvidia/cuda/samples and under /cm/shared/apps/cuda10.0/sdk.~~

~~I was able to build them, but not on the login nodes, as they use libraries that are currently only available on the GPU nodes~~

~~You can find this under /home/hpc/tests/gpu/cuda/samples/.~~

Timing

[This section is old and these tests have yet to be re-run]

I use a simple fractal computation (Julia set) to run timings. The computation is simply
z =: z * z + c

where z and c are two complex numbers.

The assignment is iterated N times and computed on a M x M grid, where z = x + i y and x and y are equispaced values between -1.5 and 1.5.

The final value of z is converted to an index iz = 255*exp(-abs(z)) and the index iz converted to a color triplet using a rainbow color map.

the computation is repeated for a range of values of c.

I wrote various versions using

C++ CUDA (a basic one, an optimized one),

CUDA FORTRAN

C++ and FORTRAN (CPU-only as reference),

MATLAB (CPU) and MATLAB using GPU,

and IDL (CPU only).

In the single precision versions, z and c are each represented by floats,

in the double precision versions, they are represented by doubles.

In the tests I ran for timing purposes I did not save the resulted image.

The codes, job files and log files are available in /home/hpc/examples/gpu/timing/

I ran them on 201 values of c and for N=150, timings units are [HH:]MM:SS[.S] (not yet complete)

	single precision			speed	double precision			speed
size M x M	4k x 4k	8k x 8k	16x 16k	ratio	4k x 4k	8k x 8k	16k x 16k	ratio	Comments	Note
GPU cases:									GPU specific code
CUDA	`03:05.0`	`12:20.8`	`49:23.6`	`9.202`					NVIDIA CUDA - their example
Optimized CUDA	`00:05.1`	`00:17.6`	`01:09.3`	`371.6`	`00:07.6`	`00:27.8`	`01:50.3`	`238.8`	NVIDIA CUDA - optimized
MATLAB GPU	`01:33.2`	`04:46.0`	`18:56.4`	`22.04`	`02:06.2`	`07:13.0`		`14.69`	MATLAB equiv, using GPU	1
GPU cases, using `pragmas:`									Regular code with special directives
PGI C++ OpenACC	`00:54.3`	`02:09.4`		`42.30`					C++ with OpenACC	2,3a
PGI C++ OpenACC (optimized)	`00:12.1`	`00:42.2`	`02:46.2`	`155.5`	`00:13.8`	`00:52.1`	`03:27.6`	`128.5`	C++ with OpenACC, optimized	2,3a
PGI C++ OpenACC (`myComplex`)	`10:52:44`			`0.043`					uses `myComplex` class	2,3b
PGI C++ OpenAC (`std::complex<>`)	`10:52:45`			`0.043`					uses `std::complex<float>`	2,3c
PGI F90 OpenACC	`00:06.0`	`00:21.1`	`01:23.0`	`311.8`	`00:09.5`	`00:34.4`	`02:18.6`	`191.4`	F90 with OpenACC pragma	2,4
PGI (CUF Fortran CUDA)	`00:05.1`	`00:17.3`	`01:08.2`	`376.0`	`00:07.7`	`00:27.8`	`01:54.6`	`234.8`	PGI FORTRAN CUDA (optimized)	2,4
CPU only cases:									CPU equivalent cases
C++ (PGI)	`28:13.7`	`01:55:10`	`07:30:38`	`1.000`	`27:49.2`	`01:51:15`	`07:25:02`	`1.021`	C++, PGI compiler (v15.9)
C++ (Intel)	`25:35.6`	`01:42:21`	`06:49:39`	`1.109`	`25:32.9`	`01:42:08`	`06:48:22`	`1.112`	C++, Intel compiler
C++ (gcc)	`28:24.6`	`01:53:37`	`07:35:41`	`0.999`	`28:36.9`	`01:54:11`	`07:36:53`	`0.994`	C++, GNU compiler (v4.9.2)
PGI C++	`28:53.6`	`01:55:38`		`0.986`	`28:33.2`	`01:54:01`		`0.999`	(duplicate)	3a
PGI C++ (`myComplex`)	`28:11.0`	`01:52:02`		`1.015`					uses `myComplex` class	3b
PGI C++ (`std::complex<>`)	`03:53:08`	`15:29:09`		`0.123`					uses `std::complex<float>`	3c
PGI F90	`33:36.8`	`02:13:50`		`0.850`	`34:04.9`	`02:31:51`		`0.793`	F90, PGI compiler (v15.9)	4
MATLAB (1 thread)	`01:09:00`	`04:34:24`		`0.414`	`01:03:18`	`04:06:06`		`0.457`	MATLAB on 1 CPU	1
MATLAB (multi-threaded)	`14:05.7`	`56:09.6`		`2.027`	`13:25.4`	`53:25.9`		`2.129`	MATLAB on multiple CPUs	1
IDL (1 thread)	`01:32:13`	`06:24:22`		`0.303`					IDL on 1 CPU	1
IDL (multi-threaded)	`12:50.0`	`01:32:50`		`1.720`					IDL on multiple CPUs	1

Notes:

Green means ran faster, red slower, grey is the reference (yellow: not run).

Optimized CUDA and CUDA FORTRAN ran 371 and 376 times faster than single CPU C++

MATLAB and IDL tests use the run-time environment, for now we only have MATLAB licenses at SAO (including compiler).

Access to the OpenACC and FORTRAN CUDA from PGI requires a license upgrade - our license was upgraded in May 2016.

C++ does not support complex arithmetic like FORTRAN does, so I used 3 approaches:

write out the complex arithmetic explicitly using only floating points operations,
use a C++ class myComplex, and let the compiler figure out how to port it to the GPU,
use the C++ std::complex<> class template, and also let the compiler figure out how to port it to the GPU.

Using OpenACC with C++ is more tricky than using it with FORTRAN.
The Julia set computation generates NaN, handling this in some of the C++ cases may explain the slow down.

FORTRAN supports complex arithmetic as a built-in data type COMPLEX.

...

Last updated 18 Nov 2021 SGK

Page tree

Versions Compared

Old Version 1

New Version Current

Key

Anchor
Tools
Tools
3. Available Tools

CUDA

NVSMI: The NVIDIA System Management Interface

NVSMI: The NVIDIA System Management Interface

Query and Monitor

GDK/NVML/pynvml

NVIDIA OpenACC/CUF

Local Tool

Anchor
Queues
Queues
4. Available Queues

Anchor
Examples
Examples
5. Examples

Note:

NVIDIA Examples

Timing

Type	CUDA Cores	Memory	Mem b/w
~~K80~~L40S	218,496176	~~12GB~~48GB	480 864 GB/s
GV100	5,120	32GB	870 GB/s

Page tree

Page History

Versions Compared

Old Version 1

New Version Current

Key

AnchorToolsTools3. Available Tools

CUDA

NVSMI: The NVIDIA System Management Interface

NVSMI: The NVIDIA System Management Interface

Query and Monitor

GDK/NVML/pynvml

NVIDIA OpenACC/CUF

Local Tool

AnchorQueuesQueues4. Available Queues

AnchorExamplesExamples5. Examples

Note:

NVIDIA Examples

Timing

Anchor
Tools
Tools
3. Available Tools

Anchor
Queues
Queues
4. Available Queues

Anchor
Examples
Examples
5. Examples