Page tree
Skip to end of metadata
Go to start of metadata

[ some features still need to be updated ]

Introduction

A total of 8 GPUs on 3 nodes are available on Hydra.

  • One node has two dual GPU cards (NVIDIA K80)
  • Two nodes have two GPU cards (NVIDIA GV100)
    • each GPU corresponds to:

      TypeCUDA CoresMemoryMem b/w
      K802,49612GB480 GB/s
      GV1005,12032GB870 GB/s

      note that CUDA cores are not like CPU cores.

GPU Configuration

  • The GPUs are configured as follow:
    • persistence mode is ENABLED (NVIDIA driver remains loaded even when there are no active clients)
    • compute mode is set to EXCLUSIVE_PROCESS (only one context is allowed per device, usable from multiple threads at a time)
    (warning) This configuration is reset at reboots.
    (see man nvidia-smi - accessible by loading the nvidia/nvsmi module .)

(lightbulb) This means that GPU applications will

  • start faster (no need to re-load the driver), and
  • only one process can use a GPU at a time (exclusive use.)

(warning) Only one process per GPU can run at the same time, each process gets a different GPU.

(warning) Starting one more process than available GPUs will fail with the following error message:

Error: all CUDA-capable devices are busy or unavailable
 in file XXX.cu at line no NNN

Available Tools

CUDA

The CUDA compiler (v10.0) is available on the login nodes, by loading the module nvidia:

% module load cuda10.0
% nvcc -o testGPU test.cu

The examples have been installed under /share/apps/nvidia/cuda/samples and under /cm/shared/apps/cuda10.0/sdk

CUDA versions 8.0, 9.0, and 9.2 are also available (modules: nvidia/cuda80, nvidia/cuda90, or nvidia/cuda92).

NVIDIA Tools

NVSMI: The NVIDIA System Management Interface

  • NVSMI version 418.67 is available.
  • The following tools are available, but only on the nodes with GPUs:
nvidia-cuda-mps-controlNVIDIA CUDA Multi Process Service management program
nvidia-installerinstall, upgrade, or uninstall the NVIDIA Accelerated Graphics Driver Set
nvidia-modprobeLoad the NVIDIA kernel module and create NVIDIA character device files
nvidia-persistencedA daemon to maintain persistent software state in the NVIDIA driver
nvidia-settingsconfigure the NVIDIA graphics driver
nvidia-smiNVIDIA System Management Interface program
nvidia-xconfigmanipulate X configuration files for the NVIDIA driver
  • Read the corresponding man pages to learn more.
  • The man pages are accessible on the GPU nodes by loading the nvidia module.

Query and Monitor

The most useful tool is nvidia-smi, it allows you to query & monitor the status of the GPU card(s):

Try one of the following commands:
  hpc@compute-79-01% nvidia-smi -l
  hpc@compute-79-01% nvidia-smi dmon -d 30 -s pucm -o DT
  hpc@compute-79-01% nvidia-smi pmon -d 10 -s um -o DT
  hpc@compute-79-01% nvidia-smi \
           --query-compute-apps=timestamp,gpu_uuid,pid,name,used_memory \
           --format=csv,nounits -l 15
  hpc@compute-79-01% nvidia-smi \
           --query-gpu=name,serial,index,memory.used,utilization.gpu,utilization.memory \
           --format=csv,nounits -l 15
  hpc@compute-79-01% nvidia-smi -q -i 0 --display=MEMORY,UTILIZATION,PIDS

read the man page (man nvidia-smi)

GDK/NVML/pynvml

The GPU development kit (GDK), NVIDIA Management Library (NVML) and the python bindings to NVML (pyNVML) are available.

PGI OpenACC/CUF

  • The PGI compilers support OpenACC.
    • OpenACC, similarly to OpenMP, instructs a compiler to produce code that will run on the GPU
    • It uses pragmas, i.e., instructions to the compilers that look otherwise like comments, to specify what part of the computation should be offset to the GPU.
    • (warning) A single pair of such pragmas produced a >300x speed up of the Julia set test case.
    • This requires an additional license that is available on Hydra (not at SAO, tho).
  • The PGI compilers also support CUDA FORTRAN (aka CUF).
    •  You can write or modify existing FORTRAN code to use the GPU like you can using C/C++ & CUDA.
  • Simple examples are available in /home/hpc/examples/gpu/cuda

Local Tool

We have tow local GPU related tools (acessible by loading the tools/local module):

  1. check-gpuse: checks current GPU usage on Hydra;

    hpc@hydra-login% check-gpuse
    hostgroup: @gpu-hosts (3 hosts)
                    - --- memory (GB) ----  -  #GPU - --------- slots/CPUs --------- 
    hostname        -   total   used   resd -  a/u  - nCPU used   load - free unused 
    compute-73-01   -   251.7   36.7  215.0 -  4/0  -   64    0    0.0 -   64   64.0
    compute-79-01   -   125.3   35.0   90.3 -  2/0  -   20    0    0.0 -   20   20.0
    
    
  2. get-gpu-info: queries whether a node has a GPUs, returns the GPUs properties and which process(es) use(s) the GPUs.

It's a simple C-shell wrapper to run the python script get-gpu-info.py, that python script uses the pyNVML (python bindings to NVML).

(lightbulb) the get-gpu-info wrapper checks if the first argument is in the form NN-MM, and if it is will run get-gpu-info.py on compute-NN-MM

in other words, "get-gpu-info 79-01 0 -d" is equivalent to "ssh -xn compute-79-01 get-gpu-info 0 -d"
 

Here is how to use it:
usage: get-gpu-info.py [-h] [-i] [-d] [-l [LOOP]] [-c COUNTS]
                       [--ntstamp NTSTAMP]
                       [id]
get-gpu-info.py: show info about GPU(s)
positional arguments:
  id                    specify the GPU id, implies --info
optional arguments:
  -h, --help            show this help message and exit
  -i, --info            show info for each GPU
  -d, --details         show details of running process, implies --info
  -l [LOOP], --loop [LOOP]
                        repeat every LOOP [in sec: 10 to 3600], default is 30,
                        implies --info
  -c COUNTS, --counts COUNTS
                        limits the no. of times to loop, implies --info
  --ntstamp NTSTAMP     specify how often to put a time stamp, by default puts
                        one every 10 readings
Ver 1.0/0 Feb 2016/SGK
Examples:
hpc@hydra-login-01% get-gpu-info
0 GPU on hydra-login-01
hpc@hydra-login-01% ssh -xn compute-79-01 get-gpu-info
2 GPUs on 79-01
hpc@hydra-login-01% get-gpu-info 73-01 -d
4 GPUs on 73-01
Tue Mar  8 13:27:45 2016
id           ------ memory ------ ------ bar1 -------- ---- usage ----
  -- name --   used/total           used/total          gpu  mem #proc
0 Tesla_K80  760.0M/11.25G   6.6% 4.562M/16.00G   0.0%   0%   0% 1
    pid=64617 name=./loopjulia2xGpu used_memory=735.9M
1 Tesla_K80  22.80M/11.25G   0.2% 2.562M/16.00G   0.0%   0%   0% 0
2 Tesla_K80  22.80M/11.25G   0.2% 2.562M/16.00G   0.0%   0%   0% 0
3 Tesla_K80  22.80M/11.25G   0.2% 2.562M/16.00G   0.0%   0%   0% 0
hpc@hydra-login01% ssh -xn compute-79-01 get-gpu-info -d
2 GPUs on 79-01
Thu Dec 12 10:02:32 2019
id           ------ memory ------ ------ bar1 -------- ---- usage ----
  -- name --   used/total           used/total          gpu  mem #proc
0 Quadro_GV100 64.00k/31.72G   0.0% 2.566M/256.0M   1.0%   0%   0% 0
1 Quadro_GV100 64.00k/31.72G   0.0% 2.566M/256.0M   1.0%   0%   0% 0

Available Queues

  • Two test queues are available: uTGPU.tq and qGPU.iq, batch or interactive queues, to access the GPUs:
    • the batch queue has no time limit, the interactive queue has a 24hr elapsed time limit
    • you must request to use a GPU to run in these queues, you can specify how many GPU cards you want to use:
      qsub -l gpu or qrsh -l gpu
    • these are equivalent to
      qsub -l gpu,ngpu=1 or qrsh -l gpu,ngpu=1
    • If your job will use two GPU cards:
      qsub -l gpu,ngpu=2

    • The current resource limit for GPUs are:
      • 3 GPUs per user in the batch queue,
      • 2 GPUs per user in the interactive queue, and
      • only 1 concurrent interactive jobs.

    • (warning) These are a test queue:
      1. their configuration may change, and
      2. access is limited to authorized users.
        (grey lightbulb) email us to be added to the authorized users list.

Available Examples

Trivial Examples

I wrote a trivial test case: computing a Julia set (fractals) and saving the corresponding image. It is derived from NVIDIA's own example.

You can find that example, and equivalent codes, under /home/hpc/examples/gpu - I wrote

cuda/CUDA and C++ code (.cu .cpp Makefile)
cuda/gpuGPU example
cuda/cpuCPU equivalent
matlab/MATLAB, using standalone compiled code (available for now only at SAO)
matlab/gpuGPU example
matlab/cpuCPU equivalent
idl/IDL CPU-only equivalent (for comparison)

Note:

  • I wrote more sophisticated alternative to that example, to achieve a 500:1 speed up compared to the equivalent computation running a single CPU.
  • (lightbulb) That's reducing a 7.5 hour long computation to less than 1 minute, in a case that is intrinsically fully "parallelizable."
  • It illustrates the potential gain, compared to the cost of coding using CUDA (an extension of C++)

NVIDIA Examples

  • You can find the NVIDIA examples under /share/apps/nvidia/cuda/samples and under /cm/shared/apps/cuda10.0/sdk.
  • I was able to build them, but not on the login nodes, as they use libraries that are currently only available on the GPU nodes
  • You can find this under  /home/hpc/tests/gpu/cuda/samples/.

Timing

  • I use a simple fractal computation (Julia set) to run timings. The computation is simply
    z =: z * z + c
  • where z and c are two complex numbers.
  • The assignment is iterated N times and computed on a M x M grid, where z = x + i y and x and y are equispaced values between -1.5 and 1.5.
  • The final value of z is converted to an index iz = 255*exp(-abs(z))  and the index iz converted to a color triplet using a rainbow color map.
  • the computation is repeated for a range of values of c.

  • I wrote various versions using
    • C++ CUDA (a basic one, an optimized one),
    • CUDA FORTRAN
    • C++ and FORTRAN (CPU-only as reference),
    • MATLAB (CPU) and MATLAB using GPU,
    • and  IDL (CPU only).
  • In the single precision versions, z and c  are each represented by floats,
  • in the double precision versions, they are represented by doubles.
  • In the tests I ran for timing purposes I did not save the resulted image.
  • The codes, job files and log files are available in /home/hpc/examples/gpu/timing/
     
    I ran them on 201 values of c and for N=150, timings units are [HH:]MM:SS[.S] (not yet complete)


single precisionspeed
double precision
speed

size M x M4k x 4k8k x 8k16x 16kratio
4k x 4k8k x 8k16k x 16k
ratioCommentsNote
GPU cases:









GPU specific code
CUDA03:05.012:20.849:23.69.202





NVIDIA CUDA - their example
Optimized CUDA00:05.100:17.601:09.3371.6
00:07.600:27.801:50.3
238.8NVIDIA CUDA - optimized
MATLAB GPU01:33.204:46.018:56.422.04
02:06.207:13.0

14.69MATLAB equiv, using GPU1
GPU cases, using pragmas:









Regular code with special directives
PGI C++ OpenACC00:54.302:09.4
42.30





C++ with OpenACC2,3a
PGI C++ OpenACC (optimized)00:12.100:42.202:46.2155.5
00:13.800:52.103:27.6
128.5C++ with OpenACC, optimized2,3a
PGI C++ OpenACC (myComplex)10:52:44

0.043





uses myComplex class2,3b
PGI C++ OpenAC (std::complex<>)10:52:45

0.043





uses std::complex<float>2,3c
PGI F90 OpenACC00:06.000:21.101:23.0311.8
00:09.500:34.402:18.6
191.4F90 with OpenACC pragma2,4
PGI (CUF Fortran CUDA)00:05.100:17.301:08.2376.0
00:07.700:27.801:54.6
234.8PGI FORTRAN CUDA (optimized)2,4
CPU only cases:









CPU equivalent cases
C++ (PGI)28:13.701:55:1007:30:381.000
27:49.201:51:1507:25:02
1.021C++, PGI compiler (v15.9)
C++ (Intel)25:35.6 01:42:2106:49:391.109

25:32.9

01:42:0806:48:22
1.112C++, Intel compiler
C++ (gcc) 28:24.601:53:3707:35:410.999
28:36.901:54:1107:36:53
0.994C++, GNU compiler (v4.9.2)
PGI C++28:53.601:55:38
0.986
 28:33.201:54:01

0.999(duplicate)3a
PGI C++ (myComplex)28:11.001:52:02
1.015





uses myComplex class3b
PGI C++ (std::complex<>)03:53:0815:29:09
0.123





uses std::complex<float>3c
PGI F9033:36.802:13:50
0.850
34:04.902:31:51

0.793F90, PGI compiler (v15.9)4
MATLAB (1 thread)01:09:0004:34:24
 0.414
01:03:1804:06:06

0.457MATLAB on 1 CPU1
MATLAB (multi-threaded)14:05.756:09.6
2.027
13:25.453:25.9

2.129MATLAB on multiple CPUs1
IDL (1 thread)01:32:1306:24:22
0.303





IDL on 1 CPU1
IDL (multi-threaded)12:50.001:32:50
1.720





IDL on multiple CPUs1

Notes:

  • Green means ran faster, red slower, grey is the reference (yellow: not run).
  • Optimized CUDA and CUDA FORTRAN ran 371 and 376 times faster than single CPU C++
  1. MATLAB and IDL tests use the run-time environment, for now we only have MATLAB licenses at SAO (including compiler).
  2. Access to the OpenACC and FORTRAN CUDA from PGI requires a license upgrade - our license was upgraded in May 2016.
  3. C++ does not support complex arithmetic like FORTRAN does, so I used 3 approaches:
    1. write out the complex arithmetic explicitly using only floating points operations,
    2. use a C++ class myComplex, and let the compiler figure out how to port it to the GPU,
    3. use the C++ std::complex<> class template, and also let the compiler figure out how to port it to the GPU.
    Using OpenACC with C++ is more tricky than using it with FORTRAN.
    The Julia set computation generates NaN, handling this in some of the C++ cases may explain the slow down.
  4. FORTRAN supports complex arithmetic as a built-in data type COMPLEX.


Last updated   SGK