Page tree
Skip to end of metadata
Go to start of metadata

[ some features still need to be updated ]


A total of 8 GPUs on 3 nodes are available on Hydra.

  • One node has two dual GPU cards (NVIDIA K80)
  • Two nodes have two GPU cards (NVIDIA GV100)
    • each GPU corresponds to:

      TypeCUDA CoresMemoryMem b/w
      K802,49612GB480 GB/s
      GV1005,12032GB870 GB/s

      note that CUDA cores are not like CPU cores.

GPU Configuration

  • The GPUs are configured as follow:
    • persistence mode is ENABLED (NVIDIA driver remains loaded even when there are no active clients)
    • compute mode is set to EXCLUSIVE_PROCESS (only one context is allowed per device, usable from multiple threads at a time)
    (warning) This configuration is reset at reboots.
    (see man nvidia-smi - accessible by loading the nvidia/nvsmi module .)

(lightbulb) This means that GPU applications will

  • start faster (no need to re-load the driver), and
  • only one process can use a GPU at a time (exclusive use.)

(warning) Only one process per GPU can run at the same time, each process gets a different GPU.

(warning) Starting one more process than available GPUs will fail with the following error message:

Error: all CUDA-capable devices are busy or unavailable
 in file at line no NNN

Available Tools


The CUDA compiler (v10.0) is available on the login nodes, by loading the module nvidia:

% module load cuda10.0
% nvcc -o testGPU

The examples have been installed under /share/apps/nvidia/cuda/samples and under /cm/shared/apps/cuda10.0/sdk

CUDA versions 8.0, 9.0, and 9.2 are also available (modules: nvidia/cuda80, nvidia/cuda90, or nvidia/cuda92).


NVSMI: The NVIDIA System Management Interface

  • NVSMI version 418.67 is available.
  • The following tools are available, but only on the nodes with GPUs:
nvidia-cuda-mps-controlNVIDIA CUDA Multi Process Service management program
nvidia-installerinstall, upgrade, or uninstall the NVIDIA Accelerated Graphics Driver Set
nvidia-modprobeLoad the NVIDIA kernel module and create NVIDIA character device files
nvidia-persistencedA daemon to maintain persistent software state in the NVIDIA driver
nvidia-settingsconfigure the NVIDIA graphics driver
nvidia-smiNVIDIA System Management Interface program
nvidia-xconfigmanipulate X configuration files for the NVIDIA driver
  • Read the corresponding man pages to learn more.
  • The man pages are accessible on the GPU nodes by loading the nvidia module.

Query and Monitor

The most useful tool is nvidia-smi, it allows you to query & monitor the status of the GPU card(s):

Try one of the following commands:
  hpc@compute-79-01% nvidia-smi -l
  hpc@compute-79-01% nvidia-smi dmon -d 30 -s pucm -o DT
  hpc@compute-79-01% nvidia-smi pmon -d 10 -s um -o DT
  hpc@compute-79-01% nvidia-smi \
           --query-compute-apps=timestamp,gpu_uuid,pid,name,used_memory \
           --format=csv,nounits -l 15
  hpc@compute-79-01% nvidia-smi \
           --query-gpu=name,serial,index,memory.used,utilization.gpu,utilization.memory \
           --format=csv,nounits -l 15
  hpc@compute-79-01% nvidia-smi -q -i 0 --display=MEMORY,UTILIZATION,PIDS

read the man page (man nvidia-smi)


The GPU development kit (GDK), NVIDIA Management Library (NVML) and the python bindings to NVML (pyNVML) are available.


  • The PGI compilers support OpenACC.
    • OpenACC, similarly to OpenMP, instructs a compiler to produce code that will run on the GPU
    • It uses pragmas, i.e., instructions to the compilers that look otherwise like comments, to specify what part of the computation should be offset to the GPU.
    • (warning) A single pair of such pragmas produced a >300x speed up of the Julia set test case.
    • This requires an additional license that is available on Hydra (not at SAO, tho).
  • The PGI compilers also support CUDA FORTRAN (aka CUF).
    •  You can write or modify existing FORTRAN code to use the GPU like you can using C/C++ & CUDA.
  • Simple examples are available in /home/hpc/examples/gpu/cuda

Local Tool

We have tow local GPU related tools (acessible by loading the tools/local module):

  1. check-gpuse: checks current GPU usage on Hydra;

    hpc@hydra-login% check-gpuse
    hostgroup: @gpu-hosts (3 hosts)
                    - --- memory (GB) ----  -  #GPU - --------- slots/CPUs --------- 
    hostname        -   total   used   resd -  a/u  - nCPU used   load - free unused 
    compute-73-01   -   251.7   36.7  215.0 -  4/0  -   64    0    0.0 -   64   64.0
    compute-79-01   -   125.3   35.0   90.3 -  2/0  -   20    0    0.0 -   20   20.0
  2. get-gpu-info: queries whether a node has a GPUs, returns the GPUs properties and which process(es) use(s) the GPUs.

It's a simple C-shell wrapper to run the python script, that python script uses the pyNVML (python bindings to NVML).

(lightbulb) the get-gpu-info wrapper checks if the first argument is in the form NN-MM, and if it is will run on compute-NN-MM

in other words, "get-gpu-info 79-01 0 -d" is equivalent to "ssh -xn compute-79-01 get-gpu-info 0 -d"

Here is how to use it:
usage: [-h] [-i] [-d] [-l [LOOP]] [-c COUNTS]
                       [--ntstamp NTSTAMP]
                       [id] show info about GPU(s)
positional arguments:
  id                    specify the GPU id, implies --info
optional arguments:
  -h, --help            show this help message and exit
  -i, --info            show info for each GPU
  -d, --details         show details of running process, implies --info
  -l [LOOP], --loop [LOOP]
                        repeat every LOOP [in sec: 10 to 3600], default is 30,
                        implies --info
  -c COUNTS, --counts COUNTS
                        limits the no. of times to loop, implies --info
  --ntstamp NTSTAMP     specify how often to put a time stamp, by default puts
                        one every 10 readings
Ver 1.0/0 Feb 2016/SGK
hpc@hydra-login-01% get-gpu-info
0 GPU on hydra-login-01
hpc@hydra-login-01% ssh -xn compute-79-01 get-gpu-info
2 GPUs on 79-01
hpc@hydra-login-01% get-gpu-info 73-01 -d
4 GPUs on 73-01
Tue Mar  8 13:27:45 2016
id           ------ memory ------ ------ bar1 -------- ---- usage ----
  -- name --   used/total           used/total          gpu  mem #proc
0 Tesla_K80  760.0M/11.25G   6.6% 4.562M/16.00G   0.0%   0%   0% 1
    pid=64617 name=./loopjulia2xGpu used_memory=735.9M
1 Tesla_K80  22.80M/11.25G   0.2% 2.562M/16.00G   0.0%   0%   0% 0
2 Tesla_K80  22.80M/11.25G   0.2% 2.562M/16.00G   0.0%   0%   0% 0
3 Tesla_K80  22.80M/11.25G   0.2% 2.562M/16.00G   0.0%   0%   0% 0
hpc@hydra-login01% ssh -xn compute-79-01 get-gpu-info -d
2 GPUs on 79-01
Thu Dec 12 10:02:32 2019
id           ------ memory ------ ------ bar1 -------- ---- usage ----
  -- name --   used/total           used/total          gpu  mem #proc
0 Quadro_GV100 64.00k/31.72G   0.0% 2.566M/256.0M   1.0%   0%   0% 0
1 Quadro_GV100 64.00k/31.72G   0.0% 2.566M/256.0M   1.0%   0%   0% 0

Available Queues

  • Two test queues are available: uTGPU.tq and, batch or interactive queues, to access the GPUs:
    • the batch queue has no time limit, the interactive queue has a 24hr elapsed time limit
    • you must request to use a GPU to run in these queues, you can specify how many GPU cards you want to use:
      qsub -l gpu or qrsh -l gpu
    • these are equivalent to
      qsub -l gpu,ngpu=1 or qrsh -l gpu,ngpu=1
    • If your job will use two GPU cards:
      qsub -l gpu,ngpu=2

    • The current resource limit for GPUs are:
      • 3 GPUs per user in the batch queue,
      • 2 GPUs per user in the interactive queue, and
      • only 1 concurrent interactive jobs.

    • (warning) These are a test queue:
      1. their configuration may change, and
      2. access is limited to authorized users.
        (grey lightbulb) email us to be added to the authorized users list.

Available Examples

Trivial Examples

I wrote a trivial test case: computing a Julia set (fractals) and saving the corresponding image. It is derived from NVIDIA's own example.

You can find that example, and equivalent codes, under /home/hpc/examples/gpu - I wrote

cuda/CUDA and C++ code (.cu .cpp Makefile)
cuda/gpuGPU example
cuda/cpuCPU equivalent
matlab/MATLAB, using standalone compiled code (available for now only at SAO)
matlab/gpuGPU example
matlab/cpuCPU equivalent
idl/IDL CPU-only equivalent (for comparison)


  • I wrote more sophisticated alternative to that example, to achieve a 500:1 speed up compared to the equivalent computation running a single CPU.
  • (lightbulb) That's reducing a 7.5 hour long computation to less than 1 minute, in a case that is intrinsically fully "parallelizable."
  • It illustrates the potential gain, compared to the cost of coding using CUDA (an extension of C++)

NVIDIA Examples

  • You can find the NVIDIA examples under /share/apps/nvidia/cuda/samples and under /cm/shared/apps/cuda10.0/sdk.
  • I was able to build them, but not on the login nodes, as they use libraries that are currently only available on the GPU nodes
  • You can find this under  /home/hpc/tests/gpu/cuda/samples/.


  • I use a simple fractal computation (Julia set) to run timings. The computation is simply
    z =: z * z + c
  • where z and c are two complex numbers.
  • The assignment is iterated N times and computed on a M x M grid, where z = x + i y and x and y are equispaced values between -1.5 and 1.5.
  • The final value of z is converted to an index iz = 255*exp(-abs(z))  and the index iz converted to a color triplet using a rainbow color map.
  • the computation is repeated for a range of values of c.

  • I wrote various versions using
    • C++ CUDA (a basic one, an optimized one),
    • C++ and FORTRAN (CPU-only as reference),
    • MATLAB (CPU) and MATLAB using GPU,
    • and  IDL (CPU only).
  • In the single precision versions, z and c  are each represented by floats,
  • in the double precision versions, they are represented by doubles.
  • In the tests I ran for timing purposes I did not save the resulted image.
  • The codes, job files and log files are available in /home/hpc/examples/gpu/timing/
    I ran them on 201 values of c and for N=150, timings units are [HH:]MM:SS[.S] (not yet complete)

single precisionspeed
double precision

size M x M4k x 4k8k x 8k16x 16kratio
4k x 4k8k x 8k16k x 16k
GPU cases:

GPU specific code

NVIDIA CUDA - their example
Optimized CUDA00:05.100:17.601:09.3371.6
238.8NVIDIA CUDA - optimized
MATLAB GPU01:33.204:46.018:56.422.04

14.69MATLAB equiv, using GPU1
GPU cases, using pragmas:

Regular code with special directives
PGI C++ OpenACC00:54.302:09.4

C++ with OpenACC2,3a
PGI C++ OpenACC (optimized)00:12.100:42.202:46.2155.5
128.5C++ with OpenACC, optimized2,3a
PGI C++ OpenACC (myComplex)10:52:44


uses myComplex class2,3b
PGI C++ OpenAC (std::complex<>)10:52:45


uses std::complex<float>2,3c
PGI F90 OpenACC00:06.000:21.101:23.0311.8
191.4F90 with OpenACC pragma2,4
PGI (CUF Fortran CUDA)00:05.100:17.301:08.2376.0
234.8PGI FORTRAN CUDA (optimized)2,4
CPU only cases:

CPU equivalent cases
C++ (PGI)28:13.701:55:1007:30:381.000
1.021C++, PGI compiler (v15.9)
C++ (Intel)25:35.6 01:42:2106:49:391.109


1.112C++, Intel compiler
C++ (gcc) 28:24.601:53:3707:35:410.999
0.994C++, GNU compiler (v4.9.2)
PGI C++28:53.601:55:38

PGI C++ (myComplex)28:11.001:52:02

uses myComplex class3b
PGI C++ (std::complex<>)03:53:0815:29:09

uses std::complex<float>3c
PGI F9033:36.802:13:50

0.793F90, PGI compiler (v15.9)4
MATLAB (1 thread)01:09:0004:34:24

0.457MATLAB on 1 CPU1
MATLAB (multi-threaded)14:05.756:09.6

2.129MATLAB on multiple CPUs1
IDL (1 thread)01:32:1306:24:22

IDL on 1 CPU1
IDL (multi-threaded)12:50.001:32:50

IDL on multiple CPUs1


  • Green means ran faster, red slower, grey is the reference (yellow: not run).
  • Optimized CUDA and CUDA FORTRAN ran 371 and 376 times faster than single CPU C++
  1. MATLAB and IDL tests use the run-time environment, for now we only have MATLAB licenses at SAO (including compiler).
  2. Access to the OpenACC and FORTRAN CUDA from PGI requires a license upgrade - our license was upgraded in May 2016.
  3. C++ does not support complex arithmetic like FORTRAN does, so I used 3 approaches:
    1. write out the complex arithmetic explicitly using only floating points operations,
    2. use a C++ class myComplex, and let the compiler figure out how to port it to the GPU,
    3. use the C++ std::complex<> class template, and also let the compiler figure out how to port it to the GPU.
    Using OpenACC with C++ is more tricky than using it with FORTRAN.
    The Julia set computation generates NaN, handling this in some of the C++ cases may explain the slow down.
  4. FORTRAN supports complex arithmetic as a built-in data type COMPLEX.

Last updated   SGK