Versions Compared


  • This line was added.
  • This line was removed.
  • Formatting was changed.


A total of 4 GPUs on 2 nodes are available on Hydra.

  • One node has two dual four GPU cards (NVIDIA K80) and is not yet back in productionL40S)
  • Two nodes have two GPU cards (NVIDIA GV100)
    • each GPU corresponds to:

      TypeCUDA CoresMemoryMem b/w
      K80L40S218,49617612GB48GB480 864 GB/s
      GV1005,12032GB870 GB/s

      note that CUDA cores are not like CPU cores.


  • The GPUs are configured as follow:
    • The current driver is CUDA 12.4  (Driver Version 550.54.15)
    • persistence mode is ENABLED (NVIDIA driver remains loaded even when there are no active clients)
    • compute mode is set to EXCLUSIVE_PROCESS (only one context is allowed per device, usable from multiple threads at a time)
    • (warning) This configuration is reset at reboots.
      (see man nvidia-smi - accessible on the gpu nodes
    , after loading the cud10
    • .
    2 module.
    • )

(lightbulb) This means that GPU applications will


(warning) Starting one more process than available GPUs will fail with the following error message:

Error: all CUDA-capable devices are busy or unavailable
 in file at line no NNN

3. Available Tools


  • The CUDA compiler is now part of the NVIDIA compilers, and is accessible loading the NVIDIA module

% module load nvidia

that loads by default NVIDIA


23.9 and CUDA





  • Other version are available, check with

% module whatis nvidia

The CUDA 10.2 sdk is also available (module whatis cuda10.2).

NVSMI: The NVIDIA System Management Interface

NVSMI: The NVIDIA System Management Interface

  • NVSMI v550.54.15 and dcgm  v 3.35 are available on the GPU nodesNVSMI version 470.57.02 is available.
  • The following tools are available, but only on the nodes with GPUs, and by loading the cuda10.2 module:
nvidia-cuda-mps-controlNVIDIA CUDA Multi Process Service management program
nvidia-installerinstall, upgrade, or uninstall the NVIDIA Accelerated Graphics Driver Setnvidia-modprobeLoad the NVIDIA kernel module and create NVIDIA character device files
nvidia-persistencedA daemon to maintain persistent software state in the NVIDIA driver
nvidia-settingsconfigure the NVIDIA graphics driver
nvidia-smiNVIDIA System Management Interface program
nvidia-xconfigmanipulate X configuration files for the NVIDIA driver
  • Read the corresponding man pages to learn more, load the cuda-dcgm module to access dcgm.
  • The man pages are accessible on the GPU nodes by loading the cuda10.2 module..

Query and Monitor

  • The most useful tool is nvidia-smi, it allows you to query & monitor the status of the GPU card(s):
Code Block
titleTry one of the following commands:
  hpc@compute-79-01% nvidia-smi -l
  hpc@compute-79-01% nvidia-smi dmon -d 30 -s pucm -o DT
  hpc@compute-79-01% nvidia-smi pmon -d 10 -s um -o DT
  hpc@compute-79-01% nvidia-smi \
           --query-compute-apps=timestamp,gpu_uuid,pid,name,used_memory \
           --format=csv,nounits -l 15
  hpc@compute-79-01% nvidia-smi \
           --query-gpu=name,serial,index,memory.used,utilization.gpu,utilization.memory \
           --format=csv,nounits -l 15
  hpc@compute-79-01% nvidia-smi -q -i 0 --display=MEMORY,UTILIZATION,PIDS


  • Read the man page (man nvidia-smi)


The GPU development kit (GDK), NVIDIA Management Library (NVML) and the python bindings to NVML (pyNVML) are available.

The GDK version,

the NVML documentation is available at NVIDIA's web site

pyNVML 7.352.0 is available via the nvidia/pynvlm module, and the documentation is on-line.


  • The


  • NVIDIA compilers support OpenACC.
  • OpenACC, similarly to OpenMP, instructs a compiler to produce code that will run on the GPU
  • It uses pragmas, i.e., instructions to the compilers that look otherwise like comments, to specify what part of the computation should be offset to the GPU.


(lightbulb)  A single pair of such pragmas produced a >300x speed up of the Julia set test case.

  • This requires an additional license that is available on Hydra (not at SAO, tho).
  • The NVIDIA compilers also support CUDA FORTRAN (aka CUF).
  •  You can write or modify existing FORTRAN code to use the GPU like you can using C/C++ & CUDA.
  • Simple examples are available in /home/hpc/examples/gpu/cuda

Local Tool

  • We


  • provide two local GPU related tools (


  • accessible by loading the tools/local-admin module):
    • check-gpuse: checks current GPU usage


    • :
Code Block
hpc@hydra-login% check-gpuse
hostgroup: @gpu-hosts (3 hosts)
                - --- memory (GB) ----  -  #GPU - --------- slots/CPUs --------- 
hostname        -   total   used   resd -  a/u  - nCPU used   load - free unused 


50-01   -   












7 -  4/0  -   64    0    0.0 -   64   64.0
compute-79-01   -   125.








107.3 -  2/0  -   20    0    0.0 -   20   20.0


compute-79-02   -   125.5   46.2   79.3 -  2/1  -   20    1    2.1 -   19   17.9

Total #GPU=8 used=1 (12.5%)
    • get-gpu-info: queries whether a node has a GPU, returns the GPU(s) properties and which process(es) use(s) the GPUs.

It's a simple C-shell wrapper to run the python script, that python script uses the pyNVML (python bindings to NVML).

(lightbulb) the get-gpu-info wrapper checks if the first argument is in the form NN-MM, and if it is will run on compute-NN-MM

in other words, "get-gpu-info 79-01 0 -d" is equivalent to "ssh -xn compute-79-01 get-gpu-info 0 -d"

Code Block
titleHere is how to use it:
usage: get-gpu-info.py3 [-h] [-i] [-d] [-l [LOOP]] [-c COUNTS]
                        [--ntstamp NTSTAMP]

get-gpu-info.py3: show info about GPU(s)

positional arguments:
  id                    specify the GPU id, implies --info

optional arguments:
  -h, --help            show this help message and exit
  -i, --info 


It's a simple C-shell wrapper to run the python script, that python script uses the pyNVML (python bindings to NVML).

(lightbulb) the get-gpu-info wrapper checks if the first argument is in the form NN-MM, and if it is will run on compute-NN-MM

in other words, "get-gpu-info 79-01 0 -d" is equivalent to "ssh -xn compute-79-01 get-gpu-info 0 -d"

Code Block
titleHere is how to use it:
usage: [-h] [-i] [-d] [-l [LOOP]] [-c COUNTS]
                      show [--ntstamp NTSTAMP]
     info for each GPU
  -d, --details         show details of running process, implies --info
  -l [idLOOP]
get, show info about GPU(s)
positional arguments:
  id-loop [LOOP]
                        repeat every LOOP specify the GPU id, implies --info
optional arguments:
  -h, --help[in sec: 10 to 3600], default is 30,
                    show this help message andimplies exit--info
  -ic COUNTS, --infocounts COUNTS
           show info for each GPU
  -d, --details      limits the no. showof detailstimes ofto running processloop, implies --info
  -l [LOOP], --loop [LOOP]
--ntstamp NTSTAMP     specify how often to put a time stamp, by default puts
                        one every repeat every LOOP [in sec: 10 to 3600], default is 30,
         10 readings

Ver 1.1/0 Oct 2021/SGK

For example:

Code Block
hpc@hydra-login-01% get-gpu-info
0 GPU on hydra-login-01
hpc@login01% get-gpu-info 50-01 -d
4 GPUs on 50-01
Thu May 16 15:30:06 2024
id               implies ------info
 memory -c COUNTS, --counts COUNTS
                -----  ------ bar1 --------  ---- usage ----
   --- name ---    used/total        limits the no. of timesused/total to loop, implies --info
  --ntstamp NTSTAMP     specifygpu how oftenmem to#proc
0 put aNVIDIA_L40S time stamp, by default puts
        479.1M/44.99G   1.0%  1.688M/64.00G   0.0%    0%   0% 0
1  NVIDIA_L40S   479.1M/44.99G    one1.0% every 10 readings
Ver 1.0/0 Feb 2016/SGK
Code Block
hpc@hydra-login-01% get-gpu-info
0 GPU on hydra-login-01
hpc@hydra-login-01% ssh -xn compute-79-01 get-gpu-info
2 GPUs on 79-01
hpc@hydra-login-01% get-gpu-info 73-01 -d
4 GPUs on 73-01
Tue Mar  8 13:27:45 2016
id           ------ memory ------ ------ bar1 -------- ---- usage ----
  -- name --   used/total688M/64.00G   0.0%    0%   0% 0
2  NVIDIA_L40S   479.1M/44.99G   1.0%  1.688M/64.00G   0.0%    0%   0% 0
3  NVIDIA_L40S   479.1M/44.99G   1.0%  1.688M/64.00G   0.0%    0%   0% 0

hpc@login01% get-gpu-info 79-01 -d
2 GPUs on 79-01
Thu May 16 15:30:12 2024
id             used/total  ------ memory ------  ------ bar1 --------  gpu----  mem #proc
0 Tesla_K80  760.0M/11.25G   6.6% 4.562M/16.00G   0.0%   0%   0% 1
    pid=64617 name=./loopjulia2xGpu used_memory=735.9M
1 Tesla_K80  22.80M/11.25G   0.2% 2.562M/16.00G   0.0%   0%   0% 0
2 Tesla_K80  22.80M/11.25G   0.2% 2.562M/16.00G   0.0%   usage ----
   --- name ---    used/total            used/total           gpu  mem #proc
0  Quadro_GV100  276.5M/32.00G   0.8%  2.688M/256.0M   1.0%    0%   0% 0
31  TeslaQuadro_K80GV100  22276.80M5M/1132.25G00G   0.2%8%  2.562M688M/16256.00G0M   01.0%    0%   0% 0
hpc@hydra-login01% ssh -xn compute-79-01 
hpc@login01% get-gpu-info 79-02 -d
2 GPUs on 79-0102
Thu DecMay 1216 1015:0230:3219 20192024
id               ------ memory ------  ------ bar1 --------  ---- usage ----
   --- name ---    used/total            used/total           gpu  mem #proc
0  Quadro_GV100 64 6.00k212G/3132.72G00G  19.4% 0.0% 25.566M188M/256.0M   12.0%    0%   0% 0
1 1
    pid=494189 name=b'python3' used_memory=5.939G
1  Quadro_GV100  64279.00k1M/3132.72G00G   0.0%9%  23.566M188M/256.0M   1.0%2%    0%   0% 0

4. Available Queues

  • Two Four queues are available to available to access the GPUs:
    • an interactive queue,, that has a batch queue,  lTgpu.q, that has the long-T time limit, andan interactive queue,, that has a 24hr elapsed time limit24hr elapsed time limit, and
    • three batch queues sTgpu.q, mTgpu.q & lTgpu.q, corresponding to a short, medium and long time limit.
  • There also limits on (CPU) memory usage and how many concurrent GPUs can be used, see the Available Queues page.


  • You must request to use a GPU to run in these queues, as follows:


  • If your job will use two GPUs, use::

% qsub -l gpu,ngpu=2

  • You can specify what type of GPU to use with

-l gpu,gpu_arch=L40S


% qsub -l gpu,ngpu=2

Currently we impose the following resource limits for GPUs are:



5. Examples


You can find that example, and equivalent codes, under /home/hpc/examples/gpu - I wrote

cuda/CUDA and C++ code (.cu .cpp Makefile)
cuda/gpuGPU example
cuda/cpuCPU equivalent
matlab/MATLAB, using standalone compiled code (available for now only at SAO)
matlab/gpuGPU example
matlab/cpuCPU equivalent
idl/IDL CPU-only equivalent (for comparison)


I wrote more sophisticated alternative to that example, to achieve a 500:1 speed up compared to the equivalent computation running a single CPU.

(lightbulb) That's reducing a 7.5 hour long computation to less than 1 minute, in a case that is intrinsically fully "parallelizable."

It illustrates the potential gain, compared to the cost of coding using CUDA (an extension of C++)

NVIDIA Examples

You can find the NVIDIA examples under /share/apps/nvidia/cuda/samples and under /cm/shared/apps/cuda10.0/sdk.

I was able to build them, but not on the login nodes, as they use libraries that are currently only available on the GPU nodes

You can find this under  /home/hpc/tests/gpu/cuda/samples/.


[This section is old and these tests have yet to be re-run]

I use a simple fractal computation (Julia set) to run timings. The computation is simply
z =: z * z + c

where z and c are two complex numbers.

The assignment is iterated N times and computed on a M x M grid, where z = x + i y and x and y are equispaced values between -1.5 and 1.5.

The final value of z is converted to an index iz = 255*exp(-abs(z))  and the index iz converted to a color triplet using a rainbow color map.

the computation is repeated for a range of values of c.

I wrote various versions using

C++ CUDA (a basic one, an optimized one),


C++ and FORTRAN (CPU-only as reference),


and  IDL (CPU only).

In the single precision versions, z and c  are each represented by floats,

in the double precision versions, they are represented by doubles.

In the tests I ran for timing purposes I did not save the resulted image.

The codes, job files and log files are available in /home/hpc/examples/gpu/timing/
I ran them on 201 values of c and for N=150, timings units are [HH:]MM:SS[.S] (not yet complete)

single precisionspeed
double precision

size M x M4k x 4k8k x 8k16x 16kratio
4k x 4k8k x 8k16k x 16k
GPU cases:

GPU specific code

NVIDIA CUDA - their example
Optimized CUDA00:05.100:17.601:09.3371.6
238.8NVIDIA CUDA - optimized
MATLAB GPU01:33.204:46.018:56.422.04

14.69MATLAB equiv, using GPU1
GPU cases, using pragmas:

Regular code with special directives
PGI C++ OpenACC00:54.302:09.4

C++ with OpenACC2,3a
PGI C++ OpenACC (optimized)00:12.100:42.202:46.2155.5
128.5C++ with OpenACC, optimized2,3a
PGI C++ OpenACC (myComplex)10:52:44


uses myComplex class2,3b
PGI C++ OpenAC (std::complex<>)10:52:45


uses std::complex<float>2,3c
PGI F90 OpenACC00:06.000:21.101:23.0311.8
191.4F90 with OpenACC pragma2,4
PGI (CUF Fortran CUDA)00:05.100:17.301:08.2376.0
234.8PGI FORTRAN CUDA (optimized)2,4
CPU only cases:

CPU equivalent cases
C++ (PGI)28:13.701:55:1007:30:381.000
1.021C++, PGI compiler (v15.9)
C++ (Intel)25:35.6 01:42:2106:49:391.109


1.112C++, Intel compiler
C++ (gcc) 28:24.601:53:3707:35:410.999
0.994C++, GNU compiler (v4.9.2)
PGI C++28:53.601:55:38

PGI C++ (myComplex)28:11.001:52:02

uses myComplex class3b
PGI C++ (std::complex<>)03:53:0815:29:09

uses std::complex<float>3c
PGI F9033:36.802:13:50

0.793F90, PGI compiler (v15.9)4
MATLAB (1 thread)01:09:0004:34:24

0.457MATLAB on 1 CPU1
MATLAB (multi-threaded)14:05.756:09.6

2.129MATLAB on multiple CPUs1
IDL (1 thread)01:32:1306:24:22

IDL on 1 CPU1
IDL (multi-threaded)12:50.001:32:50

IDL on multiple CPUs1


Green means ran faster, red slower, grey is the reference (yellow: not run).

Optimized CUDA and CUDA FORTRAN ran 371 and 376 times faster than single CPU C++

MATLAB and IDL tests use the run-time environment, for now we only have MATLAB licenses at SAO (including compiler).

Access to the OpenACC and FORTRAN CUDA from PGI requires a license upgrade - our license was upgraded in May 2016.

C++ does not support complex arithmetic like FORTRAN does, so I used 3 approaches:

  1. write out the complex arithmetic explicitly using only floating points operations,
  2. use a C++ class myComplex, and let the compiler figure out how to port it to the GPU,
  3. use the C++ std::complex<> class template, and also let the compiler figure out how to port it to the GPU.

Using OpenACC with C++ is more tricky than using it with FORTRAN.
The Julia set computation generates NaN, handling this in some of the C++ cases may explain the slow down.

FORTRAN supports complex arithmetic as a built-in data type COMPLEX.


Last updated  SGK