Introduction
Matrix of Queues
- Notes
How to Specify a Queue
- Examples
Interactive Queue
Workflow Manager Queue
Memory Reservation
- Resources Format Specifications
Host Groups
CPU architecture
Queue Selection Validation and/or Verification
Hardware Limits
- Notes

1. Introduction

Every job running on the cluster is started in a queue.

The GE will select a queue based on the resources requested and the current usage in each queue.
If you don't specify the right queue or the right resource(s), your job will either
- not get queued,
- wait forever and never run, or
- start and get killed when it exceeds one the limits of the queue it ran in.

All jobs run in batch mode, unless you use the interactive queue, and the default GE queue (all.q) is not available.

2. Matrix of Queues

The set of available queues is a matrix of queues:

Five sets of queues:
1. a high-CPU set,
2. a high-memory set, complemented by
  - a very-high-memory restricted queue,
3. a GPU set,
4. an interactive queue, and
5. a special I/O queue.
The high-cpu, high-memory and GPU sets of queues have different time limits: short, medium, long and unlimited.
- the high-cpu queues are for serial or parallel jobs that do not need a lot of memory (less than 8GB per CPU),
- the high-memory queues are for serial or multi-threaded parallel jobs that require a lot of memory (more then 6GB, but limited to 450GB),
- the very-high-memory queue is reserved for jobs that need a very large amount of memory (over 450GB),
interactive queues, to run interactively on a compute node (w/out or w/ a GPU), although it also has limits, and
a special queue reserved for projects that need special resources (I/Os).

The list of queues and their characteristics (time versus memory limits) are:

Memory limit per CPU resident / virtual	Time limit (soft CPU/elapsed time)				Available parallel environments	Type of jobs
	short	medium	long	unlimited
	T<7h/14h	T<6d/12d	T<30d/60d
8GB / 64GB	sThC.q	mThC.q	lThC.q	uThC.q	`mpich`, `orte`, `mthread, hybrid`	serial or parallel, that needs less than 8GB of memory per CPU
450GB / 900GB	sThM.q	mThM.q	lThM.q	uThM.q	`mthread`	serial or multi-threaded, 8GB < memory needed < 450GB
2TB / 2TB				uTxlM.rq	`mthread`	serial or multi-threaded, memory needed > 450GB, restricted
64GB / 128G	`sTgpu.q`	`mTgpu.q`	`lTgpu.q`		`mthread`	queue to access GPUs
		T<12h/24h	T<12h/72h
8GB / 64GB		`qrsh.iq`			`mthread`	interactive queue, use `qrsh` or `qlogin`; 12h of CPU 24h of wallclock.
64GB / 65GB		`qgpu.iq`			`mthread`	interactive queue to access GPUs, restricted; 12h of CPU 24h of wallclock.
8GB / 64 GB			`lTIO.sq`		`mthread`	I/O queue, to access `/store`; 12h of CPU, 72h of wallclock.
8GB / 64 GB			`lTWFM.sq`		`mthread`	Workflow manager queue, to run a job that will submit jobs (6d/30d CPU/wallclock.)

Notes

the listed time limit is the soft CPU limit (s_cpu), and elapsed time (s_rt)
- the soft elapsed time limit (aka real time, s_rt) is twice the soft CPU limit for all the queues.
- the hard time limits are 15m longer than the soft ones (i.e., h_cpu, and h_rt).
  Namely, in the short-time limit queues
  - a serial job is warned if it has exceeded 7 hours of consumed CPU, and killed after it has consumed 7 hours and 15 minutes of CPU, or
  - warned if it has spent 14 hours in the queue and killed after spending 14 hours and 15 minutes in the queue.
  - For parallel jobs the consumed CPU time is scaled by the number of allocated slots,
  - the elapsed time is not.
memory limits are per CPU,
- so a parallel job, in a high-cpu queue can use up to NSLOTS x 6 GB, where NSLOTS is the number of allocated slots (CPUs)
- parallel jobs in the other queues are limited to multi-threaded jobs (no multi-node jobs)
- the limit on the virtual memory (vmem) is set to be higher that the resident memory (rss)
memory usage is also limited by the available memory on a given node.
If you believe that you need access to a restricted or test queue, contact us.

3. How to Specify a Queue

By default jobs are most likely to run in the short high-cpu queue (sThC.q).
To select a different queue you need to either
- specify the name of the queue (via -q <name>), or
- pass a requested time limit (via -l s_cpu=<value> or -l s_rt=<value>).
To use a high-memory queue, you need to
- specify the memory requirement (with -l mres=X,h_data=X,h_vmem=X)
- confirm that you need a high-memory queue (-l himem)
- select the time limit either by
  - specifying the queue name, or (via -q <name>), or
  - pass a requested time limit (via -l s_cpu=<value> or -l s_rt=<value>).
To use a GPU queue, you need to specify -l gpu
To use the unlimited queues, i.e., uThC.q or uThM.q, you need to confirm that you request a low priority queue (-l lopri)

Why do I need to add -l himem,-l gpu or -l lopri?

This prevents the GridEngine from submitting a job to these queues that did not request the associated resources, just because one of these queues are less used.
It prevent the scheduler from "wasting" valuable resources.

Examples

`qsub` flags	Meaning of the request
`-l s_cpu=48:00:00`	48 hour of consumed CPU (per slot)
`-l s_rt=200:00:00`	200 hour of elapsed time
`-q mThC.q`	use the `mThC.q` queue
`-l mres=120G,h_data=12G,h_vmem=12G -pe mthread 10`	12GB of memory (per CPU), for a 10 CPUs parallel job will reserve 120GB
`-q mThM.q -l mres=12G,h_data=12G,h_vmem=12G,himem`	to run in the medium-time high-memory queue. This is a correct, i.e., complete, specification (memory use specs and `himem`)
`-q uThC.q -l lopri`	to run in the unlimited high-cpu queue, note the `-l lopri`
`-q uTxlM.rq -l himem`	unlimited-time, extra-large-memory, restricted to a subset of users

All jobs that use more than 2 GB of memory (per CPU) should include a memory reservation and requirement with -l mres=X,h_data=X,h_vmem=X.

If you do not, your job(s) may not be able to grab the memory they need at run-time and crash, or crash the node.
memory reservation is for serial and mthread jobs, not MPI.
- MPI jobs can specify h_data=X and h_vmem=X resources.
X is a number followed by a unit, like 100M or 10G, if you specify h_vmem=5 (no unit) your job can only use 5 bytes and will die right away.

4. Interactive Queue

You can start an interactive session on a compute node using the command qrsh or qlogin (not qsub, nor qsh)

Some compute nodes are set aside for interactive use,
- the corresponding queue is named qrsh.iq
To start an interactive session, use qrsh or qlogin
- qrsh will start an interactive session on one of the interactive nodes,
  - it takes both options and arguments (like qsub)
- qlogin is similar to qrsh, although
  - it will propagate the $DISPLAY variable, so you can use X-enabled applications (like to plot to the screen) if you've enabled X-forwarding, and
  - it does not take any argument, but will take options.
Unless you need X-forwarding, use qrsh

Limits on the Interactive Queue:

Like any other queue, the interactive queue has its own limits:
CPU 12h per slot (CPU/core)
Elapsed Time 24h per session
Memory 8GB/64GB per slot (CPU/core)
Like for qsub, you can request more than one slot (CPU/thread) with the -pe mthread N option,
- where N is a number between 2 and 16, as in:

qrsh -pe mthread 4

- requesting more slots allows you also to use more memory (4 slots means up to 4 x 8G = 32G).
Each user is limited to ~~one~~ four (4) concurrent interactive sessions, and up to 16 slots (CPUs/cores).
The overall limits of slots/user include all the queues (so if you use them all in batch mode, you won't be able to get an interactive session).

The NSLOTS Variable

As of Feb 3, 2020, $NSLOTS is properly propagated by qrsh, but not by qlogin.
- There is no mechanism to propagate $NSLOTS with qlogin and enable X-forwarding.

Remember, Hydra is a shared resource, do not "waste" interactive slots by keeping your qlogin or qrsh session idle.

5. Workflow Manager Queue

Users who want to use such a workflow manager (like NextFlow and Snakemake) can submit a job to the workflow manager (WFM) special queue lTWFM.sq.

This queue allows you to run a job that will in turn submit jobs.
You can use that queue for any other WFM, including your own scripts.
Jobs can be submitted from the hosts in the workflow manager queue (@wfm-hosts).

To run a job in the workflow manager queue, you will need to specify -q lTWFM.sq -l wfmq to qsub or add these as embedded directives in your job file.

Limits on the Workflow Manage queue:

Like any other queue, the WFM queue has its own limits:
CPU 144hr (6 days) per slot (CPU/core)
Elapsed Time 720h (30 days) of elapsed time
Memory 8GB/64GB per slot (CPU/core)
You can request more than one slot (CPU/thread) with the -pe mthread N option,
- where N is currently limited to 2;
- requesting more slots allows you also to use more memory (2 slots means up to 2 x 8G/64G = 16G/128G res/vmem).

Each user is limited to one concurrent interactive session, and up to 2 slots (CPUs/cores).

6. Memory Reservation

We have implemented a memory reservation mechanism, (via -l mres=XT)
- This allows the job scheduler to guarantee that the requested amount of memory is available for your job on the compute node(s),
  by keeping track of the reserved memory and not scheduling jobs that reserve more memory than available.
- Hence reserving more than you will use prevents others (including your own other jobs) from accessing the available memory and
  indirectly the available CPUs (like if you use one, or even just a few, CPUs but grab most of the memory of a given compute node).
We have at least 2GB/CPUs, but more often 4GB/CPUs on the compute node, still
it is recommended to reserve memory if your job will use more than 2GB/CPU, and
set h_data=X and h_vmem=X to the value used in mres=XT.
Remember:
- The memory specification is
  - per JOB in mres=XT (total)
  - per CPU in h_data=X and h_vmem=X, it should be XT divided by the number of requested slots.
- Memory is a scarce and expensive resource, compared to CPU, as we have fewer nodes with a lot of memory.
- Try to guesstimate the best you can your memory usage, and
- monitor the memory use of your job(s) (see Job Monitoring).
Note:
- Do not hesitate to re-queue a job if it uses a lot more or a lot less than your initial guess, esp. if you plan to queue a slew of jobs;
- often, memory usage scales with the problem in predictable ways, or the documentation might indicate memory requirements,
  consider running some test cases to help you guesstimate, and
  trim down your memory reservation whenever possible: a job that requests oodles of memory may wait for a long time in the queue for that resource to free up.
- Consider breaking down a long task into separate jobs if different steps need different type of resources.

Resources Format Specifications

The format for
- memory specification is a positive (decimal) number followed by a unit (aka a multiplier), like 13.4G for 13.4 GB;
- CPU or RT time specification is h:m:s, like 100:00:00 for 100 hours (or "100::", while "100" means 100 seconds).
see man sge_types.

7. Host Groups

The GridEngine supports the concept of a host group, i.e., a list of hosts (compute nodes).

You can request that a job run only on computers in a given a host group, and
we use these host groups to restrict queues to specific list of hosts.
To request computers of a given host group, use something like this "-q mThC.q@@ib-hosts" (yes there is a double "@"),
In fact the queue specification to qsub can be a RE, so -q '?ThC.q@@ib-hosts', means any high-cpu queue but only hosts on IB.

You can get the list of all the host groups with (show host group list)
% qconf -shgrpl
and get the list of hosts for a specific host group with
% qconf -shgrp <host-group-name>

Here is the list of host groups:

Name	Description
`@all-hosts`	all the hosts
`@hicpu-hosts`	high CPU hosts
`@himem-hosts`	high memory hosts (521GB/host)
`@xlmem-hosts`	extra large memory hosts (>=1TB/host)
`@io-hosts`	hosts in the IO queue
`@wfm-hosts`	hosts in the WFM queue

`@gpu-hosts`	hosts with GPUs
`@ssd-hosts`	hosts with local SSD

`@24c-hosts`	hosts with 24 CPUs
`@NNc-hosts`	hosts with NN CPUs (NN=value up to 192)

`@avx-hosts`	hosts with AVX-capable CPUs
`@avx2-hosts`	hosts with AVX2-capable CPUs

8. CPU Architecture

The cluster is composed of compute nodes with different CPU architectures.
You can tell the scheduler to run your job on specific CPU architecture(s) using the cpu_arch resource.

Composition

The cluster's CPU architecture composition is as follows:

Compute Nodes	CPU Arch.	CPU Model Name
`compute-64-xx`	`skylake`	Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz
`compute-65-xx`	zen	AMD EPYC 7713P 64-Core @1.94GHz
`compute-75-xx`	zen	AMD EPYC 7H12 64-Core @ 2.53GHz
`compute-76-xx`	zen	AMD EPYC 9654 64-Core
`compute-76-xx`	zen	AMD EPYC 9534 64-Core
`compute-79-xx`	`skylake`	Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz
`compute-84-xx`	`skylake`	Intel(R) Xeon(R) Platinum 8280 CPU @ 2.70GHz
`compute-93-xx`	`haswell`	Intel(R) Xeon(R) CPU E7-8867 v3 @ 2.50GHz
`compute-93-xx`	`broadwell`	Intel(R) Xeon(R) CPU E7-8860 v4 @ 2.20GHz

Usage

Grid Engine/Qsub

You can use the cpu_arch resource to request a specific architecture or a set of architectures or avoid specific architecture(a), like in

qsub -l cpu_arch=haswell

such job will only run on nodes with a haswell CPU architecture.

Alternatively, you can use logic constructs as follows:

qsub -l cpu_arch=\!broadwell - any CPU, except broadwell.

qsub -l cpu_arch='haswell|skylake' - either haswell OR skylake CPUs.

Tools

You can retrieve the node's CPU architecture with the command get-cpu_arch, accessible after loading the module tools/local.
You can query the cpu_arch list within a queue with, for example,

qstat -F cpu_arch -q sThC.q

You can set the environment variable cpu_arch by loading the module tools/cpu_arch.

Compilers

Each compiler allows you to specify specific target processors (aka CPU architectures).
The syntax is different for each compiler, read carefully the compiler's user guide.

Examples

Run only on two types of architectures:

Restrict to some type of arch

#
#$ -cwd -j y -N demo1 -o demo1.log
#$ -l cpu_arch='haswell|skylake'
#
./crunch
#

Run a different executable depending on the node's CPU architecture, and tell the scheduler to avoid AMD processors:

Run different code for different arch

#
#$ -cwd -j y -N demo2 -o demo2.log
#$ -l cpu_arch='!broadwell'
#
module load tools/cpu_arch
bin/$cpu_arch/crunch
#

9. Queue Selection Validation and/or Verification

You can submit a job script and verify if the GE can run it, i.e.,
can the GE find the adequate queue and allocate the requested resources?
The qsub flag "-w v" will run a verification, while "-w p" will poke whether the job can run, but the job will not be submitted:
% qsub -w v my_test.job
or
% qsub -w p my_test.job
The difference being that "-w v" checks against an empty cluster, while "-w p" validates against the cluster as is status.
By default all jobs are submitted with "-w e", producing an error for invalid requests.
Overriding it with "-w w" or "-w n" can result in jobs that are queued but will never run
as they request more resources than will ever be available.
You can also use the -verify flag, to print detailed information about the would-be job as though qstat -j was used,
including the effects of command-line parameters and the external environment, instead of submitting job:
% qsub -verify my_test.job

10. Hardware Limits

The following table shows the current hardware limits:

Queue name	Number of		Number of CPU per node	Available Memory
Queue name	nodes	slots	Number of CPU per node	Available Memory	Comment
`?ThC.q`	60	5000	40 to 128	>4GB/CPU	high-CPU queues
`?ThM.q`	50	4552	32 to 192	>512GB per node	high-memory queues
`uTxlM.rq`	3	480	96 to 192	>1TB per node	extra large memory queue, restricted
`?Tgpu.q`	3	8 GPUs	-	-	GPU queues, need `-l gpu`

`lTIO.sq`	2	8	-	-	I/O queue to access `/store`
`qrsh.iq`	2	40	-	256GB per node	use `qrsh` or `qlogin` (not `qsh` or `qsub`)
`qgpu.iq`	3	8 GPUs	-	-	GPU interactive queue, use `qrsh -l gpu`

The values in this table change as we modify the hardware configuration, you can verify them with either

% qstat -g c

or

% qstat+ -gc

and

% qhost

or

% qhost+

Notes

We also impose software limits, namely how much of the cluster a single user can grab (see resource limits in Submitting Jobs. )
If your pending requests will exceed these limits, your queued jobs will wait;
If you request inconsistent or unavailable resources, you will get the following error message:
Unable to run job: error: no suitable queues.
You can use "-w v" or "-verify" to track down why the GE can't find a suitable queue, as described elsewhere on this page.

Last updated 07 Nov 2024 SGK

CPU	12h per slot (CPU/core)
Elapsed Time	24h per session
Memory	8GB/64GB per slot (CPU/core)

CPU	144hr (6 days) per slot (CPU/core)
Elapsed Time	720h (30 days) of elapsed time
Memory	8GB/64GB per slot (CPU/core)

Page tree

Available Queues

1. Introduction

2. Matrix of Queues

Notes

3. How to Specify a Queue

Examples

4. Interactive Queue

Limits on the Interactive Queue:

The NSLOTS Variable

5. Workflow Manager Queue

Limits on the Workflow Manage queue:

6. Memory Reservation

Resources Format Specifications

7. Host Groups

8. CPU Architecture

Composition

Usage

Grid Engine/Qsub

Tools

Compilers

Examples

9. Queue Selection Validation and/or Verification

10. Hardware Limits

Notes