- 40140817
- Matrix of Queues
- Notes
- How to Specify a Queue
- Examples
- Memory Reservation
- Formats
- Host Groups
- Queue Selection Validation and/or Verification
- Hardware Limits
- Notes
1. Introduction
Every job running on the cluster is started in a queue.
- The GE will select a queue based on the resources requested and the current usage in each queue.
- If you don't specify the right queues or the right resource(s), your job will either not get queued,
- wait forever and never run, or
- start and get killed when it exceeds one the limits of the queue it was started in.
All jobs run in batch mode, and the default Rocks queue all.q is not available, it is disabled.
2. Matrix of Queues
The set of available queues is a matrix of queues:
- Four sets of queues:
- a high-cpu set, and
- a high-memory set, complemented by
- a very-high-memory restricted queue,
- an interactive queue, and
- a test queue.
- The high-cpu and a high-memory sets of queues have different time limits: short, medium, long and unlimited.
- The high-cpu queues are for serial or parallel jobs that do not need a lot of memory (less than 6GB per CPU),
- the high-memory queues are for serial or multi-threaded parallel jobs that require a lot of memory (more then 6GB, but limited to 450GB,
- the very-high-memory queue is reserved for jobs that need a very large amount of memory (over 450GB), and
- a special queue is reserved for special projects that need special resources (lots of memory, benefit from SSDs).
- There is also an interactive queue, to run interactively on a compute node, although it also has limits.
Here the list of queues and their characteristics (time versus memory limits) :
Memory | Time limit (soft CPU time) | Available | |||||
---|---|---|---|---|---|---|---|
limit | short | medium | long | unlimited | parallel | ||
per CPU | T<7h | T<6d | T<30d | environments | Type of jobs | ||
6 GB | sThC.q | mThC.q | lThC.q | uThC.q | mpich , orte , mthread | serial or parallel, that need less than 6GB of memory per CPU | |
450 GB | sThM.q | mThM.q | lThM.q | uThM.q | mthread | serial or multi-threaded, 6GB < memory needed < 450GB | |
1 TB | uTxlM.rq | mthread | serial or multi-threaded, memory needed > 450GB, restricted | ||||
1 TB | ssd.tq | mthread | test queue to access SSDs | ||||
T<12h | T<36h | ||||||
8 GB | qrsh.iq | mthread | interactive queue, use qrsh instead of qsub; qsh is not supported. | ||||
2 GB | lTIO.sq | mthread | I/O queue (to access /store ) |
Notes
- the listed time limit is the soft CPU limit (
s_cpu
),- the soft elapsed time limit (aka real time,
s_rt
) is twice the soft CPU limit for all the queues, except for the medium-T, - for the medium-T queues, the elapsed time limits are 9 days (1.5 times the CPU limit), and
- the hard time limits are 15m longer than the soft ones (i.e.,
h_cpu
, andh_rt
).
Namely, in the short-time limit queues- a serial job is warned if it has exceeded 7 hours of consumed CPU, and killed after it has consumed 7 hours and 15 minutes of CPU, or
- warned if it has spent 14 hours in the queue and killed after spending 14 hours and 15 minutes in the queue.
- For parallel jobs the consumed CPU time is scaled by the number of allocated slots,
- the elapsed time is not.
- the soft elapsed time limit (aka real time,
- memory limits are per CPU,
- so a parallel job, in a high-cpu queue can use up to
NSLOTS
x 6 GB, whereNSLOTS
is the number of allocated slots (CPUs) - parallel jobs in the other queues are limited to multi-threaded jobs (no multi-node jobs)
- so a parallel job, in a high-cpu queue can use up to
- memory usage is also limited by the available memory on a given node.
- If you believe that you need access to the restricted or test queue, contact us.
How to Specify a Queue
- By default jobs are most likely to run in the short high-cpu queue (
sThC.q
). - To select a different queue you need to either
- specify the name of the queue (via
-q <name>
), or - pass a requested time limit (via
-l s_cpu=<value>
or-l s_rt=<value>
).
- specify the name of the queue (via
- To use a high-memory queue, you need to
- specify the memory requirement (with
-l mres=X,h_data=X,h_vmem=X
) - confirm that you need a high-memory queue (
-l himem
) - select the time limit either by
- specifying the queue name, or (via
-q <name>
), or - pass a requested time limit (via
-l s_cpu=<value>
or-l s_rt=<value>
).
- specifying the queue name, or (via
- specify the memory requirement (with
- To use the unlimited queues, i.e., uThC.q or uThM.q, you need to confirm that you request a low priority queue (
-l lopri
)
Why do I need to add -l himem
or -l lopri
?
- This prevents the GE from submitting a job to the high memory or unlimited queues that requested no or little resources, just because one of these queues are less used.
It prevent the scheduler from "wasting" valuable resources.
Examples
qsub flags | Meaning of the request |
---|---|
-l s_cpu=48:00:00 | 48 hour of consumed CPU (per slot) |
-l s_rt=200:00:00 | 200 hour of elapsed time |
-q mThC.q | use the mThC.q queue |
-l mres=12G,h_data=12G,h_vmem=12G | 12GB of memory (per CPU), a 10 CPUs parallel job will reserve 120GB |
-q mThM.q -l mres=12G,h_data=12G,h_vmem=12G,himem | to run in the medium-time high-memory queue. This is a correct, i.e., complete, specification (memory use specs and |
-q uThC.q -l lopri | to run in the unlimited high-cpu queue, note the -l lopri |
-q uTxlM.rq -l himem | unlimited-time, extra-large-memory, restricted to a subset of users |
-q uTspM.rq -l himem | unlimited-time, special-memory, restricted to a subset of users |
All jobs that use more then 2 GB of memory (per CPU) should include a memory reservation and requirement with -l mres=X,h_data=X,h_vmem=X
.
- If you do not, your job(s) may not be able to grab the memory they need at run-time and crash, or crash the node.
Interactive Queue
You can start an interactive session on a compute node using the command qrsh
(and not qsub
, nor qsh
)
- There are currently two compute nodes, each with 64 cores and 256GB of memory set aside for interactive use,
qrsh
wil log you on one of them, to use one CPU (core or slot) and 8GB of mem,- to increase these values use
qrsh -pe mthread N
, where N is between 2 and 32, to use 2 to 32 slots or 16 to 256 GB of memory.
- to increase these values use
- The limit of CPU you can use is 12h per requested CPU (core or slot), while the limit of elapsed time is set to 24 hour.
- A single user can't use more than 1 job and 16 slots in the interactive queue and can't keep it for longer than 24 hour.
By default, the variableNSLOTS
is not set byqrsh
. If you want/need it to be set to the value ofN
passed to-pe mthread N
, use eitherqrsh -pe mthread N -pty y bash
orqrsh -pe mthread N -pty y csh
depending on which shell you want to launch (more details available here).I you forgot the "-pty y
" you won't get a prompt, simply typeexit
. Since we have not installed X11 on the compute nodes,qsh
will always fail - if you need X on the interactive nodes, contact us.- Since Hydra is a shared resource, do not "waste" interactive slots by keeping your
qrsh
session idle.
Memory Reservation
- We have implemented a memory reservation mechanism, (via
-l mres=XX
)- This allows the job scheduler to guarantee that the requested amount of memory is available for your job on the compute node(s),
by keeping track of the reserved memory and not scheduling jobs that reserve more memory than available. - Hence reserving more than you will use prevents others (including your own other jobs) from accessing the available memory and
indirectly the available CPUs (like if you use one, or even just a few, CPUs but grab most of the memory of a given compute node).
- This allows the job scheduler to guarantee that the requested amount of memory is available for your job on the compute node(s),
- We have at least 2GB/CPUs, but more often 4GB/CPUs on the compute node, still
it is recommended to reserve memory if your job will use more than 2GB/CPU, and
seth_data=X
andh_vmem=X
to the value used inmres=XX
. - Remember:
- The memory specification is
- per JOB in
mres=XX
- it is no longer scaled by the number of allocated slots/CPUs/threads - per CPU in
h_data=X
andh_vmem=X
, it should beXX
divided by the number of requested slots.
- per JOB in
- Memory is a scarce and expensive resource, compared to CPU, as we have fewer nodes with a lot of memory.
- Try to guesstimate the best you can your memory usage, and
- monitor the memory use of your job(s) (see Monitoring your Jobs ).
- The memory specification is
- Note:
- Do not hesitate to re-queue a job if it uses a lot more or a lot less than your initial guess, esp. if you plan to queue a slew of jobs;
- memory usage scales with the problem in most often predictable ways,
consider running some test cases to help you guesstimate, and
trim down your memory reservation whenever possible: a job that requests oodles of memory may wait for a long time in the queue for that resource to free up. - Consider breaking down a long task into separate jobs if different steps need different type of resources.
Format Specifications
- The format for
- memory specification is a positive (decimal) number followed by a unit (aka a multiplier), like
13.4G
for 13.4 GB; - CPU or RT time specification is
h:m:s
, like100:00:00
for 100 hours (or "100::
", while"100
" means 100 seconds).
man sge_types
. - memory specification is a positive (decimal) number followed by a unit (aka a multiplier), like
Host Groups
The GE supports the concept of a host group, i.e., a list of hosts (compute nodes).
- You can request that a job run only on computers in a given a host group, and
we use these host groups to restrict queues to specific list of hosts. - To request computers of a given host group, use something like this "
-q mThC.q@@ib-hosts
" (yes there is a double "@
"), - In fact the queue specification to
qsub
can be a RE, so-q '?ThC.@@ib-hosts
', means any high-cpu queue but only hosts on IB.
- You can get the list of all the host groups with (show host group list)
% qconf -shgrpl
- and get the list of hosts for a specific host group with
% qconf -shgrp <host-group-name>
Here is the list of host groups:
Name Description @allhosts
all the hosts
@hicpu-hosts
high CPU hosts
@himem-hosts
high memory hosts (521GB/host)
@xlmem-hosts
extra large memory hosts (1TB/host)
@supermicro-hosts
SM hosts (for special project)
@ib-hosts
hosts on the IB
@12c-hosts
hosts with 12 CPUs@16c-hosts
hosts with 16 CPUs@24c-hosts
hosts with 24 CPUs
@40c-hosts
hosts with 40 CPUs
@64c-hosts
hosts with 64 CPUs
@avx-hosts
hosts with AVX-capable CPUs
@tcp-hosts
hosts on TCP/IP only
Queue Selection Validation and/or Verification
- You can submit a job script and verify if the GE can run it, i.e.,
can the GE find the adequate queue and allocate the requested resources? - The qsub flag "
-w v
" will run a verification, while "-w p
" will poke whether the job can run, but the job will not be submitted:% qsub -w v my_test.job
or% qsub -w p my_test.job
The difference being that "-w v"
checks against an empty cluster, while"-w p
" validates against the cluster as is status. - By default all jobs are submitted with "-w e", producing an error for invalid requests.
Overriding it with"-w w
" or "-w n" can result in jobs that are queued but will never run
as they request more resources than will ever be available. - You can also use the
-verify
flag, to print detailed information about the would-be job as thoughqstat -j
was used,
including the effects of command-line parameters and the external environment, instead of submitting job:% qsub -verify my_test.job
3. Hardware limits
The following table shows the current hardware limits (to be reviewed)
Number | Number | Amount of | |||
---|---|---|---|---|---|
Queue | of | of CPUs | Memory per | ||
type | nodes | slots | per node | Node | Comment |
?ThC.q | 78 | 2884 | 12 to 64 | 2GB to 4GB/cpu | |
?ThM.q | 14 | 816 | 24 or 64 | 512GB per node | |
uTxlM.rq | 2 | 80 | 40 | 1TB per node | |
ssd.tq | 2 | (misc) | (misc) | (misc) | local SSDs (test queue) |
qrsh.iq | qrsh (not qsh or qsub ) | ||||
lTIO.sq |
The values in this table change as we modify the hardware configuration, you can verify them with either
% qstat -g c
or
% module load tools/local
% qstat+ -gc
and
% qhost
Notes
- We also impose software limits, namely how much of the cluster a single user can grab (see resource limits in Submitting Jobs. )
- If your pending requests will exceed these limits, your queued jobs will wait;
- If you request inconsistent or unavailable resources, you will get the following error message:
Unable to run job: error: no suitable queues.
You can use "-w v"
or "-verify
" to track down why the GE can't find a suitable queue, as described elsewhere on this page.
Last updated SGK