Introduction
Conceptual Examples
Serial Jobs
Job Arrays
Parallel Jobs
1. MPI Jobs
2. Multi-Threaded Jobs
Available Queues
1. High-CPUs Queues
2. High-Memory Queues
3. Very-High Memory Queues
4. Other Queues (interactive, access to SSD, ...)
Resource Limits
Examples
Help me Choose a Queue and Write a Job Script
1. QSubGen: a Job Script Generator
2. What Queue Should I Use?
3. Why Can't I Queue that Job?
4. Why Is my Job Queued but not Running?

1. Introduction

Jobs are submitted from either login node to the job scheduler (the Grid Engine or GE), using the command qsub and a job file;
submitted jobs may wait in the queue until the requested resource(s) is/are available, or, if a user has reach a resource usage limit, until that limit has cleared.
The scheduler will eventually run each job, starting it on one or several compute nodes, and the job will run in batch, not interactive, mode;
it is the scheduler that selects on which compute(s) node to run a job on, and
if the job exceeds a limit, like it uses too much memory, or consumes too much CPU time, the scheduler will kill the job.
To run a computation users must write a list of instructions, aka a job script, that specifies the step needed to perform the said computation; and
pass directives to the scheduler as to which resources are required (like amount of memory, CPU time, number of CPUs, etc).
The job script can also contains these directives, aka embedded directives.
A job is submitted with the command qsub, with options (or embedded directives) and followed by name of the file containing the job script.
A few compute nodes, currently 2, are set aside for interactive use, see section below on the interactive queue.
Do not use the login nodes to run any substantive computation - the login nodes are monitored and a task running on one of the the login node that consumes too much resources will have at first its priority reduced, and eventually terminated.

The following web pages gives you easy access to the GE's man pages.

2. Conceptual Examples

The Conceptual Examples page explains how to submit jobs and how to write simple job scripts.

The page presents:

Trivial Example
Better Example
Example with Embedded Directive
Example with Embedded Directive and Arguments
Note
1. Miscellaneous
2. Environment Variables
3. Catching Time Limits

Note

A job will run in a queue. Each queue has some form of limit:
- in most cases, a job won't be allowed to run forever, nor grab as much memory as it may want to.
- How to specify resources and what queues to use is explained elsewhere, in the Available Queues page.
There is some overhead in starting a job, so it is bad practice to submit a slew of very small jobs.
While you may find it convenient to submit 10,000 five-minute-long jobs, the system will end up taking as much time starting the jobs as the jobs will take to run.
As a precaution to prevent clobbering the system there is a limit on how many jobs a single user can submit to any of the queues (see explanations in sections about hardware limits and resource limits ).
The cluster is a shared resource:
- there are limits on how much of the cluster resources (CPUs, memory, etc) a single user can grab an any time.
In most cases your job script also needs to load a module or a set of modules.

3. Serial Jobs

A serial job is a job that uses only one CPU.
It is either started by using a dedicated job script file (i.e. a different one for each job you need to run),
or a more sophisticated job script that takes one or several arguments.
Refer to the conceptual examples to learn how to submit jobs.

4. Job Arrays

Conceptually a job array can be described something like this:
"run my computation for 100 cases, identified as task number ranging from 1 to 100 by step of 1"; hence
a job array is a set of computations, known as tasks, that can run using a unique job script file and a single number that identifies each task to be performed.
The job script must thus be written to start a specific task using a simple integer as identifier,
there are plenty of ways to convert a unique number into a specific (slew of) parameters.
A job array doesn't need to be a serial job, and the job script can take arguments.
So instead of queuing, let's say 100 jobs (like 100 bootstraps, 100 light curves to analyze, 100 models to build), one at a time,
you submit one job and request to have the GE run it 100 times, or for 100 tasks.
In fact, you can specify:
- the starting task number,
- the ending task number,
- the task number increment, and, if needed,
- the maximum number of tasks that should run concurrently.
The Job Arrays page explains how to submit job arrays.
- It also shows some tricks to convert a task identifier into a (slew of) parameters, and
- how to consolidate a large number of small jobs into fewer larger ones.

5. Parallel Jobs

A parallel job is a job that uses more than one CPU.
Because the cluster is a shared resource, it is the GE that allocates CPUs, known as slots in GE jargon, hence a parallel job must request a set of slots when it is submitted;
this is accomplished by specifying -pe <pe-name> N to qsub, where
- <pe-name> is the name of the parallel environment (PE); and
- N the number of requested slots.
The choice of a parallel environment will determine how the GE will allocate CPUs and how the job gets started.
The Parallel Jobs page explains how to submit parallel jobs and describes what parallel environment are available.

6. Available Queues

Every job running on the cluster is started in a queue.

The GE will select a queue based on the resources requested and the usage in each queue.
If you don't specify the right queues or the right resource(s), your job will either
- not get queued,
- wait forever and never run, or
- start and get killed when it exceeds one of the limit of the queue it was started in.

The set of available queues is a matrix of queues:

Four sets of queues: a high-CPU and a high-memory set of queues, complemented by a very-high-memory restricted queue and special restricted queue.

The high-CPU and a high-memory sets of queues have different time limits: short, medium, long and unlimited.

Type	Description
high-CPU	for serial or parallel jobs that do not need a lot of memory,
high-memory	for serial or multi-threaded parallel jobs that require a lot of memory,
very-high-memory	reserved for jobs that need a very large amount of memory,
other	for interactive use or projects that need special resources (SSD, ...).

The Available Queues page describe in details the available queues.

7. Resource Limits

While each queue has a set of limits (CPU, memory), the cluster also has some global limits.

What are the Resource Limits

There are limits on

how many jobs can be queued simultaneously:
- there can't be more that 25,000 jobs queued at any time,
- a single user can't queue more than 2,500 jobs, and
- a job array can't request more that 10,000 tasks.
how many jobs can run simultaneously, in particular there is:
- a limit on how many slots a single user can use (name=u_slots value=640)
- a limit on how many slots a user can grab in each queue, with fewer slots allowed in queues with longer time limits.
how much memory can be simultaneously reserved, in particular
- a limit on how much memory can be reserved by a single user in each queue.
and for some queues how many concurrent jobs a user can have
- users are limited to one concurrent interactive job, two I/O jobs, and 3 in the uTxlM.q queue.

The more resources a job uses (more CPU time, more memory), the fewer similar jobs a single user can run concurrently,
- in other words you can run a lot of small jobs at the same time but fewer very big/long jobs.
For example, users can't grab more than 71 slots (or CPUs) and 2.6TB of reserved memory concurrently for jobs running in the long-time high-memory queue (lThM.q).

The actual limits are subject to change depending on the cluster usage and the hardware configuration

How to Check the Resource Limits

To check the global limits:

% qconf -sconf global | grep max

and the explanation of these parameters can be found in

% man 5 sge_conf

To check the queue specific resource limits, use

% qconf -srqs

{
   name         slots
   description  Limit slots for all users together
   enabled      TRUE
   limit        users * to slots=3964
   limit        users * queues {sThC.q,lThC.q,mThC.q,uThC.q} to slots=3524
   limit        users * queues {sThM.q,mThM.q,lThM.q,uThM.q} to slots=968
   limit        users * queues {uTxlM.rq} to slots=176
}
{
   name         u_slots
   description  Limit slots/user for all queues
   enabled      TRUE
   limit        users {*} to slots=512
}
{
   name         hiCPU_u_slots
   description  Limit slots/user in hiCPU queues
   enabled      TRUE
   limit        users {*} queues {sThC.q} to slots=881
   limit        users {*} queues {mThC.q} to slots=622
   limit        users {*} queues {lThC.q} to slots=293
   limit        users {*} queues {uThC.q} to slots=97
}
{
   name         hiMem_u_slots
   description  Limit slots/user for hiMem queues
   enabled      TRUE
   limit        users {*} queues {sThM.q} to slots=242
   limit        users {*} queues {mThM.q} to slots=121
   limit        users {*} queues {lThM.q} to slots=60
   limit        users {*} queues {uThM.q} to slots=15
}
{
   name         xlMem_u_slots
   description  Limit slots/user for XlMem and special restricted queues
   enabled      TRUE
   limit        users {*} queues {uTxlM.rq} to slots=176
}
{
   name         qrsh_u_slots
   description  Limit slots/user for interactive (qrsh) queues
   enabled      TRUE
   limit        users {*} queues {qrsh.iq} to slots=32
}
{
   name         mem_res
   description  Limit total reserved memory for all users per queue type
   enabled      TRUE
   limit        users * queues {sThC.q,lThC.q,mThC.q,uThC.q} to mem_res=18641G
   limit        users * queues {sThM.q,mThM.q,lThM.q,uThM.q} to mem_res=8077G
   limit        users * queues {uTxlM.rq} to mem_res=4039G
}
{
   name         u_mem_res
   description  Limit reserved memory per user for specific queues
   enabled      TRUE
   limit        users {*} queues {sThC.q,lThC.q,mThC.q,uThC.q} to mem_res=4660G
   limit        users {*} queues {sThM.q,mThM.q,lThM.q,uThM.q} to mem_res=2019G
   limit        users {*} queues {uTxlM.rq} to mem_res=4039G
}
{
   name         gpu
   description  Limit GPUs for all users in GPU queue
   enabled      TRUE
   limit        users * queues {uTGPU.tq} to ngpu=4
}
{
   name         u_gpu
   description  Limit GPUs per user in GPU queue
   enabled      TRUE
   limit        users {*} queues {uTGPU.tq} to ngpu=3
}
{
   name         blast2GO
   description  Limit to set aside a slot for blast2GO
   enabled      TRUE
   limit        users * queues !lTb2g.q hosts {@b2g-hosts} to slots=62
   limit        users * queues lTb2g.q hosts {@b2g-hosts} to slots=1
}

Note that these values get adjusted as needed.

The explanation of the resource quota set (rqs) can be found in

% man 5 sge_resource_quota

To check how much of these resources (queues quota) are used overall, or by your job(s), use:

% qquota

or

% qquota -u $USER

You can also inquire about a specific resource (qquota -l mem_res), and use the local tools (module load tools/local) qquota+ to

- get a nicer printout of the reserved memory,
- get the % of usage with respect to its limit

like in

% qquota+ +% -l slots -u hpc

(more info via qquota+ -help or man qquota+.)

To check the limits of a specific queue (CPU and memory), use

% qconf -sq sThC.q

and the explanation of these parameters can be found in

% man 5 queue_conf

under the RESOURCE LIMITS heading.

NOTES

You can submit a job and tell the GE to let it start only after another job has completed, using -hold_jid <jobid> flag to qsub:

% qsub -N FirstOne pre-process.job
Your job 12345678 ("FirstOne") has been submitted
% qsub -hold_jid 12345678 -N SecondOne post-process.job
Your job 12345679 ("SecondOne") has been submitted

You can be more sophisticated (or use qchain see below):

#!/bin/csh
#
set parameter = $1
set name = $2
#
set jid1 = `qsub -terse -N "pre-process-$name" pre-process.job $parameter`
echo $jid1 submitted '("'pre-process-$name'")'
set jid2 = `qsub -terse -hold_jid $jid1 -N "process-$name" process.job $parameter`
echo $jid2 submitted '("'pre-process-$name'")'
set jid3 = `qsub -terse -hold_jid $jid2 -N "post-process-$name" post-process.job $parameter`
echo $jid3 submitted '("'post-process-$name'")'

This example will submit 3 jobs: pre-process.job, process.job and post-process.job to be run sequentially,

- each takes one argument, the parameter,
- and is given a compounded name.
- The embedded directives in the three job scripts may request different resources, like
  - lots of memory for pre-processing,
  - lots of CPUs for processing, and
  - neither for post processing.
  This way a task is broken up to avoid grabbing more resources than needed at each step.

You can use the qchain tool by loading the tools/local module, to submit jobs that must run sequentially.
module load tools/local qchain *.job
will submit the job files that match "*.job" in the order given by "echo *.job".
By using quotes, as follows:
module load tools/local qchain '-N start first.job 123' '-N crunch second.job 123' '-N post-process finish.job 123'
qchain allows you to pass arguments to both qsub and the job scripts.

You can limit how many jobs you submit with the following trick:
# define how many jobs to queue @ NMAX = 250 # loop: @ N = `qstat -u $USER | tail --lines=+3 | wc -l` if ($N >= $NMAX) then sleep 180 goto loop endif #
This example counts how many jobs you have in the queue (running and waiting) using the command qstat (and tail and wc -l) and pauses for 3 minutes (180 seconds) if that count is 250 or higher.
You would include these lines in a script that submits a slew of jobs, but should not queue more than a given number at any time (to count only the queued jobs, add -s p to qsub).
Or you can use the tool q-wait (needs the module tools/local), that takes an argument and two options:
% q-wait blah
will pause until you have no job whose name has the string 'blah' left queued or running.
The options allow you to specify the number of jobs, and how often to check, i.e.:
% q-wait -N 125 -wait 3600 crunch
will pause until there are 250 or fewer jobs whose name has the string 'crunch' left queued or running, checking once an hour.
Avoid using the -V flag to qsub
- The -V flag passes all the active environment variables to the script.
- While it may be convenient in some instances, it creates a dependency on the precise environment configuration when submitting the job,
  thus the same job script may fail when it is submitted at a later time (or from a different log in) from a different configuration.

8. Examples

You can find examples of simple/trivial test cases with source code, Makefile, job script files and resulting log files under ~hpc/examples.

The examples are organized as follows:

`c++11/`	example using the `C++11` extension
`gpu/`	GPU examples and (old) timings
`gsl/`	simple test that uses GSL
`hybrid/`	examples for using the hybrid PE
`idl/`	examples for running IDL/GDL/FL jobs
`java/`	example running `JAVA`
`lapack/`	example linking with `LAPACK` and Intel's `MKL`
`memtest/`	examples for large memory use and reservation
`misc/`	miscellaneous
`mpi/`	examples using `MPI:` `with each compiler (gcc, Intel, PGI)` `for various implementation (MVAPICH, OPENMPI)`
`openmp/`	example using `OpenMP`
`python/`	list of different implementations of PYTHON available on Hydra
`serial/`	simple (hello world) serial job, for each compiler (gcc, Intel, PGI)

You can use the command find to get a list of all the subdirectories under ~hpc/examples, i.e.:

% find ~hpc/examples -type d -print

9. Help me Choose a Queue and Write a Job Script

QSubGen: a Job Script Generator

There is a web page with an app to help you choose a queue and write a job script, mostly how to write the embedded directives and load modules.

QSubGen: the job script generator (accessible only from a trusted compute or with VPN on).

What Queue Should I Use?

To choose a queue, you need to know

whether is it a serial (single CPU) or parallel (multiple CPUs) job,
if it is a parallel job, what kind,
how much memory this job will need,
how much CPU time it will require.

Indeed:

If your computation will use	your job script needs to	`qsub` option needed/recommended
more than one CPU (parallel jobs need)	request a `PE` and `N` slots	`-pe <pe-name> N or -pe <pe-name> N-M`
more than 2GB/CPU of memory	reserve the required memory	`-l mres=X,h_data=X,h_vmem=X`
more than 6GB/CPU of memory	use a high-memory queue, and reserve the required memory	`-l mres=X,h_data=X,h_vmem=X,himem`
up to T hours of CPU (per CPU)	specify the required amount	`-l s_cpu=T:0:0`
	or specify the queue	`-q mThC.q`
no idea how much CPU	use an unlimited, low priority queue	`-q uThC.q -l lopri`

X can be something like 2GB
T can be something like 240 (for 240 hours or 10 days)

You may need to combine PE, memory and CPU resource requests.
Remember, that the more resources your job requests, the fewer concurrent similar jobs can run at any time.
Similar jobs will need similar resources, so when in doubt and before queuing a slew of similar jobs:
- run one job and monitor its resource usage, then
- queue the other jobs after trimming the requested resources (CPU and memory).
  The local+ tool check-jobs allows you to check the resources consumed by jobs that have completed.

Why Can't I Queue that Job?

There can be different reasons why a job is rejected:

inconsistency in your resources request, like asking for more CPU or memory that the limit of a given queue;
unavailable resources, like asking for more CPUs or more memory on a single node than exists on any compute nodes;
exceeding resource limits, like asking for more CPUs than are allowed per user in a given queue.

Use the -w v or the -verify flag to qsub, see queue selection validation and/or verification, to check a job script file.

Why Is my Job Queued but not Running?

There can be different reasons why a job remains in the queue:

the requested resources are not available, like there is no compute node with the requested number of CPUs or amount of memory currently available;
the user resource quota has been reached, like the allowed total amount of CPUs or memory used by a single user was reached.

Use the command qconf -srqs or qquota, see how to check under resource limits.

The local tool check-qwait allows you to visualize the queue quota resources and which jobs are waiting.

Last Updated 19 Nov 2019 SGK