1. Introduction
  2. Conceptual Examples
  3. Serial Jobs
  4. Job Arrays
  5. Parallel Jobs
    1. MPI Jobs
    2. Multi-Threaded Jobs
  6. Available Queues
    1. High-CPUs Queues
    2. High-Memory Queues
    3. Very-High Memory Queues
    4. Other Queues (interactive, access to SSD, ...)
  7. Resource Limits
  8. Examples
  9. Help me Choose a Queue and Write a Job Script
    1. QSubGen: a Job Script Generator
    2. What Queue Should I Use?
    3. Why Can't I Queue that Job?
    4. Why Is my Job Queued but not Running?

1. Introduction

2. Conceptual Examples

(info) The Conceptual Examples page explains how to submit jobs and how to write simple job scripts.

The page presents:

  1. Trivial Example
  2. Better Example
  3. Example with Embedded Directive
  4. Example with Embedded Directive and Arguments
  5. Note
    1. Miscellaneous
    2. Environment Variables
    3. Catching Time Limits

(lightbulb)Note

  1. A job will run in a queue. Each queue has some form of limit:
    • in most cases, a job won't be allowed to run forever, nor grab as much memory as it may want to.
    • How to specify resources and what queues to use is explained elsewhere, in the Available Queues page.

  2. There is some overhead in starting a job, so it is bad practice to submit a slew of very small jobs.
    While you may find it convenient to submit 10,000  five-minute-long jobs, the system will end up taking as much time starting the jobs as the jobs will take to run.
    As a precaution to prevent clobbering the system there is a limit on how many jobs a single user can submit to any of the queues (see explanations in sections about hardware limits and resource limits ).

  3. The cluster is a shared resource:
    •  there are limits on how much of the cluster resources (CPUs, memory, etc) a single user can grab an any time.

  4. In most cases your job script also needs to load a module or a set of modules.

3. Serial Jobs

4. Job Arrays

5. Parallel Jobs

6. Available Queues

Every job running on the cluster is started in a queue.

The set of available queues is a matrix of queues:

7. Resource Limits

(warning) While each queue has a set of limits (CPU, memory), the cluster also has some global limits.

What are the Resource Limits

There are limits on

  1. how many jobs can be queued simultaneously:
  2. how many jobs can run simultaneously, in particular there is:
  3. how much memory can be simultaneously reserved, in particular
  4. and for some queues how many concurrent jobs a user can have

(warning) The actual limits are subject to change depending on the cluster usage and the hardware configuration

How to Check the Resource Limits

To check the global limits:

   % qconf -sconf global | grep max

and the explanation of these parameters can be found in

   % man 5 sge_conf


To check the queue specific resource limits, use

   % qconf -srqs

{
   name         slots
   description  Limit slots for all users together
   enabled      TRUE
   limit        users * to slots=3964
   limit        users * queues {sThC.q,lThC.q,mThC.q,uThC.q} to slots=3524
   limit        users * queues {sThM.q,mThM.q,lThM.q,uThM.q} to slots=968
   limit        users * queues {uTxlM.rq} to slots=176
}
{
   name         u_slots
   description  Limit slots/user for all queues
   enabled      TRUE
   limit        users {*} to slots=512
}
{
   name         hiCPU_u_slots
   description  Limit slots/user in hiCPU queues
   enabled      TRUE
   limit        users {*} queues {sThC.q} to slots=881
   limit        users {*} queues {mThC.q} to slots=622
   limit        users {*} queues {lThC.q} to slots=293
   limit        users {*} queues {uThC.q} to slots=97
}
{
   name         hiMem_u_slots
   description  Limit slots/user for hiMem queues
   enabled      TRUE
   limit        users {*} queues {sThM.q} to slots=242
   limit        users {*} queues {mThM.q} to slots=121
   limit        users {*} queues {lThM.q} to slots=60
   limit        users {*} queues {uThM.q} to slots=15
}
{
   name         xlMem_u_slots
   description  Limit slots/user for XlMem and special restricted queues
   enabled      TRUE
   limit        users {*} queues {uTxlM.rq} to slots=176
}
{
   name         qrsh_u_slots
   description  Limit slots/user for interactive (qrsh) queues
   enabled      TRUE
   limit        users {*} queues {qrsh.iq} to slots=32
}
{
   name         mem_res
   description  Limit total reserved memory for all users per queue type
   enabled      TRUE
   limit        users * queues {sThC.q,lThC.q,mThC.q,uThC.q} to mem_res=18641G
   limit        users * queues {sThM.q,mThM.q,lThM.q,uThM.q} to mem_res=8077G
   limit        users * queues {uTxlM.rq} to mem_res=4039G
}
{
   name         u_mem_res
   description  Limit reserved memory per user for specific queues
   enabled      TRUE
   limit        users {*} queues {sThC.q,lThC.q,mThC.q,uThC.q} to mem_res=4660G
   limit        users {*} queues {sThM.q,mThM.q,lThM.q,uThM.q} to mem_res=2019G
   limit        users {*} queues {uTxlM.rq} to mem_res=4039G
}
{
   name         gpu
   description  Limit GPUs for all users in GPU queue
   enabled      TRUE
   limit        users * queues {uTGPU.tq} to ngpu=4
}
{
   name         u_gpu
   description  Limit GPUs per user in GPU queue
   enabled      TRUE
   limit        users {*} queues {uTGPU.tq} to ngpu=3
}
{
   name         blast2GO
   description  Limit to set aside a slot for blast2GO
   enabled      TRUE
   limit        users * queues !lTb2g.q hosts {@b2g-hosts} to slots=62
   limit        users * queues lTb2g.q hosts {@b2g-hosts} to slots=1
}

Note that these values get adjusted as needed.

The explanation of the resource quota set (rqs) can be found in

    % man 5 sge_resource_quota


To check how much of these resources (queues quota) are used overall, or by your job(s), use:

   % qquota

or

   % qquota -u $USER


You can also inquire about a specific resource (qquota -l mem_res), and use the local tools (module load tools/localqquota+ to

like in

   % qquota+ +% -l slots -u hpc

(more info via qquota+ -help or man qquota+.)

To check the limits of a specific queue (CPU and memory), use

   % qconf -sq sThC.q

and the explanation of these parameters can be found in

   % man 5 queue_conf

under the RESOURCE LIMITS heading.

NOTES

#!/bin/csh
#
set parameter = $1
set name = $2
#
set jid1 = `qsub -terse -N "pre-process-$name" pre-process.job $parameter`
echo $jid1 submitted '("'pre-process-$name'")'
set jid2 = `qsub -terse -hold_jid $jid1 -N "process-$name" process.job $parameter`
echo $jid2 submitted '("'pre-process-$name'")'
set jid3 = `qsub -terse -hold_jid $jid2 -N "post-process-$name" post-process.job $parameter`
echo $jid3 submitted '("'post-process-$name'")'

This example will submit 3 jobs: pre-process.jobprocess.job and post-process.job to be run sequentially,

8. Examples

You can find examples of simple/trivial test cases with source code, Makefile, job script files and resulting log files under ~hpc/examples.

9. Help me Choose a Queue and Write a Job Script

QSubGen: a Job Script Generator

There is a web page with an app to help you choose a queue and write a job script, mostly how to write  the embedded directives and load modules. 

What Queue Should I Use?

To choose a queue, you need to know

  1. whether is it a serial (single CPU) or parallel (multiple CPUs) job,
  2. if it is a parallel job, what kind,
  3. how much memory this job will need,
  4. how much CPU time it will require.

Indeed:

If your computation will useyour job script needs toqsub option needed/recommended
more than one CPU (parallel jobs need)request a PE and N slots-pe <pe-name> N or -pe <pe-name> N-M
more than 2GB/CPU of memoryreserve the required memory-l mres=X,h_data=X,h_vmem=X
more than 6GB/CPU of memory

use a high-memory queue, and

reserve the required memory

-l mres=X,h_data=X,h_vmem=X,himem
up to T hours of CPU (per CPU)specify the required amount-l s_cpu=T:0:0

or specify the queue-q mThC.q
no idea how much CPUuse an unlimited, low priority queue-q uThC.q -l lopri

Why Can't I Queue that Job?

There can be different reasons why a job is rejected:

Use the -w v or the -verify flag to qsub, see queue selection validation and/or verification, to check a job script file.

Why Is my Job Queued but not Running?

There can be different reasons why a job remains in the queue:

Use the command qconf -srqs or qquota, see how to check under resource limits.

The local tool check-qwait allows you to visualize the queue quota resources and which jobs are waiting.


Last Updated   SGK