You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 81

  1. Introduction
  2. Conceptual Examples
  3. Serial Jobs
  4. Job Arrays
  5. Parallel Jobs
    1. MPI Jobs
    2. Multi-Threaded Jobs
  6. Available Queues
    1. High-CPUs Queues
    2. High-Memory Queues
    3. Very-High Memory Queues
    4. Other Queues (interactive, access to SSD, ...)
  7. Resource Limits
  8. Examples
  9. Help me Choose a Queue and Write a Job Script
    1. QSubGen: a Job Script Generator
    2. What Queue Should I Use?
    3. Why Can't I Queue that Job?
    4. Why Is my Job Queued but not Running?

1. Introduction

  • Jobs are submitted from either login node to the job scheduler (the Grid Engine or GE), using the command qsub and a job file;
  • submitted jobs may wait in the queue until the requested resource(s) is/are available, or, if a user has reach a resource usage limit, until that limit has cleared.

  • The scheduler will eventually run each job, starting it on one or several compute nodes, and the job will run in batch, not interactive, mode;
  • it is the scheduler that selects on which compute(s) node to run a job on, and
  • if the job exceeds a limit, like it uses too much memory, or consumes too much CPU time, the scheduler will kill the job.

  • To run a computation users must write a list of instructions, aka a job script, that specifies the step needed to perform the said computation; and
  • pass directives to the scheduler as to which resources are required (like amount of memory, CPU time, number of CPUs, etc).

  • The job script can also contains these directives, aka embedded directives.
  • A job is submitted with the command qsub, with options (or embedded directives) and followed by name of the file containing the job script.

  • A few compute nodes, currently 2, are set aside for interactive use, see section below on the interactive queue.
  • Do not use the login nodes to run any substantive computation - the login nodes are monitored and a task running on one of the the login node that consumes too much resources will have at first its priority reduced, and eventually terminated.

  • The following web pages gives you easy access to the GE's man pages.

2. Conceptual Examples

(info) The Conceptual Examples page explains how to submit jobs and how to write simple job scripts.

The page presents:

  1. Trivial Example
  2. Better Example
  3. Example with Embedded Directive
  4. Example with Embedded Directive and Arguments
  5. Note
    1. Miscellaneous
    2. Environment Variables
    3. Catching Time Limits

(lightbulb)Note

  1. A job will run in a queue. Each queue has some form of limit:
    • in most cases, a job won't be allowed to run forever, nor grab as much memory as it may want to.
    • How to specify resources and what queues to use is explained elsewhere, in the Available Queues page.

  2. There is some overhead in starting a job, so it is bad practice to submit a slew of very small jobs.
    While you may find it convenient to submit 10,000  five-minute-long jobs, the system will end up taking as much time starting the jobs as the jobs will take to run.
    As a precaution to prevent clobbering the system there is a limit on how many jobs a single user can submit to any of the queues (see explanations in sections about hardware limits and resource limits ).

  3. The cluster is a shared resource:
    •  there are limits on how much of the cluster resources (CPUs, memory, etc) a single user can grab an any time.

  4. In most cases your job script also needs to load a module or a set of modules.

3. Serial Jobs

  • A serial job is a job that uses only one CPU.
  • It is either started by using a dedicated job script file (i.e.  a different one for each job you need to run),
    or a more sophisticated job script that takes one or several arguments.
  • Refer to the conceptual examples to learn how to submit jobs.

4. Job Arrays

  • Conceptually a job array can be described something like this:
    "run my computation for 100 cases, identified as task number ranging from 1 to 100 by step of 1"; hence
  • a job array is a set of computations, known as tasks, that can run using a unique job script file and a single number that identifies each task to be performed.

  • The job script must thus be written to start a specific task using a simple integer as identifier,
    there are plenty of ways to convert a unique number into a specific (slew of) parameters.

  • A job array doesn't need to be a serial job, and the job script can take arguments.

  • So instead of queuing, let's say 100 jobs (like 100 bootstraps, 100 light curves to analyze, 100 models to build), one at a time,
    you submit one job and request to have the GE run it 100 times, or for 100 tasks.

  • In fact, you can specify:
    • the starting task number,
    • the ending task number,
    • the task number increment, and, if needed,
    • the maximum number of tasks that should run concurrently.

    (info) The Job Arrays page explains how to submit job arrays.
    • It also shows some tricks to convert a task identifier into a (slew of) parameters, and
    • how to consolidate a large number of small jobs into fewer larger ones.

5. Parallel Jobs

  • A parallel job is a job that uses more than one CPU.
  • Because the cluster is a shared resource, it is the GE that allocates CPUs, known as slots in GE jargon, hence a parallel job must request a set of slots when it is submitted;

  • this is accomplished by specifying -pe <pe-name> N to qsub, where
    • <pe-name> is the name of the parallel environment (PE); and
    • N the number of requested slots.
  • The choice of a parallel environment will determine how the GE will allocate CPUs and how the job gets started.
    (info) The Parallel Jobs page explains how to submit parallel jobs and describes what parallel environment are available.

6. Available Queues

Every job running on the cluster is started in a queue.

  • The GE will select a queue based on the resources requested and the usage in each queue.
  • If you don't specify the right queues or the right resource(s), your job will either
    • not get queued,
    • wait forever and never run, or
    • start and get killed when it exceeds one of the limit of the queue it was started in.

The set of available queues is a matrix of queues:

  • Four sets of queues: a high-CPU and a high-memory set of queues, complemented by a very-high-memory restricted queue and special restricted queue.
  • The high-CPU and a high-memory sets of queues have different time limits: short, medium, long and unlimited.

    TypeDescription
    high-CPUfor serial or parallel jobs that do not need a lot of memory,
    high-memoryfor serial or multi-threaded parallel jobs that require a lot of memory,
    very-high-memoryreserved for jobs that need a very large amount of memory,
    otherfor interactive use or projects that need special resources (SSD, ...).

    (info) The Available Queues page describe in details the available queues.

7. Resource Limits

(warning) While each queue has a set of limits (CPU, memory), the cluster also has some global limits.

What are the Resource Limits

There are limits on

  1. how many jobs can be queued simultaneously:
    • there can't be more that 25,000 jobs queued at any time,
    • a single user can't queue more than 2,500 jobs, and
    • a job array can't request more that 10,000 tasks.
  2. how many jobs can run simultaneously, in particular there is:
    • a limit on how many slots a single user can use (name=u_slots value=512)
    • a limit on how many slots a user can grab in each queue, with fewer slots allowed in queues with longer time limits.
  3. and how much memory can be simultaneously reserved, in particular
    • a limit on how much memory can be reserved by a single user in each queue.

For example, users can't grab more than 51 slots (or CPUs) and 1.7TB of reserved memory concurrently for jobs running in the long-time high-memory queue (lThM.q).

(grey lightbulb) The actual limits are subject to change depending on the cluster usage and the hardware configuration.

The more resources a job uses (more CPU time, more memory), the fewer similar jobs a single user can run concurrently,

in other words you can run a lot of small jobs at the same time but fewer very big/long jobs.

How to Check the Resource Limits

To check the global limits:

   % qconf -sconf global | grep max

and the explanation of these parameters can be found in

   % man 5 sge_conf


To check the queue specific resource limits, use

   % qconf -srqs

qconf -srqs returns the following:
{
   name         slots
   description  Limit slots for all users together
   enabled      TRUE
   limit        users * to slots=3964
   limit        users * queues {sThC.q,lThC.q,mThC.q,uThC.q} to slots=3524
   limit        users * queues {sThM.q,mThM.q,lThM.q,uThM.q} to slots=968
   limit        users * queues {uTxlM.rq} to slots=176
}
{
   name         u_slots
   description  Limit slots/user for all queues
   enabled      TRUE
   limit        users {*} to slots=512
}
{
   name         hiCPU_u_slots
   description  Limit slots/user in hiCPU queues
   enabled      TRUE
   limit        users {*} queues {sThC.q} to slots=881
   limit        users {*} queues {mThC.q} to slots=622
   limit        users {*} queues {lThC.q} to slots=293
   limit        users {*} queues {uThC.q} to slots=97
}
{
   name         hiMem_u_slots
   description  Limit slots/user for hiMem queues
   enabled      TRUE
   limit        users {*} queues {sThM.q} to slots=242
   limit        users {*} queues {mThM.q} to slots=121
   limit        users {*} queues {lThM.q} to slots=60
   limit        users {*} queues {uThM.q} to slots=15
}
{
   name         xlMem_u_slots
   description  Limit slots/user for XlMem and special restricted queues
   enabled      TRUE
   limit        users {*} queues {uTxlM.rq} to slots=176
}
{
   name         qrsh_u_slots
   description  Limit slots/user for interactive (qrsh) queues
   enabled      TRUE
   limit        users {*} queues {qrsh.iq} to slots=32
}
{
   name         mem_res
   description  Limit total reserved memory for all users per queue type
   enabled      TRUE
   limit        users * queues {sThC.q,lThC.q,mThC.q,uThC.q} to mem_res=18641G
   limit        users * queues {sThM.q,mThM.q,lThM.q,uThM.q} to mem_res=8077G
   limit        users * queues {uTxlM.rq} to mem_res=4039G
}
{
   name         u_mem_res
   description  Limit reserved memory per user for specific queues
   enabled      TRUE
   limit        users {*} queues {sThC.q,lThC.q,mThC.q,uThC.q} to mem_res=4660G
   limit        users {*} queues {sThM.q,mThM.q,lThM.q,uThM.q} to mem_res=2019G
   limit        users {*} queues {uTxlM.rq} to mem_res=4039G
}
{
   name         gpu
   description  Limit GPUs for all users in GPU queue
   enabled      TRUE
   limit        users * queues {uTGPU.tq} to ngpu=4
}
{
   name         u_gpu
   description  Limit GPUs per user in GPU queue
   enabled      TRUE
   limit        users {*} queues {uTGPU.tq} to ngpu=3
}
{
   name         blast2GO
   description  Limit to set aside a slot for blast2GO
   enabled      TRUE
   limit        users * queues !lTb2g.q hosts {@b2g-hosts} to slots=62
   limit        users * queues lTb2g.q hosts {@b2g-hosts} to slots=1
}

Note that these values get adjusted as needed.

The explanation of the resource quota set (rqs) can be found in

    % man 5 sge_resource_quota


To check how much of these resources (queues quota) are used overall, or by your job(s), use:

   % qquota

or

   % qquota -u $USER


You can also inquire about a specific resource (qquota -l mem_res), and use the local tools (module load tools/localqquota+ to

    • get a nicer printout of the reserved memory,
    • get the % of usage with respect to its limit

like in

   % qquota+ +% -l slots -u hpc

(more info via qquota+ -help or man qquota+.)

To check the limits of a specific queue (CPU and memory), use

   % qconf -sq sThC.q

and the explanation of these parameters can be found in

   % man 5 queue_conf

under the RESOURCE LIMITS heading.

NOTES

  • You can submit a job and tell the GE to let it start only after another job has completed, using -hold_jid <jobid> flag to qsub:
    % qsub -N FirstOne pre-process.job
    Your job 12345678 ("FirstOne") has been submitted
    % qsub -hold_jid 12345678 -N SecondOne post-process.job
    Your job 12345679 ("SecondOne") has been submitted 
  • You can be more sophisticated (or use qchain see below):
Script that submit 3 jobs that must run sequentially, using C-shell syntax
#!/bin/csh
#
set parameter = $1
set name = $2
#
set jid1 = `qsub -terse -N "pre-process-$name" pre-process.job $parameter`
echo $jid1 submitted '("'pre-process-$name'")'
set jid2 = `qsub -terse -hold_jid $jid1 -N "process-$name" process.job $parameter`
echo $jid2 submitted '("'pre-process-$name'")'
set jid3 = `qsub -terse -hold_jid $jid2 -N "post-process-$name" post-process.job $parameter`
echo $jid3 submitted '("'post-process-$name'")'

This example will submit 3 jobs: pre-process.jobprocess.job and post-process.job to be run sequentially,

    • each takes one argument, the parameter,

    • and is given a compounded name.
    • The embedded directives in the three job scripts may request different resources, like
      • lots of memory for pre-processing,
      • lots of CPUs for processing, and
      • neither for post processing.
      This way a task is broken up to avoid grabbing more resources than needed at each step.

  • You can use the qchain tool by loading the tools/local module, to submit jobs that must run sequentially.

    module load tools/local
    qchain *.job

    will submit the job files that match "*.job" in the order given by "echo *.job".

    By using quotes, as follows:

    module load tools/local
    qchain '-N start first.job 123' '-N crunch second.job 123' '-N post-process finish.job 123'

    qchain allows you to pass arguments to both qsub and the job scripts.

  • You can limit how many jobs you submit with the following trick:

    How to limit the number of jobs submitted, using C-shell syntax
    # define how many jobs to queue 
    @ NMAX = 250 
    #
    loop:
      @ N = `qstat -u $USER | tail --lines=+3 | wc -l`
      if ($N >= $NMAX) then
        sleep 180
        goto loop
      endif
    #

    This example counts how many jobs you have in the queue (running and waiting) using the command qstat  (and tail and  wc -l) and pauses for 3 minutes (180 seconds) if that count is 250 or higher.

    You would include these lines in a script that submits a slew of jobs, but should not queue more than a given number at any time (to count only the queued jobs, add -s p to qsub).

  • Or you can use the tool q-wait (needs the module tools/local), that takes an argument and two options:
       % q-wait blah
    will pause until you have no job whose name has the string 'blah' left queued or running.
  • The options allow you to specify the number of jobs, and how often to check, i.e.:
       % q-wait -N 125 -wait 3600 crunch
    will pause until there are 250 or fewer jobs whose name has the string 'crunch' left queued or running, checking once an hour.

  • (warning) Avoid using the -V flag to qsub
    • The -V flag passes all the active environment variables to the script.
    • While it may be convenient in some instances, it creates a dependency on the precise environment configuration when submitting the job,
      thus the same job script may fail when it is submitted at a later time (or from a different log in) from a different configuration. 

8. Examples

You can find examples of simple/trivial test cases with source code, Makefile, job script files and resulting log files under ~hpc/examples.

  • The examples are organized as follows:
     

    c++11/ 

    example using the C++11 extension
    gpu/GPU examples and (old) timings
    gsl/simple test that uses GSL
    hybrid/examples for using the hybrid PE
    idl/examples for running IDL/GDL/FL jobs
    java/example running JAVA
    lapack/example linking with LAPACK and Intel's MKL
    memtest/examples for large memory use and reservation

    misc/

    miscellaneous
    mpi/

    examples using MPI:

      with each compiler (gcc, Intel, PGI)

      for various implementation (MVAPICH, OPENMPI)

    openmp/example using OpenMP
    python/list of different implementations of PYTHON available on Hydra
    serial/simple (hello world) serial job, for each compiler (gcc, Intel, PGI)

    You can use the command find to get a list of all the subdirectories under ~hpc/examples, i.e.: 

       % find ~hpc/examples -type d -print

9. Help me Choose a Queue and Write a Job Script

QSubGen: a Job Script Generator

There is a web page with an app to help you choose a queue and write a job script, mostly how to write  the embedded directives and load modules. 

  • QSubGen: the job script generator (accessible only from a trusted compute or with VPN on).

What Queue Should I Use?

To choose a queue, you need to know

  1. whether is it a serial (single CPU) or parallel (multiple CPUs) job,
  2. if it is a parallel job, what kind,
  3. how much memory this job will need,
  4. how much CPU time it will require.

Indeed:

If your computation will useyour job script needs toqsub option needed/recommended
more than one CPU (parallel jobs need)request a PE and N slots-pe <pe-name> N or -pe <pe-name> N-M
more than 2GB/CPU of memoryreserve the required memory-l mres=X,h_data=X,h_vmem=X
more than 6GB/CPU of memory

use a high-memory queue, and

reserve the required memory

-l mres=X,h_data=X,h_vmem=X,himem
up to T hours of CPU (per CPU)specify the required amount-l s_cpu=T:0:0

or specify the queue-q mThC.q
no idea how much CPUuse an unlimited, low priority queue-q uThC.q -l lopri
  • X can be something like 2GB
  • T can be something like 240 (for 240 hours or 10 days)
  • You may need to combine PE, memory and CPU resource requests.
  • Remember, that the more resources your job requests, the fewer concurrent similar jobs can run at any time.
  • Similar jobs will need similar resources, so when in doubt and before queuing a slew of similar jobs:
    •  run one job and monitor its resource usage, then
    • queue the other jobs after trimming the requested resources (CPU and memory).
      The local+ tool check-jobs allows you to check the resources consumed by jobs that have completed.

Why Can't I Queue that Job?

There can be different reasons why a job is rejected:

  • inconsistency in your resources request, like asking for more CPU or memory that the limit of a given queue;
  • unavailable resources, like asking for more CPUs or more memory on a single node than exists on any compute nodes;
  • exceeding resource limits, like asking for more CPUs than are allowed per user in a given queue.

Use the -w v or the -verify flag to qsub, see queue selection validation and/or verification, to check a job script file.

Why Is my Job Queued but not Running?

There can be different reasons why a job remains in the queue:

  • the requested resources are not available, like there is no compute node with the requested number of CPUs or amount of memory currently available;
  • the user resource quota has been reached, like the allowed total amount of CPUs or memory used by a single user was reached.

Use the command qconf -srqs or qquota, see how to check under resource limits.

The local tool check-qwait allows you to visualize the queue quota resources and which jobs are waiting.


Last Updated   SGK

  • No labels