qsub
and a job file;A job is submitted with the command qsub
, with options (or embedded directives) and followed by name of the file containing the job script.
The Conceptual Examples page explains how to submit jobs and how to write simple job scripts.
The page presents:
Note
|
So instead of queuing, let's say 100 jobs (like 100 bootstraps, 100 light curves to analyze, 100 models to build), one at a time,
you submit one job and request to have the GE run it 100 times, or for 100 tasks.
Because the cluster is a shared resource, it is the GE that allocates CPUs, known as slots in GE jargon, hence a parallel job must request a set of slots when it is submitted;
-pe <pe-name> N
to qsub
, where<pe-name>
is the name of the parallel environment (PE); andN
the number of requested slots.Every job running on the cluster is started in a queue.
The set of available queues is a matrix of queues:
The high-CPU and a high-memory sets of queues have different time limits: short, medium, long and unlimited.
Type | Description |
---|---|
high-CPU | for serial or parallel jobs that do not need a lot of memory, |
high-memory | for serial or multi-threaded parallel jobs that require a lot of memory, |
very-high-memory | reserved for jobs that need a very large amount of memory, |
other | for interactive use or projects that need special resources (SSD, ...). |
The Available Queues page describe in details the available queues.
While each queue has a set of limits (CPU, memory), the cluster also has some global limits.
There are limits on
lThM.q
).The actual limits are subject to change depending on the cluster usage and the hardware configuration
To check the global limits:
% qconf -sconf global | grep max
and the explanation of these parameters can be found in
% man 5 sge_conf
To check the queue specific resource limits, use
% qconf -srqs
{ name slots description Limit slots for all users together enabled TRUE limit users * to slots=3964 limit users * queues {sThC.q,lThC.q,mThC.q,uThC.q} to slots=3524 limit users * queues {sThM.q,mThM.q,lThM.q,uThM.q} to slots=968 limit users * queues {uTxlM.rq} to slots=176 } { name u_slots description Limit slots/user for all queues enabled TRUE limit users {*} to slots=512 } { name hiCPU_u_slots description Limit slots/user in hiCPU queues enabled TRUE limit users {*} queues {sThC.q} to slots=881 limit users {*} queues {mThC.q} to slots=622 limit users {*} queues {lThC.q} to slots=293 limit users {*} queues {uThC.q} to slots=97 } { name hiMem_u_slots description Limit slots/user for hiMem queues enabled TRUE limit users {*} queues {sThM.q} to slots=242 limit users {*} queues {mThM.q} to slots=121 limit users {*} queues {lThM.q} to slots=60 limit users {*} queues {uThM.q} to slots=15 } { name xlMem_u_slots description Limit slots/user for XlMem and special restricted queues enabled TRUE limit users {*} queues {uTxlM.rq} to slots=176 } { name qrsh_u_slots description Limit slots/user for interactive (qrsh) queues enabled TRUE limit users {*} queues {qrsh.iq} to slots=32 } { name mem_res description Limit total reserved memory for all users per queue type enabled TRUE limit users * queues {sThC.q,lThC.q,mThC.q,uThC.q} to mem_res=18641G limit users * queues {sThM.q,mThM.q,lThM.q,uThM.q} to mem_res=8077G limit users * queues {uTxlM.rq} to mem_res=4039G } { name u_mem_res description Limit reserved memory per user for specific queues enabled TRUE limit users {*} queues {sThC.q,lThC.q,mThC.q,uThC.q} to mem_res=4660G limit users {*} queues {sThM.q,mThM.q,lThM.q,uThM.q} to mem_res=2019G limit users {*} queues {uTxlM.rq} to mem_res=4039G } { name gpu description Limit GPUs for all users in GPU queue enabled TRUE limit users * queues {uTGPU.tq} to ngpu=4 } { name u_gpu description Limit GPUs per user in GPU queue enabled TRUE limit users {*} queues {uTGPU.tq} to ngpu=3 } { name blast2GO description Limit to set aside a slot for blast2GO enabled TRUE limit users * queues !lTb2g.q hosts {@b2g-hosts} to slots=62 limit users * queues lTb2g.q hosts {@b2g-hosts} to slots=1 } |
Note that these values get adjusted as needed.
The explanation of the resource quota set (rqs) can be found in
% man 5 sge_resource_quota
To check how much of these resources (queues quota) are used overall, or by your job(s), use:
% qquota
or
% qquota -u $USER
You can also inquire about a specific resource (qquota -l mem_res
), and use the local tools (module load tools/local
) qquota+
to
like in
% qquota+ +% -l slots -u hpc
(more info via qquota+ -help
or man qquota+
.)
To check the limits of a specific queue (CPU and memory), use
% qconf -sq sThC.q
and the explanation of these parameters can be found in
% man 5 queue_conf
under the RESOURCE LIMITS
heading.
-hold_jid <jobid>
flag to qsub
: % qsub -N FirstOne pre-process.job Your job 12345678 ("FirstOne") has been submitted % qsub -hold_jid 12345678 -N SecondOne post-process.job Your job 12345679 ("SecondOne") has been submitted
qchain
see below):#!/bin/csh # set parameter = $1 set name = $2 # set jid1 = `qsub -terse -N "pre-process-$name" pre-process.job $parameter` echo $jid1 submitted '("'pre-process-$name'")' set jid2 = `qsub -terse -hold_jid $jid1 -N "process-$name" process.job $parameter` echo $jid2 submitted '("'pre-process-$name'")' set jid3 = `qsub -terse -hold_jid $jid2 -N "post-process-$name" post-process.job $parameter` echo $jid3 submitted '("'post-process-$name'")' |
This example will submit 3 jobs: pre-process.job
, process.job
and post-process.job
to be run sequentially,
each takes one argument, the parameter,
You can use the qchain
tool by loading the tools/local
module, to submit jobs that must run sequentially.
module load tools/local qchain *.job |
will submit the job files that match "*.job
" in the order given by "echo *.job
".
By using quotes, as follows:
module load tools/local qchain '-N start first.job 123' '-N crunch second.job 123' '-N post-process finish.job 123' |
qchain
allows you to pass arguments to both qsub
and the job scripts.
You can limit how many jobs you submit with the following trick:
# define how many jobs to queue @ NMAX = 250 # loop: @ N = `qstat -u $USER | tail --lines=+3 | wc -l` if ($N >= $NMAX) then sleep 180 goto loop endif # |
This example counts how many jobs you have in the queue (running and waiting) using the command qstat
(and tail
and wc -l
) and pauses for 3 minutes (180 seconds) if that count is 250 or higher.
You would include these lines in a script that submits a slew of jobs, but should not queue more than a given number at any time (to count only the queued jobs, add -s p
to qsub
).
q-wait
(needs the module tools/local
), that takes an argument and two options: % q-wait blah
blah
' left queued or running. % q-wait -N 125 -wait 3600 crunch
crunch
' left queued or running, checking once an hour.-V
flag to qsub
-V
flag passes all the active environment variables to the script.You can find examples of simple/trivial test cases with source code, Makefile
, job script files and resulting log files under ~hpc/examples.
The examples are organized as follows:
| example using the C++11 extension |
gpu/ | GPU examples and (old) timings |
gsl/ | simple test that uses GSL |
hybrid/ | examples for using the hybrid PE |
idl/ | examples for running IDL/GDL/FL jobs |
java/ | example running JAVA |
lapack/ | example linking with LAPACK and Intel's MKL |
memtest/ | examples for large memory use and reservation |
| miscellaneous |
mpi/ | examples using
|
openmp/ | example using OpenMP |
python/ | list of different implementations of PYTHON available on Hydra |
serial/ | simple (hello world) serial job, for each compiler (gcc, Intel, PGI) |
You can use the command find
to get a list of all the subdirectories under ~hpc/examples
,
i.e.:
% find ~hpc/examples -type d -print
There is a web page with an app to help you choose a queue and write a job script, mostly how to write the embedded directives and load modules.
To choose a queue, you need to know
Indeed:
If your computation will use | your job script needs to | qsub option needed/recommended |
---|---|---|
more than one CPU (parallel jobs need) | request a PE and N slots | -pe <pe-name> N or -pe <pe-name> N-M |
more than 2GB/CPU of memory | reserve the required memory | -l mres=X,h_data=X,h_vmem=X |
more than 6GB/CPU of memory | use a high-memory queue, and reserve the required memory | -l mres=X,h_data=X,h_vmem=X,himem |
up to T hours of CPU (per CPU) | specify the required amount | -l s_cpu=T:0:0 |
or specify the queue | -q mThC.q | |
no idea how much CPU | use an unlimited, low priority queue | -q uThC.q -l lopri |
X
can be something like 2GBT
can be something like 240 (for 240 hours or 10 days)check-jobs
allows you to check the resources consumed by jobs that have completed.There can be different reasons why a job is rejected:
Use the -w v
or the -verify
flag to qsub
, see queue selection validation and/or verification, to check a job script file.
There can be different reasons why a job remains in the queue:
Use the command qconf -srqs
or qquota
, see how to check under resource limits.
The local tool check-qwait
allows you to visualize the queue quota resources and which jobs are waiting.
Last Updated SGK