- How to Check on Jobs
- How to Delete/Kill a Job or Jobs
- Modifying Queued Jobs
- Checking on a Completed Job
- Additional Tool
After submitting a job, or a set of jobs, with
qsub, you can
- check on your job(s) with the command
- kill your job(s) with the command
- alter the requested resources of a queued job with
- check on a finished job with
- use Hydra-specific home-grown tools
2. How to Check on Jobs
qstat command returns the status of jobs in the queue (
man qstat), here are a few usage examples:
|shows only your jobs.|
|shows everybody's jobs.|
|shows only running jobs.|
|shows only pending jobs.|
|shows also requested resources and the full job name.|
|shows master/slave info for parallel jobs|
|shows task-ID for array jobs|
|produces a more detailed output, for a specific job ID.|
|produces the explanation for the error state of a specific job, specified by its job ID.|
The job status abbreviations, returned by
qstat, correspond to
|pending (waiting in queue)|
|in transfer (typically from |
|error and waiting in queue (for ever)|
|marked for deletion|
- Jobs in the
Eqwstatus will not run, and the reason for the error status can be found via
qstat -explain E -j <jobid>
- Jobs in the
qwstatus are waiting for the requested resources to become available, or because you have reached some resource limit
- These jobs can be
- If your job is lingering in the queue (remains in
qwfor a long time, i.e., hours to days) and you have not reached some resource limit,
you are most likely requested a scarce resource, or more of a resource than is available. In this case, feel free to contact us.
- These jobs can be
- Jobs in the
tstatus are in most cases about to get started.
- Jobs in the
dstatus are about to be deleted and/or killed.
We provide a tool, q+, that uses
qstat to provide an easier way to monitor jobs and the status of the queue.
3. How to Delete/Kill a Job or Jobs
qdel command (
man qdel) allows you to either:
- delete a job from the queue (for job in the "
qw" status), or
- kill a running job (for jobs in the
You delete/kill a specific job by its job id as follows:
% qdel 4615585
or you can kill all your jobs (queued and running) with
% qdel -u $USER
Here is a trick to kill a slew of jobs, but not all of them:
grepto filter the lines from
qstatthat have your username, and then
awkto produce a list of lines like "
qdel <jobid>", and
- saves the result to a file
- edit that file to keep the lines you want, and
sourceto execute each line of that edited file as if you had typed them.
You can use
grep to better filter the output of
qstat, like this:
- filters the output of
qstatfor lines with your user name and
- with the string "
rax" to identify a sub set of jobs via their job name.
You can add
"-s r" or "
-s p" to
qstat to limit the list of job IDs to only your running or pending (queued) jobs.
4. Modifying a Queued Job
qalter command allows you to alter (modify) the properties of a job, namely its requested resources or parameters.
You can alter
- most of the properties of a queued job;
- and a few of its properties once it has started.
|move a queued job to a different queue with|
|change the name of the job's output file|
|change whether you want email notifications|
|change the requested amount of CPU|
<jobid> is the job ID (see
5. Checking on a Completed Job
qacct command shows the GE accounting information and can be used to check on
- the resources used by a given job that has completed, and
- its exit status.
% qacct -j <jobid>
will list the accounting information for a finished job with the given job ID.
It will show the following useful information:
name of the queue the job ran in
name of the (master) compute node the job ran on
the task ID (for job arrays)
when the job was queued
when the job started
when the job ended
what parallel environment was used
how many slots were allocated
did the job fail to complete (1 means job did fail, i.e., was killed by GE b/c memory or time limit exceeded)
the job script exit status (0 means job script completed OK)
wall clock time elapsed, in seconds
consumed user time, as reported by the O/S (actual CPU time), in seconds
consumed system time, as reported by the O/S (non-CPU time, usually related to I/O, or other system wait), in seconds
CPU time computed, as measured by the GE (ru_utime+ru_stime may not add to the cpu time), in seconds
total memory*time used in GB seconds: the mean memory usage is mem/cpu in GB
a measure of the I/Os operations executed by the job
the maximum amount of memory used by the job at any time during its execution
Users that run jobs in the high memory queues can use this to check if they used close to the amount of memory they have reserved.
qacct command can also be used to check on the past usage of a given user. For example:
% qacct -d <ndays> -o <username>
will return the usage statistics of the user specified in
<username>, over the past
<ndays> days (use
man qacct for more details).
You can get details information for each job that ran, with the "
-j" option, as follow:
% qacct -d <ndays> -o <username> -j > qacct.log
and save its output, if long, to a file
qacct.log). You can parse that file with the command
For example, to check all the jobs the user
hpc ran over the past 3 days:
% qacct -d 3 -o hpc -j > qacct.log
You than parse the output with:
% egrep 'jobname|jobnumber|failed|exit_status|cpu|ru_w|===' qacct.log > qacct-filtered.log
egrep is used to print any line that has one of strings in the quoted list separated by the '|' ('|' that means "or" in this context, see
This has been streamlined with the local tool check-jobs.pl - see below.
5b. A "better" qacct: qacct+
The accounting stats are ingested into a DB and the DB can be queried with qacct+.
The ingestion is run every few minutes, so the DB is not instantaneously synch'd.
6. Additional Tools
There is a set of tools written for Hydra to help monitor jobs and the cluster and do some simple operations.
To access them you must load the module
% module help tools/local
list the available tools (see table below). Each tool has a man page.
A few of these tools are describe here in more details.
6.a A "better" qstat: q+
qstat+.pl is a PERL wrapper around
qstat: it runs
qstat for you and parses its output to display it in a more friendly format.
q+ is a shorthand for
qstat+.pl simply type:
% q+ -help
% q+ -examples
It can be used to
- get useful info like the age and/or the cpu usage (in % of the job age) of running jobs,
- get the list of nodes a parallel job is running on (if you haven't saved it),
- get a filtered version of
qstat -j <jobid>, and
- get an overview of the cluster queues usage.
You can access more information about
q+ with man
qstat+ (once I get to writing the man page).
6.b Checking a Compute Node
rtop+: a script to run the command
top on a compute node (aka remote
- The Un*x command
topcan be used to look at what processes are running on a given machine (it reports the "top" processes running at any time).
- To check what processes are running on a compute node, (to check CPU and/or memory usage, you can use:
% rtop+ [-u <username>] [-<number>] N-M
% rtop+ -u hpc -50 2-5
and you will see the
<number>lines listing the processes owned by
<username>on the compute node
(second example) the first 50 lines when running
top, limited to user
If you omit
-<number>you will see only the first 10 lines, if you omit
-u <username>you will see everybody's processes.
man top to better understand the output of
6.c Checking Memory and CPU Usage of Jobs in the High Memory Queues
plot-qmemuse.pl: a tool to plot the memory and CPU usage of jobs that ran recently or are running in the high memory queues.
We monitor the jobs running in the high-memory queue, taking a usage snapshot every five minutes. This tool only applies to jobs in the high-memory queues
The resulting statistics can be used to visualize the resources usage of a given job with the command
% plot-qmemuse.pl <jobid>
% plot-qmemuse.pl <jobid>.<taskid>
For that command to run, you must load the
gnuplot module first. By default this tool produces a plot in a 850x850 pixels
You can specify the following options:
|to add your own label on the plot|
|to specify the plot size, in pixel (-s 1200 for a 1200x1200 plot)|
|to specify the name of the |
|to plot on the screen (using |
You can view the plot in the
png file with the command
display <filename>, assuming that your connection to hydra allows
you can copy that file to your local machine and view it with your browser or a png-compatible image viewer (like
6.d List of Local Tools
The following tools are available when loading the module
backup a file
format a number with fixed number of digits
show most recent files
count number of files in directories
wait until given PID has completed
print properties of local machine (#CPUs, memory, OS)
run tail [options] on a set of files
report disk usage
run and parse
show quota values
parse the daily quota report
|chain jobs to run sequentially, using the |
a shortcut (link) to
a wrapper around
alias to "
alias to "
will show how long it took to do the stuff
alias to "source /home/hpc/sbin/noX.xxx",
where xxx is sou or sh depending on your shell
alias to "
tool to configure
print the cluster usage (memory and CPUs), for hosts in specific host group (def:
print the cluster usage (memory and CPUs), for all or only hosts in given racks
print the cluster usage (memory and CPUs), as aggregate by rack
display a snapshot of cluster usage, using characters
plot a snapshot of cluster usage, using
check the jobs in a (set of) queue(s) for memory use versus reservation and CPU usage efficiency
|show statistics of resources usage for completed jobs|
|show job(s) waiting in the queue and associated queue quota limits|
show how many slots are available for a given host group or a type of queue
log statistics for
plot memory use (and efficiency) for recent jobs in hi-mem queue
monitor your use of compute-N-M for a given program every TT minutes
pretty print nicely content of
Each tool has a man page, accessible after you load the module
Some MPI codes, when killed, leave behind zombie processes.
(more on how to kill them)
Last Updated SGK.