- Monitoring your Jobs#Introduction
- How to Check on Jobs
- How to Delete/Kill a Job or Jobs
- Modifying Queued Jobs
- Checking on a Completed Job
- Additional Tool
After submitting a job, or a set of jobs, with
qsub, you can
- check on your job(s) with the command
- kill your job(s) with the command
- alter the requested resources of a queued job with
- check on a finished job with
- use Hydra-specific home-grown tools
2. How to Check on Jobs
qstat command returns the status of jobs in the queue (
man qstat), here are a few usage examples:
|shows only your jobs.|
|shows everybody's jobs.|
|shows only running jobs.|
|shows only pending jobs.|
|shows also requested resources and the full job name.|
|shows master/slave info for parallel jobs|
|shows task-ID for array jobs|
|produces a more detailed output, for a specific job ID.|
|produces the explanation for the error state of a specific job, specified by its job ID.|
The job status abbreviations, returned by
qstat, correspond to
|pending (waiting in queue)|
|in transfer (typically from |
|error and waiting in queue (for ever)|
|marked for deletion|
- Jobs in the
Eqwstatus will not run, and the reason for the error status can be found via
qstat -explain E -j <jobid>
- Jobs in the
qwstatus are waiting for the requested resources to become available, or because you have reached some resource limit
- These jobs can be
- If your job is lingering in the queue (remains in
qwfor a long time, i.e., hours to days) and you have not reached some resource limit,
you are most likely requested a scarce resource, or more of a resource than is available. In this case, feel free to contact us.
- These jobs can be
- Jobs in the
tstatus are in most cases about to get started.
- Jobs in the
dstatus are about to be deleted and/or killed.
We provide a tool, q+, that uses
qstat to provide an easier way to monitor jobs and the status of the queue.
3. How to Delete/Kill a Job or Jobs
qdel command (
man qdel) allows you to either:
- delete a job from the queue (for job in the "
qw" status), or
- kill a running job (for jobs in the
You delete/kill a specific job by its job id as follows:
% qdel 4615585
or you can kill all your jobs (queued and running) with
% qdel -u $USER
Here is a trick to kill a slew of jobs, but not all of them:
grepto filter the lines from
qstatthat have your username, and then
awkto produce a list of lines like "
qdel <jobid>", and
- saves the result to a file
- edit that file to keep the lines you want, and
sourceto execute each line of that edited file as if you had typed them.
You can use
grep to better filter the output of
qstat, like this:
- filters the output of
qstatfor lines with your user name and
- with the string "
rax" to identify a sub set of jobs via their job name.
You can add
"-s r" or "
-s p" to
qstat to limit the list of job IDs to only your running or pending (queued) jobs.
4. Modifying a Queued Job
qalter command allows you to alter (modify) the properties of a job, namely its requested resources or parameters.
You can alter
- most of the properties of a queued job;
- and a few of its properties once it has started.
|move a queued job to a different queue with|
|change the name of the job's output file|
|change whether you want email notifications|
|change the requested amount of CPU|
<jobid> is the job ID (see
5. Checking on a Completed Job
qacct command shows the GE accounting information and can be used to check on
- the resources used by a given job that has completed, and
- its exit status.
% qacct -j <jobid>
will list the accounting information for a finished job with the given job ID.
It will show the following useful information:
name of the queue the job ran in
name of the (master) compute node the job ran on
the task ID (for job arrays)
when the job was queued
when the job started
when the job ended
what parallel environment was used
how many slots were allocated
did the job fail to complete (1 means job did fail, i.e., was killed by GE b/c memory or time limit exceeded)
the job script exit status (0 means job script completed OK)
wall clock time elapsed, in seconds
consumed user time, as reported by the O/S (actual CPU time), in seconds
consumed system time, as reported by the O/S (non-CPU time, usually related to I/O, or other system wait), in seconds
CPU time computed, as measured by the GE (ru_utime+ru_stime may not add to the cpu time), in seconds
total memory*time used in GB seconds: the mean memory usage is mem/cpu in GB
a measure of the I/Os operations executed by the job
the maximum amount of memory used by the job at any time during its execution
Users that run jobs in the high memory queues can use this to check if they used close to the amount of memory they have reserved.
qacct command can also be used to check on the past usage of a given user. For example:
% qacct -d <ndays> -o <username>
will return the usage statistics of the user specified in
<username>, over the past
<ndays> days (use
man qacct for more details).
You can get details information for each job that ran, with the "
-j" option, as follow:
% qacct -d <ndays> -o <username> -j > qacct.log
and save its output, if long, to a file
qacct.log). You can parse that file with the command
For example, to check all the jobs the user
hpc ran over the past 3 days:
% qacct -d 3 -o hpc -j > qacct.log
You than parse the output with:
% egrep 'jobname|jobnumber|failed|exit_status|cpu|ru_w|===' qacct.log > qacct-filtered.log
egrep is used to print any line that has one of strings in the quoted list separated by the '|' ('|' that means "or" in this context, see
This has been streamlined with the local tool check-qacct - see additional tools.
5b. A "better" qacct: qacct+
- The accounting stats are ingested into a database and that db can be queried with
- Look at the relevant section in the additional tools page for instruction how to use
6. Additional Tools
See the additional tools page.
Some MPI codes, when killed, leave behind zombie processes.
(more on how to kill them)
Last Updated SGK.