Page tree
Skip to end of metadata
Go to start of metadata
  1. Monitoring your Jobs#Introduction
  2. How to Check on Jobs
  3. How to Delete/Kill a Job or Jobs
  4. Modifying Queued Jobs
  5. Checking on a Completed Job
  6. Additional Tool
  7. Zombies  

1. Introduction

After submitting a job, or a set of jobs, with qsub, you can

  • check on your job(s) with the command qstat,
  • kill your job(s) with the command qdel,
  • alter the requested resources of a queued job with qalter ,
  • check on a finished job with qacct ,
  • use Hydra-specific home-grown tools.

2. How to Check on Jobs

The qstat command returns the status of jobs in the queue (man qstat), here are a few usage examples:

qstat -u $USERshows only your jobs.
qstat -u '*'shows everybody's jobs.
qstat -s rshows only running jobs.
qstat -s pshows only pending jobs.
qstat -rshows also requested resources and the full job name.
qstat -s r -u $USER -g tshows master/slave info for parallel jobs
qstat -s r -u $USER -g dshows task-ID for array jobs
qstat -j 4615585produces a more detailed output, for a specific job ID.
qstat -explain E -j 4615585produces the explanation for the error state of a specific job, specified by its job ID.


The job status abbreviations, returned by qstat, correspond to

qwpending (waiting in queue)
rrunning
tin transfer (typically from qw to r)
Eqwerror and waiting in queue (for ever)
dmarked for deletion
  • Jobs in the Eqw status will not run, and the reason for the error status can be found via qstat -explain E -j <jobid>
  • Jobs in the qw status are waiting for the requested resources to become available, or because you have reached some resource limit
    • These jobs can be qalter'ed
    • If your job is lingering in the queue (remains in qw for a long time, i.e., hours to days) and you have not reached some resource limit,
      you are most likely requested a scarce resource, or more of a resource than is available. In this case, feel free to contact us.
  • Jobs in the t status are in most cases about to get started.
  • Jobs in the d status are about to be deleted and/or killed.

We provide a tool, q+, that uses qstat to provide an easier way to monitor jobs and the status of the queue.

3. How to Delete/Kill a Job or Jobs

The qdel command (man qdel) allows you to either:

  • delete a job from the queue (for job in the "qw" status), or
  • kill a running job (for jobs in the "r" status).

You delete/kill a specific job by its job id as follows:

   % qdel 4615585

or you can kill all your jobs (queued and running) with

   % qdel -u $USER


(lightbulb) Here is a trick to kill a slew of jobs, but not all of them:

qstat -u $USER | grep $USER | awk '{print "qdel", $1} > qdel.sou
[edit the file qdel.sou]
source qdel.sou

This example

  • uses grep to filter the lines from qstat that have your username, and then
  • uses awk to produce a list of lines like "qdel <jobid>", and
  • saves the result to a file (qdel.sou);

you then

  • edit that file to keep the lines you want, and
  • use source to execute each line of that edited file as if you had typed them.

(lightbulb) You can use grep to better filter the output of qstat, like this:

qstat -u $USER | grep $USER | grep rax | awk '{print "qdel", $1}' > qdel.sou
[edit the file qdel.sou]
source qdel.sou

This example

  • filters the output of qstat for lines with your user name and
  • with the string "rax" to identify a sub set of jobs via their job name.

(lightbulb) You can add "-s r" or "-s p" to qstat to limit the list of job IDs to only your running or pending (queued) jobs.

4. Modifying a Queued Job

The qalter command allows you to alter (modify) the properties of a job, namely its requested resources or parameters.

You can alter

  • most of the properties of a queued job;
  • and a few of its properties once it has started.

For example:

you canwith
move a queued job to a different queue with% qalter -q mThC.q <jobid>
change the name of the job's output file% qalter -o run-3.log <jobid>
change whether you want email notifications% qalter -m abe <jobid>
change the requested amount of CPU% qalter -l s_cpu=240:: <jobid>

where <jobid> is the job ID (see man qalter.)

5. Checking on a Completed Job

The qacct command shows the GE accounting information and can be used to check on

  • the resources used by a given job that has completed, and
  • its exit status.

For example:

   % qacct -j <jobid>

will list the accounting information for a finished job with the given job ID.


It will show the following useful information:

qname

name of the queue the job ran in

hostname

name of the (master) compute node the job ran on

taskid

the task ID (for job arrays)

qsub_time

when the job was queued

start_time

when the job started

end_time

when the job ended

granted_pe

what parallel environment was used

slots

how many slots were allocated

failed

did the job fail to complete (1 means job did fail, i.e., was killed by GE b/c memory or time limit exceeded)

exit_status

the job script exit status (0 means job script completed OK)

ru_wallclock

wall clock time elapsed, in seconds

ru_utime

consumed user time, as reported by the O/S (actual CPU time), in seconds

ru_stime

consumed system time, as reported by the O/S (non-CPU time, usually related to I/O, or other system wait), in seconds

cpu

CPU time computed, as measured by the GE (ru_utime+ru_stime may not add to the cpu time), in seconds

mem

total  memory*time used in GB seconds: the mean memory usage is mem/cpu in GB

io

a measure of the I/Os operations executed by the job

maxvmem

the maximum amount of memory used by the job at any time during its execution

(lightbulb) Users that run jobs in the high memory queues can use this to check if they used close to the amount of memory they have reserved.


The qacct command can also be used to check on the past usage of a given user. For example:

   % qacct -d <ndays> -o <username>

will return the usage statistics of the user specified in <username>, over the past <ndays> days (use man qacct for more details).


You can get details information for each job that ran, with the "-j" option, as follow:

   % qacct -d <ndays> -o <username> -j > qacct.log

and save its output, if long, to a file (qacct.log). You can parse that file with the command egrep. 


For example, to check all the jobs the user hpc ran over the past 3 days:

   % qacct -d 3 -o hpc -j > qacct.log

You than parse the output with:

   % egrep 'jobname|jobnumber|failed|exit_status|cpu|ru_w|===' qacct.log > qacct-filtered.log

(star) The command egrep is used to print any line that has one of strings in the quoted list separated by the '|' ('|' that means "or" in this context, see man egrep).

(lightbulb) This has been streamlined with the local tool check-qacct - see additional tools.

5b. A "better" qacct: qacct+

  • The accounting stats are ingested into a database and that db can be queried with qacct+.
  • Look at the relevant section in the additional tools page for instruction how to use qacct+.

6. Additional Tools

See the additional tools page

7. Zombies

Some MPI codes, when killed, leave behind zombie processes.

(more on how to kill them)



Last Updated   SGK.

  • No labels