Page tree
Skip to end of metadata
Go to start of metadata
  1. Introduction
  2. How to Check on Jobs
  3. How to Delete/Kill a Job or Jobs
  4. Modifying Queued Jobs
  5. Checking on a Completed Job
  6. Additional Tool
  7. Zombies  

1. Introduction

After submitting a job, or a set of jobs, with qsub, you can

  • check on your job(s) with the command qstat,
  • kill your job(s) with the command qdel,
  • alter the requested resources of a queued job with qalter ,
  • check on a finished job with qacct ,
  • use Hydra-specific home-grown tools.

2. How to Check on Jobs

The qstat command returns the status of jobs in the queue (man qstat), here are a few usage examples:

qstat -u $USERshows only your jobs.
qstat -u '*'shows everybody's jobs.
qstat -s rshows only running jobs.
qstat -s pshows only pending jobs.
qstat -rshows also requested resources and the full job name.
qstat -s r -u $USER -g tshows master/slave info for parallel jobs
qstat -s r -u $USER -g dshows task-ID for array jobs
qstat -j 4615585produces a more detailed output, for a specific job ID.
qstat -explain E -j 4615585produces the explanation for the error state of a specific job, specified by its job ID.

 

The job status abbreviations, returned by qstat, correspond to

qwpending (waiting in queue)
rrunning
tin transfer (typically from qw to r)
Eqwerror and waiting in queue (for ever)
dmarked for deletion
  • Jobs in the Eqw status will not run, and the reason for the error status can be found via qstat -explain E -j <jobid>
  • Jobs in the qw status are waiting for the requested resources to become available, or because you have reached some resource limit
    • These jobs can be qalter'ed
    • If your job is lingering in the queue (remains in qw for a long time, i.e., hours to days) and you have not reached some resource limit,
      you are most likely requested a scarce resource, or more of a resource than is available. In this case, feel free to contact us.
  • Jobs in the t status are in most cases about to get started.
  • Jobs in the d status are about to be deleted and/or killed.

We provide a tool, q+, that uses qstat to provide an easier way to monitor jobs and the status of the queue.

3. How to Delete/Kill a Job or Jobs

The qdel command (man qdel) allows you to either:

  • delete a job from the queue (for job in the "qw" status), or
  • kill a running job (for jobs in the "r" status).

You delete/kill a specific job by its job id as follows:

   % qdel 4615585

or you can kill all your jobs (queued and running) with

   % qdel -u $USER


(lightbulb) Here is a trick to kill a slew of jobs, but not all of them:

qstat -u $USER | grep $USER | awk '{print "qdel", $1} > qdel.sou
[edit the file qdel.sou]
source qdel.sou

This example

  • uses grep to filter the lines from qstat that have your username, and then
  • uses awk to produce a list of lines like "qdel <jobid>", and
  • saves the result to a file (qdel.sou);

you then

  • edit that file to keep the lines you want, and
  • use source to execute each line of that edited file as if you had typed them.

(lightbulb) You can use grep to better filter the output of qstat, like this:

qstat -u $USER | grep $USER | grep rax | awk '{print "qdel", $1}' > qdel.sou
[edit the file qdel.sou]
source qdel.sou

This example

  • filters the output of qstat for lines with your user name and
  • with the string "rax" to identify a sub set of jobs via their job name.

(lightbulb) You can add "-s r" or "-s p" to qstat to limit the list of job IDs to only your running or pending (queued) jobs.

4. Modifying a Queued Job

The qalter command allows you to alter (modify) the properties of a job, namely its requested resources or parameters.

You can alter

  • most of the properties of a queued job;
  • and a few of its properties once it has started.

For example:

you canwith
move a queued job to a different queue with% qalter -q mThC.q <jobid>
change the name of the job's output file% qalter -o run-3.log <jobid>
change whether you want email notifications% qalter -m abe <jobid>
change the requested amount of CPU% qalter -l s_cpu=240:: <jobid>

where <jobid> is the job ID (see man qalter.)

5. Checking on a Completed Job

The qacct command shows the GE accounting information and can be used to check on

  • the resources used by a given job that has completed, and
  • its exit status.

For example:

   % qacct -j <jobid>

will list the accounting information for a finished job with the given job ID.

 

It will show the following useful information:

qname

name of the queue the job ran in

hostname

name of the (master) compute node the job ran on

taskid

the task ID (for job arrays)

qsub_time

when the job was queued

start_time

when the job started

end_time

when the job ended

granted_pe

what parallel environment was used

slots

how many slots were allocated

failed

did the job fail to complete (1 means job did fail, i.e., was killed by GE b/c memory or time limit exceeded)

exit_status

the job script exit status (0 means job script completed OK)

ru_wallclock

wall clock time elapsed, in seconds

ru_utime

consumed user time, as reported by the O/S (actual CPU time), in seconds

ru_stime

consumed system time, as reported by the O/S (non-CPU time, usually related to I/O, or other system wait), in seconds

cpu

CPU time computed, as measured by the GE (ru_utime+ru_stime may not add to the cpu time), in seconds

mem

total  memory*time used in GB seconds: the mean memory usage is mem/cpu in GB

io

a measure of the I/Os operations executed by the job

maxvmem

the maximum amount of memory used by the job at any time during its execution

(lightbulb) Users that run jobs in the high memory queues can use this to check if they used close to the amount of memory they have reserved.

 

The qacct command can also be used to check on the past usage of a given user. For example:

   % qacct -d <ndays> -o <username>

will return the usage statistics of the user specified in <username>, over the past <ndays> days (use man qacct for more details).

 

You can get details information for each job that ran, with the "-j" option, as follow:

   % qacct -d <ndays> -o <username> -j > qacct.log

and save its output, if long, to a file (qacct.log). You can parse that file with the command egrep. 


For example, to check all the jobs the user hpc ran over the past 3 days:

   % qacct -d 3 -o hpc -j > qacct.log

You than parse the output with:

   % egrep 'jobname|jobnumber|failed|exit_status|cpu|ru_w|===' qacct.log > qacct-filtered.log

(star) The command egrep is used to print any line that has one of strings in the quoted list separated by the '|' ('|' that means "or" in this context, see man egrep).

 

(lightbulb) This has been streamlined with the local tool check-jobs.pl - see below.

5b. A "better" qacct: qacct+

The accounting stats are ingested into a DB and the DB can be queried with qacct+.

The ingestion is run every few minutes, so the DB is not instantaneously synch'd.

qacct+ -help
qacct+ [options] where options are

  -j     <jobid>    limit to a given job ID
  -t     <taskid>   limit to a given task ID, or a range of task ID (if n:m)
  -o     <owner>    limit to a given owner
  -q     <queue>    limit to given queue        (regexp ok, like ".ThC.q")
  -pe    <PE>       limit to given PE           (like mthread)
  -from  <date>     limit to submitted >= date,   like "4/27/2016 10:00AM"
  -to    <date>     limit to submitted <= date,   like "4/27/2016 11:00AM"

  -max   <number>   max number of entries to show
  -wdt   <number>   width of columns in tabular mode (def: 15)
  -f     <dbname>   use the given DB file name, (def: /home/hpc/cron/qacct/dbs/qacct.db)

  -v                verbose mode
  -n                dry run, use w/ -v to check the resulting SQL search
  -help             shows this help

  -show  <string>   specify what to show        (def.: -show simple)
                    where <string> can be "simple", "simple+", "tab", or "raw",
                                       or "stats", "help" or "fields", or a custom specification
                    use -show help to get additional help on this option


6. Additional Tools

There is a set of tools written for Hydra to help monitor jobs and the cluster and do some simple operations.

To access them you must load the module tools/local . The command

   % module help tools/local

list the available tools (see table below). Each tool has a man page.

A few of these tools are describe here in more details.

6.a A "better" qstat: q+

qstat+.pl is a PERL wrapper around qstat: it runs qstat for you and parses its output to display it in a more friendly format.

q+ is a shorthand for qstat+.pl. To learn how to use qstat+.pl simply type:

   % q+ -help

or

   % q+ -examples

It can be used to

  • get useful info like the age and/or the cpu usage (in % of the job age) of running jobs,
  • get the list of nodes a parallel job is running on (if you haven't saved it),
  • get a filtered version of qstat -j <jobid>, and
  • get an overview of the cluster queues usage.

You can access more information about q+ with man qstat+ (once I get to writing the man page).

6.b Checking a Compute Node

rtop+: a script to run the command top on a compute node (aka remote top.)

  • The Un*x command top can be used to look at what processes are running on a given machine (it reports the "top" processes running at any time).
  • To check what processes are running on a compute node, (to check CPU and/or memory usage, you can use: 
       % rtop+ [-u <username>] [-<number>] N-M
    like in 
       % rtop+ -u hpc -50 2-5

    and you will see the <number> lines listing the processes owned by <username> on the compute node compute-N-M, or
    (second example) the first 50 lines when running top, limited to user hpc, on compute-2-5.
    If you omit -<number> you will see only the first 10 lines, if you omit -u <username> you will see everybody's processes.

Check man top to better understand the output of top.

6.c Checking Memory and CPU Usage of Jobs in the High Memory Queues

plot-qmemuse.pl: a tool to plot the memory and CPU usage of jobs that ran recently or are running in the high memory queues.

We monitor the jobs running in the high-memory queue, taking a usage snapshot every five minutes. This tool only applies to jobs in the high-memory queues

The resulting statistics can be used to visualize the resources usage of a given job with the command plot-qmemuse.pl, using

   % plot-qmemuse.pl <jobid>

 or

   % plot-qmemuse.pl <jobid>.<taskid>

For that command to run, you must load the gnuplot module first. By default this tool produces a plot in a 850x850 pixels png file.

You can specify the following options:

-l <label>to add your own label on the plot
-s <size>to specify the plot size, in pixel (-s 1200 for a 1200x1200 plot)
-o <filename>to specify the name of the png file
-xto plot on the screen (using X11, assuming your connection to hydra allows X11)

You can view the plot in the png file with the command display <filename>, assuming that your connection to hydra allows X11, or

you can copy that file to your local machine and view it with your browser or a png-compatible image viewer (like xv).

6.d List of Local Tools

The following tools are available when loading the module tools/local;

backup

backup a file

ie: mv or cp file to file.<n>

fixFmt

format a number with fixed number of digits

ie: (csh): @ n = 1; set N = `fixFmt 3 $n`

 

   ([ba]sh): n=1; N=`fixFmt 3 $n`
  echo n=$n N=$N → n=1 N=001

lsth

show most recent files

ie: lsth [-40] [<spec>] ls -lt <spec> | head -40

lswc

count number of files in directories

ie: lswc dir/ → ls dir/ | wc -l

pawk

print with awk

ie: pawk 1,hello,3 → awk '{print $1,"hello",$3}'

p-wait

wait until given PID has completed

ie: p-wait [check-time] <PID>

q-wait

wait until N jobs in queue

ie: q-wait [options] <egrep-string>

procinfo

print properties of local machine (#CPUs, memory, OS)

 

tails

run tail [options] on a set of files

ie: "tail -3 *pl" fails "tails -3 *pl" OK

disk-usage.pl

report disk usage

 

dus-report.pl

run and parse du to produce a disk usage report

 

show-quotas.pl

show quota values

 

parse-quota-report.pl

parse the daily quota report

 
qchainchain jobs to run sequentially, using the -hold_jid mechanism of qsub

ie: qchain *.job

or qchain '-N start first.job 123' '-N crunch reduce.job 123' '-N finish post-process.job 123'

qstat+.pl

a "better" qstat

 

q+

a shortcut (link) to qstat+.pl

 

qhost+

wrapper around qhost to use the shorthand N-M

 

qquota.pl

a wrapper around qquota

 

qacct+

a "better" qacct

 

rkill

remote kill

ie: rkill 2-3 4325 → ssh -x compute-2-3 kill 4325

rkillall

remote killall

ie: rkillall 2-3 crunch → ssh -x compute-2-3 killall crunch

rtop+

remote top

ie: rtop+ -u hpc -30 2-3 → ssh -x compute-2-3 top -b -n 1 -u hpc | head -30

gethr

alias to "source /home/hpc/sbin/get-hr.xxx",

  where xxx is sou or sh depending on your shell

 

get-hr.sou

csh tool to retrieve a job hard resources (mem_res h_data h_vmem) in job script

 

get-hr.sh

[ba]sh version

 

elapsed

alias to "source /home/hpc/sbin/elapsed.xxx",

  where xxx is sou or sh depending on your shell

ie: elapsed; [do stuff]; elapsed

will show how long it took to do the stuff

elapsed.sou

csh tool to monitor elapsed time

 

elapsed.sh

[ba]sh tool to monitor elapsed time

 

noX

alias to "source /home/hpc/sbin/noX.xxx",

 where xxx is sou or sh depending on your shell

 

noX.sou

csh tool to unset DISPLAY (saved in XDISPLAY)

 

noX.sh

[ba]sh tool to unset DISPLAY (saved in XDISPLAY)

 

useX

alias to "source $bin/useX.xxx",

  where xxx is sou or sh depending on your shell

 

useX.sou

to reset DISPLAY (from XDISPLAY)

 

useX.sh

to reset DISPLAY (from XDISPLAY)

 

xterm-config.pl

tool to configure xterm window properties

 

check-memuse.pl

print the cluster usage (memory and CPUs), for hosts in specific host group (def: @himem-hosts)

 

ckmem.pl

print the cluster usage (memory and CPUs), for all or only hosts in given racks

 

check-hosts.pl

print the cluster usage (memory and CPUs), as aggregate by rack

 

qsnapshot.pl

display a snapshot of cluster usage, using characters

 

plot-qsnapshot.pl

plot a snapshot of cluster usage, using gnuplot

 

qckmem.pl

check the jobs in a (set of) queue(s) for memory use versus reservation and CPU usage efficiency

 
check-jobs.plshow statistics of resources usage for completed jobs 
check-qwaitshow job(s) waiting in the queue and associated queue quota limits 

qavail.pl

show how many slots are available for a given host group or a type of queue

 

qmemuse.pl

log statistics for plot-qmemuse.pl

 

plot-qmemuse.pl

plot memory use (and efficiency) for recent jobs in hi-mem queue

 

monitor-mycode.csh

monitor your use of compute-N-M for a given program every TT minutes

 

print-proc-memory.pl

pretty print nicely content of /proc/memory

 

Each tool has a man page, accessible after you load the module tools/local.

7. Zombies

Some MPI codes, when killed, leave behind zombie processes.

(more on how to kill them)


Last Updated  SGK.

  • No labels