Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
  1. Monitoring your Jobs31326215
  2. How to Check on Jobs
  3. How to Delete/Kill a Job or Jobs
  4. Modifying Queued Jobs
  5. Checking on a Completed Job
  6. Additional Tool
  7. List of Local Tools
  8. Zombies  

Anchor
Introduction
Introduction
1. Introduction

...

qstat -u $USERshows only your jobs.
qstat -u '*'shows everybody's jobs.
qstat -s rshows only running jobs.
qstat -s pshows only pending jobs.
qstat -rshows also requested resources and the full job name.
qstat -s r -u $USER -g tshows master/slave info for parallel jobs
qstat -s r -u $USER -g dshows task-ID for array jobs
qstat -j 4615585produces a more detailed output, for a specific job ID.
qstat -explain E -j 4615585produces the explanation for the error state of a specific job, specified by its job ID.

...


The job status abbreviations, returned by qstat, correspond to

...

will list the accounting information for a finished job with the given job ID.

 


It will show the following useful information:

...

(lightbulb) Users that run jobs in the high memory queues can use this to check if they used close to the amount of memory they have reserved. 


The qacct command can also be used to check on the past usage of a given user. For example:

...

will return the usage statistics of the user specified in <username>, over the past <ndays> days (use man qacct for more details). 


You can get details information for each job that ran, with the "-j" option, as follow:

...

(star) The command egrep is used to print any line that has one of strings in the quoted list separated by the '|' ('|' that means "or" in this context, see man egrep). 


(lightbulb) This has been streamlined with the local tool check-jobs.pl - see below.

5b. A "better" qacct: qacct+

...

Anchor
AdditionalTools
AdditionalTools
6. Additional Tools

There is a set of tools written for Hydra to help monitor jobs and the cluster and do some simple operations.

To access them you must load the module tools/local . The command

   % module help tools/local

list the available tools (see table below). Each tool has a man page.

A few of these tools are describe here in more details.

6.a A "better" qstat: q+

qstat+.pl is a PERL wrapper around qstat: it runs qstat for you and parses its output to display it in a more friendly format.

...

   % q+ -help

or

   % q+ -examples

It can be used to

  • get useful info like the age and/or the cpu usage (in % of the job age) of running jobs,
  • get the list of nodes a parallel job is running on (if you haven't saved it),
  • get a filtered version of qstat -j <jobid>, and
  • get an overview of the cluster queues usage.

You can access more information about q+ with man qstat+ (once I get to writing the man page).

6.b Checking a Compute Node

rtop+: a script to run the command top on a compute node (aka remote top.)

...

See the additional tools page

Anchor

Check man top to better understand the output of top.

6.c Checking Memory and CPU Usage of Jobs in the High Memory Queues

plot-qmemuse.pl: a tool to plot the memory and CPU usage of jobs that ran recently or are running in the high memory queues.

We monitor the jobs running in the high-memory queue, taking a usage snapshot every five minutes. This tool only applies to jobs in the high-memory queues

The resulting statistics can be used to visualize the resources usage of a given job with the command plot-qmemuse.pl, using

   % plot-qmemuse.pl <jobid>

 or

   % plot-qmemuse.pl <jobid>.<taskid>

For that command to run, you must load the gnuplot module first. By default this tool produces a plot in a 850x850 pixels png file.

You can specify the following options:

-l <label>to add your own label on the plot
-s <size>to specify the plot size, in pixel (-s 1200 for a 1200x1200 plot)
-o <filename>to specify the name of the png file
-xto plot on the screen (using X11, assuming your connection to hydra allows X11)

You can view the plot in the png file with the command display <filename>, assuming that your connection to hydra allows X11, or

you can copy that file to your local machine and view it with your browser or a png-compatible image viewer (like xv).

...

The following tools are available when loading the module tools/local;

...

backup

...

backup a file

...

fixFmt

...

format a number with fixed number of digits

...

 

...

lsth

...

show most recent files

...

lswc

...

count number of files in directories

...

pawk

...

print with awk

...

p-wait

...

wait until given PID has completed

...

q-wait

...

wait until N jobs in queue

...

procinfo

...

print properties of local machine (#CPUs, memory, OS)

...

tails

...

run tail [options] on a set of files

...

disk-usage.pl

...

report disk usage

...

dus-report.pl

...

run and parse du to produce a disk usage report

...

show-quotas.pl

...

show quota values

...

parse-quota-report.pl

...

parse the daily quota report

...

ie: qchain *.job

or qchain '-N start first.job 123' '-N crunch reduce.job 123' '-N finish post-process.job 123'

...

qstat+.pl

...

a "better" qstat

...

q+

...

a shortcut (link) to qstat+.pl

...

qhost+

...

wrapper around qhost to use the shorthand N-M

...

qquota.pl

...

a wrapper around qquota

...

qacct+

...

a "better" qacct

...

rkill

...

remote kill

...

rkillall

...

remote killall

...

rtop+

...

remote top

...

gethr

...

alias to "source /home/hpc/sbin/get-hr.xxx",

  where xxx is sou or sh depending on your shell

...

get-hr.sou

...

csh tool to retrieve a job hard resources (mem_res h_data h_vmem) in job script

...

get-hr.sh

...

[ba]sh version

...

elapsed

...

alias to "source /home/hpc/sbin/elapsed.xxx",

  where xxx is sou or sh depending on your shell

...

ie: elapsed; [do stuff]; elapsed

will show how long it took to do the stuff

...

elapsed.sou

...

csh tool to monitor elapsed time

...

elapsed.sh

...

[ba]sh tool to monitor elapsed time

...

noX

...

alias to "source /home/hpc/sbin/noX.xxx",

 where xxx is sou or sh depending on your shell

...

noX.sou

...

csh tool to unset DISPLAY (saved in XDISPLAY)

...

noX.sh

...

[ba]sh tool to unset DISPLAY (saved in XDISPLAY)

...

useX

...

alias to "source $bin/useX.xxx",

  where xxx is sou or sh depending on your shell

...

useX.sou

...

to reset DISPLAY (from XDISPLAY)

...

useX.sh

...

to reset DISPLAY (from XDISPLAY)

...

xterm-config.pl

...

tool to configure xterm window properties

...

check-memuse.pl

...

print the cluster usage (memory and CPUs), for hosts in specific host group (def: @himem-hosts)

...

ckmem.pl

...

print the cluster usage (memory and CPUs), for all or only hosts in given racks

...

check-hosts.pl

...

print the cluster usage (memory and CPUs), as aggregate by rack

...

qsnapshot.pl

...

display a snapshot of cluster usage, using characters

...

plot-qsnapshot.pl

...

plot a snapshot of cluster usage, using gnuplot

...

qckmem.pl

...

check the jobs in a (set of) queue(s) for memory use versus reservation and CPU usage efficiency

...

qavail.pl

...

show how many slots are available for a given host group or a type of queue

...

qmemuse.pl

...

log statistics for plot-qmemuse.pl

...

plot-qmemuse.pl

...

plot memory use (and efficiency) for recent jobs in hi-mem queue

...

monitor-mycode.csh

...

monitor your use of compute-N-M for a given program every TT minutes

...

print-proc-memory.pl

...

pretty print nicely content of /proc/memory

...

Each tool has a man page, accessible after you load the module tools/local.

Anchor
Zombies
Zombies
7. Zombies

Some MPI codes, when killed, leave behind zombie processes.

(more on how to kill them)

Section


...

Last Updated  SGK.