Introduction

We offer a set of tools, written for Hydra, to help monitor jobs and the cluster and do some simple operations.
- At the Hydra-5 upgrade, the names of these tools have been rationalized, and they have been split in two groups.
To access them you must load the module tools/local and/or the module tools/local+.
The commands

% module help tools/local

and

% module help tools/local+

list the available tools (see table below).

Each tool has a man page.
A few of these tools are describe here in more details.
The module tools/local-bc offers backward compatibility (hence the -bc) w/ the local tools on Hydra-4.

1 A "better" qstat: qstat+

qstat+ is a PERL wrapper around qstat: it runs qstat for you and parses its output to display it in a more friendly format.
q+ is a shorthand for qstat+.
How to use qstat+ is explained in the man page (man qstat+ ) and is described by:

% qstat+ -help

or

% qstat+ -examples

It can be used to
- get useful info like the age and/or the cpu usage (in % of the job age) of running jobs,
- get the list of nodes a parallel job is running on (if you haven't saved it),
- get a filtered version of qstat -j <jobid>, and
- get an overview of the cluster queues usage.

Namely:

qstat+ -help

usage: qstat+ [mode] [options]
  where modes are (exclusive list):
    -s|-sx       show summary (two diff. format)
    -a           show all jobs
    -q           show queued jobs
    -r           show running jobs
    -X           show extra jobs (dr, Eqw, t)
    -j      JID  show info on a specific job
    -nlist  JID  show node list for a (set of) job(s)
    -hiload N    show node(s) with high load
    -gc          show global cluster status
    -es[x]       show empty slots, -esx: expanded info
    -down        show node(s) that are down
    -ores        show overreserved jobs, i.e.: resMem/maxVMem > 2.5, & age > 1 hr
    -osub        show oversubscribed jobs, i.e.: cpu% > 133%, & age > 1 hr
    -ineff       show inefficient jobs, i.e.: cpu% < 33%, & age > 1 hr
    -help        show help
    -examples    show examples

where options are:
    -u USER       limit to the specified user, you can use "*" or "all" to list everybody's jobs
                  or a coma-separated list
    -njobs        show counts in no. of jobs
    -npes         show counts in no. of PEs (slots)
    -nqpe         show jobs/PEs not jobs/tasks for queued jobs
    -nqxx         show jobs/tasks/PEs for queued jobs
    -load         show the nodes' load
    -sua          show the nodes' used/avail no of slots
    -age          show the age of the jobs, i.e., elapsed time vs starting/submit time
    -cpu          show the amount of CPU used by running job(s)
    -cpu%         show the amount of CPU/age/#PE in % (job efficiency)
    -cpur         show the ratio CPU/AGE, not scaled by #PE
    -mem          show the mean memory usage for running jobs, the total requested memory for queued jobs, in GB
    -memx         show more memory info for running jobs: reserved, mean, vmem and maxvmem (slow)
    -memr         show more memory info for running jobs: reserved, mean, maxvmem and res/mxvmem (slow)
    -io           show the I/O usage
    +lic          show license info (default), although not in detailed outputs
    -lic          do not show license info

    -noheader     do not show the header
    -nofooter     do not show the footer
    -raw          print raw values (for easier parsing)
    -queue QSPEC  limit to jobs in queue QSPEC (RE ok)
    -oresF  VAL   change the overreserved factor to VAL
    -osubT  VAL   change the oversubscribed threshold to VAL, in %
    -osubE  VAL   set the oversubscribtion excess threshold to VAL, in hr x slots
    -ineffT VAL   change the inefficient threshold to VAL, in %
    -ageT   VAL   change the minimum age threshold to VAL, in hour

    -v            verbose
    -warn         show warnings
    -wide         wide output (132 cols, implies -notty)
    -notty        do not use the width of the terminal, as returned by stty
    -check-load   check the instantanous load (-hiload only)

shorthands:
    +a   is expanded to -a -u $USER -nofooter             show all your jobs
    +a%                 -a -u $USER -nofooter -age -cpu%  ibidem, with cpu on % of age
    +ax%                -a -u $USER -nofooter -age -cpu% -load -sua -mem -io
    +ar%                -a -u $USER -nofooter -age -cpu% -memr -io
    +r                  -r -u $USER -nofooter             show all your running jobs
    +r%                 -r -u $USER -nofooter -age -cpu%  ibidem, with cpu on % of age
    +rx%                -r -u $USER -nofooter -age -cpu% -load -sua -memx -io
    +rr%                -r -u $USER -nofooter -age -cpu% -memr -io
    +q                  -q -u $USER -nofooter             show all your queued jobs
    +q%                 -q -u $USER -nofooter -age        ibidem but show age
    +qx%                -q -u $USER -nofooter -age -mem   ibidem plus mem info
    +X                  -X -u $USER -nofooter             show all your extra jobs
    +n                  -notty -wide -noheader -nofooter
qstat+: Ver 3.9/2 - Oct 2019

qstat+ -examples

examples:
 qstat+ -a -u hpc                   show all of hpc's jobs
 qstat+ -r -cpu% -u hpc             show all of hpc's running jobs, cpu in %
 qstat+ -r -cpu% -load -sua -u hpc  show all of hpc's running jobs, cpu in %, and
                                the nodes' load and slot usage/availability
 qstat+ -q                          show all the queued jobs
 qstat+ -j 8683280,8683285          show info on specific job IDs
 qstat+ -nlist 8683280,8683285      show nodes list for specific job IDs
 qstat+ -hiload 1.5                 show nodes whose load is 1.5 greater than the number of slots used
 qstat+ -ineffT 50 -ineff           show jobs that are below a 50% efficiency threshold
 qstat+ -osubT 200 -ageT 48 -osub   show jobs that are above a 200% usage threshold and are older than 48 hours
 qstat+ -gc                         show the global cluster status
 qstat+ -es                         show the empty slots
 qstat+ -down                       show which nodes are down

 +ax% -u all        is equiv to -a -u all -cpu% -load -sua -mem
  etc... for the +XXX shorthands

qstat+: Ver 3.9/2 - Oct 2019

2 A "better" qacct: qacct+

The accounting information is at regular intervals in a data base.
- The ingestion is run every few minutes, so the database is not instantaneously synchronized.
That database can be queried with qacct+
- The output of qacct+ can be better controlled than the output of qacct, and
- qacct+ computes derived values.
How to use qacct+ is explained in the man page (man qacct+) and is described by:

% qacct+ -help
or
% qacct+ -show help

Namely:

qacct+ -help

usage: qacct+ [options] where options are  
  
  -j     <jobid>    limit to a given job ID (comma separated list OK)  
  -t     <taskid>   limit to a given task ID, or a range of task ID (if n:m)  
  -o     <owner>    limit to a given owner  
  -q     <queue>    limit to given queue        (regexp ok, like ".ThC.q")  
  -pe    <PE>       limit to given PE           (like mthread)  
  -from  <date>     limit to submitted >= date,   like "4/27/2016 10:00AM"  
  -to    <date>     limit to submitted <= date,   like "4/27/2016 11:00AM"  
  
  -max   <number>   max number of entries to show  
  -wdt   <number>   width of columns in tabular mode (def: 15)  
  -f     <dbname>   use the given DB file name, (def: /home/hpc/cron/data/dbs/qacct+.db)  
  
  -all              show all that match  
  -v                verbose mode (repeat to increase verbosity)  
  -n                dry run, use w/ -v to check the resulting SQL search  
  -dateFmt <n>      how to format dates, n=-1,0,1 for YYYY/MM/DD, Mon DD YYY, MM/DD/YYYY, def.0  
  -help             shows this help  
  
  -show  <string>   specify what to show        (def.: -show simple)  
                    where <string> can be "simple", "simple+", "tab", "tab+", or "raw",  
                                       or "help", "fields" or "stats", or a custom specification  
                    use -show help to get additional help on this option  
  
  Ver 1.5/1 Oct 22 2019

and by

qacct+ -show help

  -show  <string> specify what to show  
  
  i.e.  
    -show simple  for simple format (def.)  
    -show simple+ for extension of the simple format  
    -show tab     for tabular format  
    -show tab+    for extension of the tabular format  
    -show raw     for raw format  
    -show stats   show only the DB stats  
  
  or list what keyword to show and an optional format where string is either   
    +field1[=format1][,field2[=format2]]  for keyed format  
  or  
    @field1=[format1][,field2[=format2]]  for tabular format  
  
  the format is either a C format specification like %d, %s, %.1f,  
  or the values &DATE, &MEM, or &AGE to convert to date, memory or age  
  use -show fields to list all the available fields  
  
  like in:  
    -show @qname,slots,ru_wallclock,cpu=%-15.1f,granted_pe  
  or   
    -show +qname,slots,ru_wallclock=\&AGE,cpu=%.1f

Retrieving information from a lot of jobs is currently slow. We plan to upgrade the back-end database server to speed it up.

3 A better "quota": quota+

The Linux command quota is not working on the GPFS nor the NAS
quota+ will report disk quota information on all the disks (NFS, GPFS or NAS)

quota+ help

quota+ [options]
 where options are:
  -u|--user user        return quotas for given user (must be root)
  -v|--verbose          display quotas on filesystems where no storage is allocated
  -%                    show Use% instead of Used
  +%                    show Use% as well as Used
  -f filesys            return quotas for given file system only
  -device               show device name only
  +device               show mount point and device name
  -terse                show only Used/Use% and Quota Limit, disable +%

Ver 2.2/1 Nov 2019

4 Checking a Compute Node: rtop+

rtop+: a script to run the command top on a compute node (aka remote top.)

The Un*x command top can be used to look at what processes are running on a given machine (it reports the "top" processes running at any time).
To check what processes are running on a compute node, (to check CPU and/or memory usage, you can use:
% rtop+ [-u <username>] [-<number>] NN-MM
like in
% rtop+ -u hpc -50 43-05

and you will see the <number> lines listing the processes owned by <username> on the compute node compute-NN-MM, or
(second example) the first 50 lines when running top, limited to user hpc, on compute-43-50.
If you omit -<number> you will see only the first 10 lines, if you omit -u <username> you will see everybody's processes.

Check man top to better understand the output of top.

5 Checking Memory and CPU Usage of Jobs in the High Memory Queues

plot-qmemuse: a tool to plot the memory and CPU usage of jobs that ran recently or are running in the high memory queues.
We monitor the jobs running in the high-memory queue, taking a usage snapshot every five minutes. This tool only applies to jobs in the high-memory queues
The resulting statistics can be used to visualize the resources usage of a given job with the command plot-qmemuse, using

% plot-qmemuse <jobid>

or

% plot-qmemuse <jobid>.<taskid>

For that command to run, you must load the gnuplot module first. By default this tool produces a plot in a 850x850 pixels png file.

You can specify the following options:

`-l <label>`	to add your own label on the plot
`-s <size>`	to specify the plot size, in pixel (-s 1200 for a 1200x1200 plot)
`-o <filename>`	to specify the name of the `png` file
`-x`	to plot on the screen (using `X11`, assuming your connection to hydra allows `X11`)

You can view the plot in the png file with the command display <filename>, assuming that your connection to hydra allows X11, or

you can copy that file to your local machine and view it with your browser or a png-compatible image viewer (like xv).

6 List of Local Tools

The following tools are available when loading the module tools/local;

`chage+`	substitute to chage, to query LDAP properties	ie: chage+ $USER
`check-disks-usage`	check disks usage, and print warning when usage exceed a threshold	ie: `check-disks-usage -w 10`
`check-gpuse`	print the cluster GPUs usage, for the hosts in the GPU host group
`check-hi-memuse`	check for memory use versus reservation and CPU usage efficiency, for a specific job or jobs
`check-memres`	check for jobs that use less memory than the amount reserved
`check-memuse`	print the cluster usage (memory and CPUs), for the hosts in a specific host group
`check-qlogs`	return a report on either oversubscribed or inefficient jobs, parsing existing log files report
`check-qwait`	show job(s) waiting in the queue and associated queue quota limits
`disk-usage`	return information on disk usage
`find-all-zombies`	find zombies
`get-gpu-info`	print information about GPU on the local host or a specific compute node
`monitor-code-usage`	helps you monitor your code usage
`parse-disk-quota-reports`	return report by parsing the disk quota report
`plot-disk-usage`	plot the usage of a given disk (volume)
`plot-qmemuse`	plot memory and cpu usage as a function of time for a given jobID
`plot-qsnapshot`	plot a snapshot of cluster usage
`plot-qssduse`	plot SSD usage as a function of time for a given jobID
`plot-qssduse-summary`	plot SSD usage summary as a function of time
`plot-uptime`	plot the load (uptime) of the head and login nodes
`qacct+`	a "better" qacct	show accounting information for completed jobs
`qchain`	chains a set of jobs by adding the -hold_jid <jobID> to qsub for you
`qhost+`	convenience wrapper around qhost to use the shorthand "NN-MM" instead of typing "-h compute-NN-MM"
`qquota+`	a "better" qquota	show queue quota wrt queue limits
`qstat+`	a "better" qstat	show queue status
`quota+`	a "better" quota	show disk quota information for all type (NFS, GPFS, NAS)
`q-wait`	wait until some jobs are not found in the queue
`rkill`	remote kill
`rkillall`	remote killall
`rtop+`	remote top
`show-ge-reporting`	parse and print part of the GE reporting file
`show-qmemuse`	show memory use statistics
`show-qslots`	shows how many slots are available in the queue(s)
`show-qssduse`	log statistics for plot-qssduse

Each tool has a man page, accessible after you load the module tools/local.

7 List of Local+ Tools

The following tools are available when loading the module tools/local+;

`backup`	backup a file	ie: `mv` or `cp file to file.<n>`
`centos-version`	print the CentOS version
`check-hosts`	print the cluster usage (memory and CPUs), as aggregate by logical rack
`check-qacct`	show statistics of resources usage for completed jobs
`dus-report`	run and parse `du` to produce a disk usage report
`elapsed`	print elapsed time between each call	ie; `elapsed; ...do something...; elapsed`
`fixFmt`	format a number with fixed number of digits	csh: @ n = 1; set N = `fixFmt 3 $n` [ba]sh: n=1; N=`fixFmt 3 $n` `echo n=$n N=$N → n=1 N=001`
`get-jobhr`	tool to retrieve a job hard resources (`mem_res h_data h_vmem`) in a job script
`lsth`	show most recent files	ie: `lsth [-40] [<spec>]` → `ls -lt <spec> \| head -40`
`lswc`	count number of files in directories	ie: `lswc dir/` → `ls dir/ \| wc -l`
`noX`	tool to unset `DISPLAY` (saved in `XDISPLAY`)
`p-wait`	wait until given PID has completed	ie: `p-wait [check-time] <PID>`
`pawk`	print with `awk`	ie: `pawk 1,hello,3` → `awk '{print $1,"hello",$3}'`
`print-proc-memory`	print nicely content of `/proc/memory`
`procinfo`	print properties of local machine (#CPUs, memory, OS)
`procinfo+`	print properties of local machine (#CPUs, memory, OS, etc...)
`tails`	run tail [options] on a set of files	ie: "`tail -3 pl`" fails "`tails -3 pl`" OK
`total`	compute the total of values at given column of a file
`useX`	tool to reset `DISPLAY` (from `XDISPLAY`), revert effect of noX
`xterm-config`	tool to configure `xterm` window properties

Each tool has a man page, accessible after you load the module tools/local+.

Last Updated 20 Nov 2017 SGK.

Page tree

Additional Tools