- A "better" qstat: qstat+
- A "better" qacct: qacct+
- A "better" quota: quota+
- Checking a Compute Node
- Checking Memory and CPU usage
- List of Local Tools
- List of Local+ Tools
Introduction
- We offer a set of tools, written for Hydra, to help monitor jobs and the cluster and do some simple operations.
- At the Hydra-5 upgrade, the names of these tools have been rationalized, and they have been split in two groups.
- To access them you must load the module
tools/loca
l and/or the moduletools/local+
. - The commands
% module help tools/local
and
% module help tools/local+
list the available tools (see table below).
- Each tool has a man page.
- A few of these tools are describe here in more details.
- The module
tools/local-bc
offers backward compatibility (hence the -bc) w/ the local tools on Hydra-4.
1 A "better" qstat: qstat+
qstat+
is a PERL wrapper aroundqstat
: it runsqstat
for you and parses its output to display it in a more friendly format.q+
is a shorthand forqstat+.
- How to use
qstat+
is explained in the man page (man qstat+
) and is described by:
% qstat+ -help
or
% qstat+ -examples
- It can be used to
- get useful info like the age and/or the cpu usage (in % of the job age) of running jobs,
- get the list of nodes a parallel job is running on (if you haven't saved it),
- get a filtered version of
qstat -j <jobid>
, and - get an overview of the cluster queues usage.
Namely:
usage: qstat+ [mode] [options] where modes are (exclusive list): -s|-sx show summary (two diff. format) -a show all jobs -q show queued jobs -r show running jobs -X show extra jobs (dr, Eqw, t) -j JID show info on a specific job -nlist JID show node list for a (set of) job(s) -hiload N show node(s) with high load -gc show global cluster status -es[x] show empty slots, -esx: expanded info -down show node(s) that are down -ores show overreserved jobs, i.e.: resMem/maxVMem > 2.5, & age > 1 hr -osub show oversubscribed jobs, i.e.: cpu% > 133%, & age > 1 hr -ineff show inefficient jobs, i.e.: cpu% < 33%, & age > 1 hr -help show help -examples show examples where options are: -u USER limit to the specified user, you can use "*" or "all" to list everybody's jobs or a coma-separated list -njobs show counts in no. of jobs -npes show counts in no. of PEs (slots) -nqpe show jobs/PEs not jobs/tasks for queued jobs -nqxx show jobs/tasks/PEs for queued jobs -load show the nodes' load -sua show the nodes' used/avail no of slots -age show the age of the jobs, i.e., elapsed time vs starting/submit time -cpu show the amount of CPU used by running job(s) -cpu% show the amount of CPU/age/#PE in % (job efficiency) -cpur show the ratio CPU/AGE, not scaled by #PE -mem show the mean memory usage for running jobs, the total requested memory for queued jobs, in GB -memx show more memory info for running jobs: reserved, mean, vmem and maxvmem (slow) -memr show more memory info for running jobs: reserved, mean, maxvmem and res/mxvmem (slow) -io show the I/O usage +lic show license info (default), although not in detailed outputs -lic do not show license info -noheader do not show the header -nofooter do not show the footer -raw print raw values (for easier parsing) -queue QSPEC limit to jobs in queue QSPEC (RE ok) -oresF VAL change the overreserved factor to VAL -osubT VAL change the oversubscribed threshold to VAL, in % -osubE VAL set the oversubscribtion excess threshold to VAL, in hr x slots -ineffT VAL change the inefficient threshold to VAL, in % -ageT VAL change the minimum age threshold to VAL, in hour -v verbose -warn show warnings -wide wide output (132 cols, implies -notty) -notty do not use the width of the terminal, as returned by stty -check-load check the instantanous load (-hiload only) shorthands: +a is expanded to -a -u $USER -nofooter show all your jobs +a% -a -u $USER -nofooter -age -cpu% ibidem, with cpu on % of age +ax% -a -u $USER -nofooter -age -cpu% -load -sua -mem -io +ar% -a -u $USER -nofooter -age -cpu% -memr -io +r -r -u $USER -nofooter show all your running jobs +r% -r -u $USER -nofooter -age -cpu% ibidem, with cpu on % of age +rx% -r -u $USER -nofooter -age -cpu% -load -sua -memx -io +rr% -r -u $USER -nofooter -age -cpu% -memr -io +q -q -u $USER -nofooter show all your queued jobs +q% -q -u $USER -nofooter -age ibidem but show age +qx% -q -u $USER -nofooter -age -mem ibidem plus mem info +X -X -u $USER -nofooter show all your extra jobs +n -notty -wide -noheader -nofooter qstat+: Ver 3.9/2 - Oct 2019
examples: qstat+ -a -u hpc show all of hpc's jobs qstat+ -r -cpu% -u hpc show all of hpc's running jobs, cpu in % qstat+ -r -cpu% -load -sua -u hpc show all of hpc's running jobs, cpu in %, and the nodes' load and slot usage/availability qstat+ -q show all the queued jobs qstat+ -j 8683280,8683285 show info on specific job IDs qstat+ -nlist 8683280,8683285 show nodes list for specific job IDs qstat+ -hiload 1.5 show nodes whose load is 1.5 greater than the number of slots used qstat+ -ineffT 50 -ineff show jobs that are below a 50% efficiency threshold qstat+ -osubT 200 -ageT 48 -osub show jobs that are above a 200% usage threshold and are older than 48 hours qstat+ -gc show the global cluster status qstat+ -es show the empty slots qstat+ -down show which nodes are down +ax% -u all is equiv to -a -u all -cpu% -load -sua -mem etc... for the +XXX shorthands qstat+: Ver 3.9/2 - Oct 2019
2 A "better" qacct: qacct+
- The accounting information is at regular intervals in a data base.
- The ingestion is run every few minutes, so the database is not instantaneously synchronized.
- That database can be queried with
qacct+
- The output of
qacct+
can be better controlled than the output ofqacct
, and qacct+
computes derived values.
- The output of
- How to use
qacct+
is explained in the man page (man qacct+
) and is described by:
% qacct+ -help
or % qacct+ -show help
Namely:
usage: qacct+ [options] where options are -j <jobid> limit to a given job ID (comma separated list OK) -t <taskid> limit to a given task ID, or a range of task ID (if n:m) -o <owner> limit to a given owner -q <queue> limit to given queue (regexp ok, like ".ThC.q") -pe <PE> limit to given PE (like mthread) -from <date> limit to submitted >= date, like "4/27/2016 10:00AM" -to <date> limit to submitted <= date, like "4/27/2016 11:00AM" -max <number> max number of entries to show -wdt <number> width of columns in tabular mode (def: 15) -f <dbname> use the given DB file name, (def: /home/hpc/cron/data/dbs/qacct+.db) -all show all that match -v verbose mode (repeat to increase verbosity) -n dry run, use w/ -v to check the resulting SQL search -dateFmt <n> how to format dates, n=-1,0,1 for YYYY/MM/DD, Mon DD YYY, MM/DD/YYYY, def.0 -help shows this help -show <string> specify what to show (def.: -show simple) where <string> can be "simple", "simple+", "tab", "tab+", or "raw", or "help", "fields" or "stats", or a custom specification use -show help to get additional help on this option Ver 1.5/1 Oct 22 2019
- and by
-show <string> specify what to show i.e. -show simple for simple format (def.) -show simple+ for extension of the simple format -show tab for tabular format -show tab+ for extension of the tabular format -show raw for raw format -show stats show only the DB stats or list what keyword to show and an optional format where string is either +field1[=format1][,field2[=format2]] for keyed format or @field1=[format1][,field2[=format2]] for tabular format the format is either a C format specification like %d, %s, %.1f, or the values &DATE, &MEM, or &AGE to convert to date, memory or age use -show fields to list all the available fields like in: -show @qname,slots,ru_wallclock,cpu=%-15.1f,granted_pe or -show +qname,slots,ru_wallclock=\&AGE,cpu=%.1f
- Retrieving information from a lot of jobs is currently slow. We plan to upgrade the back-end database server to speed it up.
3 A better "quota": quota+
- The Linux command
quota
is not working on the GPFS nor the NAS quota+
will report disk quota information on all the disks (NFS, GPFS or NAS)
quota+ [options] where options are: -u|--user user return quotas for given user (must be root) -v|--verbose display quotas on filesystems where no storage is allocated -% show Use% instead of Used +% show Use% as well as Used -f filesys return quotas for given file system only -device show device name only +device show mount point and device name -terse show only Used/Use% and Quota Limit, disable +% Ver 2.2/1 Nov 2019
4 Checking a Compute Node: rtop+
rtop+
: a script to run the command top
on a compute node (aka remote top
.)
- The Un*x command
top
can be used to look at what processes are running on a given machine (it reports the "top" processes running at any time). - To check what processes are running on a compute node, (to check CPU and/or memory usage, you can use:
% rtop+ [-u <username>] [-<number>] NN-MM
like in% rtop+ -u hpc -50 43-05
and you will see the<number>
lines listing the processes owned by<username>
on the compute nodecompute-NN-MM
, or
(second example) the first 50 lines when runningtop
, limited to userhpc
, oncompute-43-50
.
If you omit-<number>
you will see only the first 10 lines, if you omit-u <username>
you will see everybody's processes.
Check man top
to better understand the output of top
.
5 Checking Memory and CPU Usage of Jobs in the High Memory Queues
plot-qmemuse:
a tool to plot the memory and CPU usage of jobs that ran recently or are running in the high memory queues.- We monitor the jobs running in the high-memory queue, taking a usage snapshot every five minutes. This tool only applies to jobs in the high-memory queues
- The resulting statistics can be used to visualize the resources usage of a given job with the command
plot-qmemuse
, using
% plot-qmemuse <jobid>
or
% plot-qmemuse <jobid>.<taskid>
For that command to run, you must load the gnuplot module
first. By default this tool produces a plot in a 850x850 pixels png
file.
You can specify the following options:
-l <label> | to add your own label on the plot |
-s <size> | to specify the plot size, in pixel (-s 1200 for a 1200x1200 plot) |
-o <filename> | to specify the name of the png file |
-x | to plot on the screen (using X11 , assuming your connection to hydra allows X11 ) |
You can view the plot in the png
file with the command display <filename>
, assuming that your connection to hydra allows X11
, or
you can copy that file to your local machine and view it with your browser or a png-compatible image viewer (like xv
).
6 List of Local Tools
The following tools are available when loading the module tools/local
;
| substitute to chage, to query LDAP properties | ie: chage+ $USER |
| check disks usage, and print warning when usage exceed a threshold | ie: check-disks-usage -w 10 |
| print the cluster GPUs usage, for the hosts in the GPU host group | |
check-hi-memuse | check for memory use versus reservation and CPU usage efficiency, for a specific job or jobs | |
check-memres | check for jobs that use less memory than the amount reserved | |
check-memuse | print the cluster usage (memory and CPUs), for the hosts in a specific host group | |
check-qlogs | return a report on either oversubscribed or inefficient jobs, parsing existing log files report | |
check-qwait | show job(s) waiting in the queue and associated queue quota limits | |
disk-usage | return information on disk usage | |
find-all-zombies | find zombies | |
get-gpu-info | print information about GPU on the local host or a specific compute node | |
monitor-code-usage | helps you monitor your code usage | |
parse-disk-quota-reports | return report by parsing the disk quota report | |
plot-disk-usage | plot the usage of a given disk (volume) | |
plot-qmemuse | plot memory and cpu usage as a function of time for a given jobID | |
plot-qsnapshot | plot a snapshot of cluster usage | |
plot-qssduse | plot SSD usage as a function of time for a given jobID | |
plot-qssduse-summary | plot SSD usage summary as a function of time | |
plot-uptime | plot the load (uptime) of the head and login nodes | |
qacct+ | a "better" qacct | show accounting information for completed jobs |
qchain | chains a set of jobs by adding the -hold_jid <jobID> to qsub for you | |
qhost+ | convenience wrapper around qhost to use the shorthand "NN-MM" instead of typing "-h compute-NN-MM" | |
qquota+ | a "better" qquota | show queue quota wrt queue limits |
qstat+ | a "better" qstat | show queue status |
quota+ | a "better" quota | show disk quota information for all type (NFS, GPFS, NAS) |
q-wait | wait until some jobs are not found in the queue | |
rkill | remote kill | |
rkillall | remote killall | |
rtop+ | remote top | |
show-ge-reporting | parse and print part of the GE reporting file | |
show-qmemuse | show memory use statistics | |
show-qslots | shows how many slots are available in the queue(s) | |
show-qssduse | log statistics for plot-qssduse |
Each tool has a man page, accessible after you load the module tools/local
.
7 List of Local+ Tools
The following tools are available when loading the module tools/local+
;
| backup a file | ie: mv or cp file to file.<n> |
centos-version | print the CentOS version | |
check-hosts | print the cluster usage (memory and CPUs), as aggregate by logical rack | |
check-qacct | show statistics of resources usage for completed jobs | |
dus-report | run and parse du to produce a disk usage report | |
elapsed | print elapsed time between each call | ie; elapsed; ...do something...; elapsed |
fixFmt | format a number with fixed number of digits |
|
get-jobhr | tool to retrieve a job hard resources (mem_res h_data h_vmem ) in a job script | |
lsth | show most recent files | ie: lsth [-40] [<spec>] → ls -lt <spec> | head -40 |
lswc | count number of files in directories | ie: lswc dir/ → ls dir/ | wc -l |
noX | tool to unset DISPLAY (saved in XDISPLAY ) | |
p-wait | wait until given PID has completed | ie: p-wait [check-time] <PID> |
pawk | print with | ie: pawk 1,hello,3 → awk '{print $1,"hello",$3}' |
print-proc-memory | print nicely content of /proc/memory | |
procinfo | print properties of local machine (#CPUs, memory, OS) | |
procinfo+ | print properties of local machine (#CPUs, memory, OS, etc...) | |
tails | run tail [options] on a set of files | ie: "tail -3 *pl " fails "tails -3 *pl " OK |
total | compute the total of values at given column of a file | |
useX | tool to reset DISPLAY (from XDISPLAY ), revert effect of noX | |
xterm-config | tool to configure xterm window properties |
Each tool has a man page, accessible after you load the module tools/local+
.
Last Updated SGK.