You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 10

  1. A "better" qstat: qstat+
  2. A "better" qacct: qacct+
  3. A "better" quota: quota+
  4. Checking a Compute Node
  5. Checking Memory and CPU usage
  6. List of Local Tools
  7. List of Local+ Tools

Introduction

  • We offer a set of tools, written for Hydra, to help monitor jobs and the cluster and do some simple operations.
    • At the Hydra-5 upgrade, the names of these tools have been rationalized, and they have been split in two groups.
  • To access them you must load the module tools/local and/or the module tools/local+.
  • The commands

   % module help tools/local

and

   % module help tools/local+

list the available tools (see table below).

  • Each tool has a man page.
  • A few of these tools are describe here in more details.
  • The module tools/local-bc offers backward compatibility (hence the -bc) w/ the local tools on Hydra-4.

1 A "better" qstat: qstat+

  • qstat+ is a PERL wrapper around qstat: it runs qstat for you and parses its output to display it in a more friendly format.
  • q+ is a shorthand for qstat+.
  • How to use qstat+  is explained in the man page (man qstat+ ) and is described by:

   % qstat+ -help

or

   % qstat+ -examples

  • It can be used to
    • get useful info like the age and/or the cpu usage (in % of the job age) of running jobs,
    • get the list of nodes a parallel job is running on (if you haven't saved it),
    • get a filtered version of qstat -j <jobid>, and
    • get an overview of the cluster queues usage.

Namely:

qstat+ -help
usage: qstat+ [mode] [options]
  where modes are (exclusive list):
    -s|-sx       show summary (two diff. format)
    -a           show all jobs
    -q           show queued jobs
    -r           show running jobs
    -X           show extra jobs (dr, Eqw, t)
    -j      JID  show info on a specific job
    -nlist  JID  show node list for a (set of) job(s)
    -hiload N    show node(s) with high load
    -gc          show global cluster status
    -es[x]       show empty slots, -esx: expanded info
    -down        show node(s) that are down
    -ores        show overreserved jobs, i.e.: resMem/maxVMem > 2.5, & age > 1 hr
    -osub        show oversubscribed jobs, i.e.: cpu% > 133%, & age > 1 hr
    -ineff       show inefficient jobs, i.e.: cpu% < 33%, & age > 1 hr
    -help        show help
    -examples    show examples

where options are:
    -u USER       limit to the specified user, you can use "*" or "all" to list everybody's jobs
                  or a coma-separated list
    -njobs        show counts in no. of jobs
    -npes         show counts in no. of PEs (slots)
    -nqpe         show jobs/PEs not jobs/tasks for queued jobs
    -nqxx         show jobs/tasks/PEs for queued jobs
    -load         show the nodes' load
    -sua          show the nodes' used/avail no of slots
    -age          show the age of the jobs, i.e., elapsed time vs starting/submit time
    -cpu          show the amount of CPU used by running job(s)
    -cpu%         show the amount of CPU/age/#PE in % (job efficiency)
    -cpur         show the ratio CPU/AGE, not scaled by #PE
    -mem          show the mean memory usage for running jobs, the total requested memory for queued jobs, in GB
    -memx         show more memory info for running jobs: reserved, mean, vmem and maxvmem (slow)
    -memr         show more memory info for running jobs: reserved, mean, maxvmem and res/mxvmem (slow)
    -io           show the I/O usage
    +lic          show license info (default), although not in detailed outputs
    -lic          do not show license info

    -noheader     do not show the header
    -nofooter     do not show the footer
    -raw          print raw values (for easier parsing)
    -queue QSPEC  limit to jobs in queue QSPEC (RE ok)
    -oresF  VAL   change the overreserved factor to VAL
    -osubT  VAL   change the oversubscribed threshold to VAL, in %
    -osubE  VAL   set the oversubscribtion excess threshold to VAL, in hr x slots
    -ineffT VAL   change the inefficient threshold to VAL, in %
    -ageT   VAL   change the minimum age threshold to VAL, in hour

    -v            verbose
    -warn         show warnings
    -wide         wide output (132 cols, implies -notty)
    -notty        do not use the width of the terminal, as returned by stty
    -check-load   check the instantanous load (-hiload only)

shorthands:
    +a   is expanded to -a -u $USER -nofooter             show all your jobs
    +a%                 -a -u $USER -nofooter -age -cpu%  ibidem, with cpu on % of age
    +ax%                -a -u $USER -nofooter -age -cpu% -load -sua -mem -io
    +ar%                -a -u $USER -nofooter -age -cpu% -memr -io
    +r                  -r -u $USER -nofooter             show all your running jobs
    +r%                 -r -u $USER -nofooter -age -cpu%  ibidem, with cpu on % of age
    +rx%                -r -u $USER -nofooter -age -cpu% -load -sua -memx -io
    +rr%                -r -u $USER -nofooter -age -cpu% -memr -io
    +q                  -q -u $USER -nofooter             show all your queued jobs
    +q%                 -q -u $USER -nofooter -age        ibidem but show age
    +qx%                -q -u $USER -nofooter -age -mem   ibidem plus mem info
    +X                  -X -u $USER -nofooter             show all your extra jobs
    +n                  -notty -wide -noheader -nofooter
qstat+: Ver 3.9/2 - Oct 2019
qstat+ -examples
examples:
 qstat+ -a -u hpc                   show all of hpc's jobs
 qstat+ -r -cpu% -u hpc             show all of hpc's running jobs, cpu in %
 qstat+ -r -cpu% -load -sua -u hpc  show all of hpc's running jobs, cpu in %, and
                                the nodes' load and slot usage/availability
 qstat+ -q                          show all the queued jobs
 qstat+ -j 8683280,8683285          show info on specific job IDs
 qstat+ -nlist 8683280,8683285      show nodes list for specific job IDs
 qstat+ -hiload 1.5                 show nodes whose load is 1.5 greater than the number of slots used
 qstat+ -ineffT 50 -ineff           show jobs that are below a 50% efficiency threshold
 qstat+ -osubT 200 -ageT 48 -osub   show jobs that are above a 200% usage threshold and are older than 48 hours
 qstat+ -gc                         show the global cluster status
 qstat+ -es                         show the empty slots
 qstat+ -down                       show which nodes are down

 +ax% -u all        is equiv to -a -u all -cpu% -load -sua -mem
  etc... for the +XXX shorthands

qstat+: Ver 3.9/2 - Oct 2019

2 A "better" qacct: qacct+

  • The accounting information is at regular intervals in a data base.
    • The ingestion is run every few minutes, so the database is not instantaneously synchronized.
  • That database can be queried with qacct+
    • The output of qacct+  can be better controlled than the output of qacct, and
    • qacct+  computes derived values.
  • How to use qacct+ is explained in the man page (man qacct+) and is described by:

   % qacct+ -help
or
   % qacct+ -show help

Namely:

qacct+ -help
usage: qacct+ [options] where options are  
  
  -j     <jobid>    limit to a given job ID (comma separated list OK)  
  -t     <taskid>   limit to a given task ID, or a range of task ID (if n:m)  
  -o     <owner>    limit to a given owner  
  -q     <queue>    limit to given queue        (regexp ok, like ".ThC.q")  
  -pe    <PE>       limit to given PE           (like mthread)  
  -from  <date>     limit to submitted >= date,   like "4/27/2016 10:00AM"  
  -to    <date>     limit to submitted <= date,   like "4/27/2016 11:00AM"  
  
  -max   <number>   max number of entries to show  
  -wdt   <number>   width of columns in tabular mode (def: 15)  
  -f     <dbname>   use the given DB file name, (def: /home/hpc/cron/data/dbs/qacct+.db)  
  
  -all              show all that match  
  -v                verbose mode (repeat to increase verbosity)  
  -n                dry run, use w/ -v to check the resulting SQL search  
  -dateFmt <n>      how to format dates, n=-1,0,1 for YYYY/MM/DD, Mon DD YYY, MM/DD/YYYY, def.0  
  -help             shows this help  
  
  -show  <string>   specify what to show        (def.: -show simple)  
                    where <string> can be "simple", "simple+", "tab", "tab+", or "raw",  
                                       or "help", "fields" or "stats", or a custom specification  
                    use -show help to get additional help on this option  
  
  Ver 1.5/1 Oct 22 2019

  • and by
qacct+ -show help
  -show  <string> specify what to show  
  
  i.e.  
    -show simple  for simple format (def.)  
    -show simple+ for extension of the simple format  
    -show tab     for tabular format  
    -show tab+    for extension of the tabular format  
    -show raw     for raw format  
    -show stats   show only the DB stats  
  
  or list what keyword to show and an optional format where string is either   
    +field1[=format1][,field2[=format2]]  for keyed format  
  or  
    @field1=[format1][,field2[=format2]]  for tabular format  
  
  the format is either a C format specification like %d, %s, %.1f,  
  or the values &DATE, &MEM, or &AGE to convert to date, memory or age  
  use -show fields to list all the available fields  
  
  like in:  
    -show @qname,slots,ru_wallclock,cpu=%-15.1f,granted_pe  
  or   
    -show +qname,slots,ru_wallclock=\&AGE,cpu=%.1f
  • Retrieving information from a lot of jobs is currently slow. We plan to upgrade the back-end database server to speed it up. 

3 A better "quota": quota+

  • The Linux command quota is not working on the GPFS nor the NAS
  • quota+ will report disk quota information on all the disks (NFS, GPFS or NAS)
quota+ help
quota+ [options]
 where options are:
  -u|--user user        return quotas for given user (must be root)
  -v|--verbose          display quotas on filesystems where no storage is allocated
  -%                    show Use% instead of Used
  +%                    show Use% as well as Used
  -f filesys            return quotas for given file system only
  -device               show device name only
  +device               show mount point and device name
  -terse                show only Used/Use% and Quota Limit, disable +%

Ver 2.2/1 Nov 2019

4 Checking a Compute Node: rtop+

rtop+: a script to run the command top on a compute node (aka remote top.)

  • The Un*x command top can be used to look at what processes are running on a given machine (it reports the "top" processes running at any time).
  • To check what processes are running on a compute node, (to check CPU and/or memory usage, you can use: 
       % rtop+ [-u <username>] [-<number>] NN-MM
    like in 
       % rtop+ -u hpc -50 43-05

    and you will see the <number> lines listing the processes owned by <username> on the compute node compute-NN-MM, or
    (second example) the first 50 lines when running top, limited to user hpc, on compute-43-50.
    If you omit -<number> you will see only the first 10 lines, if you omit -u <username> you will see everybody's processes.

Check man top to better understand the output of top.

5 Checking Memory and CPU Usage of Jobs in the High Memory Queues

  • plot-qmemuse: a tool to plot the memory and CPU usage of jobs that ran recently or are running in the high memory queues.
  • We monitor the jobs running in the high-memory queue, taking a usage snapshot every five minutes. This tool only applies to jobs in the high-memory queues
  • The resulting statistics can be used to visualize the resources usage of a given job with the command plot-qmemuse, using

   % plot-qmemuse <jobid>

 or

   % plot-qmemuse <jobid>.<taskid>

For that command to run, you must load the gnuplot module first. By default this tool produces a plot in a 850x850 pixels png file.

You can specify the following options:

-l <label>to add your own label on the plot
-s <size>to specify the plot size, in pixel (-s 1200 for a 1200x1200 plot)
-o <filename>to specify the name of the png file
-xto plot on the screen (using X11, assuming your connection to hydra allows X11)

You can view the plot in the png file with the command display <filename>, assuming that your connection to hydra allows X11, or

you can copy that file to your local machine and view it with your browser or a png-compatible image viewer (like xv).

6 List of Local Tools

The following tools are available when loading the module tools/local;

chage+

substitute to chage, to query LDAP properties

ie: chage+ $USER

check-disks-usage

check disks usage, and print warning when usage exceed a threshold

ie: check-disks-usage -w 10

check-gpuse

print the cluster GPUs usage, for the hosts in the GPU host group


check-hi-memusecheck for memory use versus reservation and CPU usage efficiency, for a specific job or jobs
check-memrescheck for jobs that use less memory than the amount reserved
check-memuseprint the cluster usage (memory and CPUs), for the hosts in a specific host group
check-qlogsreturn a report on either oversubscribed or inefficient jobs, parsing existing log files report
check-qwaitshow job(s) waiting in the queue and associated queue quota limits
disk-usagereturn information on disk usage
find-all-zombiesfind zombies
get-gpu-infoprint information about GPU on the local host or a specific compute node
monitor-code-usagehelps you monitor your code usage
parse-disk-quota-reportsreturn report by parsing the disk quota report
plot-disk-usage plot the usage of a given disk (volume)
plot-qmemuseplot memory and cpu usage as a function of time for a given jobID
plot-qsnapshotplot a snapshot of cluster usage
plot-qssduseplot SSD usage as a function of time for a given jobID
plot-qssduse-summaryplot SSD usage summary as a function of time
plot-uptimeplot the load (uptime) of the head and login nodes
qacct+a "better" qacctshow accounting information for completed jobs
qchainchains a set of jobs by adding the -hold_jid <jobID> to qsub for you
qhost+convenience wrapper around qhost to use the shorthand "NN-MM" instead of typing "-h compute-NN-MM"
qquota+a "better" qquotashow queue quota wrt queue limits
qstat+a "better" qstatshow queue status
quota+a "better" quotashow disk quota information for all type (NFS, GPFS, NAS)
q-waitwait until some jobs are not found in the queue
rkillremote kill
rkillallremote killall
rtop+remote top
show-ge-reportingparse and print part of the GE reporting file
show-qmemuseshow  memory use statistics
show-qslotsshows how many slots are available in the queue(s)
show-qssduselog statistics for plot-qssduse

Each tool has a man page, accessible after you load the module tools/local.

7 List of Local+ Tools

The following tools are available when loading the module tools/local+;

backup

backup a file

ie: mv or cp file to file.<n>
centos-versionprint the CentOS version
check-hostsprint the cluster usage (memory and CPUs), as aggregate by logical rack
check-qacctshow statistics of resources usage for completed jobs
dus-reportrun and parse du to produce a disk usage report
elapsedprint elapsed time between each callie; elapsed; ...do something...; elapsed
fixFmt

format a number with fixed number of digits

csh:    @ n = 1; set N = `fixFmt 3 $n`

[ba]sh: n=1; N=`fixFmt 3 $n`

        echo n=$n N=$N → n=1 N=001

get-jobhrtool to retrieve a job hard resources (mem_res h_data h_vmem) in a job script
lsth

show most recent files

ie: lsth [-40] [<spec>] ls -lt <spec> | head -40
lswc

count number of files in directories

ie: lswc dir/ → ls dir/ | wc -l
noXtool to unset DISPLAY (saved in XDISPLAY)
p-wait

wait until given PID has completed

ie: p-wait [check-time] <PID>
pawk

print with awk

ie: pawk 1,hello,3 → awk '{print $1,"hello",$3}'
print-proc-memoryprint nicely content of /proc/memory
procinfoprint properties of local machine (#CPUs, memory, OS)
procinfo+print properties of local machine (#CPUs, memory, OS, etc...)
tails

run tail [options] on a set of files

ie: "tail -3 *pl" fails "tails -3 *pl" OK
totalcompute the total of values at given column of a file
useXtool to reset DISPLAY (from XDISPLAY), revert effect of noX
xterm-configtool to configure xterm window properties

Each tool has a man page, accessible after you load the module tools/local+.



Last Updated  SGK.

  • No labels