Page tree
Skip to end of metadata
Go to start of metadata
  1. A "better" qstat: qstat+
  2. A "better" qacct: qacct+
  3. A "better" quota: quota+
  4. Checking a Compute Node
  5. Checking Memory and CPU usage
  6. List of Local Tools
  7. List of Local+ Tools

Introduction

  • We offer a set of tools, written for Hydra, to help monitor jobs and the cluster and do some simple operations.
    • At the Hydra-5 upgrade, the names of these tools have been rationalized, and they have been split in two groups.
  • To access them you must load the module tools/local and/or the module tools/local+.
  • The commands

   % module help tools/local

and

   % module help tools/local+

list the available tools (see table below).

  • Each tool has a man page.
  • A few of these tools are describe here in more details.
  • The module tools/local-bc offers backward compatibility (hence the -bc) w/ the local tools on Hydra-4.

1 A "better" qstat: qstat+

  • qstat+ is a PERL wrapper around qstat: it runs qstat for you and parses its output to display it in a more friendly format.
  • q+ is a shorthand for qstat+.
  • How to use qstat+  is explained in the man page (man qstat+ ) and is described by:

   % qstat+ -help

or

   % qstat+ -examples

  • It can be used to
    • get useful info like the age and/or the cpu usage (in % of the job age) of running jobs,
    • get the list of nodes a parallel job is running on (if you haven't saved it),
    • get a filtered version of qstat -j <jobid>, and
    • get an overview of the cluster queues usage.

Namely:

qstat+ -help
usage: qstat+ [mode] [options]
  where modes are (exclusive list):
    -s|-sx       show summary (two diff. format)
    -a           show all jobs
    -q           show queued jobs
    -r           show running jobs
    -X           show extra jobs (dr, Eqw, t)
    -j      JID  show info on a specific job
    -nlist  JID  show node list for a (set of) job(s)
    -hiload N    show node(s) with high load
    -gc          show global cluster status
    -es[x]       show empty slots, -esx: expanded info
    -down        show node(s) that are down
    -ores        show overreserved jobs, i.e.: resMem/maxVMem > 2.5, & age > 1 hr
    -osub        show oversubscribed jobs, i.e.: cpu% > 133%, & age > 1 hr
    -ineff       show inefficient jobs, i.e.: cpu% < 33%, & age > 1 hr
    -help        show help
    -examples    show examples

where options are:
    -u USER       limit to the specified user, you can use "*" or "all" to list everybody's jobs
                  or a coma-separated list
    -njobs        show counts in no. of jobs
    -npes         show counts in no. of PEs (slots)
    -nqpe         show jobs/PEs not jobs/tasks for queued jobs
    -nqxx         show jobs/tasks/PEs for queued jobs
    -load         show the nodes' load
    -sua          show the nodes' used/avail no of slots
    -age          show the age of the jobs, i.e., elapsed time vs starting/submit time
    -cpu          show the amount of CPU used by running job(s)
    -cpu%         show the amount of CPU/age/#PE in % (job efficiency)
    -cpur         show the ratio CPU/AGE, not scaled by #PE
    -mem          show the mean memory usage for running jobs, the total requested memory for queued jobs, in GB
    -memx         show more memory info for running jobs: reserved, mean, vmem and maxvmem (slow)
    -memr         show more memory info for running jobs: reserved, mean, maxvmem and res/mxvmem (slow)
    -io           show the I/O usage
    +lic          show license info (default), although not in detailed outputs
    -lic          do not show license info

    -noheader     do not show the header
    -nofooter     do not show the footer
    -raw          print raw values (for easier parsing)
    -queue QSPEC  limit to jobs in queue QSPEC (RE ok)
    -oresF  VAL   change the overreserved factor to VAL
    -osubT  VAL   change the oversubscribed threshold to VAL, in %
    -osubE  VAL   set the oversubscribtion excess threshold to VAL, in hr x slots
    -ineffT VAL   change the inefficient threshold to VAL, in %
    -ageT   VAL   change the minimum age threshold to VAL, in hour

    -v            verbose
    -warn         show warnings
    -wide         wide output (132 cols, implies -notty)
    -notty        do not use the width of the terminal, as returned by stty
    -check-load   check the instantanous load (-hiload only)

shorthands:
    +a   is expanded to -a -u $USER -nofooter             show all your jobs
    +a%                 -a -u $USER -nofooter -age -cpu%  ibidem, with cpu on % of age
    +ax%                -a -u $USER -nofooter -age -cpu% -load -sua -mem -io
    +ar%                -a -u $USER -nofooter -age -cpu% -memr -io
    +r                  -r -u $USER -nofooter             show all your running jobs
    +r%                 -r -u $USER -nofooter -age -cpu%  ibidem, with cpu on % of age
    +rx%                -r -u $USER -nofooter -age -cpu% -load -sua -memx -io
    +rr%                -r -u $USER -nofooter -age -cpu% -memr -io
    +q                  -q -u $USER -nofooter             show all your queued jobs
    +q%                 -q -u $USER -nofooter -age        ibidem but show age
    +qx%                -q -u $USER -nofooter -age -mem   ibidem plus mem info
    +X                  -X -u $USER -nofooter             show all your extra jobs
    +n                  -notty -wide -noheader -nofooter
qstat+: Ver 3.9/2 - Oct 2019
qstat+ -examples
examples:
 qstat+ -a -u hpc                   show all of hpc's jobs
 qstat+ -r -cpu% -u hpc             show all of hpc's running jobs, cpu in %
 qstat+ -r -cpu% -load -sua -u hpc  show all of hpc's running jobs, cpu in %, and
                                the nodes' load and slot usage/availability
 qstat+ -q                          show all the queued jobs
 qstat+ -j 8683280,8683285          show info on specific job IDs
 qstat+ -nlist 8683280,8683285      show nodes list for specific job IDs
 qstat+ -hiload 1.5                 show nodes whose load is 1.5 greater than the number of slots used
 qstat+ -ineffT 50 -ineff           show jobs that are below a 50% efficiency threshold
 qstat+ -osubT 200 -ageT 48 -osub   show jobs that are above a 200% usage threshold and are older than 48 hours
 qstat+ -gc                         show the global cluster status
 qstat+ -es                         show the empty slots
 qstat+ -down                       show which nodes are down

 +ax% -u all        is equiv to -a -u all -cpu% -load -sua -mem
  etc... for the +XXX shorthands

qstat+: Ver 3.9/2 - Oct 2019

2 A "better" qacct: qacct+

  • The accounting information is at regular intervals in a data base.
    • The ingestion is run every five minutes, so the database is not instantaneously synchronized.
  • That database can be queried with qacct+
    • The output of qacct+  can be better controlled than the output of qacct, and
    • qacct+  computes derived values.
  • How to use qacct+ is explained in the man page (man qacct+) and is described by:

   % qacct+ -help
or
   % qacct+ -show help

Namely:

qacct+ -help
usage: qacct+ [options] where options are  
  
                           limit to
         -j     <jobid>    given job  ID(s),    single value, comma separated list, or range like in n:m
         -t     <taskid>   given task ID(s),    single value or range like in n:m
         -o     <owner>    given owner,         exact match
         -q     <queue>    given queue,         regexp OK, like ".ThC.q"
         -N     <name>     given job name,      regexp OK, like "crunch*"
         -pe    <pe>       given pe,            exact match, like mthread
         -xSQL  <string>   extra SQL constrain        like 'slots <= 2'

         -nSearch <n>      jobs that ended >= (now - n) days, def. 31
         -since <date>     jobs submitted  >= date,   like "4/27/2016 10:00AM"
         -from  <date>     jobs that ended >= date,   like "4/27/2016 10:00AM"
         -until <date>     jobs submitted  <= date,   like "4/27/2016 11:00AM"
         -to    <date>     jobs that ended <= date,   like "4/27/2016 11:00AM"

         -all              show all that match, not just the latest
         -max   <number>   max number of entries to show

         -v[=value]        verbose mode (repeat to increase verbosity, -v=2 equiv to -v -v)
         -n                dry run, use w/ -v to check the resulting SQL search
         -dateFmt <n>      how to format dates, n=-1,0,1 for YYYY/MM/DD, Mon DD YYY, MM/DD/YYYY, def. 0

         -db    <dbname>   use the given DB [engine:name]
         -init  <file>     mysql initialization file, def. /scratch/admin/mysql/qacct+/.my.cnf, /home/sylvain/.my.cnf

         -show  <string>   specify what to show          def.: -show simple
                     where <string> can be "simple", "simple+", "tab", "tab+", or "raw",
                                        or "help", "fields" or "stats",
                                        or a custom specification
                           use -show help to get additional help on this option
         -wdt   <number>   width of columns in tabular mode def: 15)

         -help             shows this help
  • and by
qacct+ -show help
  -show  <string> specify what to show  
  
       The  default  output (-show) is 'simple', it is a keyed output (i.e., field = output) with only a subset of the available
       fields, in a "friendly" output.

       There are three other preset output formats: 'simple+', 'tab', 'tab+', 'raw' and 'stats'.

          'simple+' is an extension of 'simple': more fields,
          'tab'     is a tabular output: one line per accounting entry,
          'tab+'    is an extension of 'tab': more fields,
          'raw'     returns everything available with no formatting,
          'stats'   return statistics about the DB.

       Moreover, you can customize the output, by specifying which fields to print out and how to format them.

       The option '-show fields' lists all the available fields.

       The option '-show help' explains how to use the -show <string> option to produce a custom output,  where  <string>  is  a
       list what to show and optionally how to format it.

       This string is either
           +field1[=format1][,field2[=format2]]  for keyed format
         or
           #field1=[format1][,field2[=format2]]  for tabular format

       the  format  is either a C format specification (see man printf) like %d, %s, %.1f, or the values @DATE, @MEM, or @AGE to
       convert the value to a date, a memory or an age. Examples are:

           -show #qname,slots,elapsed_time,cpu=%-15.1f,granted_pe
         or
           -show +qname,slots,elapsed_time=@AGE,cpu=%.1f

  • The back-end database server was changed to mysql to speed it up. 

3 A better "quota": quota+

  • The Linux command quota is not working on the GPFS nor the NAS
  • quota+ will report disk quota information on all the disks (NFS, GPFS or NAS)
quota+ help
quota+ [options]
 where options are:
  -u|--user user        return quotas for given user (must be root)
  -v|--verbose          display quotas on filesystems where no storage is allocated
  -%                    show Use% instead of Used
  +%                    show Use% as well as Used
  -f filesys            return quotas for given file system only
  -device               show device name only
  +device               show mount point and device name
  -terse                show only Used/Use% and Quota Limit, disable +%

Ver 2.2/1 Nov 2019

4 Checking a Compute Node: rtop+

rtop+: a script to run the command top on a compute node (aka remote top.)

  • The Un*x command top can be used to look at what processes are running on a given machine (it reports the "top" processes running at any time).
  • To check what processes are running on a compute node, (to check CPU and/or memory usage, you can use: 
       % rtop+ [-u <username>] [-<number>] NN-MM
    like in 
       % rtop+ -u hpc -50 43-05

    and you will see the <number> lines listing the processes owned by <username> on the compute node compute-NN-MM, or
    (second example) the first 50 lines when running top, limited to user hpc, on compute-43-50.
    If you omit -<number> you will see only the first 10 lines, if you omit -u <username> you will see everybody's processes.

Check man top to better understand the output of top.

5 Checking Memory and CPU Usage of Jobs in the High Memory Queues

  • plot-qmemuse: a tool to plot the memory and CPU usage of jobs that ran recently or are running in the high memory queues.
  • We monitor the jobs running in the high-memory queue, taking a usage snapshot every five minutes. This tool only applies to jobs in the high-memory queues
  • The resulting statistics can be used to visualize the resources usage of a given job with the command plot-qmemuse, using

   % plot-qmemuse <jobid>

 or

   % plot-qmemuse <jobid>.<taskid>

For that command to run, you must load the gnuplot module first. By default this tool produces a plot in a 850x850 pixels png file.

You can specify the following options:

-l <label>to add your own label on the plot
-s <size>to specify the plot size, in pixel (-s 1200 for a 1200x1200 plot)
-o <filename>to specify the name of the png file
-xto plot on the screen (using X11, assuming your connection to hydra allows X11)

You can view the plot in the png file with the command display <filename>, assuming that your connection to hydra allows X11, or

you can copy that file to your local machine and view it with your browser or a png-compatible image viewer (like xv).

6 List of Local Tools

The following tools are available when loading the module tools/local;

chage+

substitute to chage, to query LDAP properties

ie: chage+ $USER

check-disks-usage

check disks usage, and print warning when usage exceed a threshold

ie: check-disks-usage -w 10

check-gpuse

print the cluster GPUs usage, for the hosts in the GPU host group


check-hi-memusecheck for memory use versus reservation and CPU usage efficiency, for a specific job or jobs
check-memrescheck for jobs that use less memory than the amount reserved
check-memuseprint the cluster usage (memory and CPUs), for the hosts in a specific host group
check-qlogsreturn a report on either oversubscribed or inefficient jobs, parsing existing log files report
check-qwaitshow job(s) waiting in the queue and associated queue quota limits
disk-usagereturn information on disk usage
find-all-zombiesfind zombies
get-gpu-infoprint information about GPU on the local host or a specific compute node
monitor-code-usagehelps you monitor your code usage
parse-disk-quota-reportsreturn report by parsing the disk quota report
plot-disk-usage plot the usage of a given disk (volume)
plot-qmemuseplot memory and cpu usage as a function of time for a given jobID
plot-qsnapshotplot a snapshot of cluster usage
plot-qssduseplot SSD usage as a function of time for a given jobID
plot-qssduse-summaryplot SSD usage summary as a function of time
plot-uptimeplot the load (uptime) of the head and login nodes
qacct+a "better" qacctshow accounting information for completed jobs
qchainchains a set of jobs by adding the -hold_jid <jobID> to qsub for you
qhost+convenience wrapper around qhost to use the shorthand "NN-MM" instead of typing "-h compute-NN-MM"
qquota+a "better" qquotashow queue quota wrt queue limits
qstat+a "better" qstatshow queue status
quota+a "better" quotashow disk quota information for all type (NFS, GPFS, NAS)
q-waitwait until some jobs are not found in the queue
rkillremote kill
rkillallremote killall
rtop+remote top
show-ge-reportingparse and print part of the GE reporting file
show-qmemuseshow  memory use statistics
show-qslotsshows how many slots are available in the queue(s)
show-qssduselog statistics for plot-qssduse

Each tool has a man page, accessible after you load the module tools/local.

7 List of Local+ Tools

The following tools are available when loading the module tools/local+;

backup

backup a file

ie: mv or cp file to file.<n>
centos-versionprint the CentOS version
check-hostsprint the cluster usage (memory and CPUs), as aggregate by logical rack
check-qacctshow statistics of resources usage for completed jobs
dus-reportrun and parse du to produce a disk usage report
elapsedprint elapsed time between each callie; elapsed; ...do something...; elapsed
fixFmt

format a number with fixed number of digits

csh:    @ n = 1; set N = `fixFmt 3 $n`

[ba]sh: n=1; N=`fixFmt 3 $n`

        echo n=$n N=$N → n=1 N=001

get-jobhrtool to retrieve a job hard resources (mem_res h_data h_vmem) in a job script
lsth

show most recent files

ie: lsth [-40] [<spec>] ls -lt <spec> | head -40
lswc

count number of files in directories

ie: lswc dir/ → ls dir/ | wc -l
noXtool to unset DISPLAY (saved in XDISPLAY)
p-wait

wait until given PID has completed

ie: p-wait [check-time] <PID>
pawk

print with awk

ie: pawk 1,hello,3 → awk '{print $1,"hello",$3}'
print-proc-memoryprint nicely content of /proc/memory
procinfoprint properties of local machine (#CPUs, memory, OS)
procinfo+print properties of local machine (#CPUs, memory, OS, etc...)
tails

run tail [options] on a set of files

ie: "tail -3 *pl" fails "tails -3 *pl" OK
totalcompute the total of values at given column of a file
useXtool to reset DISPLAY (from XDISPLAY), revert effect of noX
xterm-configtool to configure xterm window properties

Each tool has a man page, accessible after you load the module tools/local+.



Last Updated   SGK.

  • No labels