- Monitoring your Jobs31326215
- How to Check on Jobs
- How to Delete/Kill a Job or Jobs
- Modifying Queued Jobs
- Checking on a Completed Job
- Additional Tool
- List of Local Tools
- Zombies
Anchor | ||||
---|---|---|---|---|
|
...
qstat -u $USER | shows only your jobs. |
---|---|
qstat -u '*' | shows everybody's jobs. |
qstat -s r | shows only running jobs. |
qstat -s p | shows only pending jobs. |
qstat -r | shows also requested resources and the full job name. |
qstat -s r -u $USER -g t | shows master/slave info for parallel jobs |
qstat -s r -u $USER -g d | shows task-ID for array jobs |
qstat -j 4615585 | produces a more detailed output, for a specific job ID. |
qstat -explain E -j 4615585 | produces the explanation for the error state of a specific job, specified by its job ID. |
...
The job status abbreviations, returned by qstat,
correspond to
...
will list the accounting information for a finished job with the given job ID.
It will show the following useful information:
...
Users that run jobs in the high memory queues can use this to check if they used close to the amount of memory they have reserved.
The qacct
command can also be used to check on the past usage of a given user. For example:
...
will return the usage statistics of the user specified in <username>
, over the past <ndays>
days (use man qacct
for more details).
You can get details information for each job that ran, with the "-j
" option, as follow:
...
The command egrep
is used to print any line that has one of strings in the quoted list separated by the '|' ('|' that means "or" in this context, see man egrep
).
This has been streamlined with the local tool check-jobs.pl - see below.
5b. A "better" qacct: qacct+
...
Anchor | ||||
---|---|---|---|---|
|
There is a set of tools written for Hydra to help monitor jobs and the cluster and do some simple operations.
To access them you must load the module tools/local
. The command
% module help tools/local
list the available tools (see table below). Each tool has a man page.
A few of these tools are describe here in more details.
6.a A "better" qstat: q+
qstat+.pl
is a PERL wrapper around qstat
: it runs qstat
for you and parses its output to display it in a more friendly format.
...
% q+ -help
or
% q+ -examples
It can be used to
- get useful info like the age and/or the cpu usage (in % of the job age) of running jobs,
- get the list of nodes a parallel job is running on (if you haven't saved it),
- get a filtered version of
qstat -j <jobid>
, and - get an overview of the cluster queues usage.
You can access more information about q+
with man qstat+
(once I get to writing the man page).
6.b Checking a Compute Node
rtop+
: a script to run the command top
on a compute node (aka remote top
.)
...
See the additional tools page.
Anchor |
---|
Check man top
to better understand the output of top
.
6.c Checking Memory and CPU Usage of Jobs in the High Memory Queues
plot-qmemuse.pl:
a tool to plot the memory and CPU usage of jobs that ran recently or are running in the high memory queues.
We monitor the jobs running in the high-memory queue, taking a usage snapshot every five minutes. This tool only applies to jobs in the high-memory queues
The resulting statistics can be used to visualize the resources usage of a given job with the command plot-qmemuse.pl
, using
% plot-qmemuse.pl <jobid>
or
% plot-qmemuse.pl <jobid>.<taskid>
For that command to run, you must load the gnuplot module
first. By default this tool produces a plot in a 850x850 pixels png
file.
You can specify the following options:
-l <label> | to add your own label on the plot |
-s <size> | to specify the plot size, in pixel (-s 1200 for a 1200x1200 plot) |
-o <filename> | to specify the name of the png file |
-x | to plot on the screen (using X11 , assuming your connection to hydra allows X11 ) |
You can view the plot in the png
file with the command display <filename>
, assuming that your connection to hydra allows X11
, or
you can copy that file to your local machine and view it with your browser or a png-compatible image viewer (like xv
).
...
The following tools are available when loading the module tools/local
;
...
backup
...
backup a file
...
fixFmt
...
format a number with fixed number of digits
...
...
lsth
...
show most recent files
...
lswc
...
count number of files in directories
...
pawk
...
print with awk
...
p-wait
...
wait until given PID has completed
...
q-wait
...
wait until N
jobs in queue
...
procinfo
...
print properties of local machine (#CPUs, memory, OS)
...
tails
...
run tail [options] on a set of files
...
disk-usage.pl
...
report disk usage
...
dus-report.pl
...
run and parse du
to produce a disk usage report
...
show-quotas.pl
...
show quota values
...
parse-quota-report.pl
...
parse the daily quota report
...
ie: qchain *.job
or qchain '-N start first.job 123' '-N crunch reduce.job 123' '-N finish post-process.job 123'
...
qstat+.pl
...
a "better" qstat
...
q+
...
a shortcut (link) to qstat+.pl
...
qhost+
...
wrapper around qhost
to use the shorthand N-M
...
qquota.pl
...
a wrapper around qquota
...
qacct+
...
a "better" qacct
...
rkill
...
remote kill
...
rkillall
...
remote killall
...
rtop+
...
remote top
...
gethr
...
alias to "source /home/hpc/sbin/get-hr.xxx
",
where xxx
is sou
or sh
depending on your shell
...
get-hr.sou
...
csh
tool to retrieve a job hard resources (mem_res h_data h_vmem
) in job script
...
get-hr.sh
...
[ba]sh
version
...
elapsed
...
alias to "source /home/hpc/sbin/elapsed.xxx
",
where xxx
is sou
or sh
depending on your shell
...
ie: elapsed; [do stuff]; elapsed
will show how long it took to do the stuff
...
elapsed.sou
...
csh
tool to monitor elapsed time
...
elapsed.sh
...
[ba]sh
tool to monitor elapsed time
...
noX
...
alias to "source /home/hpc/sbin/noX.xxx",
where xxx is sou or sh depending on your shell
...
noX.sou
...
csh
tool to unset DISPLAY
(saved in XDISPLAY
)
...
noX.sh
...
[ba]sh
tool to unset DISPLAY
(saved in XDISPLAY
)
...
useX
...
alias to "source $bin/useX.xxx
",
where xxx
is sou
or sh
depending on your shell
...
useX.sou
...
to reset DISPLAY
(from XDISPLAY
)
...
useX.sh
...
to reset DISPLAY
(from XDISPLAY
)
...
xterm-config.pl
...
tool to configure xterm
window properties
...
check-memuse.pl
...
print the cluster usage (memory and CPUs), for hosts in specific host group (def: @himem-hosts
)
...
ckmem.pl
...
print the cluster usage (memory and CPUs), for all or only hosts in given racks
...
check-hosts.pl
...
print the cluster usage (memory and CPUs), as aggregate by rack
...
qsnapshot.pl
...
display a snapshot of cluster usage, using characters
...
plot-qsnapshot.pl
...
plot a snapshot of cluster usage, using gnuplot
...
qckmem.pl
...
check the jobs in a (set of) queue(s) for memory use versus reservation and CPU usage efficiency
...
qavail.pl
...
show how many slots are available for a given host group or a type of queue
...
qmemuse.pl
...
log statistics for plot-qmemuse.pl
...
plot-qmemuse.pl
...
plot memory use (and efficiency) for recent jobs in hi-mem queue
...
monitor-mycode.csh
...
monitor your use of compute-N-M for a given program every TT minutes
...
print-proc-memory.pl
...
pretty print nicely content of /proc/memory
...
Each tool has a man page, accessible after you load the module tools/local
.
Anchor | |||
---|---|---|---|
|
Some MPI codes, when killed, leave behind zombie processes.
(more on how to kill them)
Section |
---|
...
Last Updated SGK.