- The cluster, known as
Hydra, is made of
- two login nodes,
- one front-end node,
- a queue manager/scheduler (the Grid Engine or GE), and
- a slew of compute nodes.
- To access the cluster you must log on one of the two login nodes (see below), using
- From either login node you submit and monitor your job(s) via the queue manager/scheduler.
- The queue manager/scheduler is the Grid Engine, simply GE (aka SGE for Sun Grid Engine).
- The Grid Engine runs on the front-end node (
hydra-4.si.edu), hence the front-end node should not be used as a login node.
- All the nodes (login, front-end and compute nodes) are interconnected
- via Ethernet (at 1Gbps, or 10Gbps), and
- via InfiniBand (at 40Gbps or higher).
- The disks are mounted off a dedicated device (aka appliance, server: a NetApp filer) connected to all the nodes via a 1Gbps networks switch with two 10Gbps up-links.
The following figure is a schematic representation of the cluster:
Note that it does not represent the actual physical layout.
2. Access to the Cluster
- To access the
Hydracluster you will need
- an account;
- a secure connection to log in either login nodes; and
sshclient (i.e., a program compatible with the secure shell protocol, aka
- SAO users should request an account following the
Accounts on Hydra are separate from
SI's Active Directory,or
- Non-SAO users should request an account by emailing Rebecca Dikow (DikowR@si.edu).
Accounts on Hydra are separate from
SI's Active Directory, or
- You can connect to the login nodes using
sshonly from a trusted machine.
- Trusted machines are computers connected via a hardwired network link to SI's or SAO's network and managed by OCIO or by the
- Any other computer, like your laptop, a "self-managed" computer, a computer at another institution, or a computer connected via WiFi, must be authenticated using VPN.
- Information on VPN authentication is available elsewhere:
- SAO users can connect to the cluster using
sshdirectly from either:
login.cfa.harvard.edu (CF users)
pogoN.cfa.harvard.edu (HEA users), where N is 1, 2, ..., 6.
or, of course,
- a trusted machine (like a CF- or HEA-managed machine).
- Linux users, use
ssh [<usenname>]@<host>, where
<username>is your username on hydra if it is different from the computer you are
- <host> is either of the login nodes name, i.e.,
- MacOS users can use
ssh [<usenname>]@<host>as explained above from a
- Open the Terminal app by going to
- You can get to the
Utilitiesfolder by going to the
Gomenu in the
- Open the Terminal app by going to
- PC/Windows users need to install a
sshclient. Public domain options are:
See also the Comparison of SSH clients wikipedia page.
3. Using the Cluster
- To run a job on the cluster you will need to:
- install the required software (unless it is already installed);
- copy the data your job needs to the cluster; and
- write at least one (minimal) script to run your job.
Indeed, the login nodes are for interactive use only like editing, compiling, testing, checking results, etc.... and, of course, submitting jobs.
The login nodes are not compute nodes (neither is the front-end node), and therefore they should not be used for actual computations, except short debugging interactive sessions or short ancillary computations.
The compute nodes are the machines (aka hosts, nodes, servers) on which you run your computations, by submitting a job, or a slew of jobs, to the queue system (the Grid Engine or GE).
This is accomplished via the
qsub command, from a login node, and using a job script.
Do not run on the login or front-end nodes and do not run jobs out-of-band, this means:
- do not log on a compute node to manually start a computation, always use
- do not run scripts/programs that spawn additional tasks, or use multi-threads, unless you have requested the corresponding resources;
- if your script runs something in the background (it shouldn't), use the command
waitso your job terminates only when all the associated processes have finished;
- to run multi-threaded jobs, read and follow the relevant instructions;
- to run parallel jobs (MPI), read and follow the relevant instructions;
You don't start MPI jobs on the cluster the way you do on a stand alone computer (or laptop.)
- you will need to write a script, even if trivial, to submit a job (there is a tool to help you do that);
- you should optimize your codes with the appropriate compilation flags for production runs;
- you will need to specify multiple options when submitting your jobs, via the command
- things do not always scale up: as you submit a lot of jobs (in the hundreds), that will run concurrently (at the same time), ask yourself:
- is there name space conflict? all the jobs should not write to the same file, they should not have the same name;
- what will the resulting I/O load be? do all the jobs read the same file(s), do they write a lot of (useless?) stuff?;
- how much disk space will I use? Will my job fill up my allotted disk space? Is the I/O load high compared to the CPU load?
- how much CPU time and memory does my job need? Jobs run in queues, these queues have limits.
- Checkpointing: computers do crash, networks go down and jobs get killed when they exceed limits, so
- whenever possible, and especially for long jobs, you should save intermediate results so you can resume a computation from where it stopped.
- This is known as checkpointing.
- If you use some third party tool, verify if it uses checkpointing and how to enable it.
- If you run your own code, you should include checkpointing for long computations.
The cluster is a shared resource: when a resource gets used, it is unavailable to others, hence:
- clobbering disk space prevents other from using it:
- trim down your disk usage;
- cleanup your disk usage after your computation;
- the disks on the cluster are not for long term storage;
- move what you don't use on the cluster back to your "home" machine.
- Running un-optimized code wastes CPU cycles and effectively delays/prevents other users from running their analyses.
- Reserving more memory than you need will effectively delay/prevent other users from using it.
- Fair use: we have implemented resource limits, do not bypass these limits. If needed, feel free to contact us to review how these limits impact you.
4. Software Overview
The cluster runs a Linux distribution that is specific to clusters: it is called Rocks, and we run version 6.3, and CentOS 6.9.
As for any Unix system, you must properly configure your account to access the system resources.
~/.profile, and or
~/.cshrc files need to be adjusted accordingly.
The configuration on the cluster is different from the one on the
HEA-managed machines (for SAO users).
We have implemented the command
module to simplify the configuration of your Unix environment.
You can look in the directory
~hpc/ for examples of configuration files (with
ls -la ~hpc).
- GNU compilers (
gcc, g++, gfortran, g90)
- Intel compilers and the Cluster Studio (debugger, profiler, etc:
ifort, icc, ...)
- Portland Group (PGI) compilers and the Cluster Dev Kit (debugger, profiler, etc:
pgf77, pgcc, ...)
- MPI, for GNU, PGI and Intel compilers, w/ IB support;
- the libraries that come with the compilers;
- GSL, BLAS, LAPACK, ...
Available Software Packages
- We have 128 run-time licenses for IDL, GDL is available too.
- Tools like MATHALB, JAVA, PYTHON, R are available; and
- the Bioinformatics and Genomics support group has installed a slew of packages.
Refer to the Software pages for further details. Other software packages have been installed by users, or can be installed upon request.
The cluster is located in Herndon, VA and is managed by ORIS/OCIO (Office of Research Information Services/Office of the Chief Information Officer).
The cluster is supported by the following individuals
- Jamal Uddin (UddinJ@si.edu), the the system administrator (at OCIO, Herndon, VA). As the sys-admin, he is responsible to keep the cluster operational and secure.
- Rebecca Dikow (DikowR@si.edu) provides Bioinformatics and Genomics support (Data Science Lab/OCIO). She is the primary support person for Bioinformatics and Genomics at SI.
- Matthew Kweskin (KweskinM@si.edu) - NMNH/L.A.B., IT specialist.
- Vanessa Gonzalez (GonzalezV@si.edu) - NMNH/GGI, biologist.
- Sylvain Korzennik (firstname.lastname@example.org), an astronomer at SAO (Cambridge, MA) with 20+ years of HPC experience.
As the SAO's HPC analyst, his primary role is to support SAO scientists.
He is also responsible for configuring and tuning the cluster, its queuing configuration, general Unix support, validation and documentation.
Support is also provided by other OCIO staff members (networking, etc...).
- For sys-admin issues (forgotten password, something is not working any more, etc):
- All users should contact Jamal (& Sylvain) at SI-HPC-Admin@si.edu
- For application support (how do I do this?, why did that fail?, etc):
- Please use these email addresses to let the SI/HPC support team address your issues as soon as possible.
A mailing list, called
HPPC-L on SI's listserv (i.e., at
HPCC-L@si-listserv.si.edu) is available to contact all the users of the cluster.
This mailing list is used by the support people to notify the users of important changes, etc.
- You can also use this mailing list to communicate with other cluster users, share ideas, ask for help with cluster use problems, offer advice and solutions to problems, etc.
- The mailing list is read by the cluster sysadmin and support staff as well as by other cluster users.
To email to that list you must log to the
listserv and post your message. Note that:
- replies to these messages are by default broadcast to the entire list; and
- you will need to set up a password on this
listservthe first time you use it (look in the upper right, under "Options").
Last updated SGK.