1. Preamble

The cluster, known as Hydra, is made of
1. two login nodes,
2. one front-end node,
3. a queue manager/job scheduler (the UNIVA Grid Engine or UGE), and
4. a slew of compute nodes.

To access the cluster you must log on one of the two login nodes (see below), using ssh:
- ssh hydra-login01.si.edu
  or
- ssh hydra-login02.si.edu

From either login node you submit and monitor your job(s) via the queue manager/scheduler.
The job scheduler is the Grid Engine, simply GE or UGE.
The Grid Engine runs on the front-end node (hydra-5.si.edu), hence the front-end node should not be used as a login node.
- There is no reason for users to ever have to log on Hydra-5.

All the nodes (login, front-end and compute nodes) are interconnected
- via Ethernet (at 10Gbps, aka 10GbE), and
- via InfiniBand (at 40Gbps or higher, aka IB).
The disks are mounted off 3 types of dedicated devices:
1. A NetApp filer for /home, /data, and /pool (via 10GbE),
2. A GPFS for /scratch (via IB)
3. A NAS for /store (via 10GbE), a near-line storage only available on some nodes.

The following figure is a schematic representation of the cluster:

Note that it does not represent the actual physical layout.

2. Access to the Cluster

To access the Hydra cluster you will need
1. an account;
2. a secure connection to log in either login nodes; and
3. a ssh client (i.e., a program compatible with the secure shell protocol, aka ssh).

Requesting Accounts

SAO users should request an account following the instructions posted on the CF's web page.
Accounts on Hydra are separate from CF, HEA, SI's Active Directory,or VPN accounts.
Non-SAO users should request an account by submitting a request through the SI Service Portal.
Accounts on Hydra are separate from SI's Active Directory, or VPN accounts. Please make sure you have completed the Hydra Moodle Training course before you submit the request.

Secure Connection

You can connect to the login nodes using ssh only from a trusted machine.
Trusted machines are computers connected via a hardwired network link to SI's or SAO's network and managed by OCIO or by the CF or HEA (at SAO).
Any other computer, like your laptop, a "self-managed" computer, a computer at another institution, or a computer connected via WiFi, must be authenticated using VPN.
Information on VPN authentication is available elsewhere:
- here for SI (via OCIO), and
- here for SAO (via CF/HEA).
SAO users can connect to the cluster using ssh directly from either:
- login.cfa.harvard.edu (CF users)
  or
- pogoN.cfa.harvard.edu (HEA users), where N is 1, 2, ..., 6.
  or, of course,
- a trusted machine (like a CF- or HEA-managed machine).

SSH Clients

Linux users, use ssh [<usenname>]@<host>, where
- <username> is your username on hydra if it is different from the computer you are ssh'ing from
- <host> is either of the login nodes name, i.e., hydra-login01.si.edu or hydra-login02.si.edu
MacOS users can use ssh [<usenname>]@<host> as explained above from a Terminal.
- Open the Terminal app by going to /Applications/Utilities and finding Terminal.
- You can get to the Utilities folder by going to the Go menu in the Finder and choosing Utilities.
PC/Windows users need to install a ssh client. Public domain options are:
- PuTTY
- Cygwin and use ssh [<usenname>]@<host>.
- Note that Cygwin includes a X11 server.

3. Using the Cluster

To run a job on the cluster you will need to:
1. install the required software (unless it is already installed);
2. copy the data your job needs to the cluster; and
3. write at least one (minimal) script to run your job.

Indeed, the login nodes are for interactive use only like editing, compiling, testing, checking results, etc.... and, of course, submitting jobs.

The login nodes are not compute nodes (neither is the front-end node), and therefore they should not be used for actual computations, except short debugging interactive sessions or short ancillary computations.

The compute nodes are the machines (aka hosts, nodes, servers) on which you run your computations, by submitting a job, or a slew of jobs, to the queue system (the Grid Engine or GE).

This is accomplished via the qsub command, from a login node, and using a job script.

Do not run on the login or front-end nodes and do not run jobs out-of-band, this means:

do not log on a compute node to manually start a computation, always use qsub;
do not run scripts/programs that spawn additional tasks, or use multi-threads, unless you have requested the corresponding resources;
if your script runs something in the background (it shouldn't), use the command wait so your job terminates only when all the associated processes have finished;
to run multi-threaded jobs, read and follow the relevant instructions;
to run parallel jobs (MPI), read and follow the relevant instructions;
You don't start MPI jobs on the cluster the way you do on a stand alone computer (or laptop.)

Remember that

you will need to write a script, even if trivial, to submit a job (there is a tool to help you do that);
you should optimize your codes with the appropriate compilation flags for production runs;
you will need to specify multiple options when submitting your jobs, via the command qsub;
things do not always scale up: as you submit a lot of jobs (in the hundreds), that will run concurrently (at the same time), ask yourself:
1. is there name space conflict? all the jobs should not write to the same file, they should not have the same name;
2. what will the resulting I/O load be? do all the jobs read the same file(s), do they write a lot of (useless?) stuff?;
3. how much disk space will I use? Will my job fill up my allotted disk space? Is the I/O load high compared to the CPU load?
4. how much CPU time and memory does my job need? Jobs run in queues, these queues have limits.
Checkpointing: computers do crash, networks go down and jobs get killed when they exceed limits, so
- whenever possible, and especially for long jobs, you should save intermediate results so you can resume a computation from where it stopped.
- This is known as checkpointing.
- If you use some third party tool, verify if it uses checkpointing and how to enable it.
- If you run your own code, you should include checkpointing for long computations.

The cluster is a shared resource: when a resource gets used, it is unavailable to others, hence:

clobbering disk space prevents other from using it:
1. trim down your disk usage;
2. cleanup your disk usage after your computation;
3. the disks on the cluster are not for long term storage;
4. move what you don't use on the cluster back to your "home" machine.
Running un-optimized code wastes CPU cycles and effectively delays/prevents other users from running their analyses.
Reserving more memory than you need will effectively delay/prevent other users from using it.
Fair use: we have implemented resource limits, do not bypass these limits. If needed, feel free to contact us to review how these limits impact you.

4. Software Overview

The cluster runs a Linux distribution that is specific to clusters. We use BirghtCluster (8.2) to deploy CentOS 7.6 (Core).

As for any Unix system, you must properly configure your account to access the system resources.
Your ~/.bash_profile, or ~/.profile, and or ~/.cshrc files need to be adjusted accordingly.
The configuration on the cluster is different from the one on the CF- or HEA-managed machines (for SAO users).
We have implemented the command module to simplify the configuration of your Unix environment.
You can look in the directory ~hpc/ for examples of configuration files (with ls -la ~hpc).

Available Compilers

GNU compilers (gcc, g++, gfortran, g90)
Intel compilers and the Cluster Studio (debugger, profiler, etc: ifort, icc, ...)
Portland Group (PGI) compilers and the Cluster Dev Kit (debugger, profiler, etc: pgf77, pgcc, ...)

Available Libraries

MPI, for GNU, PGI and Intel compilers, w/ IB support;
the libraries that come with the compilers;
GSL, BLAS, LAPACK, ...

Available Software Packages

We have 128 run-time licenses for IDL, GDL and FL are available too.
Tools like MATLAB, JAVA, PYTHON, R. Julia, etc... are available; and
the Bioinformatics and Genomics support group has installed a slew of packages.

Refer to the Software pages for further details. Other software packages have been installed by users, or can be installed upon request.

5. Support

The cluster is located in Herndon, VA and is managed by ORCS/OCIO (Office of Research Computing Services/Office of the Chief Information Officer).

The cluster is supported by the following individuals

DJ Ding (DingDJ@si.edu), the the system administrator (at OCIO, Herndon, VA).
- As the sys-admin, he is responsible to keep the cluster operational and secure.

Rebecca Dikow (DikowR@si.edu) provides Bioinformatics and Genomics support (Data Science Lab/OCIO, Washington, D.C.);
- she is the primary support person for Bioinformatics and Genomics at SI.

Matthew Kweskin (KweskinM@si.edu) - NMNH/L.A.B., IT specialist (Washington, D.C.).

Vanessa Gonzalez (GonzalezV@si.edu) - NMNH/GGI, biologist (Washington, D.C.).

Sylvain Korzennik (hpc@cfa.harvard.edu), an astronomer at SAO (Cambridge, MA);
- he is the primary support person for astronomers at SAO.

Support is also provided by other OCIO staff members (networking, etc...).

Simple Rules

For sys-admin issues: (something is not working any more, etc):
- All users should contact DJ (and Rebecca & Sylvain) at SI-HPC-Admin@si.edu
For application support: (how do I do this?, why did that fail?, etc):
- SAO users should contact Sylvain, not the CF or HEA support sys-admin groups, at hpc@cfa.harvard.edu,
- Non-SAO users should contact Rebecca, Vanessa & Matt at SI-HPC@si.edu
Password problems: go to the self-serve password page.
Please use these email addresses to let the SI/HPC support team address your issues as soon as possible,
- rather than emailing individuals directly.

Mailing List

A mailing list, called HPPC-L on SI's listserv (i.e., at si-listserv.si.edu, hence HPCC-L@si-listserv.si.edu) is available to contact all the users of the cluster.

This mailing list is used by the support people to notify the users of important changes, etc.

You can also use this mailing list to communicate with other cluster users, share ideas, ask for help with cluster use problems, offer advice and solutions to problems, etc.
The mailing list is read by the cluster sysadmin and support staff as well as by other cluster users.

To email to that list you must log to the listserv and post your message. Note that:

replies to these messages are by default broadcast to the entire list; and
you will need to set up a password on this listserv the first time you use it (look in the upper right, under "Options").

Last updated 13 Nov 2019 SGK.