Introduction

1. Preamble

The cluster, known as Hydra-3, is made of
1. two login nodes,
2. one front-end node,
3. a queue manager/scheduler (SGE), and
4. a slew of compute nodes.

You access the cluster by logging on one of the two login nodes, using ssh:
1. ssh hydra-login01.si.edu
  or
2. ssh hydra-login02.si.edu

From either login nodes you submit and monitor your job(s) via the queue manager/scheduler.
The queue manager/scheduler is the Grid Engine, simply GE (aka SGE for Sun Grid engine).
The Grid Engine runs on the front-end node (hydra-3.si.edu), the front-end node should not be used as a login node.
All the nodes (login, front-end and compute nodes) are interconnected
- via Ethernet (at 1Gbps), and
- via InfiniBand (at 40Gbps).
The disks are mounted off a dedicated server (NetApp filer) connected to all the nodes via a 1Gbps networks switch with two 10Gbps uplinks from the networks switch to the filer

A schematic representation of the cluster:

2. Access to the Cluster

To access the Hydra-3 cluster you will need
1. An account;
2. A secure connection.

Requesting Accounts

SAO users should request an account following the instructions posted on the CF's web page.
Accounts on Hydra are separate from CF, HEA, SI's Active Directory,or VPN accounts
non-SAO users should request an account following the instructions posted here.
Accounts on Hydra are separate from SI's Active Directory, or VPN accounts.

Secure Connection

You can connect to the login nodes (hydra-login01.si.edu or hydra-login01.si.edu) only via ssh, from a trusted machine.
Trusted machines are computers connected via a hardwired network on SI or SAO's network and managed by OCIO, or by the CF or HEA for SAO.
To use a WiFi connection and/or a self-managed computer, your must authenticate your machine via VPN.
VPN authenticate information is available elsewhere for SI/OCIO, for SAO/CF and for SAO/HEA.

3. Using the Cluster

To run your analysis/model/etc on the cluster you will need:
1. To install the required software (unless it is already installed);
2. Copy the data your job needs to the cluster;
3. Write a script to run your job.

Indeed, the login nodes are for interactive use only like editing, compiling, testing, checking results, etc.... and submitting jobs.

The login nodes are not compute nodes (neither is the front-end node), and therefore they should not be used for actual computations, except short debugging interactive sessions or short ancillary computations.

The compute nodes are the machines (hosts, servers) on which you run your computations, by submitting a job to the queue system. This is accomplished via the qsub command, from a login node, using a job script.

Do not run on the login or front-end nodes and do not run jobs out-of-band, this mean:

do not log on a compute node to manually start a computation, always use qsub,
do not run scripts/programs that spawn additional tasks, or use multi-threads, unless you have requested the corresponding resources,
if your script runs something in the background (it shouldn't), use wait so your job terminates only when all the associated processes have finished,
to run multi-threaded jobs, read and follow the relevant instructions.
to run parallel jobs (MPI), read and follow the relevant instructions.
You don't start MPI jobs on the cluster the way you do it on a stand alone computer or a laptop).

Remember that

you will need to write a script, even if trivial, to submit a job,
you should optimize your codes with the appropriate compilation flags for production runs,
you will need to specify multiple options when submitting your jobs, via the command qsub,
things do not always scale up, as you submit a lot of jobs (in the hundreds), that will run concurrently, ask yourself:
1. is there name space conflict? all the jobs should not write to the same file, they should not have the same name;
2. what will be the resulting I/O load? do all the jobs read the same file(s), do they write a lot of (useless?) stuff
3. how much disk space will I use? Will my job fill up my allotted disk space, Is the I/O load high compared to the CPU load?
4. how much CPU time and memory does my job need? Jobs run in queues, these queues have limits.

The cluster is a shared resource, when a resource get used, it is unavailable to others:

clobbering disk space prevents other from using it:
1. trim down your disk usage;
2. cleanup after your computation;
3. the disks on the cluster are not for long term storage;
4. move what you don't use on the cluster back to your "home" machine.
running un-optimized code wastes CPU cycles and effectively delays/prevents other users from running their analysis.
reserving more memory than you need will effectively delays/prevents other users from using it.

4. Software Overview

The cluster run a Linux distribution (Rocks 6.1.1) that is specific to clusters. Rocks 6.1.1 is build around CentOS 6.6.

As for any Unix system, you must properly configure you account to access all the system resources.
Your ~/.bash_profile, or ~/.profile, and or ~/.cshrc files need to be adjusted accordingly.

The configuration on the cluster is different from the one on the CF- or HEA-managed machines (SAO).
We have implemented the command module to simplify the configuration of your Unix environment.

You can look in ~hpc/ for examples (with ls -la ~hpc).

Available Compilers:

GNU compilers (gcc, g++, gfortran, g90)
Portland Group (PGI) compilers and the Cluster Dev Kit (debugger, profiler, etc...)
Intel compilers and the Cluster Studio (debugger, profiler, etc...)

Available Libraries

MPI, for GNU, PGI and Intel compilers, w/ IB support
the math libraries that come with the compilers
GSL, BLAS, LAPACK, ...

Available Software Packages

We have 128 run-time licenses for IDL, GDL is available too.
the Bioinformatics and Genomics support group has installed a slew of packages, they are described here.

Other Software Packages have been installed by users, or can be installed upon request. [we need a better way to list this]

Support

The cluster is located in Herndon, VA and is managed by ORIS/OCIO (Office of Research Information Services/Office of the CIO).

The cluster is supported by

D.J. Ding (DingDJ@si.edu), the system administrator (at OCIO, Herndon, VA). As the sys-admin, he is responsible to keep the cluster operational and secure.

Paul Frandsen (FrandsenP@si.edu), Bioinformatics and Genomics support (ORIS/OCIO). He is the primary support person for

Matthew, Kweskin (KweskinM@si.edu) - NMNH

Rebecca Dikow (DikowR@si.edu) - NMNH/NZP

Vanessa, Gonzalez (GonzalezV@si.edu) - NMNH/LAB

Sylvain Korzennik (hpc@cfa.harvard.edu), astronomer at SAO (Cambridge, MA): as the SAO's HPC analyst, his primary role is to support SAO scientists. He is also responsible for configuring and tuning the cluster, its queuing configuration, general Unix support, validation and documentation.

Support is also provided by other ORIS/OCIO staff members (networking, etc...).

Simple rules

SAO users should contact Sylvain, not the CF or HEA support sys-admins.
Non-SAO users should contact Paul for usage problems (how do I do this, why did that fail, etc).
For sys-admin problems (forgotten password, something is not working any more, etc) users should contact DJ (and CC either Paul or Sylvain).

Mailing List

A mailing list, called HPPC-L on SI's listserv (i.e., at si-listserv.si.edu, hence HPCC-L@si-listserv.si.edu) is available to contact all the cluster users.

This mailing list is used by the support people to notify the users of important changes, etc.

You can use this mailing list to communicate with other cluster users, share ideas, ask for help with cluster use problems, offer advice and solutions to problems, etc.

The mailing list is read by the cluster sysadmin and support staff as well as by other cluster users.

To email to that list you must log to the listserv and post your message. Replies to these messages are by default broadcast to the list.

You will need to set up a password on this listserv the first time you use it (look in the upper right, under "Options").

Last updated 17 Jun 2015 SGK

Page tree