1. Introduction

As of the August 2019 upgrade, Hydra-5 consists of:

one head node;
two login nodes;

90 compute nodes that adds up to ~4,900 CPUs & 4 GPUs, ~38 TB memory as follows;

#nodes	#cores/node	Memory/node	Model	Note
24	32	256GB	FC430
16	40	384GB	R640
2	128	1024GB	R7525
5	128	756GB	R7525
10	64	512GB	R815
18	64	256GB	R815
1	64	192GB	R815
2	40	1024GB	R820
2	24	512GB	R820
1	112	896GB	R840
1	64	512GB	R930
3	72	760GB	R930
1	96	2048GB	R930
2	72	1024GB	SMC
2	20	128GB	R790	2x GV100GL GPUs

a few 10Gbps network switches. all the nodes are on 10GbE)
an InfiniBand (IB) director switch (144 100Gbps ports, expandable to 256).
All the nodes (head, login and compute nodes) are connected to the IB switch, hence on the InfiniBand transport fabric, except for
- the two SuperMicro special nodes.
one NetApp "filer" (FAS8040) with 6 shelves (total ~500TB):
- a dedicated device that serves (provides) disk space to the cluster, i.e. to all the nodes using NFS.
One GPFS system with two dedicated NSDs (total ~1.5PB):
- a high performance general parallel file system (aka IBM Spectrum Scale), using the InfiniBand.
One NAS system for near-line storage (total ~1PB):
- a slower, cheaper storage available only on some nodes.

2. Nodes

The Head Node: hydra-5.si.edu

manages the cluster;
runs the job scheduler (the Grid Engine, aka UGE); and
starts jobs.

It should never be accessed by users, except if directed by support staff for special operations.

The Login Nodes: hydra-login0[12].si.edu

These are the computers available to the users to access the cluster:
- They are currently 48 cores 128GB Dell R730 servers.
- Do not run your computations on the login nodes.

You can use either node, depending on the node load.

The Compute Nodes: compute-NN-MM.local

These are the nodes (aka servers, hosts) on which jobs are being run, by submitting jobs to the scheduler, via qsub.
A couple of nodes are dedicated for
- interactive use (qrsh), and
- I/O queue (ro access to /store).
Do not ssh to the compute node to start any computation "out of band" (we'll find them and kill them).

3. Disk Space

The useful disk space available on the cluster is mounted off two dedicated devices (NetApp and GPFS), the third one (NAS) is not accessible from the compute nodes, only from the login and i/o nodes; it is a near-line storage system.

The available public disk space is divided in several area (aka partitions):

a small partition for basic configuration files and small storage, the /home partition,
a set of medium size partitions, one for SAO users, one for non-SAO users, the /data partitions,
a set of large partitions, one for SAO users, one for non-SAO users, the /pool partitions,
a set of large partitions, one for SAO users, one for non-SAO users, the /scratch partitions.

It should be used as follows:

Name

Size

Typical Use

/home

10TB

For your basic configuration files, scripts and job files (NetApp

low quota limit but you can recover old stuff.

/data/sao

/data/genomics

40TB

30TB

For important but relatively small files like final results, etc (NetApp)

medium quota limit, you can recover old stuff, but disk space is not released right away.

/pool/sao

/pool/genomics

/pool/biology

37TB

55TB

200GB

For the bulk of your storage (NetApp)

high quota limit, and disk space is released right away.

/scratch/genomics

/scratch/sao

400TB

ea

For temporary storage (GPFS)

fast storage, high quoya limit, and disk space is released right away.

/store/public 270TB For near-line storage.

Note that:

We impose quotas (limit on how much can be stored on each partition by each user) and we monitor disk usage;
/home should not be used for storage of large files, use /pool or /scratch instead;
/data is best to store things like final results, code, etc.. (small but important);
We implement an automatic scrubber: old stuff will be deleted to make space,
- stuff 180 day old on /pool will scrubbed, while
- stuff 90 days old on /scratch will be scrubbed.

None of the disks on the cluster are for long term storage:
- please copy your results back to your "home" computer and
- delete what you don't need any longer.
Once you reach your quota you won't be able to write anything on that partition until you delete stuff.
A few compute nodes have local SSDs (solid state disks) see (missing instructions), but
since we now have a GPFS, check things using /scratch first.
- Contact us if your jobs can benefit from using local SSDs.

See a complete description at the Disk Space and Disk Usage page.

4. InfiniBand Fabric

All the nodes (i.e., the compute nodes, the login nodes, and the head node) are interconnected using not only the regular 10GbE network (Ethernet),

but also via a high-speed, low latency, communication fabric, known as the InfiniBand (IB):

The IB switch is capable of a 100Gbps transfer rate, although the older nodes have IB card capable of 40Gbps only.
To use the IB for message passing (MPI) you must
- build the executable the right way, and
- specify that you want to use the IB in your job script.
- We have modules to do precisely that.
A MPI program will not use by default the IB for message passing - you need to build it right to make it use the IB.
I/O to/from the GPFS srorage uses the InfiniBand fabric.

Last updated 13 Nov 2019 SGK