As of the August 2019 upgrade, Hydra-5 consists of:
one head node;
two login nodes;
90 compute nodes that adds up to ~4,900 CPUs & 4 GPUs, ~38 TB memory as follows;
#nodes
#cores/node
Memory/node
Model
Note
24
32
256GB
FC430
16
40
384GB
R640
2
128
1024GB
R7525
5
128
756GB
R7525
10
64
512GB
R815
18
64
256GB
R815
1
64
192GB
R815
2
40
1024GB
R820
2
24
512GB
R820
1
112
896GB
R840
1
64
512GB
R930
3
72
760GB
R930
1
96
2048GB
R930
2
72
1024GB
SMC
2
20
128GB
R790
2x GV100GL GPUs
a few 10Gbps network switches. all the nodes are on 10GbE)
an InfiniBand (IB) director switch (144 100Gbps ports, expandable to 256). All the nodes (head, login and compute nodes) are connected to the IB switch, hence on the InfiniBand transport fabric, except for
the two SuperMicro special nodes.
one NetApp "filer" (FAS8040) with 6 shelves (total ~500TB):
a dedicated device that serves (provides) disk space to the cluster, i.e. to all the nodes using NFS.
One GPFS system with two dedicated NSDs (total ~1.5PB):
a high performance general parallel file system (aka IBM Spectrum Scale), using the InfiniBand.
One NAS system for near-line storage (total ~1PB):
a slower, cheaper storage available only on some nodes.
2. Nodes
The Head Node: hydra-5.si.edu
manages the cluster;
runs the job scheduler (the Grid Engine, aka UGE); and
starts jobs.
It should never be accessed by users, except if directed by support staff for special operations.
The Login Nodes: hydra-login0[12].si.edu
These are the computers available to the users to access the cluster:
They are currently 48 cores 128GB Dell R730 servers.
Do not run your computations on the login nodes.
You can use either node, depending on the node load.
The Compute Nodes: compute-NN-MM.local
These are the nodes (aka servers, hosts) on which jobs are being run, by submitting jobs to the scheduler, via qsub.
A couple of nodes are dedicated for
interactive use (qrsh), and
I/O queue (ro access to /store).
Do not ssh to the compute node to start any computation "out of band" (we'll find them and kill them).
3. Disk Space
The useful disk space available on the cluster is mounted off two dedicated devices (NetApp and GPFS), the third one (NAS) is not accessible from the compute nodes, only from the login and i/o nodes; it is a near-line storage system.
The available public disk space is divided in several area (aka partitions):
a small partition for basic configuration files and small storage, the /home partition,
a set of medium size partitions, one for SAO users, one for non-SAO users, the /data partitions,
a set of large partitions, one for SAO users, one for non-SAO users, the /pool partitions,
a set of large partitions, one for SAO users, one for non-SAO users, the /scratch partitions.
It should be used as follows:
Name
Size
Typical Use
/home
10TB
For your basic configuration files, scripts and job files (NetApp
low quota limit but you can recover old stuff.
/data/sao
/data/genomics
40TB
30TB
For important but relatively small files like final results, etc (NetApp)
medium quota limit, you can recover old stuff, but disk space is not released right away.
/pool/sao
/pool/genomics
/pool/biology
37TB
55TB
200GB
For the bulk of your storage (NetApp)
high quota limit, and disk space is released right away.
/scratch/genomics
/scratch/sao
400TB
ea
For temporary storage (GPFS)
fast storage, high quoya limit, and disk space is released right away.
/store/public
270TB
For near-line storage.
Note that:
We impose quotas (limit on how much can be stored on each partition by each user) and we monitor disk usage;
/home should not be used for storage of large files, use /pool or /scratch instead;
/data is best to store things like final results, code, etc.. (small but important);
We implement an automatic scrubber: old stuff will be deleted to make space,
stuff 180 day old on /pool will scrubbed, while
stuff 90 days old on /scratch will be scrubbed.
None of the disks on the cluster are for long term storage:
please copy your results back to your "home" computer and
delete what you don't need any longer.
Once you reach your quota you won't be able to write anything on that partition until you delete stuff.
A few compute nodes have local SSDs (solid state disks) see (missing instructions), but since we now have a GPFS, check things using /scratch first.
Contact us if your jobs can benefit from using local SSDs.