Page tree
Skip to end of metadata
Go to start of metadata
  1. Introduction
  2. Hardware#Nodes
  3. Disk Space
  4. InfiniBand Fabric

1. Introduction

As of the August 2019 upgrade, Hydra-5 consists of:

  • one head node;
  • two login nodes;
  • 83 compute node that adds up to ~4,000 CPUs as follow:
    • two special nodes with 1TB of memory, 72 cores and 2-4TB of SSD each (SuperMicro);
    • one nodes with 2TB of memory and 96 cores (Dell R930);
    • two nodes with 1TB of memory and 40 cores each (Dell R820);
    • 15 nodes with 512GB of memory and 24 to 72 cores each (Dell R820);
    • 21 nodes with 256GB of memory and 64 cores each (Dell R815);
    • 10 nodes with 512GB of memory and 64 cores each (Dell R815);
    • 24 nodes with 256GB of memory and 32 cores each (FC430);
    • 16 nodes with 384GB of memory and 40 cores each (Dell R640);
    • one node with a dual GPU (2x K80) (Del R730);
  • a few 10Gbps network switches (newer nodes are on 10GbE)
  • an InfiniBand (IB) director switch (144 100Gbps ports, capable to be expanded to 256).
    All the nodes (head, login and compute nodes) are connected to the IB switch, hence on the InfiniBand transport fabric, except for
    • the two SuperMicro special nodes.
  • one NetApp "filer" (FAS8040) with 6 shelves (total ~500TB):
    • a dedicated device that serves (provides) disk space to the cluster, i.e. to all the nodes using NFS.
  • One GPFS system with two dedicated NSDs (total ~1.5PB):
    • a high performance general parallel file system (aka IBM Spectrum Scale), using the InfiniBand.
  • One NAS system for near-line storage:
    • a slower, cheaper storage available only on some nodes.

2. Nodes

The Head Node:

  • manages the cluster;
  • runs the job scheduler (the Grid Engine, aka UGE); and
  • starts jobs.

It should never be accessed by users, except if directed by support staff for special operations.

The Login Nodes: hydra-login0[12]

  • These are the computers available to the users to access the cluster.
  • They are currently 48 cores 128GB Dell R730 servers.
  • Do not run your computations on the login nodes.

You can use either node, depending on the node load.

The Compute Nodes: compute-NN-MM.local

  • These are the nodes (aka servers, hosts) on which jobs are being run, by submitting jobs to the scheduler, via qsub.
  • A couple of nodes are dedicated for
    • interactive use (qrsh), and
    • I/O queue (access to /store).
  • Do not ssh to the compute node to start any computation "out of band" (we'll find them and kill them).

3. Disk Space

The useful disk space available on the cluster is mounted off two dedicated devices (NetApp and GPFS),

the third one (NAS) is not accessible to the compute nodes and therefore a near-line storage system.

The available disk space is divided in several area (aka partitions):

  • a small partition for basic configuration files and small storage, the /home partition,
  • a set of medium size partitions, one for SAO users, one for non-SAO users, the /data partitions,
  • a set of large partitions, one for SAO users, one for non-SAO users, the /pool partitions,
  • a set of large partitions, one for SAO users, one for non-SAO users, the /scratch partitions.

It should be used as follow:



Typical Use



For your basic configuration files, scripts and job files (NetApp

  • low limit but you can recover old stuff.





 For important but relatively small files like final results, etc (NetApp)

  • medium limit, you can recover old stuff, but disk space is not released right away.





For the bulk of your storage (NetApp)

  • high limit, and disk space is released right away.





For temporary storage (GPFS)

  • fast storage, high limit, and disk space is released right away.

Note that:

  • We impose quotas (limit on how much can be stored on each partition by each user) and we monitor disk usage;
  • /home should not be used for storage of large files, use /pool or /scratch instead;
  • /data is best to store things like final results, code, etc.. (small but important);
  • We implement an automatic scrubber: old stuff will be deleted to make space,
    • stuff 180 day old on /pool will scrubbed, while
    • stuff 90 days old on /scratch will be scrubbed.
  • (warning) None of the disks on the cluster are for long term storage, please copy your results back to your "home" computer and
    delete what you don't need any longer.
  • Once you reach your quota you won't be able to write anything on that partition until you delete stuff.
  • A few compute nodes have local SSDs (solid state disks) see (missing instructions), but
    since we now have a GPFS, check things using /scratch first.
  • Contact us if your jobs can benefit from using local SSDs.

See a complete description on the Disk Space and Disk Usage page.

4. InfiniBand Fabric

All the nodes (i.e., the compute nodes, the login nodes, and the head node) are interconnected using not only the regular 10GbE network (Ethernet),

but also via a high-speed, low latency, communication fabric, known as the InfiniBand (IB):

  • The IB switch is capable of a 100Gbps transfer rate, although the older nodes have IB card capable of 40Gbps only.
  • To use the IB for message passing you must
    • build the executable the right way, and
    • specify that you want to use the IB in your job script.
    • We have modules to do precisely that.
  • A MPI program will not use by default the IB for message passing - you need to build it right to make it use the IB.
  • I/O to/from the GPFS uses the InfiniBand fabric.

Last updated   SGK

  • No labels