Page tree
Skip to end of metadata
Go to start of metadata
  1. Introduction
  2. Nodes
  3. Disk Space
  4. InfiniBand Fabric

1. Introduction

As of the August 2019 upgrade, Hydra-5 consists of:

  • one head node;
  • two login nodes;
  • 88 compute nodes that adds up to ~4,000 CPUs & 6 GPUs, as follows;

    #nodes#cores/nodeMemory/nodeModelNote
    2432256GBFC430
    1640384GBR640
    1064512GBR815
    2064256GBR815
    164192GBR815
    2401024GBR820
    224512GBR820
    1112896GBR840
    164512GBR930
    372760GBR930
    1962048GBR930
    2721024TBSMC
    164256GBR730 4x K80 GPUs
    220128GBR790 2x V100 GPUs
  • a few 10Gbps network switches. all the nodes are on 10GbE)
  • an InfiniBand (IB) director switch (144 100Gbps ports, expandable to 256).
    All the nodes (head, login and compute nodes) are connected to the IB switch, hence on the InfiniBand transport fabric, except for
    • the two SuperMicro special nodes.
  • one NetApp "filer" (FAS8040) with 6 shelves (total ~500TB):
    • a dedicated device that serves (provides) disk space to the cluster, i.e. to all the nodes using NFS.
  • One GPFS system with two dedicated NSDs (total ~1.5PB):
    • a high performance general parallel file system (aka IBM Spectrum Scale), using the InfiniBand.
  • One NAS system for near-line storage:
    • a slower, cheaper storage available only on some nodes.

2. Nodes

The Head Node: hydra-5.si.edu

  • manages the cluster;
  • runs the job scheduler (the Grid Engine, aka UGE); and
  • starts jobs.

It should never be accessed by users, except if directed by support staff for special operations.

The Login Nodes: hydra-login0[12].si.edu

  • These are the computers available to the users to access the cluster:
    • They are currently 48 cores 128GB Dell R730 servers.
    • Do not run your computations on the login nodes.

You can use either node, depending on the node load.

The Compute Nodes: compute-NN-MM.local

  • These are the nodes (aka servers, hosts) on which jobs are being run, by submitting jobs to the scheduler, via qsub.
  • A couple of nodes are dedicated for
    • interactive use (qrsh), and
    • I/O queue (ro access to /store).
  • Do not ssh to the compute node to start any computation "out of band" (we'll find them and kill them).

3. Disk Space

The useful disk space available on the cluster is mounted off two dedicated devices (NetApp and GPFS),  the third one (NAS) is not accessible from the compute nodes, only from the login and i/o nodes; it is a near-line storage system.

The available public disk space is divided in several area (aka partitions):

  • a small partition for basic configuration files and small storage, the /home partition,
  • a set of medium size partitions, one for SAO users, one for non-SAO users, the /data partitions,
  • a set of large partitions, one for SAO users, one for non-SAO users, the /pool partitions,
  • a set of large partitions, one for SAO users, one for non-SAO users, the /scratch partitions.

It should be used as follows:

Name

Size

Typical Use

/home

10TB

For your basic configuration files, scripts and job files (NetApp

  • low quota limit but you can recover old stuff.

/data/sao

/data/genomics

40TB

30TB

 For important but relatively small files like final results, etc (NetApp)

  • medium quota limit, you can recover old stuff, but disk space is not released right away.

/pool/sao

/pool/genomics

/pool/biology

37TB

55TB

200GB

For the bulk of your storage (NetApp)

  • high quota limit, and disk space is released right away.

/scratch/genomics

/scratch/sao

400TB

ea

For temporary storage (GPFS)

  • fast storage, high quoya limit, and disk space is released right away.
/store/public270TBFor near-line storage.

Note that:

  • We impose quotas (limit on how much can be stored on each partition by each user) and we monitor disk usage;
  • /home should not be used for storage of large files, use /pool or /scratch instead;
  • /data is best to store things like final results, code, etc.. (small but important);
  • We implement an automatic scrubber: old stuff will be deleted to make space,
    • stuff 180 day old on /pool will scrubbed, while
    • stuff 90 days old on /scratch will be scrubbed.
  • (warning) None of the disks on the cluster are for long term storage:
    • please copy your results back to your "home" computer and
    • delete what you don't need any longer.
  • Once you reach your quota you won't be able to write anything on that partition until you delete stuff.
  • A few compute nodes have local SSDs (solid state disks) see (missing instructions), but
    since we now have a GPFS, check things using /scratch first.
    • Contact us if your jobs can benefit from using local SSDs.

See a complete description at the Disk Space and Disk Usage page.

4. InfiniBand Fabric

All the nodes (i.e., the compute nodes, the login nodes, and the head node) are interconnected using not only the regular 10GbE network (Ethernet),

but also via a high-speed, low latency, communication fabric, known as the InfiniBand (IB):

  • The IB switch is capable of a 100Gbps transfer rate, although the older nodes have IB card capable of 40Gbps only.
  • To use the IB for message passing (MPI) you must
    • build the executable the right way, and
    • specify that you want to use the IB in your job script.
    • We have modules to do precisely that.
  • A MPI program will not use by default the IB for message passing - you need to build it right to make it use the IB.
  • I/O to/from the GPFS srorage uses the InfiniBand fabric.



Last updated   SGK

  • No labels