Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.



Existing Smithsonian General Policies and SI Computer and Network Usage Policies

The  Smithsonian Institution High Performance Computing resources (i.e., SI/HPC: the Hydra cluster and its associated resources) is an official SI asset and, as such, is subject to all of the regulations and policies outlined in SI's official directives, such as SD-931.

Since all SI/HPC users must have a Smithsonian network account before being granted an account, they have already agreed to the policies described in SD-931.

(lightbulb) New SI/HPC account holders should consider reacquainting themselves with SD-931. In addition, it is expected that SI/HPC users' computer security awareness training (CSAT) is up to date.

User accounts

Individuals who have been granted a Smithsonian network account are eligible for a Hydra account.

In addition, it is expected that each SI/HPC user will provide and keep up to date the following, which will be gathered through an online form:

  • Name
  • Unit/Department
  • Supervisor name
  • Supervisor approval via online interface
  • 1-2 sentence description of the work they plan to conduct on Hydra.

Existing SAO users will continue to use their current system administered by Sylvain Korzennik.

Users are required to complete an online training module in order to receive a Hydra account unless they can demonstrate a reasonable level of proficiency with Linux and HPC.

Those with temporary appointments (e.g. contractors and fellows) are required to renew their Hydra account annually.

Note:

  • The sharing of credentials on SI/HPC assets (like Hydra) is strictly prohibited.
  • Borrowing disk space from other users to by-pass bypass quota is also prohibited. While it is, of course, OK to share data with collaborators, users should not try to bypass quota limit by having their data stored by others.
    • Either policy violation may/will result in suspension/cancellation of the user's account. Users should contact us if they need more resources to complete some specific task.
  • Users who have temporary appointments (students, postdocs, fellows, contractors, etc.) will need to renew their accounts annually (including SAO's postdocs/fellows).
  • Users and their supervisors will be contacted upon account expiration (via the email in their ~/.forward file). If there is no response for the user or their supervisor within 30 days, the data will be subject to deletion.
    (warning) If someone else should be contacted regarding a user's account/data, or the user's supervisor has changed or left, it is the user's responsibility to notify their SI/HPC contact person. 

Hydra Use

As with all assets on the Smithsonian network, Hydra users should have no expectation of privacy concerning their use of Hydra. Users should adhere to "Rule 5" of SD-931 regarding appropriate computer and network use.

Hydra's administrators reserve the right to suspend, cancel, or modify user accounts, quotas on the public disks, queue configuration, and other configuration settings, without warning and as needed to maintain the cluster integrity and optimal use as a shared resource.

Users should keep in mind that Hydra is a shared resource. As such, users should avoid conducting analyses on the login nodes or the front-end node, be mindful that others are using the cluster, and understand that others' work is no less important than yours.

(lightbulb) If you are wondering whether your behavior is in violation of "the spirit" of shared-use, consider whether dozens of people doing the same thing as you would adversely affect the functioning of the cluster.

(tick) Users are responsible for monitoring the status and the progress of their jobs. Users who plan to conduct many similar operations should start with a small (set of) test job(s) before scaling up their use. Users who have jobs running on Hydra should be able to access the cluster while their jobs are running, so they can adjust their use if need be.

(tick) Since the SI/HPC resources are a shared resource, users are expected to do a best effort to estimate their needs (CPU time, memory, disk space) and have a handle on how it scales with the size of their analysis.

(warning) Users are subject to suspension or cancellation of their accounts if they systematically abuse a scarce or depleted resource (memory -aka RAM- disk space, especially high throughput disks like SSDs, etc.) or bypass the resource limits in place.

Disk Use on Hydra

Disk storage on Hydra is not to be used for archival storage. Users should have no expectation of the long-term viability of their files stored on Hydra. While the disk system on Hydra is highly reliable, the Smithsonian is not responsible for any data that is are lost on Hydra

Data on disks on Hydra are not backed up and old data on the public disks will be regularly removed (scrubbed) according to the following model:

  • Files and directories on /scratch/{biology|genomics|sao}* will be scrubbed after 90 days.
  • Files and directories on /pool/{biology|genomics|sao} will be scrubbed after 180 days.

Remember:

  • All public disks have quotas and are scrubbed.
  • Analyses on large files and data-sets should be conducted using the /pool or /scratch disks.
  • The /home and /data disks should not be used for large files, their low quota will prevent users from storing large files.
  • All data associated with an expired account is subject to deletion 30 days after the account expires, unless agreed otherwise.
  • It is the responsibility of the user leaving SI to either save her/his data or pass on the responsibility of her/his data to her/his supervisor.


Section
bordertrue

Oversubscribed and Inefficient Jobs

We monitor the cluster usage for the following conditions:

  • Oversubscribed jobs: scaled CPU usage > 133%;

  • Very inefficient or “hosed” jobs: efficiency (CPU/age) < 10% and age > 36hr;

  • Inefficient jobs: scaled CPU usage < 33%;

  • Jobs that over-reserved memory: >2.5x actual RAM use.

Users with such jobs receive warning emails (sent to the email listed in their ~/.forward file).

User responsibility to respond to warnings:

(tick) (tick) For “hosed” and oversubscribed jobs, users should respond to warnings by emailing si-hpc-admin@si.edu within 24 hours of receiving the warnings, whenever possible, to help determine whether jobs should be killed. Oversubscribed jobs that are expected to run for over 24 hours after receiving the first warning should be killed and resubmitted with the correct parameters.

(warning) (warning) When Hydra’s usage is high (>70% usage, or lots of jobs waiting in the queue), or when the oversubscription is excessive (>500%, multiple offending jobs, repeat offender), jobs may get killed promptly by the system administrator team at their discretion (we do not plan to implement an automatic job killing).

(tick) (tick) For inefficient and memory over-reserved jobs, users should monitor their jobs, not ignore the warnings and contact the support staff if they do not know or understand why the warnings are occurring and/or how to fix the problem for future jobs.

(warning) (warning) If a user has a large number of similarly inefficient jobs, or is repeatedly receiving the same warnings without responding, their jobs may be killed by the system administrator team at their discretion, and eventually their Hydra usage privilege may be suspended until further training is completed.

Our overarching goal is to support all users: to get all jobs through the queue and help users learn how to best make use of a shared resource. Remember that oversubscribed jobs are likely to slow down someone else’s job(s) running on the same compute node(s), while inefficient or memory over-reserved jobs are clobbering the system and preventing the scheduler from starting the jobs waiting in the queue.

(lightbulb)(lightbulb)As a reminder, all users are expected to read messages sent to the email address listed in their ~/.forward file.

Communication

All users are expected to check and read their email regularly, including those from the SI HPCC-L listserv. Users will be added to the HPCC-L listserv when they are granted an account. 

The SI/HPC admins use email to:

  1. announce new features (e.g. on Hydra),
  2. warn users when their jobs are improperly using resources,
  3. warn users when their files will be scrubbed,
  4. warn users whose accounts are expiring,
  5. announce configuration changes, and
  6. announce new policies or changes to existing policies.

Users should be sure that the file ~/.forward (in their home directory on Hydra) contains a working email address that is regularly checked. A ~/.forward file is created for each new account, using the user canonical email. See here about creating a ~/.forward file.

Whom to Contact