install the required software (unless it is already installed);
copy the data your job needs to the cluster; and
write at least one (minimal) script to run your job.
Indeed, the login nodes are for interactive use only like editing, compiling, testing, checking results, etc.... and, of course, submitting jobs.
The login nodes are not compute nodes (neither is the front-end node), and therefore they should not be used for actual computations, except short debugging interactive sessions or short ancillary computations.
The compute nodes are the machines (aka hosts, nodes, servers) on which you run your computations, by submitting a job, or a slew of jobs, to the queue system (the Grid Engine or GE).
This is accomplished via the qsub command, from a login node, and using a job script.
Do not run on the login or front-end nodes and do not run jobs out-of-band, this means:
do not log on a compute node to manually start a computation, always use qsub;
do not run scripts/programs that spawn additional tasks, or use multi-threads, unless you have requested the corresponding resources;
if your script runs something in the background (it shouldn't), use the command wait so your job terminates only when all the associated processes have finished;
to run multi-threaded jobs, read and follow the relevant instructions;
to run parallel jobs (MPI), read and follow the relevant instructions; You don't start MPI jobs on the cluster the way you do on a stand alone computer (or laptop.)
you will need to write a script, even if trivial, to submit a job (there is a tool to help you do that);
you should optimize your codes with the appropriate compilation flags for production runs;
you will need to specify multiple options when submitting your jobs, via the command qsub;
things do not always scale up: as you submit a lot of jobs (in the hundreds), that will run concurrently (at the same time), ask yourself:
is there name space conflict? all the jobs should not write to the same file, they should not have the same name;
what will the resulting I/O load be? do all the jobs read the same file(s), do they write a lot of (useless?) stuff?;
how much disk space will I use? Will my job fill up my allotted disk space? Is the I/O load high compared to the CPU load?
how much CPU time and memory does my job need? Jobs run in queues, these queues have limits.
Checkpointing: computers do crash, networks go down and jobs get killed when they exceed limits, so
whenever possible, and especially for long jobs, you should save intermediate results so you can resume a computation from where it stopped.
This is known as checkpointing.
If you use some third party tool, verify if it uses checkpointing and how to enable it.
If you run your own code, you should include checkpointing for long computations.
The cluster is a shared resource: when a resource gets used, it is unavailable to others, hence:
clobbering disk space prevents other from using it:
trim down your disk usage;
cleanup your disk usage after your computation;
the disks on the cluster are not for long term storage;
move what you don't use on the cluster back to your "home" machine.
Running un-optimized code wastes CPU cycles and effectively delays/prevents other users from running their analyses.
Reserving more memory than you need will effectively delay/prevent other users from using it.
Fair use: we have implemented resource limits, do not bypass these limits. If needed, feel free to contact us to review how these limits impact you.
4. Software Overview
The cluster runs a Linux distribution that is specific to clusters. We use BirghtCluster (8.2) to deploy CentOS 7.6 (Core).
As for any Unix system, you must properly configure your account to access the system resources. Your ~/.bash_profile, or ~/.profile, and or ~/.cshrc files need to be adjusted accordingly.
The configuration on the cluster is different from the one on the CF- or HEA-managed machines (for SAO users). We have implemented the command module to simplify the configuration of your Unix environment.
You can look in the directory ~hpc/ for examples of configuration files (with ls -la ~hpc).
GNU compilers (gcc, g++, gfortran, g90)
Intel compilers and the Cluster Studio (debugger, profiler, etc: ifort, icc, ...)
Portland Group (PGI) compilers and the Cluster Dev Kit (debugger, profiler, etc: pgf77, pgcc, ...)
MPI, for GNU, PGI and Intel compilers, w/ IB support;
the libraries that come with the compilers;
GSL, BLAS, LAPACK, ...
Available Software Packages
We have 128 run-time licenses for IDL, GDL and FL are available too.
Tools like MATLAB, JAVA, PYTHON, R. Julia, etc... are available; and
the Bioinformatics and Genomics support group has installed a slew of packages.
Refer to the Software pages for further details. Other software packages have been installed by users, or can be installed upon request.
The cluster is located in Herndon, VA and is managed by ORCS/OCIO (Office of Research Computing Services/Office of the Chief Information Officer).
The cluster is supported by the following individuals
DJ Ding (DingDJ@si.edu), the the system administrator (at OCIO, Herndon, VA).
As the sys-admin, he is responsible to keep the cluster operational and secure.
Rebecca Dikow (DikowR@si.edu) provides Bioinformatics and Genomics support (Data Science Lab/OCIO, Washington, D.C.);
she is the primary support person for Bioinformatics and Genomics at SI.
Matthew Kweskin (KweskinM@si.edu) - NMNH/L.A.B., IT specialist (Washington, D.C.).