Page tree
Skip to end of metadata
Go to start of metadata

1. A Trivial Example

Let us assume that

  • you have a program that you have successfully compiled/installed on the cluster and that you want to run,  we will call it crunch;
  • the executable crunch and all the files needed to run it are in one directory, and/or the files produced will go there;
  • for illustration, it is all in /pool/sao/hpc/test

The simplest way to submit this computation would be to write a two line job script, in a file located in the directory in /pool/sao/hpc/test and name it crunch.job:

A trivial crunch.job
cd /pool/sao/hpc/test
./crunch

You start the computation by submitting the job script file with:

% cd /pool/sao/hpc/test
% qsub crunch.job
Your job NNNNNNN ("crunch.job") has been submitted

The qsub command, if successful, will produce the "Your job ... has been submitted" message where NNNNNNN is a unique number, the job ID assigned to that job.

By default, the output and error files associated with that job will be located in your home directory and named crunch.job.oNNNNNNN and crunch.job.eNNNNNNN

2. A Better Example

A better approach is to specify more parameters associated to your job and save more information about the job as follow:

A better crunch.job
echo + `date` job $JOB_NAME started in $QUEUE with jobID=$JOB_ID on $HOSTNAME
./crunch
echo = `date` job $JOB_NAME done

and start the computation with:

% cd /pool/sao/hpc/test
% qsub -N crunch -cwd -j y -o crunch.log crunch.job
Your job NNNNNNN ("crunch") has been submitted

With this approach:

  • you specify the name of the job (-N crunch);
  • you join error and output in a single file (-j y);
  • you give a name to the output file (-o crunch.log); and
  • you tell qsub to start the job in the current working directory, and to write the output file there (-cwd).

You also, this way, keep track, by saving it in the log file, of

  • the job name and  job id,
  • in which queue the job ran, and on which compute node (host), and,
  • when it started and when it ended.

3. Example with Embedded Directives

You can put it all in one file as follows:

A better crunch.job with embedded directives
#
#$ -N crunch -cwd -j y -o crunch.log
#
echo + `date` job $JOB_NAME started in $QUEUE with jobID=$JOB_ID on $HOSTNAME
./crunch
echo = `date` job $JOB_NAME done

and you submit your job with simply:

% cd /pool/sao/hpc/test
% qsub crunch.job
Your job NNNNNNN ("crunch") has been submitted

The command qsub will look for lines starting with #$ (that are otherwise comments in the script) and parse these embedded directives as if they were options passed to the command itself.

4. Example with Embedded Directives and arguments

Finally, the job file is a script: it has to be written according to a specific syntax (C-shell, or Bourne shell) and can take arguments.

Let's now assume that crunch takes two arguments:

  • you can either write a different job file for each case you want to run, or
  • you can write your job script to use arguments, like this:
A better crunch.job with embedded directives and arguments
# /bin/csh
#
# this script takes two arguments and is written using the C-shell syntax
#
#$ -N crunch -cwd -j y -o crunch.log
#
echo + `date` job $JOB_NAME started in $QUEUE with jobID=$JOB_ID on $HOSTNAME
set OPTIONS = (-from $1 -to $2)
echo starting crunch $OPTIONS
./crunch $OPTIONS
echo = `date` job $JOB_NAME done

and you can start several jobs as follow:

% cd /pool/sao/hpc/test
% qsub -N crunch-20-50 -o crunch-20-50.log crunch.job 20 50
Your job NNNNNNN ("crunch-20-50") has been submitted
% qsub -N crunch-100-150 -o crunch-100-150.log crunch.job 100 150
Your job NNNNNNN ("crunch-100-150") has been submitted
[etc...]

In this example the job name and the log file name are redefined to be specific to each job by specifying them on the qsub command line.

5. Notes

  • Your job script can be an elaborate script, or can start an elaborate and/or convoluted script - how to write Un*x scripts is beyond the scope of this documentation.

  • The way the GE is setup, the default syntax for job scripts is the C-shell (csh).
  • If you prefer using the Bourne shell syntax (sh or bash), you need to tell qsub to use that shell by passing the option -S /bin/sh, or
    adding it as an embedded directive as follows:
A better crunch.job with embedded directives and arguments, using Bourne shell syntax
# /bin/sh
#
# this script takes two arguments and is written using the Bourne-shell syntax
#
#$ -S /bin/sh
#$ -N crunch -cwd -j y -o crunch.log
#
echo + `date` job $JOB_NAME started in $QUEUE with jobID=$JOB_ID on $HOSTNAME
OPTIONS="from $1 -to $2"
echo starting crunch $OPTIONS
./crunch $OPTIONS
echo = `date` job $JOB_NAME done
  • (warning) A #!/bin/sh on the first line of a job script is ignored (hence in the the examples I use # /bin/csh or  # /bin/sh as reminders, not as specifiers);

  • (warning) Unless you are fully versed in the idiosyncrasies of Linux and how /bin/bash differs from /bin/sh at startup, it is highly recommended to use /bin/sh and not /bin/bash.
    There are no syntax differences, the only differences are which initialization files are read at startup.

  • (lightbulb) The options passed to the qsub command are setup as follow:
    1. options specified in the system wide file $SGE_ROOT/$SGE_CELL/common/sge_request (lines not starting w/ #);
    2. options specified in a file called .sge_request, located in the current working directory (if there is one);
    3. options specified in a file called .sge_request, located in your home directory (if there is one);
    4. embedded options in the submitted script (lines starting with #$);
    5. options passed to qsub

For example, you can write a ~/.sge_request file with the line

-cwd -j y

and all your jobs will include -cwd -j y as options. At each step you can override a set option.

  • (warning) As you queue more than one job beware of "name space":
    • jobs will run concurrently (on different compute nodes, or not), so they should not write to the same file(s) and you can keep track of them better if they do not have the same job names.

  • (warning) Also, especially for Emacs users, the last line of a job file must be properly terminated with a newline character (in Unix the return key insert a NL, not a CR)
    The csh does not execute a line not properly terminated, and thus the last line of your script may not be executed if it is not followed by a blank line.

    Add the following lines in your ~/.emacs file
    ;; always end a file with a newline character
    (setq require-final-newline t)
  • A job is likely to need resources (CPU time, memory, etc) and need to run in a specific queue.  What queues to use and how to request resources is explained in the Available Queues page. 
    How to monitor your job(s) and the cluster is explained elsewhere (Monitoring your Jobs and Monitoring the Cluster).  

  • (warning) Long jobs should, whenever possible, use checkpointing: save intermediate results so one can resume a computation from where it stopped.

Email Notifications

You can request that the GE notify you by email when a job

  • is started (beginning of the job)
  • ends, or
  • is aborted.

This is accomplished by adding the -m abe option to qsub.

You can specify the list of users to which the GE will send mail, with the -M <email_address> option to  qsub, by default mail is sent to the job owner, and will be redirected if your created a ~/.forward file (see Introduction).

These options can be set as an embedded directive in the job script file.

Environment Variables

When a job is started, the GE defines a slew of environment variables.

(grey lightbulb) The following is a subset of these variables that your job scripts may want to use:

NameDescriptionExample of Value
JOB_NAMEThe job nametest
JOB_IDA unique job identifier8736123
HOSTNAMEThe name of the (master) node the job is running oncompute-8-5.local
QUEUEThe name of the cluster queue in which the job is runningsTHC.q
NSLOTSThe number of queue slots allocated to the job1
TMPDIRThe absolute path to the job's temporary working directory/tmp/8736126.1.sThC.q

(info) You can find a complete list at the bottom of the qsub manual - or man page - man qsub.

Catching Time Limits

  • All the queues, except a few, have time limits: a job will get killed if it exceeds either a CPU limit or an elapsed time (aka real time, wall clock) limit. What those limits are is explained elsewhere (Available Queues and resource limits.) 
  • All time-limited queues have a soft limit and a hard limit. When the soft limit is reached, a signal is sent by the GE to the script. That signal can be caught to execute something before reaching the hard limit (job termination).
  • Jobs using the Bourne-shell syntax (sh) can catch the signal, jobs using the C-shell (csh) can't (a shortcoming of Linux' implementation of csh).
  • So, especially for csh jobs, it is recommended to end the job script with a line like  'echo job done' that will indicate that the job (script) completed.
  • For sh jobs, the following example illustrates how to catch signals at the script level:
#
#$ -S /bin/sh
#
warn()
{
 echo @ `date` warning, received $1 signal.
}
#
trap "warn xcpu" SIGXCPU
trap "warn usr1" SIGUSR1
trap "warn kill" SIGKILL
#
echo + `date` job $JOB_NAME started in $QUEUE with jobID=$JOB_ID on $HOSTNAME
echo sleeping $1
sleep $1
echo = `date` job $JOB_NAME done
  • You can check
    • the status of a job with with the qstat command ,
    • the exit status of a job, and the resources it used (CPU, elapsed time, memory, I/Os), once it completed, with the qacct command
      (see Monitoring your Jobs ).

Last updated SGK

 

  • No labels