Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

(warning) While one can submit a job array that uses a parallel environment (-pe and -t, or parallel job arrays), one must use a specific the following approach to avoid a race condition specific to SGE.

  • By default, SGE makes a local copy of each job script on the compute nodes it runs on.
  • Parallel job arrays should avoid this , to prevent a race condition, where for a small fraction of the tasks the scheduler starts the script before it is copied, hence some tasks fails to start.

  • The output of qstat -j 9616234 will show something like this:
    error reason  11:          03/24/2016 11:09:11 [10464:63260]: unable to find job file "/opt/gridengine/default/spool/compute-2-2/job_scripts/9439938"
    and the SGE reporting file will list:
    job never ran -> schedule it again
  • The output of qstat -f -explain E | grep QERROR  will show something like this:
    queue mThC.q marked QERROR as result of job 9616234's failure at host compute-1-2.local 
  • Leaving a queue entry in Error state.

How to Write a Parallel Job Array

  1. Do not use embedded directive (sigh).

  2. Write a script (sh or csh, or...) with the needed steps, as for a job script.
  3. Make that script executable (chmod +x).
  4. Write a file with the qsub command and with all the options that you would otherwise put as embedded directives.
  5. Pass the -b y option to qsub and specify the full path of the script to execute.
  6. Source that file to submit the parallel job array.
  7. (warning) Do not modify the executable script file while the job array is running.

...

chmod +x demo.sh

To submit the job array, simple simply source the qsub_XXX.sou file:

...

You can edit the qsub_demo.sou to submit more tasks, but (warning) but do not edit modify the executable script file while the job array is running.

...