Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • By default, SGE makes a local copy of each job script on the compute nodes it runs on.
  • Parallel job arrays should avoid this to prevent a race condition, where for a small fraction of the tasks the scheduler starts the script before it is copied, hence some tasks fails to start.

  • The output of qstat -j 9616234 will show something like this:
    error reason  11:          03/24/2016 11:09:11 [10464:63260]: unable to find job file "/opt/gridengine/default/spool/compute-2-2/job_scripts/94399389416234"
    and the SGE reporting file will list:
    job never ran -> schedule it again
  • The output of qstat -f -explain E | grep QERROR  will show something like this:
    queue mThC.q marked QERROR as result of job 9616234's failure at host compute-1-2.local 
    Leaving leaving a queue entry in Error state.

...

  1. Do not use embedded directive (sigh(sad)).

  2. Write a script (sh or ,csh, or..., perl, python, etc) with the needed required steps, as for a job script.
  3. Make that script executable (chmod +x), you can use the #! mechanism to specify the interpreter (aka shebang).
  4. Write a file with the qsub command and all the options that you would otherwise put as embedded directives.
  5. Pass the -b y option to qsub and specify the full path of the script to execute.
  6. Source that file to submit the parallel job array.
  7. (warning) Do not modify the executable script file while the job array is running.

...