...
While one can submit a job array that uses a parallel environment (-pe
and -t,
or parallel job arrays), one must use a specific the following approach to avoid a race condition specific to SGE.
- By default, SGE makes a local copy of each job script on the compute nodes it runs on.
- Parallel job arrays should avoid this , to prevent a race condition, where for a small fraction of the tasks the scheduler starts the script before it is copied, hence some tasks fails to start.
- The output of
qstat -j 9616234
will show something like this:and the SGE reporting file will list:error reason 11: 03/24/2016 11:09:11 [10464:63260]: unable to find job file "/opt/gridengine/default/spool/compute-2-2/job_scripts/9439938"
job never ran -> schedule it again
- The output of
qstat -f -explain E | grep QERROR
will show something like this:queue mThC.q marked QERROR as result of job 9616234's failure at host compute-1-2.local
- Leaving a queue entry in Error state.
How to Write a Parallel Job Array
Do not use embedded directive (sigh).
- Write a script (
sh
orcsh
, or...) with the needed steps, as for a job script. - Make that script executable (
chmod +x
). - Write a file with the
qsub
command and with all the options that you would otherwise put as embedded directives. - Pass the
-b y
option toqsub
and specify the full path of the script to execute. - Source that file to submit the parallel job array.
- Do not modify the executable script file while the job array is running.
...
chmod +x demo.sh
To submit the job array, simple simply source the qsub_XXX.sou
file:
...
You can edit the qsub_demo.sou
to submit more tasks, but but do not edit modify the executable script file while the job array is running.
...