...
- By default, SGE makes a local copy of each job script on the compute nodes it runs on.
- Parallel job arrays should avoid this to prevent a race condition, where for a small fraction of the tasks the scheduler starts the script before it is copied, hence some tasks fails to start.
- The output of
qstat -j 9616234
will show something like this:and the SGE reporting file will list:error reason 11: 03/24/2016 11:09:11 [10464:63260]: unable to find job file "/opt/gridengine/default/spool/compute-2-2/job_scripts/94399389416234"
job never ran -> schedule it again
- The output of
qstat -f -explain E | grep QERROR
will show something like this:queue mThC.q marked QERROR as result of job 9616234's failure at host compute-1-2.local
Leaving leaving a queue entry in Error state.
...
Do not use embedded directive (sigh).
- Write a script (
sh
or,csh
, or...,perl
,python
, etc) with the needed required steps, as for a job script. - Make that script executable (
chmod +x
), you can use the#!
mechanism to specify the interpreter (aka shebang). - Write a file with the
qsub
command and all the options that you would otherwise put as embedded directives. - Pass the
-b y
option toqsub
and specify the full path of the script to execute. - Source that file to submit the parallel job array.
- Do not modify the executable script file while the job array is running.
...