SGE Queue Tantrums
You can get a "bird's eye" view of your cluster load by running:
$ qstat -g c CLUSTER QUEUE CQLOAD USED AVAIL TOTAL aoACDS cdsuE ------------------------------------------------------------------------------- dev.q 1.01 8 0 8 0 0 express.q 0.38 0 176 176 0 0 general.q 0.66 17 1 32 16 0 short.q 0.37 29 2 88 0 72
If you find processers are mysteriously unavailable to a queue (e.g., the "short.q" above has 88 processers allocated to it, but, with 29 being used, only 2 more are available: what about the remaining 57?), it might be that the queue is in an error state.
You can check this by running:
$ qstat -f
An "E" in the "state" column indicates the queue is in an error state.
You can get a (slightly) more detailed report by running:
$ qstat -explain EHere is the annoying thing: the error state will not be cleared automatically. Even a reboot will not help. What has happened is that a job or job submission has failed in such a catastrophic way, that SGE has pulled the node from the queue to prevent any further activity there until a human can step in to investigatge and resolve the issue.
The only way to clear the error state is to run:
$ qmod -c '*'
feed
Comments
0 comments postedPost new comment