Ensuring That SGE Does Not Oversubscribe Processors or Memory

SGE, be default, will happily oversubscribe processors when multiple queues target the same nodes (nice one, SGE). Furthermore, even if jobs specify a memory limit, if each individual job uses less then the total memory limit, but the sum memory usage of jobs assigned to the machine exceeds the machine’s memory, the memory of the entire node can be exhausted, sending it off into limbo (cunning, SGE, very cunning). The solution to both of these issues is to specify processors and memory as consumable resources at the host level. This post details the procedure.

Creating the Consumable Resource Attributes

  • Call up the complex attribute modification editor:

    $ qconf -mc
  • Edit the “slots” entry, so it looks like the following:

    slots               s          INT
  • Edit the “virtual_free” entry, so it looks like the following:

    virtual_free        vf         MEMORY
  • Save and exit (”:wq”)

Set the Host-Level Limits for the Resources

This is a pain. The configuration for each host has to be specified individually. (Really, SGE? Really?) So, for each host with the name/address “``“, type:

$ qconf -me

and either add or edit the existing “complex_values” entry so that it looks like:

complex_values        slots=8, virtual_free=16G

This page has a script that helps by scripting out some of the pain. It creates a temporary file for each host that describes the limits, and then sets each host’s configuration using the file. It obviously requires careful customization for each individual set-up. Modified for my case:

#! /bin/bash

host_prefix='compute-0-'
for i in ; do
    n=`printf "%d" $i`
    host=$host_prefix$n
    echo $host
    file_name=sge_$host.conf
    cat > $file_name << EOF
hostname $host
load_scaling NONE
complex_values slots=8, virtual_free=15G
user_lists NONE
xuser_lists NONE
projects NONE
xprojects NONE
usage_scaling NONE
report_variables NONE
EOF
    qconf -Me  $file_name
done

Set/Customize the Default Resources Used by a Job

Edit “$SGE_ROOT/default/common/sge_request” and add:

# default memory limit
-l h_vmem=1.8G
# default memory usage
-l virtual_free=1.8G
# default to general.q
#-q "general.q"
Share