Scheduling Jobs in Sun Gridengine
The information on this page is derived from http://gridengine.sunsource.net/nonav/source/browse/~checkout~/gridengine/source/daemons/schedd/schedd.html where the scheduling mechanism of SGE is explained in more detail. In preparing this page the local settings of scheduler parameters were taken into account.
Sun Grid Engine does assign tickets to each job according to 4 high level policies. The sum of the tickets a job does receive according to these policies determines the sort order in the scheduling queue. The scheduler attempts to dispatch the job with the most tickets first. If the resources requested by the job are temporarily not available, the next job in turn is tried. For these reasons a submitted job can be scheduled ahead of other jobs submitted before or scheduling can be postponed.
Policies to obtain tickets
As already mentioned there are 4 policies to assign tickets to jobs:
- Share tree: Users and projects are given relative entitlements. These entitlements are meant to be granted over time. If a group does not use its shares, then these shares are made available to other groups. In the long run SGE tries however to achieve the ratio defined by the share tree, provided enough jobs are submitted. The entitlements do have a decay time (currently 24 h)
- Functional: This policy assigns entitlements to users and projects statically, i.e. it defines a relative priority among the projects.
The other two of the policies are used in exceptional cases only:
- Deadline: Jobs can be submitted with a deadline defined. The entitlement of deadline jobs grows automatically as they approach their deadline.
- Override: Administrators can manually override the policies given above to prioritize given jobs, users or projects.
Currently SGE is configured such that the share tree tickets do have a fairly big influence on the overall scheduling policy. Furthermore the number of tickets a job does receive does depend on the resources requested. The parameters mem and cpu are used in the calculation. Here mem is the memory requested multiplied by the wallclock time requested (h_vmem*h_rt) and cpu is the wallclock time requested (not the cpu time!). Currently our configuration does give the cpu parameter a much bigger weight than the mem parameter.
Taking these facts into account one can see how a job can be scheduled fast:
The less resources a job requires the faster it will be scheduled
The more jobs a user submits, the smaller the chance of fast scheduling
Queue and host selection
If there is more than one host and more than one queue suitable for a given job, then the host with the lowest load is selected and on the host one of the queues available is taken to schedule the job. Therefore as long as there is more than one slot in the batch system that meets the job requirements, the queue chosen depends on the load of the hosts and NOT on the queue parameters such as the maximum cpu or wallclock time allowed in a queue.
A user can manually select a queue but it should become clear that this way the chances for scheduling a job cannot become better.
Conclusions for submitting jobs
give safe but as low as possible values for the resources h_cpu, h_rt and h_vmem
omitting one of these values may cause your job to crash or be scheduled late
you benefit from tight memory limits as SGE takes h_vmem*h_rt of past jobs into account