/!\ This web page will no longer be updated. Please use this link for current information.



This page contains information about the batch system and its usage. At DESY Zeuthen we use the batch system Univa Grid Engine 8.3.1p6.

1. News

2. Overview

In case of problems, please send email to <uco-zn AT desy DOT de>

2.1. Available farm nodes

Node name

number of systems

CPU

clock frequency

cores

memory

scratch space in $TMPDIR

comment

bladeb*

16

Intel Xeon X5660

1.8GHz

12

48GB

480GB

bladec* / tcx17*

32

Intel Xeon X5675

3.08GHz

12

48GB

1.2TB

bladed*

8

Intel Xeon X5650

2.67GHz

12

48GB

480GB

1x nVidia Tesla M2090 GPGPU per node

blade{e,f}*

32

Intel Xeon E5-2660

2.2GHz

16

64GB

1.2TB

kepler{00..15}

16

Intel Xeon E5-2660

2.2GHz

16

64GB

480GB

2x nVidia Kepler K20 GPGPU per node

kepler{16..26}

11

Intel Xeon E5-2640 v3

2.6GHz

16

64GB

480GB

2x nVidia Kepler K80 GPGPU per node

blade{g,h}*

32

Intel Xeon E5-2640 v3

2.6GHz

16

64GB

1.2TB

bladei*

16

Intel Xeon E5-2640 v4

2.4GHz

20

64GB

1.2TB

For an up-to-date overview, see also output of qhost command - details can be found here

2.2. Submission hosts

2.3. Job runtime

The batch farm is configured to optimize job throughput while providing some kind of interactive availability. Therefor we prefer job runtimes of 1-12 hours. Jobs running for more than 12 hours can only fill up the compute farm up to a certain percentage. The maximum job runtime is currently limited to 48 hours: /!\ Jobs requesting a longer runtime won't ever start!

job runtime

description

0-30 minutes

allows a slight cpu oversubscription - this means, a job can start although all available slots are currently filled. Expect a slightly worse cpu performance! It should be used for test purposes only therefor.

30 minutes - 12 hours

The preferred job runtime. Allows the maximum farm usage while keeping an overall good "interactivity" i.e. fast job turnaround

12-24 hours

number of simultaneously running jobs is limited to 75% of the available slots

24-48 hours

number of simultaneously running jobs is limited to 66% of the available slots

A job runtime of less than 10 minutes should be avoided to keep a good ratio between overhead at job start/end and the actual job payload.

3. Job submission

3.1. Basic requirements

3.2. Batch job submission

3.3. Interactive job submission

3.4. Parallel Jobs

There are two kinds of parallel jobs Grid Engine supports:

  1. Multithreaded / multiprocessor jobs on a single node:
    • You can reserve more than one core on a multicore host by specifying the qsub / qrsh parameter -pe multicore <number of cores>. This should of course only be done if your job is able to use all those cores simultaneously in an efficient way (i.e. 100% cpu usage per core).

    • As it is rather unlikely that you get the requested resources at once, you should always also specify the -R y switch to prevent your job from starving in the waiting queue.

  2. MPI jobs:
    • use the parameter -pe mpi <number of cores>

    • see the Cluster pages for additional documentation about running parallel jobs of this kind.

    • to prevent "starving" jobs in the queue, request a reservation: qsub switch: -R y

/!\ ATTENTION: Resources like h_rss & h_cpu are always requested per job slot, not per job! So you always have to adjust the requirements accordingly. Example: your parallel multithreaded job consumes almost 8 GB memory altogether. This job runs in 4 slots (-pe multicore 4). So if you request -l h_rss=2G, your job is allowed to consume up to 8 GB resident memory (4*2GB).

Here's a short overview of the available parallel environments:

PE name

description

multicore

intended to be used for jobs using more than one cpu core on a single node

multicore-mpi

jobs running on more than one cpu core on a single node but do the parallelisation via OpenMPI

mpi

jobs linked against OpenMPI and running in parallel on several nodes

3.5. GPU Jobs

The Zeuthen batch farm provides a limited number of nodes with installed nVidia Tesla GPGPUs. The CUDA-SDK is installed on those nodes as well. To request a GPGPU use the qsub/qrsh switch -l gpu:

qsub -l gpu <other requirements> <job>

If you intend to use a special GPGPU model, use the qsub/qrsh switch -l gpu_type=<type>. To see which different models are available, execute this statement:

qhost -l gpu_type='*' -F gpu_type

Inside the gpu job you can access your reserved nVidia device by using the environment variable $SGE_GPU_DEVICE:

[bladed0] ~ % ls -l $SGE_GPU_DEVICE
crw------- 1 ahaupt sysprog 195, 0 Mar  7 21:24 /dev/nvidia0

In case you requested multiple GPGPUs per job, the environment variables $SGE_GPU_DEVICE0, $SGE_GPU_DEVICE1, ... hold the paths to your devices. Also $CUDA_VISIBLE_DEVICES is set accordingly.

3.6. Most common qsub / qrsh submission switches

This section describes the most common parameters for qsub and qrsh. For more details read their man page. The switches can also be directly specified in the job script as it is shown in the example script.

Switch

description

-cwd

execute the job from the current directory and not relative to your home directory, mostly used in conjunction with the -o and -e switches (if you just specify a relative path there)

-e <job's stderr output file>

specifies the path to the job's stderr output file (relative to home directory or to the current directory if the -cwd switch is used)

-hold_jid <job ids>

tell Grid Engine not to start the job until the specified jobs have been finished successfully

-i <job's stdin file>

specifies the job's stdin

-j <y|n>

merge the job's stderr with its stdout

-js <job share>

Specifies the relative job share. Can be used to weight the importance of a single user's jobs. Higher numbers mean higher importance.

-l <job resource>

specifies the resources a job needs, multiple specifications can be stacked, see the next topic "Requesting resources" for details

-m <b|e|a>

Let Grid Engine send a mail on job's:
b : begin
e : end
a : abort

-notify

Grid Engine will notify the job about the following abortion (SIGKILL) by sending a SIGUSR2 first

-now <y|n>

force / switch off an immediate execution of the job. y is default for qrsh, n for qsub

-N <jobname>

specifies the job name, default is the name of the submitted scripts

-o <job's stdout output file>

specifies the path to the job's stdout output file, if you spefify -j as well it will also take the stderr output

-P <project>

specifies the project the job should run under. See the projects topic for details

-pe <parallel environment>

specifies the parallel environment the job should run under. See the parallel job topic for details

-R <y|n>

tell Grid Engine (not) to reserve a slot for huge jobs (i.e. parallel jobs, high demand of memory). This should prevent "starving" jobs in the waiting queue. If you know your job can run on the farm and already waits a really long time, try this switch

-S <path to shell>

specifies the shell Grid Engine should start your job with. Default is /bin/zsh

-t <from-to:step>

Submit an array job. That's actually the same as submitting the same job several times (difference of: to - from). Optionally you can specify the step size between the task numbers. The task within this array can be accessed in the job via the environment variable $SGE_TASK_ID. For more details see the man page

-tc <max running tasks>

In case you submit an array job, this switch limits the number of simultaneously running job tasks. Should be used e.g. to avoid overloading of limited central services like network file systems.

-V

inherit the current shell environment to the job <!> LD_LIBRARY_PATH is not inherited.

3.7. Requesting resources

These are the most common resources you can request:

complex name

description

possible values

example

comment

arch

the required host architecture

i386,x86_64

-l arch=x86_64

currently useless since only 64 bit systems are available

os

the required host operating system

sl6,sl7

-l os=sl6

currently useless since only SL6 nodes are available

h_rss

job's maximum resident memory usage (RSS)

-l h_rss=2G

hard limit, default: 1G /!\ jobs requesting less than 256M will be rejected

s_cpu

job's maximum cpu time

-l s_cpu=03:00:00

soft limit

h_cpu

job's maximum cpu time

-l h_cpu=48:00:00

hard limit

s_rt

job's maximum wallclock time

-l s_rt=03:00:00

soft limit

h_rt

job's maximum wallclock time

-l h_rt=15:00:00

hard limit, /!\ if your job needs more than 30 minutes runtime, either h_cpu/s_cpu or h_rt/s_rt are mandatory

tmpdir_size

job's maximum scratch space usage in $TMPDIR

-l tmpdir_size=5G

actually only needed if you need lots of space (>1G)

gpu

request a GPGPU

-l gpu

hostname

name of the execution host

-l hostname=bladeff

usage not recommended!

3.8. Projects

3.9. CPU core binding

All jobs requesting a runtime of more than 30 minutes will automatically run with cpu core binding enabled. You can of course change the default core-binding strategy - see qsub man page for details.

3.10. Some words about the stdout/stderr job output files

Unfortunately Gridengine stores the job's stdout/stderr files in your home directory by default. However, as soon as you run more than a few (say: 10 and more) jobs simultaneously, this will cause severe performance impacts on the AFS file server hosting your home directory. That's why we really advise you to store those files at a different place (e.g. somewhere below your scratch volume at /afs/ifh.de/group/<your group>/scratch/<your account>). Furthermore do not access this directory from within your job (e.g. make it the job's cwd, do an "ls" on it, etc.)!

Nevertheless, best would be to not create those files on a shared file system (AFS, Lustre) at all. To get this working, do something like this:

[oreade38] ~ % qsub -j y -o /dev/null <other requirements> <your jobscript>

Your job script should then start with a line like this:

exec > "$TMPDIR"/stdout.txt 2>"$TMPDIR"/stderr.txt

If you are interested in those two stdout/stderr files you'll need to copy it to a common place (e.g. your scratch dir) at the end of the job. You can catch signals sent to the job in your script like this, USR1 is sent if you hit the s_rt limit, 0 is for the normal exit:

#$ -m e
#$ -l s_rt=00:00:02
echo "starting"
trap 'echo exiting normally' 0
trap 'echo exiting after USR1;exit 2' USR1
sleep 60

3.11. Examples


Intention

Command ( <!> Always without line breaks! )

You want to submit a job which needs 4 gb rss memory and 13 hours cpu time

qsub -l h_rss=4G -l h_cpu=13:00:00 <job_script>

Your job should run under the project z_nuastr (which you are member of) although your default project is icecube. It needs only 20 minutes of cpu time.

qsub -P z_nuastr -l h_cpu=00:20:00 <job_script>

You want to submit the same 64 bit job 20 times at once. Every job needs 1.5 gb rss memory and runs for 30 hours.

qsub -t 1-20 -l h_rss=1.5G -l h_cpu=30:00:00 <job_script>

You want to have an interactive shell on a batch node. You intend to use this shell for three hours.

qrsh -l h_rt=03:00:00

Your job needs 48 hours cpu time, but you are not sure, whether this is enough. The job should receive a USR2 signal before actually being killed by the batch system. You further want to receive an email on jobs start, end and possible abort

qsub -l h_cpu=48:00:00 -notify -m abe <job_script>

You have a number of jobs that run some lower priority "background tasks". You now want to submit more urgent jobs without killing / suspending the currently already running "background tasks jobs".

qsub -l <resources> <low priority job>
qsub -l <resources> -js 10 <important_jobscript>

3.12. Best Practices

When running mass production on the farm, please keep in mind:

By following these rules you will help everybody (including yourself) to get most out of the existing, limited resources!

4. Other commands to handle your active jobs

Command

Function

qhold <job id>

Put submitted (but not yet started) jobs into 'hold' state, i.e. the job won't considered for execution until it is 'released' again.

qrls <job id>

Removes the 'hold' state of a job.

5. Monitoring the current farm and job status

6. Troubleshooting

6.1. Common pitfalls

problem description

possible reason

example

solution

Your job "starves" in the waiting queue

The farm is full

check the output of "qstat -g c" for available nodes

You requested resources which cannot be fulfilled:

-l h_cpu > 48:00:00

you can just request cpu time < 48 hours

Your job is in error state

qstat lists your job in Eqw state

Check the reason for the error and remove the error flag (details about it can be found here)

You requested high amounts of consumable resources (h_rss, your jobs requested a PE)

qsub -pe multicore 8 <jobscript>
qsub -l h_rss=30G <jobscript>

Use job reservation additionally! (qsub switch: -R y)

Only some of a set of identical jobs die

You did not specify your requirements correctly

You did not specify h_cpu

If h_cpu is not specified, your job might run on short queues. If your job needs more than 30 minutes cpu time, it will be killed

Too many jobs access data on the same file server at once

Use AFS!

Do not submit too many jobs at once. If you really need to, try using the qsub "-hold_jid" option.

Read the article about optimal storage usage at DESY.

All your jobs die at once

There are problems writing the log files (job's STDOUT/STDERR)

The log directory (located in AFS) contains too many files. SGE's error mail (qsub parameter '-m a') contains a line saying something like "/afs/ifh.de/...: File too large".

Do not store more than 1000 output files per directory.

The output directory is not writable. SGE's error mail contains a line saying something like "/afs/ifh.de/...: permission denied".

Check directory permissions.

The log directory does not exist on the execution host.

You can only use network enabled filesystems (AFS, NFS) as log directory. Local directories (e.g. /usr1/scratch) won't work.

qrsh fails with an error message complaining 'Your "qrsh" request could not be scheduled, try again later'

The farm is full and qrsh wants to occupy a slot at once.

Try "qrsh -now n <other requirements>". That way your request will be put into the waiting queue and no immediate execution will be forced.

6.2. Retrieving Job Status Information

Shortly after jobs are finished the job status information is no longer accessible using normal SGE commands. The MACBAT web page does contain in the menu for a given farm the heading 'Reporting - Finished Jobs'. The related link gives access to the job information of your finished jobs. This URL will list an overview of finished jobs per day. From there listings of finished jobs on a given date can be retrieved. By further following the links every job detail can be displayed up to the single tasks in array jobs.

The same information can be obtained using command line tools on Linux. The command arcx sgejobs is provided for retrieving this info. A short usage is printed with

arcx sgejobs -h

The information is displayed for the default farm ('uge'), but another farm ('pax' or the old 'sge') can be chosen using the -f=<farm> switch. All information is displayed only belonging to authenticated users and only for own jobs. Group admins can be registered (please contact <uco-zn AT desy DOT de>) who then are able to view information on jobs belonging to other users of their group.

If arcx sgejobs called without further arguments a list of submission dates and number of jobs submitted that day is printed.

arcx sgejobs

If a submission date or submission interval is given (date format yyyy-mm-dd) then job data are printed in a tabular form

arcx sgejobs 2015-09-01

Finally if a job number is given

arcx sgejobs 946942

then the full information belonging to that job is displayed

arcx sgejobs 946942

        qname = std.q
        hostname = bladecd.zeuthen.desy.de
        unixgroup = sysprog
        owner = ahaupt
        job_name = farmHEPSPEC.sh
        job_number = 946942
        submission_time = 1441134736
        start_time = 1441134747
        end_time = 1441144320
        failed = 0
        exit_status = 0
        ru_wallclock = 9572
        ru_utime = 9047
        ru_stime = 166
        ru_maxrss = 493216
        ru_minflt = 23156122
        ru_majflt = 483
        ru_inblock = 2699188
        ru_oublock = 7691376
        ru_nvcsw = 114531
        ru_nivcsw = 943807
        project = sysprog
        granted_pe = NONE
        slots = 1
        task_number = 0
        cpu = 9213
        mem = 1023.24
        category = -l h_cpu=32400,h_rss=2G,h_stack=10M,hostname=bladec*,m_mem_free=2.1G,s_rt=32700,tmpdir_size=5G -P sysprog -binding linear_automatic 1 0 0 0 no_explicit_binding
        pe_taskid = NONE
        maxvmem = 720318000

6.3. SGE Failure and Exit Codes

The exit code is the return value of the exiting program. It can be a user defined value if the job is finished with a call to 'exit(number)'. For abnormally terminated jobs it is the signal number + 128. If an SGE job is terminated because a limit was exceeded, SGE has sent a SIGUSR1 signal (10) to the job which results in an exit code of 138.

The SGE failure code indicates why a job was abnormally terminated. The following incomplete list mentions the most frequent failure codes:

6.4. Known SGE Bugs

7. More documentation

Batch_System_Usage (last edited 2017-06-15 13:13:20 by ManuelaBrehmer)