Grid Engine Array Task Dependency Specification at Open Source at Rising Sun Pictures

Grid Engine Array Task Dependency Specification

Project Overview

Project aim

The aim of this project is to extend the Grid Engine job dependencies to support dependencies between sub-tasks in an array job. The aim is to target the most common use cases for digital content creation and digital visual effects.

The issues targeted with this project are:

Also, see related email threads:

Expected benefit

The expected benefit of these changes is to make the grid engine easier to use for digital visual effects and digital content creation where typically “renders” consist of the processing of sets of images (frames in an image sequence) which is ideally carried out as an array task.

As an example:
In a 3d render, several passes of renders are carried out in a specific order to produce a final rendered image. For instance, ribs need to be generated, followed by shadow maps, followed by the main render.

The time to submit renders of large numbers of frames will be reduced by at least an order of magnitude. The number of simultaneous running jobs on the grid engine will be increased. It will be easier to manage jobs as all frames will be “grouped” at the grid engine level. The speed of the grid in general should increase, including qstat when monitoring all jobs.

Functionality

The suggested changes in this document are designed to be fully backward compatible so with no change in usage there will be no change in behaviour.

The hold array dependency option

Add an option, when submitting an array job, where we specify a list of jobs that we make this job dependent on (in a similar way to -hold_jid) except we treat the dependency at the sub-task level rather than the job level. This is best explained through an example.

For example:

Currently, when an array task (B), consisting of three sub tasks is submitted and is dependent on another array task (A) consisting of three sub tasks as well:

$ qsub -t 1-3 A
$ qsub -hold_jid A -t 1-3 B

| A.1 |     | B.1 |
| A.2 | --> | B.2 |
| A.3 |     | B.3 |

All the sub-tasks in job B will wait for all sub-tasks (1, 2 and 3) in A to finish before starting the tasks in job B. So, on a single machine renderfarm, the tasks will be executed in the following approximate order: A.1, A.2, A.3, B.1, B.2, B.3.

With the option, each sub-task in array job B would be dependent on each corresponding sub-task in job A in a one-to-one mapping as follows:

$ qsub -t 1-3 A
$ qsub -hold_jid_ad A -t 1-3 B

The option “-hold_jid_ad” stands for hold array dependency

| A.1 | --> | B.1 |
| A.2 | --> | B.2 |
| A.3 | --> | B.3 |

Sub-task B.1 will only start when A.1 completes. B.2 will only start once A.2 completes, etc. On a single machine renderfarm, the tasks thus could be executed in the following approximate order: A.1, B.1, A.2, B.2, A.3, B.3

It should only be able to specify the option if we are submitting an array job, it is dependent on another array job, and that array job has the same number of sub-tasks.

Note, that it is perfectly valid to submit a job (C) that is dependent on one job (A) in the normal way and dependent on another job (B) with an array dependency

$ qsub A
$ qsub -t 1-3 B
$ qsub -hold_jid A -hold_jid_ad B -t 1-3 C

Chunking

It is fairly common in 3d rendering applications that is more efficient to render several frames at once on the same cpu than distributing the frames across several machines. An example of this is RIB generation, where the time taken to start the 3d application in batch mode can be longer than the generation of the rib. So, it is more efficient to start the 3d application, generate the RIBs for several frames and then stop the 3d application. The generation of several frames at once we will refer to as “chunking”.

Consider the following example where we submit two jobs where the first job A has a step size of 2 and the second job B is submitted with the “hold array dependency” option:

$ qsub -t 1-6:2 A
$ qsub -hold_jid_ad A -t 1-6 B

| A.1 | --> | B.1 |
|     | --> | B.2 |
| A.3 | --> | B.3 |
|     | --> | B.4 |
| A.5 | --> | B.5 |
|     | --> | B.6 |

In this circumstance it is assumed that job A is “chunking”. So, B.1 and B.2 are dependent on A.1. B.3 and B.4 are dependent on A.3 and so on. By assuming that job A is chunking we are assuming that job A.1 is rendering frame 1 and frame 2.

It is reasonable to always assume that job A is chunking, because in this example if job A.1 didn’t render frame 2, then B.2 would fail.

Consider when we submit job A with a step size of 1 and job B with a step size of 2 (with the “hold array dependency”):

$ qsub -t 1-6 A
$ qsub -hold_jid_ad A -t 1-6:2 B

| A.1 | --> | B.1 |
| A.2 | --> |     |
| A.3 | --> | B.3 |
| A.4 | --> |     |
| A.5 | --> | B.5 |
| A.6 | --> |     |

In this circumstance it is assumed that job B is “chunking”. So, B.1 is dependent on A.1 and A.2. B.3 is dependent on A.3 and A.4 and so on.

It is reasonable to always assume that job B is chunking because otherwise A.2, A.4 and A.6 would be needlessly run and the result would never be used.

Now consider when we submit job A with a step size of 3 and job B with a step size of 2 (with the “hold array dependency”):

$ qsub -t 1-6:3 A
$ qsub -hold_jid_ad A -t 1-6:2 B

| A.1 | --> | B.1 |
|     | --> |     |
|     | --> | B.3 |
| A.4 | --> |     |
|     | --> | B.5 |
|     | --> |     |

In this circumstance it is assumed that both job A and job B are “chunking”. So, B.1 is dependent on A.1, B.3 is dependent on A.1 and A.4, and B.5 is dependent on A.4.

The general case of arbitrary step size in job A and job B is handled in the next section.

Chunking in the general case

When the hold array dependency option “-hold_jid_ad” is specified and the step sizes of the array job and the dependent array job are different, we *always* assume that both are chunking.

Let us submit a job A with task index starting at t0 and ending at t1 with a step size of sa

$ qsub -t t0-t1:sa A

Then, we submit a job B, with array dependency with task indices starting at t0 and ending at t1 with a step size of sb

$ qsub -hold_jid_ad A -t t0-t1:sb B

Let div(a, b) = floor(a / b)
Let nearest_index_in_A(i) = t0 + div(i - t0, sa) * sa

Sub-task B.i will be dependent on all tasks in A between nearest_index_in_A(i) and nearest_index_in_A(i + sb - 1) where i = t0 + n * sb and n is a positive integer

Interfaces

GUI - Graphical user interfaces

qmon

job submission dialogue must be enhanced to allow Hold Array Dependencies be specified at job submission
job control dialogue must be enhanced to allow per task array hold states be monitored

API - Application programming interface

DRMAA

drmaa_run_bulk_job(3) must accept “-hold_jid_ad wc_job_list” when passed through job template attribute drmaa_native_specification
drmaa_job_ps(3) to return either DRMAA_PS_SYSTEM_ON_HOLD or DRMAA_PS_USER_SYSTEM_ON_HOLD for array tasks that are in hold due to “-hold_jid_ad wc_job_list”

CLI - Command line interface

New options for qsub and qalter

“Hold Array Dependency”

-hold_jid_ad wc_job_list
Analogous to “-hold_jid” option. Submitting an array job (using -t) as dependent on or more other array jobs using this option treats the dependency at the sub-task level rather than the job level. For instance, sub-task 1 will be dependent on sub-task 1, sub-task 2 on sub-task 2 and so on. The wc_job_list type is detailed in sge_types(1).

New option for qstat

-s hd
Display all jobs which are on hold because there are sub-tasks that are waiting to start because of an “hold array dependency”.

Modified behaviour of qstat

‘h’ state specifier in reduced qstat format qstat
a ‘h’ as state specifier shown for a job/sub-task by qstat can also indicate -hold_jid_ad

Error conditions

When submitting a job that is not an array job

$ qsub -hold_jid_ad A B
Can only specify "-hold_jid_ad" option with an array job (using "-t" option)

When submitting a job that is dependent on another array job with a different number of sub-tasks

$ qsub -t 1-10 A
$ qsub -hold_jid_ad A -t 1-3 B
This array job must have the same range of sub-tasks as the dependent array job specified with -hold_jid_ad

qstat output

Let us consider a set of jobs that have been submitted in the following way

$ qsub -N A -b y /bin/sleep 300
Your job 101 ("A") has been submitted.
$ qsub -N A -b y /bin/sleep 300
Your job 102 ("A") has been submitted.
$ qsub -N B -t 1-3 -b y /bin/sleep 300
Your job 103.1-3:1 ("B") has been submitted.
$ qsub -N B -t 1-3 -b y /bin/sleep 300
Your job 104.1-3:1 ("B") has been submitted.
$ qsub -hold_jid A -hold_jid_ad B -t 1-3 -N C -b y /bin/sleep 300
Your job 105.1-3:1 ("C") has been submitted.

The output of “qstat -j” has an extra output “ja_ad_predecessor_list” which shows the jobs that are dependent via the array dependency. It is the analogue of “jid_predecessor_list”.

$ qstat -j C
...
jid_predecessor_list (req):  A
jid_predecessor_list:  101,102
ja_ad_predecessor_list (req):  B
ja_ad_predecessor_list: 103,104
...

When xml output is requested the JB_ja_ad_request_list has the same structure as the JB_jid_request_list and JB_ja_ad_predecessor_list has the same structure as the JB_jid_predecessor_list

$ qstat -j C -xml
<...>
      <JB_jid_request_list>
        <element>
          <JRE_job_number>0</JRE_job_number>
          <JRE_job_name>A</JRE_job_name>
        </element>
      </JB_jid_request_list>
      <JB_jid_predecessor_list>
        <element>
          <JRE_job_number>101</JRE_job_number>
        </element>
        <element>
          <JRE_job_number>102</JRE_job_number>
        </element>
      </JB_jid_predecessor_list>
      <JB_ja_ad_request_list>
        <element>
          <JRE_job_number>0</JRE_job_number>
          <JRE_job_name>B</JRE_job_name>
        </element>
      </JB_ja_ad_request_list>
      <JB_ja_ad_predecessor_list>
        <element>
          <JRE_job_number>103</JRE_job_number>
        </element>
        <element>
          <JRE_job_number>104</JRE_job_number>
        </element>
      </JB_ja_ad_predecessor_list>
</...>

Also, ja_ad_sucessor_list is analogous to the jid_sucessor_list

$ qstat -j 102
...
jid_sucessor_list:  105
...
$ qstat -j 103
...
ja_ad_sucessor_list:  105
...

And the JB_ja_ad_sucessor_list element has the same structure as the JB_jid_sucessor_list element.

$ qstat -j 102 -xml
...
      <JB_jid_sucessor_list>
        <element>
          <JRE_job_number>105</JRE_job_number>
        </element>
      </JB_jid_sucessor_list>
...
$ qstat -j 103 -xml
...
      <JB_ja_ad_sucessor_list>
        <element>
          <JRE_job_number>105</JRE_job_number>
        </element>
      </JB_ja_ad_sucessor_list>
...

Internal data structures

Additions to sge_jobL.h

enum { MINUS_H_TGT_JA_AD };
SGE_LIST(JB_ja_ad_request_list, JRE_Type, CULL_DEFAULT | CULL_SPOOL)
SGE_LIST(JB_ja_ad_predecessor_list, JRE_Type, CULL_DEFAULT | CULL_SPOOL)
SGE_LIST(JB_ja_ad_sucessor_list, JRE_Type, CULL_DEFAULT)
SGE_LIST(JB_ja_a_h_ids, RN_Type, CULL_DEFAULT | CULL_SPOOL)

the CULL_SPOOL attribute controls which data fields of a JB_Type are persistent. In addition it indicates when spooling operations are needed:

needed when a request changes JB_ja_ad_request_list and JB_ja_ad_predecessor_list was resolved
not needed when JB_ja_ad_predecessor_list/JB_ja_ad_sucessor_list changes in the process of regular job processing

Functional Definition

Documentation

Reference documentation (man pages) must be enhanced accordingly to cover all interface changes.