workflows.
In the second approach, the dependences
between tasks are described as by parent-child pair.
Children tasks must wait until all parent tasks finish
before starting. The workflow above can be
described in this approach as follows:
PARENT Task1 CHILD Task2, Task3
PARENT Task2, Task3 CHILD Task4
Majority of scientific workflows use this approach
for describing dependence. Typical examples are
JDL (Job Description Language) (E. Laure,
et al.,
2006
) which are used by gLite (gLite, 2011),
DAGMan (J. Frey, 2002) in Condor (Condor
project, 2011), SCULF (Kostas,
et al., 2004) in
Taverna (D. Hull,
et al., 2006), Pegasus (K. Lee, et
al., 2008
). The main advantage of this approach is
that it can describe more complex workflows than
the first approach. The dependences can be
visualized as directed acyclic graphs (DAG), where
tasks are represented by nodes of the graphs and the
directed edges show the parent-child relationships.
3 PROGRAMABLE WORKFLOW
DESCRIPTION
In this section we will describe our approach for
describing workflow. We will start with basic ideas
and gradually to more complex cases.
3.1 Basic Ideas
Therefore, we use a simpler way to describe
workflows as follows:
A task in a workflow is described by triple: its
code, a set of input data and a set of output data.
A workflow is described as a sequence of tasks.
Dependence and parallelism among tasks are
implicitly defined by the input/output of tasks.
An example of a workflow is follows:
My_workflow(input, result)
Task(code1, input, data1)
Task(code2, data1, data2)
Task(code3, data1, data3)
Task(code4,[data2,data3],result)
In the code above, input and result are the lists of
input and output data of the workflow. The items in
the lists are usually the names of files containing
corresponding data. The first task uses code in file
code1 for processing data from input and produces
data stored in files in data1. Similarly, second and
third tasks use data1 as input and generate data2, or
data3 respectively. Finally the last task use data
from data2 and data3 for create result, which is also
the output of whole workflow.
As it is shown in the example above, we only
describe tasks, not the dependences among tasks.
The dependence is implicitly defined by the
input/output data of tasks. For example, second task
use data produced by first task, so it must wait until
the first task finishes.
We can prove the equivalence of workflows in
our approach and workflows described by directed
acyclic graphs by following statements:
Every workflow represented in DAG can be
described in our approach.
Every workflow described in our approach can be
converted to DAG with linear complexity O(N).
So, in our approach, we can omit the part describing
dependence between tasks and save cost of creating
workflows. However, the main advantages of our
approach are in the following sections.
3.2 Nested Workflows
In the example above, the workflow has the same
structure as the task: the code (the body of the
workflow), input and output data. Therefore, we can
define sub-workflows and use them in the way like
tasks.
My_sub_workflow(input,output)
Task(code5, input, data1)
....
My_workflow(input, result)
Task(code1, input, data1)
Workflow(My_sub_workflow, data1,
data2)
Task(code3, data1, data3)
Task(code4,[data2,data3],result)
In the example above, the second task is replaced by
a sub-workflow. The syntax is similar to calling sub-
programs/functions in high-level programming
language: the sub-workflow command in the main
workflow will replace the formal input/output
parameters of the sub-workflow by actual data in the
main workflows.
Nested workflows are very useful for defining
workflows with repeated patterns (a group of tasks
doing the same actions with different input/output
data). Like functions/subprograms in classical
languages, they also make abstractions of task
groups and make the workflows more readable.
WORKFLOW COMPOSITION AND DESCRIPTION TOOL
231