Swertz, 2011), we aimed to combine computational
and data management in a single system. However,
several important functionalities were left out of the
initial solution. These included easy tracing of the
data produced and run-time error handling during
workflow execution. By error handling, we mean sce-
narios which can be applied if some data is missing or
quality indications for results are low.
In the new version of the MCF, our main goal is
to help a user to understand how the complex analyses
were accomplished and what computational processes
were used. In this paper, we describe the design and
implementation challenges to specifying data prove-
nance and handlling errors during workflow execu-
tion. In particular, we present our solution for the
NGS workflows used in the GoNL project.
This paper is structured as follows. Section 2 re-
views related work in the context of data provenance
and error handling in other workflow management
systems. Section 3 describes the new model in de-
tails. Section 4 reviews the system design in rough
lines. Section 5 details the new functional logic of the
system for error handling for generic pipelines and, in
particular - for the NGS pipeline and gives examples
of generated user interfaces. Section 6 discusses our
experience with using the system in practice. Section
7 concludes the paper.
2 RELATED WORK
Extensive overviews of data provenance approaches
and techniques are given in (Glavic and Dittrich,
2007) and (Simmhan and Gannon, 2005). We did
not set out to develop a new theoretical model for
data provenance, rather, we are interested in using a
lightweight data provenance approach for the specific
bioinformatics domain. In our scenario, several indi-
vidual researchers involved in the same project would
like to collaborate on analysing data. Here, data
sources, intermediate and final analysis results and
computational processes are often shared between re-
searchers to speed up the analysis. Consequently, data
and process oriented provenance should be combined
in one solution. Without proper data annotation, the
analysis results can easily be overwritten or dupli-
cated when the analysis is re-run on the same data
with other parameters, in other execution settings or
just at an other time. We are interested in methods to
avoid such situations.
J. Yu et al. present a taxonomy of workflow man-
agement systems in their work (Yu and Buyya, 2005).
Data provenance is modelled and implemented in var-
ious ways in different data warehouses and workflow
management systems. In the Taverna 2.0 workflow
system (Oinn and Greenwood, 2005), the semantics
of workflows is modelled using so-called traces(Sroka
and Goble, 2010), which record sequences of events.
These events can be of three types: input events, rep-
resenting values on input ports, atomic executions
and output events, i.e. values on output ports. The
model is implemented in the system using the file-
based database. Taverna can remember workflow runs
and only saves the results to file system after running
a workflow with different inputs. Users have the pos-
sibility to switch data provenance options off, which
can give a performance benefit and reduce a disk-
space usage. By default, Taverna stores the input val-
ues, intermediate values and the results of workflow
runs in memory. When Taverna is closed the values
are lost. In-memory storage can also be switched off
for workflows where passed data is large. In Kepler
(Altintas and Berkley, 2004), ordered trees are used to
represent data products of workflows (M. K. Anand
and T. McPhillips, 2009). These trees are stored
in trace files using the XML format. Kepler allows
browsing and navigation in the history of execution
traces by querying trace files. Queries can become
large and complex to produce scientifically meaning-
ful results. Kepler also enables outputs of one run
to be used as inputs of another. The specific bioin-
formatics management systems Galaxy (Blankenberg
and Taylor, 2007) tracks metadata to ensure repro-
ducibility of analyses. However, it is not sufficient
to capture the intent of analysis. Galaxy is not really
integrated with any data management system. All the
results produced by all analysis runs are saved in a
disk storage, which considerably increases the storage
requirements for large analyses. Furthermore, Galaxy
considers a workflow as a black box and if errors oc-
cur during execution, they will be received as the end
result of the analysis.
Keeping in mind features of the workflow systems
we are aware of, it emerges that even data provenance
is present to some extent in all of them. However,
an automatic error handling is missing. Adding error
handling can save a lot of time for computationally in-
tensive analyses, where a re-run of an individual anal-
ysis operation ad hoc instead of re-running the whole
workflow later would save a lot of time and efforts.
It can be difficult to find a good quality indications
of the successful completion of operations. These in-
dicators should be present in the model to specify re-
covery scenarios. Comparing our developments to the
above workflow systems, we aimed to create a spe-
cific solution for a particular bioinformatics analysis
(i.e. NGS workflows). However, we want to introduce
more advanced error handling into the system. In our
INTRODUCING DATA PROVENANCE AND ERROR HANDLING FOR NGS WORKFLOWS WITHIN THE
MOLGENIS COMPUTATIONAL FRAMEWORK
43