INTRODUCING DATA PROVENANCE AND ERROR HANDLING FOR NGS WORKFLOWS WITHIN THE MOLGENIS COMPUTATIONAL FRAMEWORK

H. V. Byelas, M. Dijkstra, M. A. Swertz

Abstract

Running bioinformatics analyses in a distributed computational environment and monitoring their executions has become a huge challenge due to the size of data and complexity of analysis workflows. Some attempts have been made to combine computational and data management in a single solution using the MOLGENIS software generator. However, it was not clear how to explicitly specify output data for a particular research, evaluate its quality or possibly repeat the analysis depending on results. We present here a new version of a MOLGENIS computational framework for bioinformatics, which reflects lessons learnt and new requirements from end users. We have improved our initial solution in two ways. First, we propose a new data model, which describes a workflow as a graph in a relational database, where nodes are analysis operations and edges are transactions between them. Inputs and outputs of the workflow nodes are explicitly specified. Second, we have extended the executional logic to trace data, show how final results were created and how to handle errors in the distributed environment. We illustrate system applications on several analysis workflows for next generation sequencing.

References

  1. Altintas, I. and Berkley, C. (2004). Kepler: Towards a gridenabled system for scientific workflows. In in proceedings of GGF10-The Tenth Global Grid Forum.
  2. BBMRI-NL bioinformatics team (2010). Biobanking and biomolecular research infrastructure. http:// www.bbmriwiki.nl.
  3. Blankenberg, D. and Taylor, J. (2007). A framework for collaborative analysis of encode data: making large-scale analyses biologist-friendly. Genome Res., 17:6:960 - 4.
  4. FastQC (2011). Babraham bioinformatics. http:// www.bioinformatics.bbsrc.ac.uk/projects/fastqc/.
  5. Fu, J. and Swertz, M. (2007). Metanetwork: a computational protocol for the genetic study of metabolic networks. Nature Protocols 2, pages 685 - 694.
  6. Genomics Coordination Center, Groningen (2011). Molgenis web-site. http://www.molgenis.org.
  7. Glavic, B. and Dittrich, K. (2007). Data provenance: A categorization of existing approaches. In Datenbanksysteme in Business, Technologie und Web, pages 227- 241.
  8. H. Byelas and M. Swertz (2011). Towards a molgenis based computational framework. in proceedings of the 19th EUROMICRO International Conference on Parallel, Distributed and Network-Based Computing, pages 331-339.
  9. Ivanov, N. (2010). Cloud development platform. http:// gridgain.com/.
  10. J. Fu and R. Jansen (2007). System-wide molecular evidence for phenotypic buffering in arabidopsis. Nature Genetics, 41:685 - 694.
  11. Li, Y. and Swertz, M. (2009). DesignGG: an R-package and web tool for the optimal design of genetical genomics. BMC Bioinformatics, 10:188.
  12. M. K. Anand and T. McPhillips (2009). Efficient provenance storage over nested data collections. in proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology.
  13. M. Swertz and R. Jansen (2007). The molgenis toolkit: rapid prototyping of biosoftware at the push of a button. BMC Bioinformatics, 11:12.
  14. M. Swertz and R. Jansen (2010). Xgap: a uniform and extensible data model and software platform for genotype and phenotype experiments. Genome Biology, 11:27.
  15. Millipede Cluster Team, Groningen (2010). newblock Clustervision opteron cluster. http://www.rug.nl/cit/ hpcv/faciliteiten/index.
  16. Oinn, T. and Greenwood, M. (2005). Taverna: lessons in creating a workflow environment for the life sciences. Concurrency and Computation: Practice and Experience, 18:10:1067 - 1100.
  17. Simmhan, Y. and Gannon, D. (2005). A survey of data provenance techniques. Technical report.
  18. Sroka, J. and Goble, C. (2010). A formal semantics for the taverna 2 workflow model. Journal of Computer and System Sciences, 76:6:490-508.
  19. Swertz, M. and Jansen, R. (2007). Beyond standardization: dynamic software infrastructures for systems biology. Nature Reviews Genetics, 8:3:235-43.
  20. Swiss Federal Institute of Technology, Z. (2006). Ganymed ssh-2 for java. http://www.ganymed.ethz.ch/ssh2.
  21. The Genome Analysis Toolkit (2011). http://www.broadinstitute.org/.
  22. Y. Li and R. Jansen (2010). Global genetic robustness of the alternative splicing machinery in caenorhabditis elegans. Genetics, 186(1):405-10.
Download


Paper Citation


in Harvard Style

V. Byelas H., Dijkstra M. and A. Swertz M. (2012). INTRODUCING DATA PROVENANCE AND ERROR HANDLING FOR NGS WORKFLOWS WITHIN THE MOLGENIS COMPUTATIONAL FRAMEWORK . In Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms - Volume 1: BIOINFORMATICS, (BIOSTEC 2012) ISBN 978-989-8425-90-4, pages 42-50. DOI: 10.5220/0003738900420050


in Bibtex Style

@conference{bioinformatics12,
author={H. V. Byelas and M. Dijkstra and M. A. Swertz},
title={INTRODUCING DATA PROVENANCE AND ERROR HANDLING FOR NGS WORKFLOWS WITHIN THE MOLGENIS COMPUTATIONAL FRAMEWORK},
booktitle={Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms - Volume 1: BIOINFORMATICS, (BIOSTEC 2012)},
year={2012},
pages={42-50},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003738900420050},
isbn={978-989-8425-90-4},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms - Volume 1: BIOINFORMATICS, (BIOSTEC 2012)
TI - INTRODUCING DATA PROVENANCE AND ERROR HANDLING FOR NGS WORKFLOWS WITHIN THE MOLGENIS COMPUTATIONAL FRAMEWORK
SN - 978-989-8425-90-4
AU - V. Byelas H.
AU - Dijkstra M.
AU - A. Swertz M.
PY - 2012
SP - 42
EP - 50
DO - 10.5220/0003738900420050