Using Cloud-Aware Provenance to Reproduce Scientific Workflow Execution on Cloud

Khawar Hasham, Kamran Munir, Richard McClatchey

Abstract

Provenance has been thought of a mechanism to verify a workflow and to provide workflow reproducibility. This provenance of scientific workflows has been effectively carried out in Grid based scientific workflow systems. However, recent adoption of Cloud-based scientific workflows present an opportunity to investigate the suitability of existing approaches or propose new approaches to collect provenance information from the Cloud and to utilize it for workflow reproducibility on the Cloud infrastructure. This paper presents a novel approach that can assist in mitigating this challenge. This approach can collect Cloud infrastructure information along with workflow provenance and can establish a mapping between them to provide a Cloud-aware provenance. The reproducibility of the workflow execution is performed by: (a) capturing the Cloud infrastructure information (virtual machine configuration) along with the workflow provenance, (b) re-provisioning the similar resources on the Cloud and re-executing the workflow on them and (c) by comparing the outputs of workflows. The evaluation of the prototype suggests that the proposed approach is feasible and can be investigated further. Since there is no reference model for workflow reproducibility on Cloud exists in the literature, this paper also attempts to present a model that is used in the proposed design to achieve workflow reproducibility in the Cloud environment.

References

  1. (2014). GriPhyN: http://www.phys.utb.edu/griphyn/ [Last visited 30-12-2014].
  2. (2014). SDSS: http://www.sdss.org [Last visited 30-12- 2014].
  3. Azarnoosh, S., Rynge, M., Juve, G., Deelman, E., Niec, M., Malawski, M., and da Silva, R. (2013). Introducing precip: An api for managing repeatable experiments in the cloud. In 5th IEEE Conference on Cloud Computing Technology and Science (CloudCom), volume 2, pages 19-26.
  4. Belhajjame, K., Roos, M., Garcia-Cuesta, E., Klyne, G., Zhao, J., De Roure, D., Goble, C., Gomez-Perez, J. M., Hettne, K., and Garrido, A. (2012). Why workflows break - understanding and combating decay in taverna workflows. In Proceedings of the 2012 IEEE 8th International Conference on E-Science (eScience'12), pages 1-9, USA. IEEE Computer Society.
  5. Stodden, V. C. (2010). Reproducible research: Addressing the need for data and code sharing in computational science. Computing in Science & Engineering, 12.
  6. Chirigati, F., Shasha, D., and Freire, J. (2013). Reprozip: Using provenance to support computational reproducibility. In Proceedings of the 5th USENIX Workshop on the Theory and Practice of Provenance, TaPP 7813, pages 1-4, Berkeley, USA. USENIX Association.
  7. Oliveira, D., Ogasawara, E., Bai ~ao, F., and Mattoso, M. (2010). Scicumulus: A lightweight cloud middleware to explore many task computing paradigm in scientific workflows. In Cloud Computing (CLOUD), 2010 IEEE 3rd International Conference on, pages 378-385.
  8. Deelman, E., Blythe, J., Gil, Y., Kesselman, C., Mehta, G., Patil, S., Su, M.-H., Vahi, K., and Livny, M. (2004). Pegasus: Mapping scientific workflows onto the grid. In Dikaiakos, M., editor, Grid Computing, volume 3165 of Lecture Notes in Computer Science, pages 11-20. Springer Berlin Heidelberg.
  9. Deelman, E., Gannon, D., Shields, M., and Taylor, I. (2008). Workflows and e-science: An overview of workflow system features and capabilities.
  10. Foster, I. and Kesselman, C., editors (1999). The Grid: Blueprint for a New Computing Infrastructure. Morgan Kaufmann Publishers Inc., USA.
  11. Foster, I., V ¨ockler, J., Wilde, M., and Zhao, Y. (2002). Chimera: a virtual data system for representing, querying, and automating data derivation. In Scientific and Statistical Database Management, Proceedings. 14th International Conference on, pages 37-46.
  12. Foster, I., Zhao, Y., Raicu, I., and Lu, S. (2008). Cloud computing and grid com- puting 360-degree compared. In Grid Computing Environments Workshop, 2008. GCE 7808, pages 1-10.
  13. Groth, P., Deelman, E., Juve, G., Mehta, G., and Berriman, B. (2009). Pipeline- centric provenance model. In Proceedings of the 4th Workshop on Workflows in Support of Large-Scale Science, WORKS 7809, pages 4:1-4:8, USA. ACM.
  14. Howe, B. (2012). Virtual appliances, cloud computing, and reproducible re- search. Computing in Science Engineering, 14(4):36-41.
  15. Janin, Y., Vincent, C., and Duraffort, R. (2014). Care, the comprehensive archiver for reproducible execution. In Proceedings of the 1st ACM SIG- PLAN Workshop on Reproducible Research Methodologies and New Publication Models in Computer Engineering, TRUST 7814, pages 1:1-1:7, USA. ACM.
  16. Juve, G. and Deelman, E. (2010). Scientific workflows and clouds. Crossroads, 16(3):14-18.
  17. Kim, J., Deelman, E., Gil, Y., Mehta, G., and Ratnakar, V. (2008). Provenance trails in the wings-pegasus system. Concurr. Comput. : Pract. Exper., 20(5):587-597.
  18. Ko, R., Lee, B., and Pearson, S. (2011). Towards achieving accountability, auditability and trust in cloud computing. In Advances in Computing and Communications, volume 193 of Communications in Computer and Information Science, pages 432-444. Springer Berlin Heidelberg.
  19. Lifschitz, S., Gomes, L., and Rehen, S. K. (2011). Dealing with reusability and reproducibility for scientific workflows. In Bioinformatics and Biomedicine Workshops (BIBMW), 2011 IEEE International Conference on, pages 625-632. IEEE. 38, 69.
  20. Macko, P., Chiarini, M., and Seltzer, M. (2011). Collecting provenance via the xen hypervisor. 3rd USENIX Workshop on the Theory and Practice of Provenance (TAPP).
  21. Mehmood, Y., Habib, I., Bloodsworth, P., Anjum, A., Lansdale, T., and McClatchey, R. (2009). A middleware agnostic infrastructure for neuro- imaging analysis. In Computer-Based Medical Systems, 2009. CBMS 2009. 22nd IEEE International Symposium on, pages 1-4.
  22. Mei, L., Chan, W. K., and Tse, T. H. (2008). A tale of clouds: Paradigm comparisons and some thoughts on research issues. In Proceedings of the 2008 IEEE Asia-Pacific Services Computing Conference, APSCC 7808, pages 464-469, USA. IEEE Computer Society.
  23. Mell, P. M. and Grance, T. (2011). Sp 800-145. the nist definition of cloud computing. Technical report, Gaithersburg, MD, United States.
  24. Missier, P., Woodman, S., Hiden, H., and Watson, P. (2013). Provenance and data differencing for workflow reproducibility analysis. Concurrency and Computation: Practice and Experience.
  25. Munir, K., Kiani, S. L., Hasham, K., McClatchey, R., Branson, A., and Sham- dasani, J. (2013). An integrated e-science analysis base for computation neuroscience experiments and analysis. Procedia - Social and Behavioral Sciences, 73(0):85 - 92. Proceedings of the 2nd International Conference on Integrated Information (IC-ININFO 2012), Budapest, Hungary, August 30 - September 3, 2012.
  26. Munir, K., Liaquat Kiani, S., Hasham, K., McClatchey, R., Branson, A., and Shamdasani, J. (2014). Provision of an integrated data analysis platform for computational neuroscience experiments. Journal of Systems and Information Technology, 16(3):150-169.
  27. Ramakrishnan, L. and Plale, B. (2010). A multidimensional classification model for scientific workflow characteristics. In Proceedings of the 1st International Workshop on Workflow Approaches to New Data-centric Science, Wands 7810, pages 4:1- 4:12, USA. ACM.
  28. Roure, D. D., Manuel, J., Hettne, K., Belhajjame, K., Palma, R., Klyne, G., Missier, P., Ruiz, J. E., and Goble, C. (2011). Towards the preservation of scientific workflows. In Procs. of the 8th International Conference on Preservation of Digital Objects (iPRES 2011). ACM.
  29. Sandve, G. K., Nekrutenko, A., Taylor, J., and Hovig, E. (2013). Ten sim- ple rules for reproducible computational research. PLoS Comput Biol, 9(10):e1003285.
  30. Santana-Perez, I., Ferreira da Silva, R., Rynge, M., Deelman, E., P e´rez- Hern ´andez, M., and Corcho, O. (2014a). A semantic-based approach to attain reproducibility of computational environments in scientific work- flows: A case study. In Parallel Processing Workshops, volume 8805 of Lecture Notes in Computer Science, pages 452-463. Springer International Publishing.
  31. Santana-Perez, I., Ferreira da Silva, R., Rynge, M., Deelman, E., Perez- Hernandez, M. S., and Corcho, O. (2014b). Leveraging semantics to improve reproducibility in scientific workflows. In The reproducibility at XSEDE workshop.
  32. Scheidegger, C., Koop, D., Santos, E., Vo, H., Callahan, S., Freire, J., and Silva, C. (2008). Tackling the provenance challenge one layer at a time. Concurr. Comput. : Pract. Exper., 20(5):473-483.
  33. Shamdasani, J., Branson, A., and McClatchey, R. (2012). Towards semantic provenance in cristal. In Third International Workshop on the role of Se- mantic Web in Provenance Management (SWPM 2012).
  34. Simmhan, Y. L., Plale, B., and Gannon, D. (2005). A survey of data provenance in e-science. SIGMOD Rec., 34(3):31-36.
  35. SMS, C., CE, P., D, O., MLM, C., and M., M. (2011). Capturing distributed provenance metadata from cloud-based scientific workflows. Information and Data Management, 2:43-50.
  36. Stallings, W. (2010). Cryptography and Network Security: Principles and Prac- tice. Prentice Hall Press, Upper Saddle River, NJ, USA, 5th edition.
  37. Stevens, R. D., Robinson, A. J., and Goble, C. A. (2003). myGrid: personalised bioinformatics on the information grid, Bioinformatics, 19:i302-i304.
  38. Tan, Y. S., Ko, R. K., Jagadpramana, P., Suen, C. H., Kirchberg, M., Lim, T. H., Lee, B. S., Singla, A., Mermoud, K., Keller, D., and Duc, H. (2012). Tracking of data leaving the cloud. 2013 12th IEEE International Confer- ence on Trust, Security and Privacy in Computing and Communications, 0:137- 144.
  39. Tannenbaum, T., Wright, D., Miller, K., and Livny, M. (2002). Beowulf cluster computing with linux. chapter Condor: A Distributed Job Scheduler, pages 307-350. MIT Press, Cambridge, MA, USA.
  40. Vouk, M. (2008). Cloud computing #x2014; issues, research and implementa- tions. In Information Technology Interfaces, 2008. ITI 2008. 30th International Conference on, pages 31-40.
  41. Zhang, O. Q., Kirchberg, M., Ko, R. K., and Lee, B. S. (2011). How to track your data: The case for cloud computing provenance. In Cloud Computing Technology and Science (CloudCom), 2011 IEEE Third International Conference on, pages 446-453. IEEE.
  42. Zhao, X., Zhang, Y., Wu, Y., Chen, K., Jiang, J., and Li, K. (2014). Liquid: A scalable deduplication file system for virtual machine images. Parallel and Distributed Systems, IEEE Transactions on, 25(5):1257-1266.
  43. Zhao, Y., Fei, X., Raicu, I., and Lu, S. (2011). Opportunities and challenges in running scientific
Download


Paper Citation


in Harvard Style

Hasham K., Munir K. and McClatchey R. (2015). Using Cloud-Aware Provenance to Reproduce Scientific Workflow Execution on Cloud . In Proceedings of the 5th International Conference on Cloud Computing and Services Science - Volume 1: CLOSER, ISBN 978-989-758-104-5, pages 49-59. DOI: 10.5220/0005452800490059


in Bibtex Style

@conference{closer15,
author={Khawar Hasham and Kamran Munir and Richard McClatchey},
title={Using Cloud-Aware Provenance to Reproduce Scientific Workflow Execution on Cloud},
booktitle={Proceedings of the 5th International Conference on Cloud Computing and Services Science - Volume 1: CLOSER,},
year={2015},
pages={49-59},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005452800490059},
isbn={978-989-758-104-5},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 5th International Conference on Cloud Computing and Services Science - Volume 1: CLOSER,
TI - Using Cloud-Aware Provenance to Reproduce Scientific Workflow Execution on Cloud
SN - 978-989-758-104-5
AU - Hasham K.
AU - Munir K.
AU - McClatchey R.
PY - 2015
SP - 49
EP - 59
DO - 10.5220/0005452800490059