A CLOUD ARCHITECTURE FOR BIOINFORMATICS WORKFLOWS

Hugo Saldanha, Edward Ribeiro, Maristela Holanda, Aleteia Araujo, Genaina Rodrigues, Maria Emilia Walter, João Carlos Setubal, Alberto Dávila

2011

Abstract

Cloud computing has emerged as a promising platform for large scale data intensive scientific research, i.e., processing tasks that use hundreds of hours of CPU time and petabytes of data storage. Despite being object of current research, efforts are mainly based on MapReduce in order to have processing performed in clouds. This article describes the BioNimbus project, which aims to define an architecture and to create a framework for easy and flexible integration and support for distributed execution of bioinformatics tools in a cloud environment, not only tied to the MapReduce paradigm. As a result, we leverage cloud elasticity, fault tolerance and, at the same time, significantly improve the storage capacity and execution time of bioinformatics tasks, mainly of large scale genome sequencing projects.

References

  1. (2008). High-throughput bioinformatics with the Cyrille2 pipeline system. BMC bioinformatics, 9(1):96+.
  2. (2008). High-throughput bioinformatics with the Cyrille2 pipeline system. BMC bioinformatics, 9(1):96+.
  3. Altintas, I., Berkley, C., Jaeger, E., Jones, M., Ludascher, B., and Mock, S. (2004). Kepler: An extensible system for design and execution of scientific workflows. In Proceedings of the 16th International Conference on Scientific and Statistical Database Management, pages 423-, Washington, DC, USA. IEEE Computer Society.
  4. Altintas, I., Berkley, C., Jaeger, E., Jones, M., Ludascher, B., and Mock, S. (2004). Kepler: An extensible system for design and execution of scientific workflows. In Proceedings of the 16th International Conference on Scientific and Statistical Database Management, pages 423-, Washington, DC, USA. IEEE Computer Society.
  5. Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J. (1990). Basic local alignment search tool. Journal of Molecular Biology, 215(3):403-410.
  6. Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J. (1990). Basic local alignment search tool. Journal of Molecular Biology, 215(3):403-410.
  7. Benson, D. A., Karsch-Mizrachi, I., Lipman, D. J., Ostell, J., and Sayers, E. W. (2009). GenBank. Nucleic acids research, 37(Database issue):D26-31.
  8. Benson, D. A., Karsch-Mizrachi, I., Lipman, D. J., Ostell, J., and Sayers, E. W. (2009). GenBank. Nucleic acids research, 37(Database issue):D26-31.
  9. Biosystems, A. (2010). Applied http://www.appliedbiosystems.com.
  10. Biosystems, A. (2010). Applied http://www.appliedbiosystems.com.
  11. Coutinho, F., Ogasawara, E., de Oliveira, D., Braganholo, V., Lima, A. A. B., Dávila, A. M. R., and Mattoso, M. (2010). Data parallelism in bioinformatics workflows using hydra. In Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, HPDC 7810, pages 507-515, New York, NY, USA. ACM.
  12. Coutinho, F., Ogasawara, E., de Oliveira, D., Braganholo, V., Lima, A. A. B., Dávila, A. M. R., and Mattoso, M. (2010). Data parallelism in bioinformatics workflows using hydra. In Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, HPDC 7810, pages 507-515, New York, NY, USA. ACM.
  13. Dean, J. and Ghemawat, S. (2010). MapReduce: A flexible data processing tool. Communications of the ACM, 53(1):72-77.
  14. Dean, J. and Ghemawat, S. (2010). MapReduce: A flexible data processing tool. Communications of the ACM, 53(1):72-77.
  15. DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A., Sivasubramanian, S., Vosshall, P., and Vogels, W. (2007). Dynamo: Amazon's highly available key-value store. SIGOPS Oper. Syst. Rev., 41(6):205-220.
  16. DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A., Sivasubramanian, S., Vosshall, P., and Vogels, W. (2007). Dynamo: Amazon's highly available key-value store. SIGOPS Oper. Syst. Rev., 41(6):205-220.
  17. Ekanayake, J., Li, H., Zhang, B., Gunarathne, T., Bae, S.- H., Qiu, J., and Fox, G. (2010). Twister: A Runtime for Iterative MapReduce.
  18. Ekanayake, J., Li, H., Zhang, B., Gunarathne, T., Bae, S.- H., Qiu, J., and Fox, G. (2010). Twister: A Runtime for Iterative MapReduce.
  19. Filichkin, S. A., Priest, H. D., Givan, S. A., Shen, R., Bryant, D. W., Fox, S. E., Wong, W., and Mockler, T. C. (2010). Genome-wide mapping of alternative splicing in arabidopsis thaliana. Genome Research, 20(1):45-58.
  20. Filichkin, S. A., Priest, H. D., Givan, S. A., Shen, R., Bryant, D. W., Fox, S. E., Wong, W., and Mockler, T. C. (2010). Genome-wide mapping of alternative splicing in arabidopsis thaliana. Genome Research, 20(1):45-58.
  21. Gonzalez, L. M. V., Rodero-Merino, L., Caceres, J., and Lindner, M. (2009). A break in the clouds: towards a cloud definition. SIGCOMM Computer Communication Review, 39(1):50-55.
  22. Gonzalez, L. M. V., Rodero-Merino, L., Caceres, J., and Lindner, M. (2009). A break in the clouds: towards a cloud definition. SIGCOMM Computer Communication Review, 39(1):50-55.
  23. Hayes, B. (2008). Cloud computing. Communications of the ACM, 51:9-11.
  24. Hayes, B. (2008). Cloud computing. Communications of the ACM, 51:9-11.
  25. Hull, D., Wolstencroft, K., Stevens, R., Goble, C., Pocock, M., Li, P., and Oinn, T. (2006). Taverna: a tool for building and running workflows of services. Nucleic Acids Research, 34(Web Server issue):729-732.
  26. Hull, D., Wolstencroft, K., Stevens, R., Goble, C., Pocock, M., Li, P., and Oinn, T. (2006). Taverna: a tool for building and running workflows of services. Nucleic Acids Research, 34(Web Server issue):729-732.
  27. Illumina (2010). Illumina Inc. http://www.illumina.com.
  28. Illumina (2010). Illumina Inc. http://www.illumina.com.
  29. Inc, A. (2008). Amazon Cloud (Amazon EC2).
  30. Inc, A. (2008). Amazon Cloud (Amazon EC2).
  31. http://aws.amazon.com/ec2/#pricing.
  32. http://aws.amazon.com/ec2/#pricing.
  33. Lakshman, A. and Malik, P. (2009). Cassandra: structured storage system on a p2p network. In Proceedings of the 28th ACM symposium on Principles of distributed computing, PODC 7809, pages 5-5, New York, NY, USA. ACM.
  34. Lakshman, A. and Malik, P. (2009). Cassandra: structured storage system on a p2p network. In Proceedings of the 28th ACM symposium on Principles of distributed computing, PODC 7809, pages 5-5, New York, NY, USA. ACM.
  35. Ogasawara, E., de Oliveira, D., Chirigati, F., Barbosa, C. E., Elias, R., Braganholo, V., Coutinho, A., and Mattoso, M. (2009). Exploring many task computing in scientific workflows. In Proceedings of the 2nd Workshop on Many-Task Computing on Grids and Supercomputers, MTAGS 7809, pages 2:1-2:10, New York, NY, USA. ACM.
  36. Ogasawara, E., de Oliveira, D., Chirigati, F., Barbosa, C. E., Elias, R., Braganholo, V., Coutinho, A., and Mattoso, M. (2009). Exploring many task computing in scientific workflows. In Proceedings of the 2nd Workshop on Many-Task Computing on Grids and Supercomputers, MTAGS 7809, pages 2:1-2:10, New York, NY, USA. ACM.
  37. Pan, Q., Lee, O. S. L. J., Frey, B. J., and Blencowe, B. J. (2008). Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing. Nature Genetics, 40(12):1413-1415.
  38. Pan, Q., Lee, O. S. L. J., Frey, B. J., and Blencowe, B. J. (2008). Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing. Nature Genetics, 40(12):1413-1415.
  39. Qiu, X., Ekanayake, J., Beason, S., Gunarathne, T., Fox, G., Barga, R., and Gannon, D. (2009). Cloud technologies for bioinformatics applications. In Proceedings of the 2nd Workshop on Many-Task Computing on Grids and Supercomputers, MTAGS 7809, pages 6:1-6:10, New York, NY, USA. ACM.
  40. Qiu, X., Ekanayake, J., Beason, S., Gunarathne, T., Fox, G., Barga, R., and Gannon, D. (2009). Cloud technologies for bioinformatics applications. In Proceedings of the 2nd Workshop on Many-Task Computing on Grids and Supercomputers, MTAGS 7809, pages 6:1-6:10, New York, NY, USA. ACM.
  41. Richardson, L. and Ruby, S. (2007). Restful Web Services. O'Reilly Media, Inc., 1 edition.
  42. Richardson, L. and Ruby, S. (2007). Restful Web Services. O'Reilly Media, Inc., 1 edition.
  43. Schatz, M. C. (2009). Cloudburst: highly sensitive read mapping with mapreduce. BMC Bioinformatics, 25(11):1363-1369.
  44. Schatz, M. C. (2009). Cloudburst: highly sensitive read mapping with mapreduce. BMC Bioinformatics, 25(11):1363-1369.
  45. Sciences, . L. (2010). http://www.454.com.
  46. Sciences, . L. (2010). http://www.454.com.
  47. Stoica, I., Morris, R., Liben-Nowell, D., Karger, D. R., Kaashoek, M. F., Dabek, F., and Balakrishnan, H. (2003). Chord: a scalable peer-to-peer lookup protocol for internet applications. IEEE/ACM Trans. Netw., 11(1):17-32.
  48. Stoica, I., Morris, R., Liben-Nowell, D., Karger, D. R., Kaashoek, M. F., Dabek, F., and Balakrishnan, H. (2003). Chord: a scalable peer-to-peer lookup protocol for internet applications. IEEE/ACM Trans. Netw., 11(1):17-32.
  49. (2008). A global view of gene activity and alternative splicing by deep sequencing of the human transcriptome. Science, 321(5891):956-960.
  50. (2008). A global view of gene activity and alternative splicing by deep sequencing of the human transcriptome. Science, 321(5891):956-960.
  51. Wang, J., Crawl, D., and Altintas, I. (2009). Kepler + hadoop: a general architecture facilitating dataintensive applications in scientific workflow systems. In Proceedings of the 4th Workshop on Workflows in Support of Large-Scale Science, WORKS 7809, pages 12:1-12:8, New York, NY, USA. ACM.
  52. Wang, J., Crawl, D., and Altintas, I. (2009). Kepler + hadoop: a general architecture facilitating dataintensive applications in scientific workflow systems. In Proceedings of the 4th Workshop on Workflows in Support of Large-Scale Science, WORKS 7809, pages 12:1-12:8, New York, NY, USA. ACM.
  53. White, T. (2009). Hadoop: The Definitive Guide. O'Reilly, first edition edition.
  54. White, T. (2009). Hadoop: The Definitive Guide. O'Reilly, first edition edition.
  55. Wilkinson, M. D. and Links, M. (2002). BioMOBY: an open source biological web services proposal. Briefings in Bioinformatics, 3(4):331-341.
  56. Wilkinson, M. D. and Links, M. (2002). BioMOBY: an open source biological web services proposal. Briefings in Bioinformatics, 3(4):331-341.
Download


Paper Citation


in Harvard Style

Saldanha H., Ribeiro E., Holanda M., Araujo A., Rodrigues G., Emilia Walter M., Carlos Setubal J. and Dávila A. (2011). A CLOUD ARCHITECTURE FOR BIOINFORMATICS WORKFLOWS . In Proceedings of the 1st International Conference on Cloud Computing and Services Science - Volume 1: CLOSER, ISBN 978-989-8425-52-2, pages 477-483. DOI: 10.5220/0003394004770483


in Harvard Style

Saldanha H., Ribeiro E., Holanda M., Araujo A., Rodrigues G., Emilia Walter M., Carlos Setubal J. and Dávila A. (2011). A CLOUD ARCHITECTURE FOR BIOINFORMATICS WORKFLOWS . In Proceedings of the 1st International Conference on Cloud Computing and Services Science - Volume 1: CLOSER, ISBN 978-989-8425-52-2, pages 477-483. DOI: 10.5220/0003394004770483


in Bibtex Style

@conference{closer11,
author={Hugo Saldanha and Edward Ribeiro and Maristela Holanda and Aleteia Araujo and Genaina Rodrigues and Maria Emilia Walter and João Carlos Setubal and Alberto Dávila},
title={A CLOUD ARCHITECTURE FOR BIOINFORMATICS WORKFLOWS},
booktitle={Proceedings of the 1st International Conference on Cloud Computing and Services Science - Volume 1: CLOSER,},
year={2011},
pages={477-483},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003394004770483},
isbn={978-989-8425-52-2},
}


in Bibtex Style

@conference{closer11,
author={Hugo Saldanha and Edward Ribeiro and Maristela Holanda and Aleteia Araujo and Genaina Rodrigues and Maria Emilia Walter and João Carlos Setubal and Alberto Dávila},
title={A CLOUD ARCHITECTURE FOR BIOINFORMATICS WORKFLOWS},
booktitle={Proceedings of the 1st International Conference on Cloud Computing and Services Science - Volume 1: CLOSER,},
year={2011},
pages={477-483},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003394004770483},
isbn={978-989-8425-52-2},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 1st International Conference on Cloud Computing and Services Science - Volume 1: CLOSER,
TI - A CLOUD ARCHITECTURE FOR BIOINFORMATICS WORKFLOWS
SN - 978-989-8425-52-2
AU - Saldanha H.
AU - Ribeiro E.
AU - Holanda M.
AU - Araujo A.
AU - Rodrigues G.
AU - Emilia Walter M.
AU - Carlos Setubal J.
AU - Dávila A.
PY - 2011
SP - 477
EP - 483
DO - 10.5220/0003394004770483


in EndNote Style

TY - CONF
JO - Proceedings of the 1st International Conference on Cloud Computing and Services Science - Volume 1: CLOSER,
TI - A CLOUD ARCHITECTURE FOR BIOINFORMATICS WORKFLOWS
SN - 978-989-8425-52-2
AU - Saldanha H.
AU - Ribeiro E.
AU - Holanda M.
AU - Araujo A.
AU - Rodrigues G.
AU - Emilia Walter M.
AU - Carlos Setubal J.
AU - Dávila A.
PY - 2011
SP - 477
EP - 483
DO - 10.5220/0003394004770483