A Cloud-based GWAS Analysis Pipeline for Clinical Researchers

Paul Heinzlreiter, James Richard Perkins, Óscar Torreño, Johan Karlsson, Juan Antonio Ranea, Andreas Mitterecker, Miguel Blanca, Oswaldo Trelles

2014

Abstract

The cost of obtaining genome-scale biomedical data continues to drop rapidly, with many hospitals and universities being able to produce large amounts of data. Managing and analysing such ever-growing datasets is becoming a crucial issue. Cloud computing presents a good solution to this problem due to its flexibility in obtaining computational resources. However, it is essential to allow end-users with no experience to take advantage of the cloud computing model of elastic resource provisioning. This paper presents a workflow that allows the end-user to perform the core steps of a genome wide association analysis where raw gene- expression data is quality assessed. A number of steps in this process are computationally intensive and vary greatly depending on the size of the study, from a few samples to a few thousand. Therefore cloud computing provides an ideal solution to this problem by enabling scalability due to elastic resource provisioning. The key contributions of this paper are a real world application of cloud computing addressing a critical problem in biomedicine through parallelization of the appropriate parts of the workflow as well as enabling the end-user to concentrate on data analysis and biological interpretation of results by taking care of the computational aspects.

References

  1. Afgan, E., Baker, D., Coraor, N., Chapman, B., Nekrutenko, A., and Taylor, J. (2010). Galaxy cloudman: delivering cloud compute clusters. BMC Bioinformatics, 11(Suppl 12)(S4).
  2. Afgan, E., Chapman, B., Jadan, M., Franke, V., and Taylor, J. (2012). Using cloud computing infrastructure with cloudbiolinux, cloudman, and galaxy. Current Protocols in Bioinformatics.
  3. Allcock, W., Bester, J., Bresnahan, S., Plaszczak, P., and Tuecke, S. (2003). Gridftp: protocol extensions to ftp for the grid. Technical Report GFD-R-P.020, Open Grid Forum. Proposed Recommendation.
  4. Allen, B., Bresnahan, J., Childers, L., Foster, I., Kandaswamy, G., and Kettimuthu, R. (2011). Globus online: Radical simplification of data movement via saas. Technical Report Preprint CI-PP-5-0611, Computation Institute, The University of Chicago.
  5. Amazon Web Services (2013). Amazon simple storage service (amazon s3). http://aws.amazon.com/s3/.
  6. Ayuso, P., Blanca-López, N., Don˜a, I., Torres, M., GuéantRodriguez, R., Canto, G., Sanak, M., Mayorga, C., Guéant, J., Blanca, M., and Cornejo-García, J. (2013). Advanced phenotyping in hypersensitivity drug reactions to nsaids. Clinical and Experimental Allergy, 43(10):1097-1109.
  7. Button, K., Ioannidis, J., Mokrysz, C., Nosek, B., Flint, J., Robinson, E., and Munaf, M. (2013). Power failure: why small sample size undermines the reliability of neuroscience. Nature Reviews Neuroscience, 14(5):365-376.
  8. Ceph Team (2012). Rados gateway - ceph documentation. http://eu.ceph.com/docs/wip-3060/radosgw/.
  9. Cornejo-García, J., Liu, B., Blanca-López, N., na, I. D., Chen, C., Chou, Y., Chuang, H., Wu, J., Chen, Y., Plaza-Serón, M., Mayorga, C., Guéant-Rodríguez, R., Lin, S., Torres, M., Campo, P., Rondón, C., Laguna, J., Fernández, J., Guéant, J., Canto, G., Blanca, M., and Lee, M. (2013). Genome-wide association study in nsaids-induced acute urticaria/angioedema in spanish and han-chinese populations. Pharmacogenomics. in press.
  10. Debian Wiki (2012). euca2ools - debian https://wiki.debian.org/euca2ools.
  11. Foster, I. and Kesselman, C., editors (2003). The Grid 2: Blueprint for a New Computing Infrastructure. Elsevier.
  12. Foster, I., Kesselman, C., and Tuecke, S. (2001). The anatomy of the grid: Enabling scalable virtual organizations. International Journal of Supercomputer Applications, 3(15).
  13. Howes, T. and Smith, M. (1995). A scalable, deployable directory service framework for the internet. Technical Report UM-CITI 95-7, University of Michigan.
  14. Karlsson, J. and Trelles, O. (2013). Mapi: a software framework for distributed biomedical applications. Journal of Biomedical Semantics, 4(4).
  15. Korn, J., Kuruvilla, F., McCarroll, S., Wysoker, A., Nemesh, J., Cawley, S., Hubbell, E., Veitch, J., Collins, P., Darvishi, K., Lee, C., Nizzari, M., Gabriel, S., Purcell, S., Daly, M., and Altshuler, D. (2008). Integrated genotype calling and association analysis of snps, common copy number polymorphisms and rare cnvs. Nature Genetics, 10(40):1253-1260.
  16. Krampis, K., Booth, T., Chapman, B., B. Tiwari, M. B., Field, D., and Nelson, K. (2012). Cloud biolinux: preconfigured and on-demand bioinformatics compu ting for the genomics community. BMC Bioinformatics, 13(1):42.
  17. Le Bras, Y. and Chilton, J. (2013). duction galaxy instances on cloudbiolinux and cloudman. biogenouest.org/resources/243.
  18. Martin-Requena, V., Rios, J., Garcia, M., Ramirez, S., and Trelles, O. (2010). jorca: easily integrating bioinformatics web services. Bioinformatics, 26(4):553-559.
  19. Mell, P. and Grance, T. (2011). The nist definition of cloud computing. Technical Report 800-145, National Institute of Standards and Technology.
  20. Mobley, A., Linder, S., Braeuer, R., Ellis, L., and Zwelling, L. (2013). A survey on data reproducibility in cancer research provides insights into our limited ability to translate findings from the laboratory to the clinic. PLoS One, 8(5). e63221.
  21. Novotny, J., Tuecke, S., and Welch, V. (2011). An online credential repository for the grid: Myproxy. In Proceedings of the Tenth International Symposium on High Performance Distributed Computing.
  22. Oinn, T., Addis, M., Ferris, J., Marvin, D., Senger, M., Greenwood, M., Carver, T., Glover, K., Pocock, M., Wipat, A., and Li, P. (2004). Taverna: a tool for the composition and enactment of bioinformatics workflows. Bioinformatics, 20(17):3045-3054.
  23. Pepple, K. (2011). Deploying OpenStack. O'Reilly Media, first edition.
  24. R Development Core Team (2008). R: A language ad environment for statistical computing. R Foundation for Statistical Computing.
  25. Weil, S., Brandt, S., Miller, E., Long, D., and Maltzahn, C. (2006a). Ceph: A scalable, high-performance distributed file system. In Proceedings of the 7th Symposium on Operating System Design and Implementation, pages 307-320.
  26. Weil, S., Brandt, S., Miller, E., and Maltzahn, C. (2006b). Crush: controlled, scalable, decentralized placement of replicated data. In Proceedings of the 2006 ACM/IEEE Conference on Supercomputing.
Download


Paper Citation


in Harvard Style

Heinzlreiter P., Perkins J., Torreño Ó., Karlsson J., Ranea J., Mitterecker A., Blanca M. and Trelles O. (2014). A Cloud-based GWAS Analysis Pipeline for Clinical Researchers . In Proceedings of the 4th International Conference on Cloud Computing and Services Science - Volume 1: CLOSER, ISBN 978-989-758-019-2, pages 387-394. DOI: 10.5220/0004802103870394


in Bibtex Style

@conference{closer14,
author={Paul Heinzlreiter and James Richard Perkins and Óscar Torreño and Johan Karlsson and Juan Antonio Ranea and Andreas Mitterecker and Miguel Blanca and Oswaldo Trelles},
title={A Cloud-based GWAS Analysis Pipeline for Clinical Researchers},
booktitle={Proceedings of the 4th International Conference on Cloud Computing and Services Science - Volume 1: CLOSER,},
year={2014},
pages={387-394},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0004802103870394},
isbn={978-989-758-019-2},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 4th International Conference on Cloud Computing and Services Science - Volume 1: CLOSER,
TI - A Cloud-based GWAS Analysis Pipeline for Clinical Researchers
SN - 978-989-758-019-2
AU - Heinzlreiter P.
AU - Perkins J.
AU - Torreño Ó.
AU - Karlsson J.
AU - Ranea J.
AU - Mitterecker A.
AU - Blanca M.
AU - Trelles O.
PY - 2014
SP - 387
EP - 394
DO - 10.5220/0004802103870394