Genetic Mapping of Diseases through Big Data Techniques

Julio Cesar Santos dos Anjos, Bruno Reckziegel Filho, Junior F. Barros, Raffael B. Schemmer, Claudio Geyer, Ursula Matte


The development of sophisticated sequencing machines and DNA techniques has enabled advances to be made in the medical field of genetics research. However, due to the large amount of data that sequencers produce, new methods and programs are required to allow an efficient and rapid analysis of the data. MapReduce is a data-intensive computing model that handles large volumes that are easy to program by means of two basic functions (Map and Reduce). This work introduces GMS, a genetic mapping system that can assist doctors in the clinical diagnosis of patients by conducting an analysis of the genetic mutations contained in their DNA. As a result, the model can offer a good method for analyzing the data generated by sequencers, by providing a scalable system that can handle a large amount of data. The use of several medical databases at the same time makes it possible to determine susceptibilities to diseases through big data analysis mechanisms. The results show scalability and offer a possible diagnosis that can improve the genetic diagnosis with a powerful tool for health professionals.


  1. Alberts, B., Johnson, A., Lewis, J., Raff, M., Roberts, K., and Walter, P. (2014). Molecular Biology of the Cell. Garland Science, 6th edition.
  2. BCM (2014). DNA Nexus Project. Technical report.
  3. Chung, W.-C., Chen, C.-C., Ho, J.-M., Lin, C.-Y., Hsu, W.- L., Wang, Y.-C., Lee, D. T., Lai, F., Huang, C.-W., and Chang, Y.-J. (2014). CloudDOE: A User-Friendly Tool for Deploying Hadoop Clouds and Analyzing High-Throughput Sequencing Data with MapReduce. PLOS ONE, 9:e98146.
  4. Costa, F. F. (2014). Big data in biomedicine. Drug Discovery Today, 19(4):433-440.
  5. Dean, J. and Ghemawat, S. (2010). MapReduce - A Flexible Data Processing Tool. Communications of the ACM, 53(1):72-77.
  6. Frebourg, T. (2014). The challenge for the next generation of medical geneticists. Hum Mutat, 35(8):909-11.
  7. Gurtowski, J., Schatz, M. C., and Langmead, B. (2012). Genotyping in the cloud with Crossbow. Curr Protoc Bioinformatics.
  8. Hansen, M., Miron-Shatz, T., Lau, A. Y. S., and Paton, C. (2014). Big Data in Science and Healthcare: A Review of Recent Literature and Perspectives. Yearbook of medical informatics, 9(4):21-6.
  9. Johnsen, J. M., Nickerson, D. A., and Reiner, A. P. (2013). Massively parallel sequencing: the new frontier of hematologic genomics. Blood, 122(19):3268-3275.
  10. Kinsella, R. J., Kahari, A., Haider, S., Zamora, J., Proctor, G., Spudich, G., Almeida-King, J., Staines, D., Derwent, P., Kerhornou, A., Kersey, P., and Flicek, P. (2011). Ensembl BioMarts: a hub for data retrieval across taxonomic space. Database, 2011:1-9.
  11. McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibulskis, K., Kernytsky, A., Garimella, K., Altshuler, D., Gabriel, S., Daly, M., and DePristo, M. A. (2010). The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data. Genome Research, 20(9):1297-1303.
  12. MEDLINE (2013). The NCBI Handbook, volume NBK143764. National Center for Biotechnology Information, 2nd edition.
  13. NCBI (2014). A Base Pathogenic Mutations. Technical report.
  14. Nguyen, T., Shi, W., and Shi, W. (2011). CloudAligner: A fast and full-featured MapReduce based tool for sequence mapping. BMC Research Notes, 4(171):1-16.
  15. Niemenmaa, M., Kallio, A., Schumacher, A., Klemela, P., Korpelainen, E., and Heljanko, K. (2012). HadoopBAM: directly manipulating next generation sequencing data in the cloud. Bioinformatics, 28(6):876-877.
  16. Nussbaum, R., McInnes, R., and Willard, H. (2013). Thompson Genetics in Medicine. Elsevier Science Publishers B. V., 7th edition.
  17. O'Driscoll, A., Daugelaite, J., and Sleator, R. D. (2014). Big data, Hadoop and cloud computing in genomics. Journal of Biomedical Informatics, 46(5):774-781.
  18. Sawyer, S. A., Parsch, J., Zhang, Z., and Hartl, D. L. (2007). Prevalence of positive selection among nearly neutral amino acid replacements in Drosophila. Proceedings of the National Academy of Sciences, 104(16):6504- 6510.
  19. Schatz, M. C., Langmead, B., and Salzberg, S. L. (2010). Cloud computing and the DNA data race. NATURE BIOTECHNOLOGY, 28(7):691-693.
  20. Scientific, T. F. (2014). Choose Next-Generation Sequencing or Sanger Sequencing Solutions. Technical report.
  21. White, T. (2012). Hadoop - The Definitive Guide, volume 1. OReilly Media, Inc., 3rd edition.
  22. William J, T. and Palladino, M. A. (2012). Introduction to Biotechnology, volume 1. Pearson, 3rd edition.
  23. Zou, Q., Li, X.-B., Jiang, W.-R., Lin, Z.-Y., Li, G.-L., and Chen, K. (2014). Survey of MapReduce frame operation in bioinformatics. Briefings in Bioinformatics, 15(4):637-647.

Paper Citation

in Harvard Style

Santos dos Anjos J., Reckziegel Filho B., F. Barros J., B. Schemmer R., Geyer C. and Matte U. (2015). Genetic Mapping of Diseases through Big Data Techniques . In Proceedings of the 17th International Conference on Enterprise Information Systems - Volume 1: ICEIS, ISBN 978-989-758-096-3, pages 279-286. DOI: 10.5220/0005365402790286

in Bibtex Style

author={Julio Cesar Santos dos Anjos and Bruno Reckziegel Filho and Junior F. Barros and Raffael B. Schemmer and Claudio Geyer and Ursula Matte},
title={Genetic Mapping of Diseases through Big Data Techniques},
booktitle={Proceedings of the 17th International Conference on Enterprise Information Systems - Volume 1: ICEIS,},

in EndNote Style

JO - Proceedings of the 17th International Conference on Enterprise Information Systems - Volume 1: ICEIS,
TI - Genetic Mapping of Diseases through Big Data Techniques
SN - 978-989-758-096-3
AU - Santos dos Anjos J.
AU - Reckziegel Filho B.
AU - F. Barros J.
AU - B. Schemmer R.
AU - Geyer C.
AU - Matte U.
PY - 2015
SP - 279
EP - 286
DO - 10.5220/0005365402790286