Memory Efficient de novo Assembly Algorithm using Disk Streaming of K-mers

Yuki Endo, Fubito Toyama, Chikafumi Chiba, Hiroshi Mori, Kenji Shoji

2016

Abstract

Sequencing the whole genome of various species has many applications, not only in understanding biological systems, but also in medicine, pharmacy, and agriculture. In recent years, the emergence of high-throughput next generation sequencing technologies has dramatically reduced the time and costs for whole genome sequencing. These new technologies provide ultrahigh throughput with a lower per-unit data cost. However, the data are generated from very short fragments of DNA. Thus, it is very important to develop algorithms for merging these fragments. One method of merging these fragments without using a reference dataset is called de novo assembly. Many algorithms for de novo assembly have been proposed in recent years. Velvet and SOAPdenovo2 are well-known assembly algorithms, which have good performance in terms of memory and time consumption. However, memory consumption increases dramatically when the size of input fragments is larger. Therefore, it is necessary to develop an alternative algorithm with low memory usage. In this paper, we propose an algorithm for de novo assembly with lower memory. In the proposed method, memory-efficient DSK (disk streaming of k-mers) to count k-mers is adopted. Moreover, the amount of memory usage for constructing de bruijn graph is reduced by not keeping edge information in the graph. In our experiment using human chromosome 14, the average maximum memory consumption of the proposed method was approximately 7.5–8.8% of that of the popular assemblers.

References

  1. Bowe, A., Onodera, T., Sadakane, K., and Shibuya, T. (2012). Succinct de bruijn graphs. In WABI, volume 7534 of Lecture Notes in Computer Science, pages 225-235. Springer.
  2. Butler, J., MacCallum, I., Kleber, M., Shlyakhter, I. A., Belmonte, M. K., Lander, E. S., Nusbaum, C., and Jaffe, D. B. (2008). ALLPATHS: de novo assembly of whole-genome shotgun microreads. Genome Res., 18(5):810-820.
  3. Chevreux, B., Pfisterer, T., Drescher, B., Driesel, A. J., Muller, W. E., Wetter, T., and Suhai, S. (2004). Using the miraEST assembler for reliable and automated mRNA transcript assembly and SNP detection in sequenced ESTs. Genome Res., 14(6):1147-1159.
  4. Chikhi, R., Limasset, A., Jackman, S., Simpson, J., and Medvedev, P. (2014). On the representation of de bruijn graphs. In RECOMB, volume 8394 of Lecture Notes in Computer Science, pages 35-55. Springer.
  5. Chikhi, R. and Rizk, G. (2012). Space-efficient and exact de bruijn graph representation based on a bloom filter. In WABI, volume 7534 of Lecture Notes in Computer Science, pages 236-248. Springer.
  6. Conway, T. C. and Bromage, A. J. (2011). Succinct data structures for assembling large genomes. Bioinformatics, 27(4):479-486.
  7. Endo, Y., Toyama, F., Chiba, C., Mori, H., and Shoji, K. (2014). De Novo Short Read Assembly Algorithm with Low Memory Usage. In Proceedings of International Conference on Bioinformatics Models, Methods and Algorithms (BIOINFORMATICS2014), pages 215-200.
  8. Hernandez, D., Francois, P., Farinelli, L., Osteras, M., and Schrenzel, J. (2008). De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer. Genome Res., 18(5):802-809.
  9. Jeck, W. R., Reinhardt, J. A., Baltrus, D. A., Hickenbotham, M. T., Magrini, V., Mardis, E. R., Dangl, J. L., and Jones, C. D. (2007). Extending assembly of short DNA sequences to handle error. Bioinformatics, 23(21):2942-2944.
  10. Li, R., Zhu, H., Ruan, J., Qian, W., Fang, X., Shi, Z., Li, Y., Li, S., Shan, G., Kristiansen, K., Li, S., Yang, H., Wang, J., and Wang, J. (2010). De novo assembly of human genomes with massively parallel short read sequencing. Genome Res., 20(2):265-272.
  11. Miller, J. R., Delcher, A. L., Koren, S., Venter, E., Walenz, B. P., Brownley, A., Johnson, J., Li, K., Mobarry, C., and Sutton, G. (2008). Aggressive assembly of pyrosequencing reads with mates. Bioinformatics, 24(24):2818-2824.
  12. Rizk, G., Lavenier, D., and Chikhi, R. (2013). Dsk: k-mer counting with very low memory usage. Bioinformatics, 29(5):652-653.
  13. Salzberg, S. L., Phillippy, A. M., Zimin, A., Puiu, D., Magoc, T., Koren, S., Treangen, T. J., Schatz, M. C., Delcher, A. L., Roberts, M., et al. (2012). GAGE: A critical evaluation of genome assemblies and assembly algorithms. Genome Res., 22(3):557-567.
  14. Simpson, J. T., Wong, K., Jackman, S. D., Schein, J. E., Jones, S. J., and Birol, I. (2009). ABySS: a parallel assembler for short read sequence data. Genome Res., 19(6):1117-1123.
  15. Warren, R. L., Sutton, G. G., Jones, S. J., and Holt, R. A. (2007). Assembling millions of short DNA sequences using SSAKE. Bioinformatics, 23(4):500-501.
  16. Zerbino, D. R. and Birney, E. (2008). Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res., 18(5):821-829.
Download


Paper Citation


in Harvard Style

Endo Y., Toyama F., Chiba C., Mori H. and Shoji K. (2016). Memory Efficient de novo Assembly Algorithm using Disk Streaming of K-mers . In Proceedings of the 9th International Joint Conference on Biomedical Engineering Systems and Technologies - Volume 3: BIOINFORMATICS, (BIOSTEC 2016) ISBN 978-989-758-170-0, pages 266-271. DOI: 10.5220/0005798302660271


in Bibtex Style

@conference{bioinformatics16,
author={Yuki Endo and Fubito Toyama and Chikafumi Chiba and Hiroshi Mori and Kenji Shoji},
title={Memory Efficient de novo Assembly Algorithm using Disk Streaming of K-mers},
booktitle={Proceedings of the 9th International Joint Conference on Biomedical Engineering Systems and Technologies - Volume 3: BIOINFORMATICS, (BIOSTEC 2016)},
year={2016},
pages={266-271},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005798302660271},
isbn={978-989-758-170-0},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 9th International Joint Conference on Biomedical Engineering Systems and Technologies - Volume 3: BIOINFORMATICS, (BIOSTEC 2016)
TI - Memory Efficient de novo Assembly Algorithm using Disk Streaming of K-mers
SN - 978-989-758-170-0
AU - Endo Y.
AU - Toyama F.
AU - Chiba C.
AU - Mori H.
AU - Shoji K.
PY - 2016
SP - 266
EP - 271
DO - 10.5220/0005798302660271