MASon: MILLION ALIGNMENTS IN SECONDS - A Platform Independent Pairwise Sequence Alignment Library for Next Generation Sequencing Data

Philipp Rescheneder, Arndt von Haeseler, Fritz J. Sedlazeck

Abstract

The advent of Next Generation Sequencing (NGS) technologies and the increase in read length and number of reads per run poses a computational challenge to bioinformatics. The demand for sensible, inexpensive, and fast methods to align reads to a reference genome is constantly increasing. Due to the high sensitivity the Smith-Waterman (SW) algorithm is best suited for that. However, its high demand for computational resources makes it unpractical. Here we present an optimal SWimplementation for NGS data and demonstrate the advantages of using common and inexpensive high performance architectures to improve the computing time of NGS applications. We implemented a C++ library (MASon) that exploits graphic cards (CUDA, OpenCL) and CPU vector instructions (SSE, OpenCL) to efficiently handle millions of short local pairwise sequence alignments (36bp - 1,000bp). All libraries can be easily integrated into existing and upcoming NGS applications and allow programmers to optimally utilize modern hardware, ranging from desktop computers to high-end cluster.

References

  1. Blom, J., Jakobi, T., Doppmeier, D., Jaenicke, S., Kalinowski, J., Stoye, J., and Goesmann, A. (2011). Exact and complete short read alignment to microbial genomes using GPU programming. Bioinformatics, 27(10):1351-1358.
  2. Dagum, L. and Menon, R. (1998). OpenMP: an industry standard API for shared-memory programming. IEEE Computational Science and Engineering, 5(1):46-55.
  3. Döring, A., Weese, D., Rausch, T., and Reinert, K. (2008). SeqAn an efficient, generic C++ library for sequence analysis. BMC bioinformatics, 9:11.
  4. Farrar, M. (2007). Striped SmithWaterman speeds database searches six times over other SIMD implementations. Bioinformatics, 23(2):156-161.
  5. Flicek, P. and Birney, E. (2009). Sense from sequence reads: methods for alignment and assembly. Nature Methods, 6(11s):S6-S12.
  6. Glenn, T. C. (2011). Field guide to next-generation DNA sequencers. Molecular Ecology Resources, 11(5):759- 769.
  7. Gusfield, D. (1997). Algorithms on strings, trees, and sequences : computer science and computational biology. Cambridge Univ. Press.
  8. Homer, N., Merriman, B., and Nelson, S. F. (2009). BFAST: An alignment tool for large scale genome resequencing. PLoS ONE, 4(11):e7767+.
  9. Jukes, T. H. and Cantor, C. R. (1969). Evolution of protein molecules. Manmmalian Protein Metabolism, pages 21-132.
  10. Kirk, D. B. and Mei, W. (2010). Programming Massively Parallel Processors: A Hands-on Approach. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1st edition.
  11. Langmead, B., Trapnell, C., Pop, M., and Salzberg, S. (2009). Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology, 10(3):R25+.
  12. Li, H. and Durbin, R. (2009). Fast and accurate short read alignment with BurrowsWheeler transform. Bioinformatics, 25(14):1754-1760.
  13. Liu, Y., Maskell, D. L., and Schmidt, B. (2009). CUDASW++: optimizing Smith-Waterman sequence database searches for CUDA-enabled graphics processing units. BMC research notes, 2(1):73+.
  14. Metzker, M. L. (2010). Sequencing technologies - the next generation. Nature reviews. Genetics, 11(1):31-46.
  15. Ning, Z., Cox, A. J., and Mullikin, J. C. (2001). SSAHA: a fast search method for large DNA databases. Genome research, 11(10):1725-1729.
  16. Nvidia (2009). NVIDIA CUDA C Programming Best Practices Guide. Nvidia, 1st edition.
  17. Raman, S. K., Pentkovski, V., and Keshava, J. (2000). Implementing streaming SIMD extensions on the pentium III processor. IEEE Micro, 20(4):47-57.
  18. Rumble, S. M., Lacroute, P., Dalca, A. V., Fiume, M., Sidow, A., and Brudno, M. (2009). SHRiMP: Accurate mapping of short color-space reads. PLoS Comput Biol, 5(5):e1000386+.
  19. Schatz, M., Trapnell, C., Delcher, A., and Varshney, A. (2007). High-throughput sequence alignment using graphics processing units. BMC Bioinformatics, 8(1):474+.
  20. Smith, T. F. and Waterman, M. S. (1981). Identification of common molecular subsequences. Journal of molecular biology, 147(1):195-197.
  21. Stone, J. E., Gohara, D., and Shi, G. (2010). OpenCL: A parallel programming standard for heterogeneous computing systems. Computing in Science & Engineering, 12(3):66-73.
  22. Vouzis, P. D. and Sahinidis, N. V. (2011). GPU-BLAST: using graphics processors to accelerate protein sequence alignment. Bioinformatics, 27(2):182-188.
Download


Paper Citation


in Harvard Style

Rescheneder P., von Haeseler A. and J. Sedlazeck F. (2012). MASon: MILLION ALIGNMENTS IN SECONDS - A Platform Independent Pairwise Sequence Alignment Library for Next Generation Sequencing Data . In Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms - Volume 1: BIOINFORMATICS, (BIOSTEC 2012) ISBN 978-989-8425-90-4, pages 195-201. DOI: 10.5220/0003775701950201


in Bibtex Style

@conference{bioinformatics12,
author={Philipp Rescheneder and Arndt von Haeseler and Fritz J. Sedlazeck},
title={MASon: MILLION ALIGNMENTS IN SECONDS - A Platform Independent Pairwise Sequence Alignment Library for Next Generation Sequencing Data},
booktitle={Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms - Volume 1: BIOINFORMATICS, (BIOSTEC 2012)},
year={2012},
pages={195-201},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003775701950201},
isbn={978-989-8425-90-4},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms - Volume 1: BIOINFORMATICS, (BIOSTEC 2012)
TI - MASon: MILLION ALIGNMENTS IN SECONDS - A Platform Independent Pairwise Sequence Alignment Library for Next Generation Sequencing Data
SN - 978-989-8425-90-4
AU - Rescheneder P.
AU - von Haeseler A.
AU - J. Sedlazeck F.
PY - 2012
SP - 195
EP - 201
DO - 10.5220/0003775701950201