Hadoop-RINS - A Hadoop Accelerated Pipeline for Rapid Nonhuman Sequence Identification

Li Jiangyu; Liu Yang; Wang Xiaolei; Mao Yiqing; Wang Yumin; Zhao Dongsheng

doi:10.5220/0004239602960299

Hadoop-RINS - A Hadoop Accelerated Pipeline for Rapid Nonhuman Sequence Identification

Li Jiangyu, Liu Yang, Wang Xiaolei, Mao Yiqing, Wang Yumin, Zhao Dongsheng

2013

Abstract

Sequencing data increase rapidly in recent years with the development of high-throughput sequencing technology. Using parallel computing to accelerate the computation is an important way to process the large volume of sequence data. RINS is a pipeline used to identify nonhuman sequences in deep sequencing datasets. It uses user-provided microbial reference genomes to reduce the number of reads to be processed and improve the processing speed. But all of its steps run serially. As a result, the processing speed of RINS slows down sharply as the sequencing data and reference genomes increase. In this article, we report a pipeline that processes sequencing data parallel through Hadoop. By comparing the runtime using same dataset, Hadoop-RINS is proved to be significantly faster than RINS with the same computation result.

References

Altschul, S. F. et al., 1990. Basic local alignment search tool. J Molecular Biology. 215(3):403-410.
Bhaduri, A. et al., 2012. Rapid Identification of Nonhuman Sequences in High Throughput Sequencing Data Set. Bioinformatics. 28(8):1174-1175.
Grabherr, M. G. et al., 2011. Full-length transcriptome assembly from RNA-seq data without a reference genome. Nature Biotechnology. 29(7):644-652.
Kent, W. J., 2002. BLAT--the BLAST-like alignment tool. Genome research. 12(4): 656-664.
Kostic, A. D. et al., 2011. PathSeq: software to identify or discover microbes by deep sequencing of human tissue. Nature Biotechnology. 29(5):393-396.
Lei W. Y., 2011. Cloud-computing: the strategy and practice of enterprise information construction. http://blog.sina.com.cn/s/blog_6c5cffe30100p833.html.
Nachankar, V., Arvind, D., 2011. Hadoop-BLAST. http://salsahpc.indiana.edu/csci-b649-2011/collection/ project1/report/group11_proj1_report.pdf.
Stephen S., 2008. Google spotlights data center inner workings. http://news.cnet.com/8301-10784_3-99551 84-7.html.
Talukder, A. K. et al., 2010. Cloud-MAQ: the cloudenabled scalable whole genome reference Assembly application. WOCN 2010.
Welch, T. A., 1984. A technique for high-performance data compression. Computer. 17: 8-19.

Download

Paper Citation

in Harvard Style

Jiangyu L., Yang L., Xiaolei W., Yiqing M., Yumin W. and Dongsheng Z. (2013). Hadoop-RINS - A Hadoop Accelerated Pipeline for Rapid Nonhuman Sequence Identification . In Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms - Volume 1: BIOINFORMATICS, (BIOSTEC 2013) ISBN 978-989-8565-35-8, pages 296-299. DOI: 10.5220/0004239602960299

in Bibtex Style

@conference{bioinformatics13,
author={Li Jiangyu and Liu Yang and Wang Xiaolei and Mao Yiqing and Wang Yumin and Zhao Dongsheng},
title={Hadoop-RINS - A Hadoop Accelerated Pipeline for Rapid Nonhuman Sequence Identification},
booktitle={Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms - Volume 1: BIOINFORMATICS, (BIOSTEC 2013)},
year={2013},
pages={296-299},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0004239602960299},
isbn={978-989-8565-35-8},
}

in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms - Volume 1: BIOINFORMATICS, (BIOSTEC 2013)
TI - Hadoop-RINS - A Hadoop Accelerated Pipeline for Rapid Nonhuman Sequence Identification
SN - 978-989-8565-35-8
AU - Jiangyu L.
AU - Yang L.
AU - Xiaolei W.
AU - Yiqing M.
AU - Yumin W.
AU - Dongsheng Z.
PY - 2013
SP - 296
EP - 299
DO - 10.5220/0004239602960299