Table 3: Runtime of four tests (SRR073726).
Test
Number
Runtime (sec)
RINS on
single node
2-node
cluster
4-node
cluster
8-node
cluster
test 1 6745 1889 1338 1152
test 2 6655 2059 1488 1222
test 3 6871 1856 1437 1134
test 4 6903 1797 1308 1215
avg. 6794 1900 1393 1181
Table 4: Runtime of four tests (SRR073732).
Test
Number
Runtime (sec)
RINS on
single node
2-node
cluster
4-node
cluster
8-node
cluster
test 1 8572 2571 1624 1392
test 2 8762 2563 1722 1429
test 3 8889 2409 1638 1317
test 4 8948 2381 1584 1405
avg. 8793 2481 1642 1386
Table 5 shows the runtime of each step in
Hadoop-RINS. The runtime of step 3 decreases
significantly as the number of computing nodes
increases. Due to partly paralleled, the runtime of
step 5 also decreases as the number of nodes
increases. While the runtime of file format
transformation and file segmentation changes little.
The runtime in file distribution increases with the
number of computing nodes. The runtime of step 4
seems inconsistent with the increase of nodes.
Table 5: Average runtime of each step in Hadoop-RINS
(SRR073726/SRR073732).
Step
Runtime (sec)
2 nodes 4 nodes 8 nodes
step 1 224/327 222/345 307/354
step 2 59/80 87/151 116/137
step 3 1368/1691 753/932 441/510
step 4 3/108 180/42 178/264
step 5 247/274 151/171 139/121
total 1901/2479 1393/1641 1181/1386
BLAT is the main bottleneck in RINS. We have
tried to run BLAT with divided data and the runtime
of Hadoop-RINS reduces greatly. From the runtime
of multi nodes cluster, we can see the speedup does
not increase remarkably with node number. That's
because the runtime of step 1 and step 2 does not
decrease with the increase of nodes. So the runtime
proportion increases with the number of nodes. For
the 8-node cluster, it can amount to 35.8%.
The inconsistence of step 4 is caused by the
differences of computing nodes. For some reason the
processing speed of one node is slower than others,
then Hadoop needs to wait until all nodes finish their
work, which increases the runtime of step 4. So in a
heterogeneous cluster, the runtime may be
influenced by the slowest node. Heterogeneous
environment is not recommended for Hadoop-RINS.
Compared with RINS, Hadoop-RINS running on
2-node cluster, 4-node cluster and 8-node cluster get
the same contigs as those of RINS. So Hadoop-
RINS has the same accuracy with RINS.
5 DISCUSSION
Processing speed is an important indicator in
pathogen detection. In this article, we analyze the
pipeline and runtime of RINS to find the bottleneck,
and then we use Hadoop to realize a parallel pipeline
to finish the main steps of RINS. In the future, we
will implement a sub-pipeline to analyze the filtered
data which can’t be mapped to reference genomes.
REFERENCES
Altschul, S. F. et al., 1990. Basic local alignment search
tool. J Molecular Biology. 215(3):403-410.
Bhaduri, A. et al., 2012. Rapid Identification of
Nonhuman Sequences in High Throughput Sequenc-
ing Data Set. Bioinformatics. 28(8):1174-1175.
Grabherr, M. G. et al., 2011. Full-length transcriptome
assembly from RNA-seq data without a reference
genome. Nature Biotechnology. 29(7):644-652.
Kent, W. J., 2002. BLAT--the BLAST-like alignment tool.
Genome research. 12(4): 656-664.
Kostic, A. D. et al., 2011. PathSeq: software to identify or
discover microbes by deep sequencing of human
tissue. Nature Biotechnology. 29(5):393-396.
Lei W. Y., 2011. Cloud-computing: the strategy and
practice of enterprise information construction.
http://blog.sina.com.cn/s/blog_6c5cffe30100p833.html.
Nachankar, V., Arvind, D., 2011. Hadoop-BLAST.
http://salsahpc.indiana.edu/csci-b649-2011/collection/
project1/report/group11_proj1_report.pdf.
Stephen S., 2008. Google spotlights data center inner
workings. http://news.cnet.com/8301-10784_3-99551
84-7.html.
Talukder, A. K. et al., 2010. Cloud-MAQ: the cloud-
enabled scalable whole genome reference Assembly
application. WOCN 2010.
Welch, T. A., 1984. A technique for high-performance
data compression. Computer. 17: 8-19.
Hadoop-RINS-AHadoopAcceleratedPipelineforRapidNonhumanSequenceIdentification
299