ERapidNJ did not finish within 48 hours due to clut-
tering of data. NINJA was able to finish this data set
in 16 hours because NINJA computes much tighter
bounds on q-values than ERapidNJ. NINJA also uses
a technique called q-filtering (Wheeler, 2009) to fur-
ther reduce the search space when searching for
q
min
. This is computationally expensive but on a few
large data sets with cluttering the tighter bounds give
NINJA an advantage because ERapidNJ cannot store
enough columns of S in the internal memory. More
memory improves ERapidNJs performance on these
data sets significantly.
The performance of QuickTree was inferior to
both NINJA and ERapidNJ on all data sets. When
trying to build trees with more than 32,000 taxa using
QuickTree the running time exceeded 48 hours be-
cause more than 2GB of internal memory is needed to
build such trees which results in memory page swap-
ping. Since QuickTree is not I/O efficient, page swap-
ping causes a huge penalty which prevents QuickTree
from finishing within a resonable amount of time.
3.2.1 Improving Performance by Parallelisation
Parallisation of the original NJ method can be done
by dividing the rows of D into t sets of approximately
the same size and then searching each set for q
min
in parallel. Similarly, ERapidNJ can be parallised
by searching rows of S in parallel. The performance
of the canonical NJ method can easily be improved
in this way, as searching for q
min
is the most time
consuming step of the canonical NJ method. This is
not always the case with ERapidNJ where the time
spent on operations such as reading the distance ma-
trix from the HDD, sorting S and updating data struc-
tures is similar to the total time used on searching for
q
min
when building relatively small trees. As an ex-
ample, ERapidNJ uses 33% of the total running time
on reading the distance matrix and only 24% of the
total running time on searching for q
min
when build-
ing a tree containing 10,403 taxa. Operations such as
reading data from external memory and updating data
structures in internal memory does not benefit signif-
icantly from parallelisation and consequently limits
the potential performance gain from parallelisation of
ERapidNJ in the case of small data sets. On larger
data sets the total time used to search for q
min
takes
up a substantial part of the total time consumption and
here parallelisation is more effective.
Experiments with a parallelised version of ER-
apidNJ showed a reduction of the total running time
by a factor 2.2 on a quad core Intel Core 2 Duo pro-
cessor compared to the unparallised ERapidNJ on the
same processor when building a tree with 25,803 taxa.
When building a tree with 10,403 taxa the total run-
ning time was only reduced by a factor 1.12. Both
these data sets were computed in internal memory and
parallelisation of RapidDiskNJ will not increase per-
formance significantly as RapidDiskNJ is I/O bound,
i.e. most of the running time is spend on waiting for
the external memory.
4 CONCLUSIONS
We have presented two extensions and an improved
search heuristic for the RapidNJ method which both
increases the performance of RapidNJ and decreases
internal memory requirements significantly. Using
the methods described in this paper, we were able to
overcome RapidNJs limitations regarding the mem-
ory consumption and performance on data sets con-
taining redundant and cluttered taxa. We have pre-
sented experiments with the extended RapidNJ tool
showing that canonical NJ trees containing more than
50,000 taxa can be build in a few hours on a desk-
top computer with only 2GB of RAM. Our experi-
ments also showed that in comparison with the fastest
tools available, for building canonical NJ trees, the
ERapidNJ tool is significantly faster on any size of
input.
We are aware that statistical phylogenetic infer-
ence methods with better precision than distance
based method are available. However, the time com-
plexity of these methods are high compared to the NJ
method and currently they do not scale well to large
data sets (Stamatakis, 2006; Ott et al., 2007), which
justify the existence of distance based methods as pre-
sented in this paper.
REFERENCES
Aggerwal, A. and Vitter, T. S. (1988). The input output
complexity of sorting and related problems. In Com-
munications of the ACM, volume 31(9), pages 1116–
1127.
Alm, E. J., Huang, K. H., Price, M. N., Koche, R. P., Keller,
K., Dubchak, I. L., and Arkin, A. P. (2005). The
microbesonline web site for comparative genomics.
Genome Research, 15(7):1015–1022.
Elias, I. and Lagergren, J. (2005). Fast neighbour joining.
In Proceedings of the 32nd International Colloquium
on Automata, Languages and Programming (ICALP),
volume 3580 of Lecture Notes in Computer Science,
pages 1263–1274. Springer.
Finn, R. D., Mistry, J., Schuster-B
¨
ockler, B., Griffiths-
Jones, S., Hollich, V., Lassmann, T., Moxon, S.,
Marshall, M., Khanna, A., Durbin, R., Eddy, S. R.,
Sonnhammer, E. L. L., and Bateman, A. (2006). Pfam:
BUILDING VERY LARGE NEIGHBOUR-JOINING TREES
31