by the scientific community, cited over 3000 times in
the literature.
Phylogenetic inference (PI), constructing evolu-
tionary histories, and Phylogeny Comparative Meth-
ods (PCM) a complementary statistical framework for
analysing data in a phylogenetic context, are funda-
mental and universal methods for studying biological
systems(Pagel and Meade, 2008). With large volumes
of data produced by next generation sequencing large
phylogenies have been created: these include a near
complete phylogeny of the birds (9000 taxa) and a
large fish phylogeny (8000 taxa), these phylogenies
join a near complete mammal phylogeny (5000 taxa),
a mega-phylogeny of plants (55,000+ taxa). PCM
analysis has also been applied to discover many evo-
lutionary processes.
Despite these advances, the Phylogenetic problem
remains far from being considered as ’solved’ and
in fact has become acute at big data scale. It was
hoped that access to large data sets would make phy-
logenetic inference universally available, robust and
accurate, surprisingly the opposite appears to be the
case(Salichos and Rokas, 2013). Analysing big data
is therefore much more than redesigning current mod-
elling used for traditional analysis to work with data
sets orders of magnitude larger. New generations of
models, methods and software are needed to fully ex-
ploit the power of these data sets. Scalability is a
defining feature of big data analysis. Overly simple
statistical models do not scale, software designed to
work with data sets orders of magnitude larger than it
was originally designed are not scalable, and compu-
tational methods to convert the results to biologically
relevant information also have scalability limitations.
These can be defined as model scalability, software
scalability, and data visualisation scalability. Big
data phylogenetic inference and comparative meth-
ods offer much more than the challenge of analysing
larger data sets: they offer new opportunities to exam-
ine evolutionary processes and the prospect of gaining
new insights at both micro and macro levels.
In this paper we present evaluation of scalabil-
ity of Bayesphylogenies across two petascale class
architectures on fish big data data sets (3k and 5k).
Analysis of strong and weak scaling on large proces-
sor counts from 4k to 12k gives insights into hybrid
parallel program efficiencies, scalability limitations
across large taxa sets and algorithmic characteristics
of MCMCMC. From these insights we make sev-
eral conclusions on scaling phylogenetic analysis to
next generation from hierarchical parallelisation and
emerging algorithmic adaptations based on machine
learning techniques.
2 BIG DATA BAYESIAN
PHYLOGENETICS
2.1 Bayesian Phylogenetics
2.1.1 Search Procedure
Algorithmic approaches to Bayesian inference of
phylogenies have two components: a scoring metric
and a search procedure.
Let D denote the DNA sequence data from n taxa.
An analysis is typically performed over 10,000 nu-
cleotides drawn from multiple genes. The phyloge-
netic analysis seeks to infer the tree τ
i
that is most
consistent with the observed data D. As the space of
possible trees is exponential in the number of taxa the
problem is framed as search of plausible trees condi-
tional on the observation in a statistical framework:
P(τ
i
|D). A computational procedure is then formu-
lated using Bayes rule as follows:
P(τ
i
| D) =
P(τ
i
) · P(D/τ
i
)
∑
B(n)
j=1
P(τ
i
) · P(D/τ
i
)
(1)
where P(τ
i
| D) gives a score for the i
th
tree,
P(D/τ
i
) the likelihood and P(τ
i
) the prior probaba-
bility.
Instead of computing one tree as in maximum-
likelihood method the formulation in equation 1 com-
putes a distribution of trees by applying a MCMC
search procedure (Meade, 2011). Under the assump-
tion of un-informative priors, the main computation
then is the likelihood P(D/τ
i
). The procedure thus
searches for a set of plausible trees that are weighted
by their probabilities. In practice a specific tree
is modelled by M = {τ, υ,Q,γ}: topology, branch
lengths, DNA substitution parameters and gamma
shape parameter respectively.
2.1.2 Nucleotide Substitution Model
For a particular tree the transition probabilities from
root to all the leaf nodes needs to be defined. The
problem specification therefore includes a concrete
model of evolution. An evolutionary mechanism re-
sponsible for sequence change can be quantified in
terms of (rate of) nucleotide substitution as a sub-
stitution matrix Q. A number of models have been
proposed and it has been shown that different mod-
els are subsumed in the General Time Reversal (GTR)
model(Felsenstein, 2004). The GTR model states that
each character may be substituted for any other, and
this can be specified with a 4 × 4 instantaneous rate
matrix Q.
BIOINFORMATICS 2019 - 10th International Conference on Bioinformatics Models, Methods and Algorithms
144