Application of Ant Colony Optimization for Mapping the Combinatorial

Phylogenetic Search Space

Alexander Safatli

and Christian Blouin

1,2

Faculty of Computer Science, Dalhousie University, Halifax, Canada

Department of Biochemistry & Molecular Biology, Dalhousie University, Halifax, Canada

Keywords:

Phylogenetic Trees, Combinatorial Space, Ant Colony Optimization, Swarm Intelligence.

Abstract:

In bioinformatics, landscapes of phylogenetic trees for an alignment of sequence data are deﬁned by a discrete

state combinatorial space. The optimal solution in such a space is the best-ﬁtting tree which provides insight

on the evolutionary relationship between taxonomic groups. The underlying structure of this space is poorly

understood. The Ant Colony Optimization (ACO) algorithm is applied in a novel manner to sample phyloge-

netic tree landscapes in order to understand more about this structure. The proposed implementation provides

a probabilistic model for exploring this combinatorial space. This probabilistic model allows us to circumvent

the complexity that arises due to increasing the number of sequences. In order to evaluate its performance,

quantities of resultant solutions were judged in order to determine how much of the space can be sampled.

The results show that the algorithm is robust to the starting location and consistently samples a majority of the

search space.

1 INTRODUCTION

Trees have been used to a great extent in biology to

graphically represent evolutionary relationships be-

tween species, subpopulations, and genes. These

trees are generally semi-labelled, binary, and rooted,

but they can also be multifurcating or unrooted. They

descend from a root node, bifurcate at other nodes

and end at leaves labelled by the names of operational

taxonomic units (OTUs). Internal nodes are thus in-

ferred ancestors of these OTUs. Traditionally, these

trees were inferred from morphological similarities,

where two species that shared the most characteristics

were considered siblings or homologous and shared

a common ancestor that was not the ancestor of any

other species. More recently, the inference of these

so-called phylogenetic trees is done from sequence

alignment data.

This inference or tree reconstruction is a central

problem in computational biology and is known to be

NP-complete (Foulds and Graham, 1982). Biologists

have to therefore use approximate optimization algo-

rithms with certain starting points and certain random

moves between trees where resulting trees can vary

from run to run. Many such methods exist for the

construction of phylogenetic trees. All share a ﬁtness

function f that scores the ﬁt of a particular tree with

a sequence alignment. Examples of ﬁtness functions

include likelihood and parsimony (Fitch, 1971).

A phylogenetic landscape or tree space refers to

the combinatorial space of all possible phylogenetic

tree topologies for a set of n leaves or taxa. This

forms a discrete solution search space and ﬁnite graph

G = (T,E) viewed as a function of its vertices. The

ﬁnite set E refers to the neighborhood relation on

T , a set of tree topologies or conﬁgurations (Bastert

et al., 2002; Charleston, 1995; Stadler, 1996). Tran-

sitions between states in this space are bidirectional

and represent transformation from one tree topology

to another by a tree rearrangement operator. We use

the the Subtree Prune and Regraft (SPR) operator

which describes a transfer of a node in a phylogenetic

tree from one parent node to a new parent (Felsen-

stein, 2004). The phylogenetic tree space is central to

an example of a combinatorial optimization problem

P = (T, f ). The set of all possible feasible assign-

ments T is known as the search space. Each element

of the set is a candidate solution. The optimal solution

in tree space is the solution t

∗

∈ T with maximum ﬁt-

ness function value f (t

∗

) ≥ f (t)∀t ∈ T .

Because the structure of this space is still poorly

understood, and distinguishing elements of this space

are still being discovered (Sanderson et al., 2011),

we sought to ﬁnd a method to sample large areas of

195

Safatli A. and Blouin C..

Application of Ant Colony Optimization for Mapping the Combinatorial Phylogenetic Search Space.

DOI: 10.5220/0005255701950200

In Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms (BIOINFORMATICS-2015), pages 195-200

ISBN: 978-989-758-070-3

 2015 SCITEPRESS (Science and Technology Publications, Lda.)

this space in an effort to discover large numbers of

trees that comprise them, the structure of the graph

that connects these trees, and how this conﬁguration

of trees and edges can be utilized to improve exist-

ing approximate methods. To this end, we chose the

Ant Colony Optimization (ACO) algorithm and im-

plemented for use in exploring the space so as to sam-

ple it for trees and structure.

The ACO algorithm is a metaheuristic related to

swarm intelligence. It models the foraging behaviour

of ants with a collective behaviour to ﬁnd paths be-

tween food sources and their nests. The biologi-

cal mechanism for this involves the deposition of a

certain concentration of pheromones by ants as they

search for food based on whether or not a path they

are on leads to food. In the ACO algorithm, edges in

a graph are paths, and paths of higher pheromone con-

centrations are chosen with higher probability by ant

agents as they traverse the space. Therefore, cooper-

ative interaction emerges to solve for shortest paths

through a parametrized probabilistic model (Blum

and Roli, 2003). A solution to the optimization prob-

lem can be expressed in terms of a feasible path, or

ant trail, on this graph (Blum and Roli, 2003; Dorigo

et al., 1999; Luke, 2013).

The ﬁrst phase of the algorithm involves ant gen-

eration and activity. This includes the stochastic

movement of ants based on edge weights. The sec-

ond phase is when pheromone evaporation takes place

(Dorigo et al., 1999). Pheromone evaporation de-

scribes the decay of pheromone intensity over time.

This avoids too rapid convergence towards a sub-

optimal region (Dorigo et al., 1999).

The characteristic of a candidate solution is left

problem-speciﬁc (Luke, 2013). While moving, ants

keep in memory the path it follows. This forms a par-

tial solution to the problem. When an ant has built a

candidate solution to the problem, it will retrace its

path back to the source node and die (Dorigo et al.,

1999).

2 METHODS

2.1 PLACO

Our primary objective is to sample the phylogenetic

tree space. Therefore, sampling of the space should

avoid poorly scoring regions unless such regions are

necessary for ants to visit in order to move to a re-

gion of near globally optimal ﬁtness. In order to do

so in an effective and timely manner, it would also be

beneﬁcial to avoid consideration of all possible trans-

formations for a topology and only sample a subset of

them.

We construct an algorithm based on the simple

ant colony optimization metaheuristic introduced in

(Dorigo et al., 1999). In this algorithm, labelled as the

Phylogenetic Landscape Ant Colony Optimization

(PLACO) algorithm, a single population of ant agents

explore a space G by visiting its trees. Pheromone

trails are expressed as edge weights proportional to

the change in f between one tree and the next. Each

ant in the system performs a random walk to adjacent

vertices in the graph until they run out of a quantity

we deﬁne as energy. An ant that runs out of energy

returns to the colony by retracing its path and subse-

quently dies.

In both directions that an ant travels, away from

and back to the colony, pheromone trails are deposited

as an increase to edge weights. This corresponds with

topology ﬁtness along the forward path, but along the

path in the return trip this is a function of the ﬁttest

topology visited. This can be computed to be equal to

an exponentiation of the change in f (t) from an origin

topology to the next. We present the weight on the

edge between t

and t

as ∆w

i j

= 2

( f (t

)− f (t

))

. Neg-

ative changes in the ﬁtness function will result in a

smaller concentration of pheromones to be deposited

by ants.

Avoiding the computation of all neighboring re-

conﬁgurations to a topology can allow this heuristic

to effectively scale to large n, as the degree of a ver-

tex t

∈ T in such a case increases proportionally to the

square of n. When feasible solution components are

discovered along the vertex set of G, ants located at t

will tend to consider a set of topologies that does not

encompass all topologies in its neighborhood. To do

this, we propose that traversal in the space is reduced

to a binary decision to move to an existing neighbor

that has already been visited or to jump to a random

unexplored vertex. This is a probabilistic decision in-

ﬂuenced by the weights of existing edges and an ini-

tial weight for all of the unvisited edges that can be

possibly formed.

Movement expenditure of energy Λ is calculated

by Equation 1 as an ant travels along a path. More of

an expenditure of energy is made if the path has not

yet been explored. Otherwise, a fractional amount is

expended depending on the edge weight. Notice that

highly travelled paths are virtually free to traverse.

i j

(

1/w

i j

, if existing path

1, if unvisited path

(1)

This sequence of items should result in an algo-

rithm (Algorithm 1) that is capable of scaling to large

input and performing an adaptive search of the graph

analogous to a breadth ﬁrst search in unexplored re-

BIOINFORMATICS2015-InternationalConferenceonBioinformaticsModels,MethodsandAlgorithms

196

gions of the graph, and depth ﬁrst search where there

is a strong pheromone concentration.

Algorithm 1: Phylogenetic Ant Colony Exploration.

Require: Phylogenetic Landscape G = (T,E) for n

taxa

Require: Starting Colony Location in T

Require: e ← Evaporation Constant, 0 < e ≤ 1

Require: max ← Maximum Number of Ant Agents

Require: init ← Starting Pheromone Concentra-

tion, init > 0

ants ← Population of Ant Agents

while still exploring do

createAnt(ants,n)  Generate new ant.

for ant in ants do  Perform all ant

movements.

if ant.getEnergy() > 0 then

moveAntForward(G,ant)

else

moveAntBackward(ant)

if ant.isDead() then

Remove ant from ants

end if

end for

for edge in E do  Evaporate all pheromones.

edge.reduceWeight(e)

end for

end while

2.2 Performance Evaluation

In order to test the effect of a number of free pa-

rameters in the proposed algorithm on its to sample

the space, the proposed algorithm was run on exist-

ing sequence data. The algorithm was applied to both

empirical biological sequence data (n = 23) as well

as synthetic sequence data (n = 9), of which the full

search space can be explored, in order to compare

both the difference arising from different taxon set

size and simulation. Stamatakis and Albright et al.

both claim that simulated phylogenetic tree data tends

to be less complex than real biological data. As a re-

sult, the landscapes that result from simulated data are

not entirely representative in terms of the complex-

ity that may result from actual data (Albright et al.,

2014; Stamatakis, 2014). This hypothesis will be in-

vestigated when considering the results of the experi-

ments.

The data used for evaluation was involved in

a phylogenetic examination done by Meehan et al.

which was used to investigate complex phylogenetic

properties of the Lachnospiraceae family (Meehan

and Beiko, 2013). 16S RNA sequence information

regarding a selection of species in this family was

used as input for phylogenetic tree reconstruction in

the proposed algorithm.

Deﬁnition Let the free parameters for the proposed

method be denoted e for an evaporation constant, m

for the maximum number of possible ant agents, i for

the exponent of the initial pheromone concentration

of edges such that 2

is the starting weight, and t

cor-

responds to the starting topology.

After running a series of experiments, 45 land-

scapes were created for both the empirical and syn-

thetic data, each with unique sets of parameter val-

ues, testing the parameters e, m and i, to thoroughly

span the domain of possible values for each parame-

ter. These were all given a starting topology that cor-

responds to one that scores well as found by a heuris-

tic known as FastTree (Price et al., 2009). For a selec-

tion of the landscapes with parameters that searched

the widest portion of the space, replicate runs will be

performed with identical parameters and with a vari-

ation in starting location t

. For all experiments, the

number of iterations was kept ﬁxed at 10000.

Various quantities on resultant explored graphs

were recorded. These quantities include the range

of ﬁtness function values, for which we use log-

likelihood, range of node degree, number of trees,

and their graph diameter. Log-likelihood was cal-

culated using the C phylogenetic likelihood library

(http://www.libpll.org). Node degree refers to the

number of edges connected to a given node. Another

evaluation metric involves considering the number of

different splits among explored trees as a measure

of diversity in tree topologies. A greater number of

splits, or bipartitions of trees, implies a greater di-

versity of trees found. Consideration of these prop-

erties should determine the breadth of sampling that

was done across the space.

A ﬁnal test that was carried out on the explored

landscapes involved ranking all found trees by their

log-likelihood. Then, we deﬁned a conﬁdence set

of trees amongst the top 10% of these trees from re-

sults of the Approximately Unbiased (AU) test. The

AU test is a procedure which provides a selection of

trees which is most likely to to include the true tree

amongst a selection of trees (Shimodaira, 2002). The

AU test was applied to these selections of trees using

the CONSEL application (Shimodaira and Hasegawa,

2001).

ApplicationofAntColonyOptimizationforMappingtheCombinatorialPhylogeneticSearchSpace

197

Table 1: Replicate Run Landscape Quantities. The σ and ¯x noted by each quantity indicates these are standard deviations

and averages of each measure respectively. Let A represent the conﬁdence set of trees computed by the AU Test.

∗

Parameters

are triplets where (a) e = 0.25,i = −8,m = 5, (b) e = 0.50,i = 0,m = 10, (c) e = 0.75,i = 0,m = 10, and (d) e = 0.00, i =

8,m = 10.

†

The number of bipartitions refers to the number of unique clades or topological splits in the trees.

Data Type Set

∗

σ Num. σ Max Degree σ Avg. ¯x Avg. σ Max ¯x |A|

Bipartitions

†

Log-Likelihood Log-Likelihood Log-Likelihood

Empirical

n = 23

(a) 4726 23.7 95.7 -12384.7 138 20

(b) 10579 104 149 -12406.4 74.5 31

Synthetic

n = 9

(d) 0.376 9.00 17.0 -11014.7 14.6 2

(b) 0.855 8.90 13.3 -11016.0 9.68 3

Table 2: Varied Starting Topologies Run Landscape Quantities. The σ and ¯x noted by each quantity indicates these are

standard deviations and averages of each measure respectively. Let A represent the conﬁdence set of trees computed by the

AU Test.

∗

Parameters are triplets where (a) e = 0.25, i = −8,m = 5, (b) e = 0.50,i = 0, m = 10, (c) e = 0.75,i = 0,m = 10,

and (d) e = 0.00,i = 8,m = 10.

†

The number of bipartitions refers to the number of unique clades or topological splits in the

trees.

Data Type Set

∗

σ Num. σ Max Degree σ Avg. ¯x Avg. σ Max ¯x |A|

Bipartitions

†

Log-Likelihood Log-Likelihood Log-Likelihood

Empirical

n = 23

(a) 10883 63.2 616 -12955.5 739 16

(b) 25297 182 257 -12512.3 182 37

Synthetic

n = 9

(d) 0.870 13.3 14.4 -11005.9 22.2 2

(b) 2.140 17.3 18.2 -11014.3 14.7 3

3 RESULTS

When investigating different sets of e, m and i param-

eters, it was found that the diameter for every land-

scape appears to remain constant regardless of how

parameters were varied. The diameter of the phylo-

genetic space is meant to be θ(n), and the diameters

found were slightly less than n. The diameter for em-

pirical data landscapes were found to be equal to 15-

16 nodes, and for the synthetic data landscapes it was

5. The fact that these values are smaller than expected

is hypothesized to be due to the existence of subopti-

mal regions that are not reached by the PACO algo-

rithm.

What does appear to differ regards the quantity

and range of scores for trees visited between these

different topologies. The number of trees and biparti-

tions found differed mostly when e and m were varied.

There appears to be a signiﬁcant proportionality be-

tween the number of ants and how many bipartitions

are found. A greater quantity of ants implies more

work being done at every iteration of the algorithm.

Furthermore, these ants also interact with each other

through the deposited pheromone applied on edges.

The evaporation constant appears to signiﬁcantly

affect the quality of ﬁtness of the topologies the ant

agents visit. The best collection of trees is found

around a constant of e = 0.5, but extreme e values

leads to a drop in the ability of the algorithm to ex-

plore the most relevant regions of the space. This dra-

matically reduces the relative ﬁtness of found trees.

Pheromone concentrations across edges effectively

encode a long-term memory of ants upon the surface.

It is surmised that evaporation provides the algorithm

the ability to forget poor regions and reinforce the ex-

ploration of higher likelihood regions.

We selected three triplets of parameters where

search properties were satisfactory and kept them

ﬁxed to test for robustness of the search. Across

replicate runs with these selections (Table 1), we ﬁnd

similar properties of broad exploration through the

space. All of the replicate runs consistently gener-

ated landscapes with a large number of trees. How-

ever, when we investigate results from the empirical

data, a large deviation exists amongst the replicate

runs for the number of bipartitions and the maximum

degree found in the search space. The former devi-

ation signiﬁes variation in the algorithm’s ability to

BIOINFORMATICS2015-InternationalConferenceonBioinformaticsModels,MethodsandAlgorithms

198

ﬁnd a great diversity of trees across the runs. The lat-

ter suggests inconsistent behavior when ants are caus-

ing edges to be created, possibly due to differently

scored trees being visited. Despite this, the difference

in log-likelihood is small and the nodes where the de-

gree is largest are those that score higher. Therefore,

while different breadths of tree diversity is being ac-

quired, and while different trees are being visited be-

tween replicate runs, the ability of the algorithm to

sample similarly scoring trees and regions does not

appear to change. Notice, also, that this inference is

less relevant if we discuss the synthetic dataset.

Being a smaller search space, the algorithm ap-

pears to sample the trees found in the smaller space

thoroughly. Respectively, the algorithm explored a

number of the possible trees in the empirical land-

scape on the order of 10

−15

% and of the synthetic

dataset on the order of 1%. As the number of trees in

the space is equal to O(n!), this shows that although

the algorithm is not searching a very large proportion

of all possible trees, it is sampling a number of them

that is sufﬁcient to acquire a shape of the landscape.

When choosing different starting topologies (Ta-

ble 2), the deviations in number of bipartitions and

average log-likelihood are magniﬁed between both

empirical and synthetic data. It appears that when a

different, possibly worse starting topology, is chosen,

more iterations need to be done in order to acquire

more of a diversity in splits and to bring resultant

landscapes consistently into regions of better scoring

topologies. However, when a good path is found, the

energy expenditure function should mitigate this ef-

fect.

When the AU test conﬁdence sets of trees were

computed for, the number of trees found to be present

in these sets were similar for respective sets of param-

eters and for both emperical and synthetic data. Even

when starting topologies were varied, the number of

trees found to be part of these sets did not seem to be

reduced from those found by starting at a well-scoring

topology. This suggests a tendency for the search to

ﬁnd well-scoring regions of trees.

4 CONCLUSION

This metaheuristic was designed to sample a large

number of regions of interest of the search space with

a reasonable number of iterations and amount of time.

In order to acquire an understanding of its perfor-

mance, a number of parameters possessed by the al-

gorithm can be tuned. We found that evaporation was

effective in steering the search to well-scoring regions

of the space, the number of ant agents extended the

number of trees found, and that the highest scoring

trees in the search were visited more often as indi-

cated by their increased degree.

Two rounds of experimentation were carried out

including a ﬁrst round testing for different triplets of

parameter values. The second round of experimen-

tation saw the investigation of replicate runs and the

starting topology being varied. All results show that,

when exploring both empirical and synthetic data, we

can make three claims about the performance of the

proposed algorithm. Firstly, the PLACO algorithm is

capable of broadly exploring the combinatorial space

in spite of the number of taxa. Secondly, across repli-

cate runs we ﬁnd consistent behaviour but variation in

quality of trees. This implies a sparse but broad search

where different topologies are being found. Thirdly,

it does not matter where the algorithm starts in order

to acquire a wide ranging set of trees and to sample

properties of and the shape of the space.

Future work investigate the maintenance of multi-

ple populations in the space. For example, we could

build into the algorithm an ability for it to iteratively

create colonies. This can accomplish to more densely

move across the space and focus on regions of partic-

ular interest.

REFERENCES

Albright, E., Hessel, J., Hiranuma, N., Wang, C., and Go-

ings, S. (2014). A comparative analysis of popu-

lar phylogenetic reconstruction algorithms. Midwest

Instruction and Computing Symposium (MICS) 2014

Proceedings.

Bastert, O., Rockmore, D., Stadler, P. F., and Tinhofer, G.

(2002). Landscapes on spaces of trees. Applied math-

ematics and computation, 131(2):439–459.

Blum, C. and Roli, A. (2003). Metaheuristics in combinato-

rial optimization: Overview and conceptual compari-

son. ACM Comput. Surv., 35(3):268–308.

Charleston, M. A. (1995). Toward a characterization of

landscapes of combinatorial optimization problems,

with special attention to the phylogeny problem. Jour-

nal of Computational Biology, 2(3):439–450.

Dorigo, M., Di Caro, G., and Gambardella, L. M. (1999).

Ant algorithms for discrete optimization. Artiﬁcial

life, 5(2):137–172.

Felsenstein, J. (2004). Inferring phylogenies, volume 2.

Sinauer Associates Sunderland.

Fitch, W. M. (1971). Toward deﬁning the course of evo-

lution: minimum change for a speciﬁc tree topology.

Systematic Biology, 20(4):406–416.

Foulds, L. R. and Graham, R. L. (1982). The steiner prob-

lem in phylogeny is np-complete. Advances in Applied

Mathematics, 3(1):43–49.

Luke, S. (2013). Essentials of Metaheuristics. Lulu, second

edition.

ApplicationofAntColonyOptimizationforMappingtheCombinatorialPhylogeneticSearchSpace

199

Meehan, C. J. and Beiko, R. G. (2013). A phyloge-

nomic view of ecological specialization in the lach-

nospiraceae, a family of digestive tract-associated

bacteria. Technical report, PeerJ PrePrints.

Price, M. N., Dehal, P. S., and Arkin, A. P. (2009). Fasttree:

computing large minimum evolution trees with pro-

ﬁles instead of a distance matrix. Molecular biology

and evolution, 26(7):1641–1650.

Sanderson, M. J., McMahon, M. M., and Steel, M.

(2011). Terraces in phylogenetic tree space. Science,

333(6041):448–450.

Shimodaira, H. (2002). An approximately unbiased test

of phylogenetic tree selection. Systematic Biology,

51(3):492–508.

Shimodaira, H. and Hasegawa, M. (2001). Consel: for as-

sessing the conﬁdence of phylogenetic tree selection.

Bioinformatics, 17(12):1246–1247.

Stadler, P. F. (1996). Landscapes and their correlation func-

tions. Journal of Mathematical chemistry, 20(1):1–45.

Stamatakis, A. (2014). Raxml version 8: A tool for phy-

logenetic analysis and post-analysis of large phyloge-

nies. Bioinformatics.

BIOINFORMATICS2015-InternationalConferenceonBioinformaticsModels,MethodsandAlgorithms

200