ANT COLONY SYSTEM ALGORITHM FOR EXTRACTING

MATHEMATICAL RELATIONS FROM DATABASE

R. F. Marques

and L. H. A. Monteiro

1,2

Escola de Engenharia, Universidade Presbiteriana Mackenzie, Rua da Consolac¸˜ao 896, 01302-907, S˜ao Paulo, SP, Brazil

Escola Polit´ecnica, Universidade de S˜ao Paulo

Av. Prof. Luciano Gualberto (travessa 3) 380, 05508-900, S˜ao Paulo, SP, Brazil

Keywords:

Ant colony system, Breast cancer, Data mining, Swarm intelligence.

Abstract:

An algorithm, inspired on the strategy employed by ant colonies for seeking food, was developed in order to

extract mathematical formulas from data. This algorithm, called Formula Miner, is applied in a breast cancer

diagnosis database and its performance is compared to the performances of some well-known data mining

algorithms.

1 INTRODUCTION

Ant colonies do not present an evident centralized

management. A social organization working without

explicit command may seem incomprehensible, be-

cause no one has individual conscience about what is

needed to do in order to accomplish the vital activities

for the society, such as walk to ﬁnd food. Accord-

ing to Deneubourg et al. (1993), ant walk is essen-

tially a probabilistic behavior, which can be veriﬁed

by studying the strategy that ants adopt in seeking

food. Usually, some ants randomly go out the way

and discover new sources of food. These new sources

require the recruitment of new ants in order to explore

them. Such a behavior gives them adaptive advan-

tage, allowing fast adaptation when the environment

changes.

When an ant moves, a chemical substance, called

pheromone, is deposited on the ﬂoor, making a trail.

Another ant ﬁnding this pheromonetrail can decide to

follow it, leaving along this trail its own pheromone.

Thus, ants communicate among them, sharing in-

formation, and they can ﬁnd the right way towards

sources of food (Beckers et al., 1992).

This collective behavior, known as swarm intel-

ligence, emerges from an autocatalytic process: as

more ants follow a pheromone trail, more attractive

this trail becomes to be followed. Such a process is

characterized by positive feedback, which reinforces

itself (Dorigo et al., 1996). Hence, the probability of

an ant choosing a way increases with the number of

ants that previously have chosen this same way.

Such an ant colony framework have inspired the

development of evolutionary algorithms in order to

numerically solve complexcomputationaltasks. Usu-

ally, artiﬁcial ants are employed, in a collaborative

manner, to ﬁnd optimum solutions in large search

spaces. The original idea was due to Dorigo et

al. (1996) and Dorigo and Gambardella (1997).

They showed how the simple behavior of following

pheromone trails can be used to solve the traveling

salesman problem. Their algorithms were based on

the observation that ants are able to create the short-

est path from their nest to the food source. Similar

approaches have been developed by several authors:

Gambardella et al. (1997) to solve the quadratic as-

signment problem; Rajesh et al. (2001) to optimize

chemical process design; Parpinelli et al. (2002) to

extract classiﬁcation rules from data; G´omez et al.

(2004) to plan the topology of power systems; Sam-

rout et al. (2005) to minimize the preventive main-

tenance cost of series-parallel systems; Rahhal and

Abu-Al-Nadi (2007) to determine the optimum con-

ﬁguration antenna array; Mirabedini et al. (2008)

to ﬁnd out effective routing in communications net-

works.

Here, we present an ant colony system algorithm

for obtaining mathematical formulas from data. This

algorithm,called FormulaMiner, is applied in a breast

cancer diagnosisdatabase and its performance is com-

pared to the performances of other data mining algo-

rithms.

204

Marques R. and Monteiro L. (2009).

ANT COLONY SYSTEM ALGORITHM FOR EXTRACTING MATHEMATICAL RELATIONS FROM DATABASE.

In Proceedings of the International Conference on Agents and Artiﬁcial Intelligence, pages 204-207

DOI: 10.5220/0001656402040207

 SciTePress

2 THE ALGORITHM FORMULA

MINER

Bonabeau et al. (1999)gave tips about how to develop

an algorithm based on ant colony behavior:

1. choosing an appropriated representation of the

problem, where artiﬁcial ants can incrementally

build or modify the solutions through a proba-

bilistic rule based on the amount of pheromone

deposited in a trail (a valid solution) and on other

local heuristics;

2. deﬁning a heuristic function η measuring the

quality of a pheromone trail (an ant way) in the

search space and a procedure for reinforcing this

trail.

Formula Miner intends to construct analytical for-

mulas, extracting knowledge from database. For ex-

ample, in the Cartesian plane x

× x

the algorithm

can ﬁnd the (non-linear) function x

= F(x

) that di-

vide the plan in two regions, thus classifying the data.

The algorithm uses a symbolic representation to

build formulas by appending terms to an initial for-

mula. Here, we impose that all formulas begin with

the number zero, and zero occurs only at this ﬁrst

term. An operatorfollowed by a variable or a constant

composes each term. We use only the four basic arith-

metic operators (+,−,×, ÷) plus parentheses. The

constants are integer numbers pertaining to the inter-

val (1,9). The variables in a n-dimensional space are

, x

, ..., x

. Three examples of terms: “−x

”, “÷3”

and “÷3)”: here “÷3” means that the last term in the

expression will be divided by 3; “÷3)” means that all

previous expression will be divided by 3. The formula

is represented by x

= F(x

,..., x

n−1

The trail (a partial formula) generated by the ﬁrst

ant released in the search space is stored. Then, the

pheromone value corresponding to each term com-

posing the trail is updated according to the rule de-

scribed below. Thus, it is possible to create condi-

tions for the next ants follow a previous trail, allow-

ing the convergence of the algorithm. The higher the

pheromone in a trail, the more attractive this trail to

the next ants. New ants can originate new formulas

too; that is, they do not necessarily follow the trails

that already exist.

Let Q

be the quality criterion of the formula F

where k is the label of the ant. The value of Q

used to update the amount of pheromone τ on the cor-

responding trail. We adopt the same expression for

suggested by Parpinelli et al. (2002). Thus, Q

given by:



TP+ FN



TN + FP



(1)

where TP is the number of true positives, TN the

number of true negatives, FN the numberoffalse neg-

atives, FP the number of false positives. Notice that

0 ≤ Q

≤ 1.

The pheromone update at the iteration t is per-

formed for each term i composing the formula F

ac-

cording to:

(t + 1) = τ

(t)(1+ γQ

) (2)

where γ is a positive number. For simplicity, a factor

for pheromone evaporation is not taken into account.

We choose the heuristic function η

deﬁned by

the ratio between the number of correct classiﬁca-

tions performed by the formula F

after including the

term i, and the number of database records. Thus, this

heuristic function is written as:

TP+ TN

TP+ TN + FP+ FN

(3)

3 FORMULA CONSTRUCTION

Artiﬁcial ants are used to generate a formula, which is

modiﬁed by attaching pairs of operator and variable-

or-constant. The ant k starts with an empty formula

(number zero) and produces the formula F

by ap-

pending terms at the current formula, where i is the la-

bel of the new term. Each appended term corresponds

to a partial trail walked by the ant, and this fragment

receives pheromone according to the quality criterion

achieved by the ﬁnal formula.

The probability P

of taking by chance the term i

depends on the heuristic function and the pheromone

amount deposited on the trails. Thus, the probability

that the ant k will pick the term i to be attached to

the formula can be given by (Dorigo et al., 1996):

(τ

(t))

(η

)

∑

i=1

(τ

(t))

(η

)

(4)

where N is the number of trail options offered to the

ant, and α and β are parameters to control the relative

weight between the pheromone amount on the trail

and the heuristic function.

In order to ﬁnd the term i + 1 to be included

in F

, the heuristic function η

i+1

is calculated to

all trail options (all combinations between operator

and variable-or-constant, with or without parenthe-

ses). These options form the trail matrix. Then, the

probabilities P

i+1

related to the offered options are

calculated. The probability P

i+1

of each term i+ 1 is

taken into account in the draw determining the term

that will be added to the formula. With this new term,

the quality criterion Q

i+1

is calculated and compared

ANT COLONY SYSTEM ALGORITHM FOR EXTRACTING MATHEMATICAL RELATIONS FROM DATABASE

205

with Q

, corresponding to the formula without it. If

i+1

> Q

such a term is accepted and added to the

formula F

. If not, this term is discarded and a new

draw is performed. A tentative counter is incremented

in this case. When a term is added to F

, the tentative

counter is reset and the process for choosing terms

begins again. The parameter M controls how many

times the algorithm must insist to improve the qual-

ity criterion by attaching terms. When the counter

reaches a preset value, the algorithm records the for-

mula F

obtained up to this moment, and begins a new

cycle with the next ant, k + 1. Another criterion used

to stop the algorithm is the maximumnumberof terms

L that the formula can have.

After the ant k has completed its formula F

, the

pheromone amount τ

for each position i of the for-

mula F

is updated following expression (2).

Then, when the next ant k+1 starts to construct its

own formula, it considers the deposited pheromone

on the trails formed by the previous ants. This pro-

cess is repeated until a maximum number of ants is

reached, speciﬁed by the parameter R.

A simple description of the algorithm is the fol-

lowing:

Set the initial trail matrix (i = 0)

Calculate η

for each term appearing at the trail matrix

Calculate P

for each term appearing at the trail matrix

For each ant k = 1 until R

Do until tentative counter ≤ M

Choose at random the term i considering P

Calculate Q

to F

with the randomly picked

term

If Q

i+1

> Q

Attach the term i to F

Increment i

Reset the tentative counter

Calculate η

for each term at the trail matrix

Calculate P

for each term at the trail matrix

Else

Increment the tentative counter

End Do

Update the pheromone in each segment forming F

Reset the tentative counter

Reset the term position (i = 0)

Reset η

Calculate P

for each term at the trail matrix

End Loop

4 NUMERICAL EXPERIMENT

Formula Miner was tested on the Wisconsin Diagnos-

tic Breast Cancer WDBC dataset, which contains559

cases, 2 classes and 30 numerical attributes. Accord-

ing to Mangasarian et al. (1995), 3 of these 30 at-

tributes are more relevant for diagnostics. Thus, we

selected them among the other attributes to search for

a mathematical relationship that separates the cases

of malignant and benignant breast cancer. These at-

tributes are: mean texture (x), worst smoothness (y),

and worst area (z) of the analyzed cells (Mangasar-

ian et al., 1995). Due to the dimensional differences

among these attributes, it was necessary to include

scale factors. Thus: x

≡ x, x

≡ 10y, x

≡ z/10.

Then, Formula Miner was applied to search a

mathematical relationship for these three attributes,

in order to correctly classify malignant and benignant

cases. The goal is to ﬁnd a formula that expresses one

attribute in terms of the two others. The attributes are

related by the function x

= F(x

)

The algorithm accuracy was evaluated using a

ten-fold cross-validation method (Stone, 1974). The

database is divided in ten equal parts. One part is set

apart and the algorithm is applied in the other nine

parts. The resulting formulas are tested at the part

that was removed of the database. In this procedure,

all cases are used only once as test and nine times to

run the Formula Miner.

The accuracy rate of each evaluation is deﬁned as

the quotient between the number of correctly classi-

ﬁed cases by the total number of tested cases, using

the heuristic function given by expression (3). The

ﬁnal accuracy rate is the arithmetical average of the

accuracy rate of the nine rounds, followed by the cor-

responding standard deviation.

We chose the following parameter values: maxi-

mum number of ants R = 15; maximum number of

tentatives M = 5; α = 4; β = 5; γ = 3. The number

of registries used in each turn was 512; the number of

registries used in the validation procedure was 57.

Our results were compared with the ones ob-

tained by applying well-known algorithms, for the

same database (the same 3 attributes) and all using a

ten-fold cross-validation procedure. The algorithms

and their respective performances are: Ant Miner

(Parpinelli et al., 2002): (96.0 ± 1.0)%; C4.5 (Quin-

lau, 1993): (95.0± 0.3)%; Formula Miner: (81.4±

7.6)%; MSM-T (Mangasarian et al., 1995): 97.5%.

We must notice the differences among the ap-

proaches and the purposes of the algorithms in rela-

tion to Formula Miner in order to properly comment

these results. The MSM-T (Multisurface Method

Tree) looks for multiples planes (that can not form

a function), while Formula Miner looks for a unique

mathematical relationship that deﬁnes just one con-

tinuous surface to classify the diagnosis. In the algo-

rithms C4.5 andAnt Miner, the attributes were used as

discrete range of values, and the their results depend

on the adopted ranges. Formula Miner deals with con-

ICAART 2009 - International Conference on Agents and Artificial Intelligence

206

tinuous attributes.

The algorithms C4.5 and MSM-T are based on

decision-tree; Ant Miner is based on rule construc-

tion. In fact, these three algorithms accept disconti-

nuities in a possible classiﬁcation rule.

The Formula Miner goal is to obtain a continuous

function for better classifying the cases. Our best for-

mula presented performance of 92.98%. It is given

by:

= x

+ x

+ 1)/x

+(x

+ 17)/x

+ 1/x

+ 5x

(5)

Observe that the performance of this formula is com-

parable to the ones of the other data-mining algo-

rithms.

5 CONCLUSIONS

This paper presented a way of extracting analytical

formulasfrom database, inspired in the ant colony be-

havior. The aim is to separate data in a multidimen-

sional search space by a continuous function. These

formulas written in a literal way can be useful to clas-

sify and to develop analytical mathematical models

from a database.

ACKNOWLEDGEMENTS

LHAM is partially supported by CNPq.

REFERENCES

Beckers, R., Deneubourg, J. L., & Goss, S. (1992). Trails

and u-turns in the selection of the shortest path by the

ant Lasius Niger. Journal of Theoretical Biology, 159,

397-415.

Bonabeau, E., Dorigo, M., & Theraulaz, G. (1999). Swarm

intelligence: From natural to artiﬁcial systems. New

York: Oxford University Press.

Deneubourg, J. L., Pasteels, J. M., & Verhaeghe, J. C.

(1983). Probabilistic behaviour in ants: a strategy of

errors? Journal of Theoretical Biology, 105, 259-271.

Dorigo, M., Maniezzo, V., & Corloni, A. (1996). The

ant system: optimization by a colony of cooperating

agents. IEEE Transactions on Systems, Man, and Cy-

bernetics B, 26, 29-41.

Dorigo, M., & Gambardella, L. M. (1997). Ant colonies for

the travelling salesman problem. BioSystems, 43, 73-

81.

G´omez, J. F., Khodr, H. M., de Oliveira, P. M., Ocque,

L., Yusta, J. M., Villasana, R., & Urdaneta, A. J.

(2004). Ant colony system algorithm for the planning

of primary distribution circuits. IEEE Transactions on

Power Systems, 19, 996-1004.

Gambardella, L. M., Taillard, E., & Dorigo, M. (1997).

Ant colonies for QAP. Technical Report IDSIA97-4

Lugano, Switzerland.

Mangasarian, O. L., Street, W. N., & Wolberg, W. H.

(1995). Breast cancer diagnosis and prognosis via lin-

ear programming. Operations Research, 43, 570-577.

Mirabedini, S. J., Teshnehlab, M., Shenasa, M. H., & Rah-

mani, A. M. (2008). Flar: An adaptive fuzzy routing

algorithm for communications networks using mobile

ants. Cybernetics and Systems, 39, 684-702.

Parpinelli, R. S., Lopes, H. S., & Freitas, A. A. (2002).

Data mining with an ant colony optimization algo-

rithm. IEEE Transactions on Evolutionary Computa-

tion, 6, 321-332.

Quinlan, J. R. (1993). C4.5: Programs for machine learn-

ing. San Francisco: Morgan Kaufmann.

Rahhal, J. S., & Abu-Al-Nadi, D. I. (2007). A general con-

ﬁguration antenna array for multi-user systems with

genetic and ant colony optimization. Electromagnet-

ics, 27, 413-426.

Rajesh, J., Gupta, K., Kusumakar, H. S., Jayaraman, V.

K., & Kulkarni, B. D. (2001). Dynamic optimization

of chemical processes using ant colony framework.

Compututers & Chemistry, 25, 583-595.

Samrout, M., Yalaoui, F., Chatelet, E., & Chebbo, N.

(2005). New methods to minimize the preventive

maintenance cost of series-parallel systems using ant

colony optimization. Reliability Engineering and Sys-

tem Safety, 89, 346-354.

Stone, M. (1974). Cross-validatory choice and assessment

of statistical predictions. Journal of Royal Statistical

Society B, 36, 111-147.

ANT COLONY SYSTEM ALGORITHM FOR EXTRACTING MATHEMATICAL RELATIONS FROM DATABASE

207