Exploring the Impact of Different Classiﬁcation Quality Functions in an

ACO Algorithm for Learning Neural Network Structures

Khalid M. Salama

and Ashraf M. Abdelbar

School of Computing, University of Kent, Canterbury, U.K.

Mathematics and Computer Science Department, Brandon University, Brandon, Canada

Keywords:

Ant Colony Optimization (ACO), Machine Learning, Pattern Classiﬁcation, Neural Networks.

Abstract:

Although artiﬁcial neural networks can be a very effective classiﬁcation method, one of the drawbacks of

their use is the need to manually prescribe the neural network topology. Recent work has introduced the

ANN-Miner algorithm, an Ant Colony Optimization (ACO) technique for optimizing the topology of arbitrary

FFNN’s, i.e. FFNN’s with multiple hidden layers, layer-skipping connections, and without the requirement

of full-connectivity between successive layers. In this paper, we explore the use of several classiﬁcation

quality evaluation functions in ANN-Miner. Our experimental results, using 30 popular benchmark datasets,

identify several quality functions that signiﬁcantly improve on the simple Accuracy quality function that was

previously used in ANN-Miner.

1 INTRODUCTION

Feed-ForwardNeural Networks (FFNN) are a popular

and effective pattern classiﬁcation technique. How-

ever, one drawback of FFNN’s is the need to manually

prescribe the neural network topology. Even if one’s

attention is restricted to three-layer FFNN’s with full-

connectivity between successive layers and no layer-

skipping connections, then one only needs to select

the number of neurons in the single hidden layer. If

one allows for arbitrary feed-forward topologies, then

optimizing the network topology becomes more chal-

lenging.

Recent work has introduced the ANN-Miner algo-

rithm (Salama and Abdelbar, 2014), an Ant Colony

Optimization (ACO) technique for optimizing the

topology of arbitrary FFNN’s, i.e. FFNN’s with mul-

tiple hidden layers, layer-skipping connections, and

without the requirement of full-connectivity between

successive layers. In this paper, we explore the use

of several classiﬁcation quality evaluation functions

in ANN-Miner. Our experimental results, using 30

popular benchmark datasets, identify several quality

functions that signiﬁcantly improve on the simple Ac-

curacy quality function that was used in the original

work on ANN-Miner (Salama and Abdelbar, 2014).

2 FEED-FORWARD NEURAL

NETWORKS

Feed-forward neural networks (FFNN) are widely ac-

knowledged as being one of the most popular methods

for pattern classiﬁcation. The most common FFNN

topology is a three-layer topology in which neurons

are arranged in an input layer, a hidden layer, and an

output layer. Commonly, there are connections be-

tween every neuron in a layer to all the neurons in the

succeeding layer.

Most commonly, each neuron i is a simple com-

putational unit which accepts r inputs o

,...,o

, and

produces a single output o

net

∑

i=1

+ θ

, (1)

= σ(net

) = a· tanh(b· net

) , (2)

where each input o

is the output of a neuron in the

previous layer, the weight w

represents a real-valued

weight between neuron j and neuron i, θ

represents

a weight associated with neuron i itself called the

neuron’s “self-bias,” and σ is an activation function

that is most commonly selected to be the sigmoidally-

shaped logistic function shown in Eq. (2).

A FFNN with n input neurons and m output neu-

rons computes a mapping R

7→ R

. In a classiﬁca-

tion problem, the training set T consists of a number

137

Salama K. and Abdelbar A..

Exploring the Impact of Different Classiﬁcation Quality Functions in an ACO Algorithm for Learning Neural Network Structures.

DOI: 10.5220/0005031301370144

In Proceedings of the International Conference on Evolutionary Computation Theory and Applications (ECTA-2014), pages 137-144

ISBN: 978-989-758-052-9

 2014 SCITEPRESS (Science and Technology Publications, Lda.)

of labelled patterns. Each training pattern x is an n-

dimensional input vector of real values, and the label

is an m-dimensional output vector y. The k-th ele-

ment in y that represents the class of pattern x is set to

1, while the other (m− 1) elements in y are set to 0.

The aim is to train a FFNN, given the training set T ,

to be able to correctly classify (predict the label of)

a new unlabelled pattern. For that, each training pat-

tern x is, in turn, applied to the input layer of the net-

work, the signal is allowed to propagate through the

network, and the output of the network, denoted y

′

is compared to the desired output y to determine the

error of the network for that pattern, denoted E

. A

common error function is the simple Sum of Squared

Error (SSE) function, deﬁned as:

∑

i=1

(y− y

′

)

, (3)

where the total error is simply: E =

∑

Perhaps the most popular neural network training

algorithm is the gradient descent based Backward Er-

ror Propagation (BP) algorithm which is based on re-

peatedly applying the training set to the network (each

full pass through the training set is called an epoch),

computing the error E, and then modifying each ele-

ment of the weight vector according to: ∆w

= −η

∂E

∂w

where η is the learning rate parameter. Commonly,

FFNN applications use a simple three-layer network

topology, with full connectivity between layers.

3 ANT COLONY OPTIMIZATION

Swarm intelligence is a branch of soft computing

in which a wide variety of biological collective be-

haviours are applied to solve optimization problems.

Ant Colony Optimization (ACO) was deﬁned as a

meta-heuristic for combinatorial optimization prob-

lems (Dorigo and St¨utzle, 2004), inspired by the be-

haviour of natural ant colonies. The basic principle of

ACO is that a population of artiﬁcial ants cooperate to

ﬁnd the best path in a graph, analogously to the way

that natural ants cooperate to ﬁnd the shortest path be-

tween two points like their nest and a food source.

In ACO, each artiﬁcial ant constructs a candidate

solution to the target problem, represented by a com-

bination of solution components in the search space.

Ants cooperate via indirect communication, by de-

positing pheromone on the selected solution com-

ponents for a candidate solution. The amount of

pheromone deposited is proportional to the quality of

that solution, which inﬂuences the probability with

which other ants will use that solution’s components

when constructing their solution. This contributes to

the global search aspect of ACO algorithms. The pop-

ulation of ants searches for the best solution in par-

allel, thus exploring possibly different regions of the

search space at each iteration of the algorithm. This

increases the chances of ﬁnding a near-optimal solu-

tion in the search space.

ACO has been successful in tackling the clas-

siﬁcation problem of data mining. A number of

ACO-based algorithms have been introduced in the

literature with different classiﬁcation learning ap-

proaches. Ant-Miner (Parpinelli et al., 2002) is the

ﬁrst ant-based classiﬁcation algorithm, which dis-

covers a list of classiﬁcation rules in the form of

IF-Conditions-Then-Class.

The algorithm has

been followed by several extensions in (Parpinelli

et al., 2002; Salama et al., 2011; Salama et al., 2013;

Otero et al., 2009; Otero et al., 2013).

ACDT (Boryczka and Kozak, 2010; Boryczka and

Kozak, 2011) and Ant-Tree-Miner (Otero et al., 2012)

are two different ACO-based algorithms for inducing

decision trees for classiﬁcation. Salama and Freitas

(2013a; 2013b) have recently employed ACO to learn

various types of Bayesian network classiﬁers.

As for learning neural networks, the ACO meta-

heuristic was utilized in two works. Liu et al. (2006)

proposed ACO-PB, a hybrid of the ant colony and

back-propagation algorithms to optimize the network

weights. It adopts ACO to search the optimal combi-

nation of weights in the solution space, and then uses

the BP algorithm to further ﬁne-tune the ACO solu-

tion. Blum and Socha applied ACO

, an ant colony

optimization algorithm for continuous optimization

(Socha and Dorigo, 2008; Liao et al., 2014), to

train feed-forward neural networks (Socha and Blum,

2007).

4 THE ANN-Miner ALGORITHM

As discussed previously, the three-layer fully-

connected FFNN topology is the most commonly

used FFNN topology. ANN-Miner (Salama and Ab-

delbar, 2014), a recently proposed ACO algorithm for

learning FFNN topologies, allows connections to be

generated between hidden neurons and other hidden

neurons — under the restriction that the topology re-

mains acyclic — as well as direct connections be-

tween input neurons and output neurons. This allows

producing networks with a variable number of layers,

as well as arbitrary connections that skip over layers.

As for the problem at hand, a candidate solution

is a network topology, and the solution components

are the possible connections (between input and hid-

den neurons, between hidden and output neurons, be-

ECTA2014-InternationalConferenceonEvolutionaryComputationTheoryandApplications

138

tween input and output neurons, and connections be-

tween different hidden neurons). Each potential con-

nection c = i → j, connecting between neurons i and

j, has two solution components in the construction

graph: D

true

, representing the decision to include con-

nection i → j in the current candidate topology be-

ing constructed by the ant, and D

false

, representing

the decision not to include the connection. There-

fore, the construction graph can be represented as a

two-dimensional 2× |C| array, where 2 refers to the

Boolean solution components, and C is the set of the

available connections.

The number of input neurons and output neurons

depends on the dataset and the representation that is

used for the attributes of the dataset, while the total

number of hidden neurons is an external user-supplied

parameter. Suppose the total number of neurons is N,

with N

input neurons, N

output neurons, and N

po-

tential hidden neurons. N

× N

, N

× N

and N

× N

are the number of available connections between in-

put and hidden neurons, hidden and output neurons,

and input and output neurons, respectively, in the total

available number of connections |C|. This means that,

for instance, an ant can select (or unselect) a connec-

tion between any input neuron and any hidden neuron.

The same applies for the two other connection types.

However, the available connections between the

hidden neurons N

are deﬁned as follows. In order to

ensure that the topology is acyclic, we impose the re-

striction that i → j is not available if i ≥ j. In other

words, each hidden neuron has a numeric index, and

we only allow connections from a given hidden neu-

ron n

to a higher-numbered neuron n

. It is well-

known that any directed acyclic graph is isomorphic

to a graph where the nodes are lexicographically or-

dered and for all arcs (u, v) in the graph u precedes v

in the lexicographic order. Hence, the number of the

available connection between the N

hidden neurons

is: (N

− 1) + (N

− 2) + ... + 1+ 0 = N

− 1)/2.

The overall process of ANN-Miner is illustrated in

Algorithm 1. In the initialization step of ANN-Miner

(line 3), the amount of pheromone assigned to each

solution component D

– where a can be true or false

– in the construction graph is initialized with the value

0.5. Hence, for each connection c, the probability of

including i → j (i.e. selecting D

true

) in the topology

equals the probability of not including i → j (i.e. se-

lecting D

false

In the inner for-loop (lines 6-12), each ant

in the

colony creates a candidate solution NN

, i.e. a com-

plete neural network (line 7). Then the quality of the

constructed solution is evaluated (line 8). The best

solution NN

tbest

produced in the colony is selected to

update the pheromone trail according to the quality

Algorithm 1: Pseudo-code of ANN-Miner.

1: Begin

2: NN

bsf

= φ; t = 1;

3: InitializePheromone();

4: repeat

5: NN

tbest

= φ; Q

tbest

= 0;

6: for i = 1 → colony

size do

7: NN

= ant

.CreateSolution();

8: Q(NN

) = EvalQuality(NN

);

9: if Q(NN

) > Q(NN

tbest

) then

10: NN

tbest

= NN

;

11: end if

12: end for

13: U pdatePheromone();

14: if Q(NN

tbest

) > Q(NN

bsf

) then

15: NN

bsf

= NN

tbest

;

16: end if

17: t = t + 1;

18: until t = max

iterations or Convergence();

19: NN

final

= PostProcessing(NN

bsf

);

20: return NN

final

;

21: End

of its solution Q(NN

tbest

). After that, the algorithm

compares the iteration-best solution NN

tbest

with the

best-so-far solution NN

bsf

(the if statement in lines

14-16) to keep track of the best solution found so far

during the algorithm execution.

After the outer repeat-until loop terminates (lines

4-18), the best-so-far neural network undergoes a

post-processing step to produce the ﬁnal neural net-

work NN

final

to be returned by the algorithm. Ba-

sically, the algorithm learns the ﬁnal weights of the

connections in the neural network NN

bsf

— which

represents the best topology found during the ACO

search process. The standard BP procedure, described

in Section 2, is used to train NN

bsf

and learn its ﬁ-

nal weights. The only difference is that, instead of

having BP work on the conventional three-layer fully-

connected network topology, it works on the arbitrary

topologies constructed by ANN-Miner.

The process of an ant creating a new candidate so-

lution (neural network) is described in Algorithm 2.

The procedure starts with an empty (edge-less) neu-

ral network (line 2) to be constructed throughout the

procedure. In addition, an empty array SLN, which

represents the ant trail in the construction graph and

its selected solution components, is initialized. This

data structure is necessary for the pheromone update

procedure, as described later. For each connection c

in the available set of connections C, the ant selects

to decide whether to include this connection in the

candidate network NN or not (line 4) — by either se-

ExploringtheImpactofDifferentClassificationQualityFunctionsinanACOAlgorithmforLearningNeuralNetwork

Structures

139

lecting solution component D

true

or D

false

. The selec-

tion of the solution component at each step is based on

the following probabilistic state transition formula

p(D

) =

τ(D

)

τ(D

true

) + τ



false



, (4)

where p(D

) is the probability of selecting decision

for connection c, and τ(D

) is the current amount

of pheromone associated with D

. Every selected so-

lution component D

(where a = true or a = false)

is added to the data structure SLN (line 5). How-

ever, only if D

= D

true

, that is, the ant selected the

decision to include connection c in the topology, the

corresponding connection (i → j)

is appended to the

candidate network NN (the if statement in lines 6-

8). After the ant visits all the available connections

in the construction graph and performs the include-

or-not decision, the network topology of NN is now

complete.

Algorithm 2: Pseudo-code of solution creation.

1: Begin CreateSolution()

2: NN ← φ ; SLN ← φ;

3: for c = 1 → |C| do

4: D

= SelectDecisionComponent();

5: SLN = SLN ∪ D

;

6: if D

== D

true

then

7: NN = NN ∪ (i → j)

;

8: end if

9: end for

10: InitNetwork(NN,NN

bsf

);

11: TrainNeuralNetwork(NN,T

);

12: return NN;

13: End

The weights of the neural network NN are then

initialized (line 10) as follows. For each connection

i → j in the topology of the NN network, if the con-

nection i → j is also present in the topology of NN

bsf

— the best trained neural network encountered so far

— then the weight w

associated with the connection

i → j is initialized with the value that it has in the

trained network NN

bsf

. On the other hand, if the con-

nection i → j is not present in N

bsf

, then the weight

is randomly initialized.

The idea here is to start the training of each

newly-created topology by building on the weights

of the best network we have generated so far NN

bsf

If the new network, after training, performs better

than NN

bsf

, it will replace NN

bsf

and its connection

weights will be used as initial values for performing

BP on subsequent networks. Of course, some connec-

tions in NN may not exist in NN

bsf

, in which case,

they are randomly initialized, as mentioned above.

Further, some connections in NN

bsf

may not exist

in NN. Such a diversion in the topologies maintains

the exploration aspect of the algorithm, while the ex-

ploitation aspect is realized by building on the best

weights learned in previous iterations.

After NN is initialized, it is trained (line 11) using

the BP procedure, with a relatively large learning rate

and a small number of epochs (see Table 1 for param-

eter settings). These are intended as “quick and dirty”

parameter settings meant to allow us to obtain a com-

plete neural network and evaluate its pattern classiﬁ-

cation quality. The classiﬁcation quality evaluation is

discussed in the next section.

5 EXPLORING DIFFERENT

QUALITY FUNCTIONS

A key objectiveof a classiﬁcation algorithm is to learn

models with good generalization, that is, models that

are able to accurately predict the class labels of new

unknown patterns. Overﬁtting occurs when the in-

duced model reﬂects good classiﬁcation performance

(ﬁt) on the training (in-sample) data used in the learn-

ing process, yet shows bad predictive performance

(generalization) involving new/testing data.

Therefore, we split the training set T at the be-

ginning of the algorithm into two mutually exclusive

parts: 1) the learning set T

, which contains 80% of

the training set and is used to learn the neural network

topology and weights (line 11, Algorithm 2); and 2)

the validation set T

, which contains 20% of the train-

ing set and is used to evaluate the quality of the model

(line 8, Algorithm 1).

In this paper, we investigate several quality eval-

uation functions to be used in line 8 of Algorithm 1.

First, let us establish some notation. Let m denote the

number of classes, ˆc denote the true (correct) class

for a given pattern x (the element in the label vector y

where the value=1), and c denote the class that is pre-

dicted by the neural network NN. We will assume that

the output of the network is an m-dimensional vector

′

= o

,...,o

of real values. The output vector of

NN can be transformed into a vector p of class prob-

ability scores through a simple normalization:

= o

∑

j=1

. (5)

Hence, the predicted class c will be the label k of the

output of the NN with the highest probability score

. We will use the notation f(NN|x) to refer to the

value of the classiﬁcation measure f on NN givenpat-

tern x, and Q

(NN|T

) as the quality of NN, using

ECTA2014-InternationalConferenceonEvolutionaryComputationTheoryandApplications

140

classiﬁcation measure f, on the full validation set T

Note that Q

(NN|T

) is the actual pheromone amount

to be deposited in the pheromone update step (line 13,

Algorithm 1); the higher the value of Q

(NN|T

), the

better the quality of NN. Recall that the amount of

pheromone deposited is proportional to Q(NN

tbest

as follows:

τ(D

) = τ(D

) + [τ(D

) × Q(NN

tbest

)] ∀D

∈ |SLN

tbest

(6)

Note that, for quality function f, if Q

is computed to

be less than zero, it is adjusted to zero, in order not to

allow negative pheromone values.

Example 1 — suppose we have a pattern x where

m = 3 and ˆc = 1. Consider three candidate NNs with

the following probability score vectors given pattern

x: NN

(x) = (0.9,0.1,0), NN

(x) = (0.6,0.4,0) and

(x) = (0.6,0.2,0.2). This example will be used

in the following subsections.

5.1 Accuracy

A simple, and perhaps the most widely used, classiﬁ-

cation measure is accuracy. This is the quality mea-

sure that was used in previous work on ANN-Miner

(Salama and Abdelbar, 2014). For a given pattern x,

Acc(NN|x) =



1 if ˆc = c,

0 if ˆc 6= c.

(7)

For an entire validation set T

Acc

(NN|T

) =

∑

x∈T

Acc(NN|x). (8)

The deﬁciency of the accuracy measure is that, in the

three NNs of Example 1, Acc(x) will be equal to 1 for

all three probability vectors. However, it is obvious

that NN

should receive a better quality preference

than NN

and NN

, since it produces a higher prob-

ability for the true class.

5.2 Class Probability Error

An alternative to the accuracy measure is the true

Class Probability Error (CPE) measure. For a given

pattern x,

CPE(NN|x) = 1 − p

ˆc

. (9)

Let us again consider Example 1. Here,

CPE(NN

|x) = 0.1, while CPE(NN

|x) =

CPE(NN

|x) = 0.4. In this case, CPE prefers

over NN

and NN

, but does not differentiate

between NN

and NN

For an entire validation set T

CPE

(NN|T

) = 1−

∑

x∈T

CPE(NN|x). (10)

5.3 Variations of the CPE Measure

We consider two variations of CPE that differ primar-

ily in their handling of outlier patterns. First, CPE

deﬁned as follows. For a single pattern x,

CPE

(NN|x) = (1− p

ˆc

)

. (11)

In this paper, we focus on the case where k = 2. Sup-

pose we have a dataset T

consisting of three patterns

with individual pattern error values of: 0.1, 0.2, 0.3,

and another dataset T

with pattern error values of:

0.1, 0.15, 0.35. Under CPE, these two datasets would

have equal error values. But, under CPE

, the second

dataset would have a higher error value (0.155 ver-

sus 0.14). Thus, CPE

is less tolerant of outliers than

CPE; a small number of outlier patterns can dominate

the CPE

function much more easily than the CPE

function.

Another variation of CPE is CPE

which is even

more tolerant of outliers than CPE. For a single pat-

tern x, CPE

(NN|x) = CPE(NN|x). However, for an

entire validation set:

CPE

(NN|T

) = 1− median{CPE(x) : x ∈ T

(12)

For the two dataset scenarios described above, T

(with pattern errors 0.1, 0.2, 0.3) would have a smaller

error value under Q

CPE

than T

(with pattern errors

0.1, 0.15, 0.35), because Q

CPE

would simply com-

pare 0.2 to 0.15.

5.4 Quadratic Loss Function

The Quadratic Loss Function is another widely used

error measure. For a given pattern x:

QLF(NN|x) = (1− p

ˆc

)

∑

k:k6= ˆc

)

. (13)

In Example 1, the three probability vectors would

have QLF values of: 0.02, 0.32, and 0.24. Thus, the

QLF error measure would prefer NN

, followed by

, followed by NN

. Thus, not only does QLF

favour the models that produce a higher probability

for the true class, but it also favours the models that

produce the lowest probabilities for the other classes.

Note that, for a two-class problem (i.e. m = 2),

QLF(NN|x) ≡ 2·CPE

(NN|x) . (14)

For an entire validation set T

QLF

(NN|T

) = 1 −

∑

x∈T

QLF(NN|x). (15)

ExploringtheImpactofDifferentClassificationQualityFunctionsinanACOAlgorithmforLearningNeuralNetwork

Structures

141

5.5 Cross Entropy

For a given pattern x, the cross entropyCE measure is

deﬁned as:

CE(NN|x) = −

∑

k=1

ln p

, (16)

where y is the target vector. Since y

ˆc

= 1 and y

= 0

for all k 6= ˆc, Eq. (16) reduces to:

CE(NN|x) = −ln p

ˆc

. (17)

Thus, CE is somewhat similar to CPE: both are

monotonically decreasing functions of p

ˆc

. They differ

only in their respective response curves as p

ˆc

varies

from 0 to 1.

For an entire validation set T

(NN|T

) = 1 −

∑

x∈T

CE(NN|x). (18)

5.6 Bayesian Information Reward

For a given pattern x, we use a variation of Bayesian

Information Reward BIR, deﬁned as:

BIR(NN|x) =

∑

c=1

(NN|x) (19)

(NN|x) =

(

1−

log(p

)

log(p

′

)

if c = ˆc (reward),

log(1−p

)

log(1−p

′

)

if c 6= ˆc (penalty).

(20)

where p

′

represents the prior probability of class c,

which is the ratio of the number of patterns in the

learning set with class label c to the total number of

patterns in the learning set, and m is the total number

of the classes. Note that the ﬁrst branch in the con-

ditional Equation (24) is the reward value, where the

predicted class c is the same as the true (correct) class

ˆc, while the second branch is the penalty value, where

the predicted c is not the same as the true class ˆc.

Not only does the Bayesian Information Reward

measure respect the probability of the true class p

ˆc

, it

also respects the prior probability p

′

of the predicted

class in the dataset. This makes BIR more robust to

class imbalance situations, where one (or more) class

values have high occurrence in a given dataset com-

pared to the other values. That is, the lower the fre-

quency of the correctly predicted class in the dataset,

the higher the reward. Similarly, the higher the fre-

quency of the misclassiﬁed class in the dataset, the

higher the penalty.

For an entire validation set T

BIR

(NN|T

) =

∑

x∈T

[φ

+ BIR(NN|x)/φ

(21)

Table 1: Parameter settings used in experiments.

No. Parameter Value

max iterations

1000

colony size

conv iterations

Ant BP Learning Rate

0.05

Ant BP Epochs

3L-BP Learning Rate

0.01

3L-BP Epochs

1000

The parameters φ

and φ

are used to adjust the actual

amount of pheromone to be deposited. The ﬁrst pa-

rameter makes sure that the Q

BIR

value is greater than

0, while the second parameter scales the Q

BIR

value.

In our experiments, we set φ

= φ

= 50/m.

6 COMPUTATIONAL RESULTS

The experiments were carried out using the well-

known stratiﬁed 10-times 10-fold cross-validation

procedure, which works as follows. First, the target

dataset is divided into 10 mutually exclusive parti-

tions (or folds), with approximately the same number

of patterns in each partition. Then, for each of the 7

different quality measures used for candidate-solution

evaluation and pheromone updating in this work, a

version of ANN-Miner using that measure is run 10

times, where each time a different partition is used as

the test set and the other 9 partitions are merged and

used as the training set T for the algorithm. Note that

T is itself further divided during algorithm execution

into T

and T

as discussed in Section 5.

The predictive performance associated with each

quality measure is computed as the average value

of the accuracy on the test set across the 10 runs.

In addition, we ran ANN-Miner with each of the 7

quality measures 10 times – using a different ran-

dom seed to initialize the search each time – for each

cross-validation fold. Thus, the accuracy associated

with each quality measure is actually averaged over

100 values (10 cross-validation folds times 10 runs

per fold). We also report the results for the stan-

dard three-layer fully-connectedNeural Network with

Back-Propagation (3L-BP) as a baseline.

The performance of classiﬁcation quality mea-

sures was evaluated using 30 public-domain datasets

from the well-known University of California at

Irvine (UCI) dataset repository. The parameter set-

tings of ANN-Miner used are shown in Table 1. The

number of the hidden neurons is set equal to the num-

ECTA2014-InternationalConferenceonEvolutionaryComputationTheoryandApplications

142

Table 2: Predictive Accuracy Results.

Dataset 3L-BP ANN-Acc ANN-CPE

ANN-CPE ANN-CPE

ANN-QLF ANN-CE ANN-BIR

balance

96.50 91.33 96.83 98.67 94.67 97.33 96.67 99.10

breast-l

72.29 68.80 69.09 68.29 70.22 69.84 71.64 70.84

breast-p

68.26 75.21 76.79 75.76 76.32 77.29 77.29 77.61

breast-t

32.64 57.55 67.03 60.73 58.36 63.46 57.73 64.38

breast-w

93.86 95.43 94.74 96.14 95.79 96.32 94.91 96.49

car

90.29 98.19 98.66 99.01 97.78 98.31 97.84 96.83

credit-a

84.35 82.75 85.76 83.04 84.49 85.80 85.19 85.76

credit-g

74.00 71.90 74.33 71.00 74.40 74.67 74.33 74.40

ecoli

79.53 84.86 83.67 86.08 86.33 86.94 85.46 85.18

glass

46.30 48.46 58.99 57.56 60.39 52.38 54.68 54.70

hay

60.01 75.45 74.26 77.67 76.43 77.60 71.26 77.87

heart-c

57.46 55.43 58.43 55.47 56.46 58.41 58.08 57.19

hepatitis

83.79 81.92 79.92 81.88 80.00 82.58 83.25 81.34

horse

76.04 78.94 81.67 79.77 82.01 79.14 81.42 77.32

ionosphere

89.67 93.38 92.53 93.40 91.10 91.68 91.08 93.54

iris

87.28 90.67 93.95 92.57 94.62 95.28 93.95 92.54

liver

57.40 64.08 67.49 64.07 64.38 64.86 64.98 67.31

monks

54.25 42.20 77.36 77.70 70.80 69.63 67.33 71.36

parkinsons

75.78 75.13 85.62 85.68 83.03 85.71 82.03 82.53

pima

55.71 42.86 76.20 75.91 76.56 76.17 75.51 74.60

pop

81.11 81.85 52.68 50.54 68.21 68.39 65.71 56.15

s-heart

94.16 92.74 83.71 81.85 83.70 83.33 84.07 81.90

segmentation

70.56 93.16 93.84 93.21 93.16 92.66 93.68 94.35

thyroid

86.10 89.78 90.37 88.85 89.80 89.78 87.99 92.30

transfusion

70.56 72.60 74.76 73.66 72.75 74.45 72.58 75.67

ttt

76.63 98.00 98.31 98.00 98.21 98.84 95.26 98.36

vehicle

64.90 60.90 75.00 72.69 71.63 71.98 72.11 71.27

voting

93.89 94.90 94.12 94.86 92.91 94.59 94.52 94.75

wine

94.41 94.38 95.52 93.86 94.38 96.04 94.97 94.44

zoo

81.25 97.50 96.50 96.07 96.07 94.64 96.07 96.75

Average Rank 6.5 5.7 3.5 4.6 4.4 3.3 4.6 3.5

ber of input attributes plus the number of class values

(output units) for each dataset. Note that the learning

rate and the number of epochs used in BP in each ant

solution creation step (line 11, Algorithm 2) are in-

dicated in parameters 4 and 5, respectively, while the

learning rate and the number of epochs used in the BP

post-processing step (line 19, Algorithm 1) are indi-

cated in parameters 6 and 7, respectively.

Table 2 shows the average predictive accuracy re-

sults of the ANN-Miner algorithm with the 7 different

classiﬁcation quality functions, as well as the conven-

tional BP algorithm, where the best result obtained

for each algorithm is shown in boldface. The results

reported are the average of the 100 runs of the strati-

ﬁed 10-times 10-fold cross-validation procedure. The

last row of the table shows the average ranking of

each algorithm. The average rank for a given algo-

Table 3: Results of Friedman statistical signiﬁcance test

with Holm post hoc test.

Comparison p Holm

3L-BP vs. ANN-QLF 5.5E-07 0.00179

3L-BP vs. ANN-BIR 1.6E-06 0.00185

3L-BP vs. ANN-CPE

1.8E-06 0.00192

ANN-Acc vs. ANN-QLF 2.3E-04 0.00200

ANN-Acc vs. ANN-BIR 5.0E-04 0.00208

ANN-Acc vs. ANN-CPE

5.6E-04 0.00217

3L-BP vs. ANN-CPE

9.9E-04 0.00227

3L-BP vs. ANN-CPE 0.0020 0.00238

3L-BP vs. ANN-CE 0.0024 0.00250

rithm g is obtained by ﬁrst computing the rank of g on

each dataset individually, with a rank of 1 represent-

ing the best performance and a rank of 8 representing

the worst performance. The individual ranks are then

ExploringtheImpactofDifferentClassificationQualityFunctionsinanACOAlgorithmforLearningNeuralNetwork

Structures

143

averaged across all datasets to obtain the overall aver-

age rank.

As shown in Table 2, the Quadratic Loss Func-

tion (ANN-QLF) obtained the best overall ranking of

3.3, followed closely by Bayesian Information Re-

ward (ANN-BIR) and Squared Class Probability Er-

ror (ANN-CPE

), both of which had an overall rank-

ing of 3.5.

The non-parametric Friedman statistical test with

the Holm’s post-hoc test was applied, at the conven-

tional 0.05 signiﬁcance level, to the predictive accu-

racy results reported in Table 2—an all-pairs multiple

comparison on the 8 algorithms under evaluation. The

results of the statistical signiﬁcance test are reported

in Table 3; for space limitations, results are reported

only for the cases where there is a statistically sig-

niﬁcant difference. The results indicate that all seven

measures are signiﬁcantly better than the baseline 3L-

BP, and four of the measures are also signiﬁcantly

better than ANN-Acc (the measure that was used in

the original work on ANN-Miner): ANN-BIR, ANN-

QLF, ANN-CPE

, and ANN-CPE

ACKNOWLEDGEMENT

This partial support of a Brandon University Research

Grant is gratefully acknowledged.

REFERENCES

Boryczka, U. and Kozak, J. (2010). Ant colony decision

trees. In 4th International Conference on Computa-

tional Collective Intelligence, pages 4373–382.

Boryczka, U. and Kozak, J. (2011). An adaptive discretiza-

tion in the ACDT algorithm for continuous attributes.

In 3rd International Conference on Computational

Collective Intelligence, pages 475–484.

Dorigo, M. and St¨utzle, T. (2004). Ant Colony Optimiza-

tion. MIT Press, Cambridge, MA, USA.

Liao, T., Socha, K., de Oca, M. M., Stuetzle, T., and

Dorigo, M. (2014). Ant colony optimization for

mixed-variable optimization problems. To appear in

IEEE Transactions on Evolutionary Computation.

Otero, F., Freitas, A., and Johnson, C. (2009). Handling

continuous attributes in ant colony classiﬁcation algo-

rithms. In IEEE Symposium on Computational Intel-

ligence in Data Mining (CIDM-09), pages 225–231.

Otero, F., Freitas, A., and Johnson, C. (2013). A new se-

quential covering strategy for inducing classiﬁcation

rules with ant colony algorithms. IEEE Transactions

on Evolutionary Computation, 17(1):64–74.

Otero, F. E. B., Freitas, A. A., and Johnson, C. G.

(2012). Inducing decision trees with an ant colony

optimization algorithm. Applied Soft Computing,

12(11):3615–3626.

Parpinelli, R. S., Lopes, H. S., and Freitas, A. A. (2002).

Data mining with an ant colony optimization algo-

rithm. IEEE Transactions on Evolutionary Compu-

tation, 6(4):321–332.

Salama, K., Abdelbar, A., and Freitas, A. (2011). Multi-

ple pheromone types and other extensions to the Ant-

Miner classiﬁcation rule discovery algorithm. Swarm

Intelligence, 5(3-4):149–182.

Salama, K., Abdelbar, A., Otero, F., and Freitas, A. (2013).

Utilizing multiple pheromones in an ant-based algo-

rithm for continuous-attribute classiﬁcation rule dis-

covery. Applied Soft Computing, 13(1):667–675.

Salama, K. M. and Abdelbar, A. M. (2014). A novel ant

colony algorithm for building neural network topolo-

gies. In Proceedings ANTS-14. Accepted.

Socha, K. and Blum, C. (2007). An ant colony optimization

algorithm for continuous optimization: Application to

feed-forward neural network training. Neural Com-

puting & Applications, 16:235–247.

Socha, K. and Dorigo, M. (2008). Ant colony optimization

for continuous domains. European Journal of Opera-

tional Research, 185:1155–1173.

ECTA2014-InternationalConferenceonEvolutionaryComputationTheoryandApplications

144