Automatic Modularization of Artificial Neural Networks

Eva Volna

University of Ostrava, 30

dubna st. 22, 701 03 Ostrava, Czech Republic

Abstract. The majority of this paper relies on some forms of automatic decom-

position tasks into modules. Both described methods execute automatic neural

network modularization. Modules in neural networks emerge; we do not build

them straightforward by penalizing interference between modules. The concept

of emergence takes an important role in the study of the design of neural net-

works. In the paper, we study an emergence of modular connectionist architec-

ture of neural networks, in which networks composing the architecture compete

to learn the training patterns directly from the interaction of reproduction with

the task environment. Network architectures emerge from an initial set of ran-

domly connected networks. In this way can be eliminated connections so as to

dedicate different portions of the system to learn different tasks. Mentioned me-

thods were demonstrated for experimental task solving.

1 Reasons for a Modular Approach

The primary reason for adopting an ensemble approach to combining nets into a

modular architecture is that of improving performance. There are a number of possi-

ble justifications for taking a modular approach to combining artificial neural nets.

First, a modular approach might be used to solve a problem which could not have

been solved through the use of a unitary net. A modular system of nets can exploit the

specialist capabilities of the modules, and consequently achieve results, which would

not be possible in a single net. Another reason for adopting a modular approach is

that of reducing model complexity, and making the overall system easier to under-

stand. This justification is often common to engineering design in general. Other

possible reasons include the incorporation of prior knowledge, which usually takes

the form of suggesting an appropriate decomposition of the global task. A modular

approach can also reduce training times and make subsequent modification and ex-

tension easier. Finally, a modular approach is likely to be adopted when there is con-

cern to achieve some degree of neurobiological or psychological plausibility, since it

is reasonable to suppose that most aspects of information processing involve mod-

ularity.

A modular neural network can be characterized by a series of independent neural

networks moderated by some intermediary. Each independent neural network serves

as a module and operates on separate inputs to accomplish some subtask of the task

the network hopes to perform [1]. The intermediary takes the outputs of each module

and processes them to produce the output of the network as a whole. The interme-

Volna E. (2010).

Automatic Modularization of Artiﬁcial Neural Networks .

In Proceedings of the 6th International Workshop on Artiﬁcial Neural Networks and Intelligent Information Processing, pages 23-32

 SciTePress

seen in terms of the backpropagation algorithm, it is limited to networks trained using

this algorithm. Therefore, spatial crosstalk may be considered as resulting from the

connectivity of the network and not from the learning algorithm used to training the

network. By maintaining short connections and eliminating long connections, spatial

crosstalk can be reduced and tasks can be decomposed into subtasks. Although the

three systems show in Fig. 1 [5] can be trained to perform the same mapping. System

in Panel A has its hidden units fully interconnected with its output units and is most

susceptible to spatial crosstalk. System in the Panel B has its hidden units on the top

fully interconnected with its top output units and its hidden units on the bottom fully

interconnected with its bottom output units. Thus, it consists of two separate networks

(two 4-4-2 networks). If the mapping that this system is trained to perform can be

decomposed so that the mapping from the input units to the top set of output units

may be thought of as one task and the mapping from the input units to the bottom set

of output units may be thought of as a second task, then this system has dedicated

different networks to learn the different tasks. Because there is no spatial crosstalk

between the two tasks, such a system may show rapid learning. The Panel C has hid-

den unit project to only a single output unit. It therefore consists of a separate net-

work for each output unit (four 4-2-1 networks) and is immune to spatial crosstalk.

A B C

Fig. 1. A: One 4-8-4 network. B: Two 4-4-2 networks. C: Four 4-2-1 networks [5].

Artificial neural network with many adjustable weights may learn to training data

quickly and accurately, but generalize poorly to novel data. One method of improving

the generalization abilities of network with too many “degrees of freedom” is to de-

cay or eliminate weights during training. A second method is to match the structure of

the network with the structure task. For example, networks, whose units have local

receptive fields, can learn to reliably, detect the local structure that is often present in

pattern recognition tasks. A system that maintains short connections and eliminates

long connections should generalize well because its degrees of freedom are reduced

and because its units develop local receptive fields.

Artificial neural network often develop relatively not interpretable representations

for at least two reasons. Networks whose units are densely connected tend to develop

representations that are distributed over many units and, thus, are difficult to interpret.

In addition, not interpretable representations often develop in networks that are

trained to simultaneously perform multiple tasks. In contrast, networks, whose units

tend to have local receptive fields, towards short connections may develop relatively

local representations. Furthermore, such a system may be capable of eliminating

connections so that different networks learn different tasks.

3 Evolutionary Module Acquisition

There is a simple model of evolutionary emergence of modular neural network topol-

ogy introduced in the chapter [10]. We describe a method of optimization of the

modular neural network architecture via evolutionary algorithms that uses a fix part

of network architecture in the genome. Every individual is a multilayer neural net-

work with one hidden layer of units. We have to fix its maximal architecture (e.g.

number of input, hidden and output units) before the main calculation. Population P

consists of P = {

2,...,

p}, where p is equal to a number of chromosomes in P.

Every chromosome consists of binary digits that are generated randomly with a prob-

ability 0.5. Chromosome, with m hidden units a n output units is shown in Fig. 2,

where e

ij = 0, if the connection between i - th hidden unit and j - th output unit of the

individual doesn’t exist, and eij = 1, if the connection exists (i = 1,…,m; j = 1,…n).

Connections between input and hidden units are not included in chromosomes, be-

cause they are not necessary for modular network architecture creation. Each individ-

ual (e.g. the network architecture) is partially adapted by backpropagation, its fitness

function is then calculated as follows (1):

Fitness

(1)

where k = 1, …, p (p = number of individuals in the population);

E is the error after backpropagation adaptation of the k-individual.

Population P:

individual:

...

individual:

...

individual:

INDIVIUAL

, …e

, ... e

, …e

Fig. 2. A population of individuals.

Only two mutation operators are used, no crossover operators. The first mutation

operator is defined in following way. In the every generation, one individual is ran-

domly chosen and each bit is changed with probability 0.01 (e.g. if the connection

exists – after mutation it does not exist and vice versa) in its chromosome. The

second mutation operator is defined in following way, see Fig. 3. First, we define a

pattern of t-consecutive zeroes that will be fixed during whole calculation. The pat-

tern is determined by number of neurons in the output layer, which represent individ-

ual modules. Output neurons are organized into d modules, t = min (t

i, i = 1 ,..., d),

where t is number of neurons in the pattern, and ti is number of neurons in the i-th

module. Defined pattern is represented as a continuous chain of t-zeros, which is not

changed during applications of the second mutation operator. Fixation of t-zeros

chain can be defended by biological motivation, where the protection against muta-

tion is usually related to continuous section. Defined pattern in the chromosome al-

lows temporary fixing the existing module against the application of the second muta-

tion operator. Then we find the define pattern in each chromosome. If we find only

one continuous pattern, we fix it. If we find more than n-consecutive zeroes, we ran-

domly choose n-consecutive zeroes from them and fix them. The fixed pattern

represents a single atomic unit and the second mutation operator is not applied to it.

Only to the rest of bits from chromosomes are changed with probability 0.01. Thus,

each individual has a unique collection of fixed patterns. The second mutation opera-

tor is applied to every individual r - times, where r is a parameter and its value is

define before calculation. Only the best individual or its best mutation is included into

the next generation. Next, all individuals in the new generation release a portion of

the pattern that was fixed that way they can once again be manipulated by reproduc-

tion operators. The process of evolutionary algorithms

is ended when the population

achieves the maximal generation or if there is no improvement in the objective func-

tion for a define sequence of consecutive generations.

00...0000010...

00100010101

0...0

110010

0...0

k < t k = t k > t

00100010101

0...0

1100100...00

0...0

000010...

00100010101

0...0

1100100..

.000.

..0000010...

Fig. 3. The second mutation operator. The fixed pattern is t- consecutive zeroes, k is number of

consecutive zeroes in the chromosome. A: An individual before mutation. B: Possible chromo-

somal representation of the individual after mutation.

4 Modularization Via Evolutionary Hill – climbing Algorithm

The second presented method is based on hill-climbing algorithm with learning [8].

Evolution of the probability vector is modeled by a genetic algorithm on the basis of

the best evaluated individuals in this algorithm, which are selected on the basis of the

speed and quality of learning of the given tasks [11]. Population P is presented in Fig.

2 and is defined in the same way as in the previous chapter. Individuals in the next

generation are generated from the updating probability vector. Every individual (e.g.

its neural network architecture) is partially adapted by backpropagation [2] and eva-

luated by the quality of its adaptation. The number of epochs is a very important

criterion in the described method, because modular architectures start to learn faster

than fully connected multilayer connectionist networks [9]. Our goal is to produce

such a neural network architecture that is able to learn a given problem with the smal-

lest error. A backpropagation error is a fitness function parameter. A fitness function

value F

i of the i - th individual is calculated as follows (2):

con

∑

(2)

where i = 1, …, p (p is number of individuals in a population);

The best individual in the population is included to the next population automati-

cally. Values of the chromosomes of the rest of individuals

i (i = 2, …,p) are calcu-

lated for the next generation as follows: if wk = 0(1), then (ek)i = 0(1); if 0 < wk < 1

the corresponding (ek)i is determined randomly by (6):

()

⎩

⎨

⎧

otherwise0

wrandomf

(6)

where k = 1, …, p (p = number of individuals in the population).

The process of the evolutionary algorithm is ended if the saturation parameter

(w)

is greater than a predefined value.

5 Experiments

In the experimental task, a system (neural network) recognizes a binary pattern and

its rotation. Neural network with one hidden layer of units with topology 9-13-8

adapted by backpropagation represents a system here. The creation of such modular

system that would solve partial tasks correctly was our target. Basic set of training

patterns are organized into a matrix (grid) 3x3, which is represented by binary vector.

The direction of rotation is defined towards the basic pattern by four possibilities: (a)

0°-state without rotation, (b) turn 90°, (c) turn 180°, and (d) turn 270°. The training

set includes four patterns that are defined in four different states, see Fig 4. Thus, we

get 16 different combinations of shapes and their rotations. Eight output units are

divided into two subsets of four units. Units in the “shape” subset are responsible for

indicating the identity of the input. Each input is associated with one of the four

“shape” units, and one of the four rotations. The system is considered to correctly

recognize and locate an input.

Parameters of the Experimental part.

− Population (both methods):

Number of individuals: 100.

Neural network architecture: 9 – 13 – 8.

Training algorithm: Backpropagation

(learning rate: 1; momentum: 0; training times: 150 epochs in the partial training).

− Parameters of method from chapter 3:

Probability of mutations: 0.01.

Fix pattern in the second mutation: “0000”.

r: 5.

Ending conditions: Maximal number of generations: 500.

(w) = a number of entries (wi) of the probability vector w that are less then weff or (1- weff),

where weff is a small positive number.

− Parameters of method from chapter 4:

con: 100; see formula (2).

: 0.2; see formula (4).

Ending conditions: The saturation parameter,

(w): 0.99*m*n

(m=13, number of hidden units; n=8, number of output units); weff = 0.01.

Fig. 4. A defined pattern in a training set.

Table 1 shows a table of results. The table shows an evolution of the best individ-

ual in the population. It is evidently seen, the connections among modules are elimi-

nated faster than connection inside modules. These results support also the fact that

systems were created dynamically during a learning process. Method from chapter 3

gives the following results: six hidden units of the best individual realise the “shape”

task and its four units realise the “rotation” task in the last generation. Method from

chapter 4 gives the following results: seven hidden units of the best individual realise

the “shape” task and its four units realise the “rotation” task in the last generation.

Calculation was terminated, when ending conditions were fulfilled, e.g. for method

from chapter 3 was calculation terminated in the 498-th generation and for method

from chapter 4 was calculation terminated in the 353-rd generation. Other numerical

simulations give very similar results.

Table 1. Table of results.

OUTPUT

WHICH SHAPE

SHAPE ROTATION

NPUT

HIDDE

LAYER

0°

90°

180°

270°

: s

ape

otat

Method from chapter 3: Method from chapter 4:

generation

number of

hidden

units:

„shape“

task:

number of

hidden units:

„rotate“

task:

number of

interferen-

ces:

number of

hidden units:

„shape“

task:

number of

hidden units:

„rotate“

task:

number of

interferen-

ces:

1 1 1 11 0 0 13

100 2 3 8 1 3 9

200 3 3 7 4 3 6

300 4 3 6 6 3 4

400 5 4 4 GENERATION: 353

GENERATION: 498 7 4 2

6 4 3

We made the following experiment. Neural network with modular architecture (the

best individual) and network with the same arrangement of neurons, but by all con-

nections between layers have been adapted via backpropagation to solve the above

defined task. For each model was done 10 adaptations, the weight vector was at the

beginning of each simulation generated randomly. In Fig. 5 the average error function

values is shown: (a) modular neural network and (b) fully connected neural network

during the whole calculation. Adaptation of each neural network was terminated after

1500 iterations. The figure shows that the network with a modular architecture, which

includes only a limited number of connections, allows to learn the considered prob-

lem as efficiently as a monolithic networks designed within an appropriate architec-

ture.

0 500 1000

fully connected individual

modular architecture

iter ations

0 500 1000

modular architecture

fully connected individual

iter ati ons

A B

Fig. 5. The history of average error function value during whole calculation A: method from

chapter 3; B: method from chapter 4.

6 Conclusions

Both described method are methods of automatic neural net modularization. The

problem specific modularisations of the representation emerge through the iteration

of the evolutionary algorithm directly with the problem.

When interpreting solutions, we have to be careful, because algorithms’ parame-

ters are not the object of the optimization process, but we obtain solutions just in

dependence on these parameters. Both numerical simulations reflect the modular

structure significance as a tool of a negative influence interference rejection on neural

network adaptation. As the hidden units in the not split network are perceived as

some input information processing for output units, where a multiple pattern classifi-

cation is realized on the basis of diametrically distinct criteria (e.g. neural network

has to classify patterns according to their form, location, colors, ...), so in the begin-

ning of an adaptation process the interference can be the reason that output units also

get further information about general object classifications than the one which is

desired from them. This negative interference influence on running the adaptive

process is removed just at the modular neural network architecture, which is proved

also by results of the performed experiment. The winning modular network architec-

ture was the product of emergence using evolutional algorithms. The neural network

serves here as a special way of solving the evolutional algorithm, because of its struc-

ture and properties it can be slightly transformed into an individual in evolutionary

algorithm.

References

1. Di Fernando, A., Calebretta, R., and Parisi, D. (2001) Evolving modular architectures for

neural networks. In French R., and Sougne, J. (eds.).Proceedings of the Sixth Neural Com-

putation and Psychology Workshop: Evolution, Learning and Development. Springer Ver-

lag, London.

2. Fausett, L. V. (1994) Fundamentals of neural networks. Prentice-Hall, Inc., Englewood

Cliffs, New Jersey.

3. Hampshire, J. and Waibel, A. The Meta-Pi network: Building distributed knowledge repre-

sentation for robust pattern recognition. Technical Report CMU-CS-89-166. Pittsburgh,

PA: Carnegie Mellon University.

4. Jacobs, R. A., Jordan, M. I., Nowlan, S.J., and Hinton, G. E. (1991) Adaptive mixtures of

local experts. Neural Computation, 3, pp.79-97.

5. Jacobs, R. A., Jordan, M. I. (1992). Computational consequences of a bias toward short

connections. Journal of Cognitive Neuroscience, 4, 323–336.

6. Jacobs, R. A. (1994) Hierarchical mixtures of experts and the EM algorithm. Neural Com-

putation, 6, 181-214.

7. Jordan, M. I. and Jacobs, R. A. (1995) Modular and Hierarchical Learning Systems. In M.

A. Arbib (Ed) The Handbook of Brain Theory and Neural Networks. pp 579-581.

8. Kvasnička, V; Pelikán, M.; Pospíchal, J. (1996) Hill climbing with learning (an abstraction

of genetic algorithm). Neural network world 5, 773-796.

9. Rueckl, J. G. (1989) Why are “What” and “Where” processed by separate cortical visual

systems? A computational investigation. Journal of Cognitive Neuroscience 2, 171-186.

10. Volna, E. (2002) Neural structure as a modular developmental system. In P. Sinčák, J.

Vaščák, V. Kvasnička, J. Pospíchal (eds.): Intelligent technologies – theory and applica-

tions. IOS Press, Amsterdam, pp.55-60.

11. Volna, E. (2007) Designing Modular Artificial Neural Network through Evolution. In J.

Marques de Sá, L. A. Alexandre, W. Duch, and D. P.Mandic (eds.) Artificial Neural Net-

works – ICANN’07, Lecture Notes in Computer Science, vol. 4668, Springer-Verlag series,

pp 299-308.