Mitigate Catastrophic Forgetting by Varying Goals

Lu Chen

and Murata Masayuki

Kyoto Institute of Technology, Kyoto, Japan

Osaka University, Osaka, Japan

Keywords:

Modular Network, Catastrophic Forgetting, Neural Network.

Abstract:

Catastrophic forgetting occurs because neural network learning algorithms change connections to learn a new

skill which encodes previously acquired skills. Recent research suggests that encouraging modularity in neural

networks may overcome catastrophic forgetting because it should reduce learning interference. However,

manually constructing modular topology is hard in practice since it involves expert design and trial and error.

Therefore, an automatic approach is needed. Kashtan et al. ﬁnd that evolution under an environment that

changes in a modular fashion can lead to the spontaneous evolution of modular network structure. However,

goals in their research are made of a different combination of subgoals, while real-world data is rarely perfectly

separable. Therefore, in this paper, we explore the application of such approach to mitigate catastrophic

forgetting in a slightly practical situation, that is applying it to classiﬁcation of small sized real images and

applying it to the increment of goals. We ﬁnd that varying goals can improve catastrophic forgetting in a

CIFAR-10 based classiﬁcation problem. We ﬁnd that when learning a large set of goals, a relatively small

switching interval is required to have the advantage of mitigating catastrophic forgetting. On the other hand,

when learning a small set of goals, an appropriate large switching interval is preferred since this less worsens

the advantage and also can improve accuracy.

1 INTRODUCTION

Learning a variety of different skills for different

problems is a long-standing goal in artiﬁcial intel-

ligence (Ellefsen et al., 2015). However, in neural

networks, when it learns a new skill, it typically los-

ing previously acquired skills (Ellefsen et al., 2015).

This problem called catastrophic forgetting, and it oc-

curs because learning algorithms change connections

to learn a new skill which encode previously acquired

skills (Ellefsen et al., 2015).

Catastrophic forgetting has been studied for a few

decades (French, 1999). Recently, in computational

biology ﬁeld, a modular approach for neural net-

works is considered to be needed as learning problems

grow in scale and complexity (Amer and Maul, 2019).

(Ellefsen et al., 2015) studied whether catastrophic

forgetting can be reduced by evolving topological

modular neural networks. They said that modularity

intuitivelyshould reduce learning interference by sep-

arating functionality into physically distinct modules.

Their results suggest that encouraging modularity in

neural networks may overcome catastrophic forget-

ting. However, in their approach, the input data is

needed to be partitioned in advance so that different

modules can be assigned. Since manual data modu-

larization is usually based on some heuristic, expert

knowledge or analytical solution, a good partitioning

requires a good prior understanding of the problem

and its constraints, which is rarely the case for neural

network learning tasks (Amer and Maul, 2019).

Although there are several manual techniques for

constructing modular topology, manual formation is

hard in practice since manual techniques involve ex-

pert design and trial and error(Amer and Maul, 2019).

Therefore, an automatic approach is needed. In the

ﬁeld of computational biology, (Kashtan and Alon,

2005) ﬁnd that evolution under an environment that

changes in a modular fashion leads to the sponta-

neous evolution of modular network structure. That

is, they repeatedly switch between several goals, each

made of a different combination of subgoals, which

they call MVG (modularly varying goals). Although

modular structures are usually less optimal than non-

modular ones (Kashtan and Alon, 2005; Alon, 2003),

they ﬁnd that modular networks that evolve under

such varying goals can remember their history (Parter

et al., 2008).

530

Chen, L. and Masayuki, M.

Mitigate Catastrophic Forgetting by Varying Goals.

DOI: 10.5220/0008950005300537

In Proceedings of the 12th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2020) - Volume 2, pages 530-537

ISBN: 978-989-758-395-7; ISSN: 2184-433X

Moreover, (Kashtan and Alon, 2005) showed that

MVG also leads to spontaneous evolution of network

motifs, bifan and diamond motif, which are kinds of

four-node subgraphs which occur signiﬁcantly often

than that in random networks. Bifan and diamond

motifs are relatively highly connected among the

four-node motifs existing in a feedforward neural net-

work. These motifs are interesting because they could

be considered that contribute to the modularity since a

network with high modularity is considered that net-

works having densely connected groups of vertices

with only sparse connections between them (New-

man, 2006). Also, bifan and diamond motifs are in-

teresting because they could be thought as structural

motifs able to provide large and diverse functional

interactions. (Sporns and K¨otter, 2004) suggest that

biological neuronal networks have evolved such that

their repertoire of potential functional interactions is

both large and highly diverse, while their physical ar-

chitecture is constructed from structural motifs that

are less numerous and less diverse. A large functional

repertoire facilitates ﬂexible and dynamic processing,

while a small structural repertoire promotes efﬁcient

encoding and assembly. Since bifan and diamond mo-

tifs have symmetric structures, those could be thought

as structural motifs able to provide large and diverse

functional interactions.

Our study explores the application of MVG to mit-

igate catastrophic forgetting in a slightly practical sit-

uation. In (Kashtan and Alon, 2005), they not only

showed switching goals made of a different combina-

tion of subgoals can lead to spontaneous evolution of

modular network structure, but also pointed out that

randomly changing environments do not seem to be

sufﬁcient to produce modularity. However, real-world

data is neither perfectly separable nor random. For

example, it is popular that images have similar fea-

tures, e.g. edge, intersecting lines, curves (Li et al.,

2015). This is not the situation considered in (Kash-

tan and Alon, 2005). Although MVG uncovers the

fundamental mechanism of the spontaneous evolution

of modular network structure, there is still a long way

to practical use. Therefore, to show the availability of

MVG in a practical situation, in this paper we explore

the application of MVG in a slightly practical situ-

ation, in detail, applying it to classiﬁcation of small

sized real images, such as CIFAR-10. Also, although

2 goals are evaluated in (Kashtan and Alon, 2005), we

explore the increment of goals since a large amount

of data would be expected to learn in practical use.

To distinguish from MVG which is targeted to goals

made of a different combination of subgoals, the ap-

proach in this paper is renamed as CFVG (mitigate

Catastrophic Forgetting by Varying Goals).

In the evaluation, we compare CFVG with neu-

ral networks learned a single goal, which is the most

common method in practical use. The reason why

there is no comparison with existing catastrophic for-

getting methods is that there is no CIFAR-10 sized

image learn-able method with generating modules,

which is considered able to learn problems grow in

scale and complexity (Amer and Maul, 2019). In

recent research, (Kirkpatrick et al., 2017) has pro-

posed a practical solution to overcome catastrophic

forgetting to train a neural network by protecting the

weights important for previous goals. However, exact

recognition (French, 1999) is required which could be

inferred having a limitation in learning goals. In fact,

there is a parameter that exists, which sets how im-

portant the old goal is compared with the new one. In-

stead of it, in this paper, the amount of time it required

to relearn the original goal is measured, which does

not require exact recognition, therefore could be ex-

pected to deal with goals grow in scale and complex-

ity. Also, the existing scenario for evaluating CIFAR-

10 is not used. This is because it does not capture the

property of real world data. For example, in (Kirk-

patrick et al., 2017), they generated goals by shufﬂing

the order of pixels. Although this leads to the equal

difﬁculty for each goal, it is easy to infer that the pro-

cedure disorder features presented in the original im-

ages.

From the result, we ﬁnd that varying goals can im-

prove catastrophic forgetting compared to neural net-

works learned a single goal in a CIFAR-10 based clas-

siﬁcation problem. And, we ﬁnd that when learning a

large set of goals, a relatively small switching interval

is required to have the advantage of mitigating catas-

trophic forgetting. On the other hand, when learning

a small set of goals, an appropriate large switching

interval is preferred since this less worsens the ad-

vantage and also can improve accuracy. Moreover,

from exploring the obtained neural network structure,

we ﬁnd that, after pruning some unimportant connec-

tions, it shows strong motif of bifan and diamond mo-

tifs, which suggest that the obtained neural networks

are modular, and this could be the reason of mitigat-

ing catastrophic forgetting.

This paper is organized as follows. Section 2

brieﬂy explains goals for evaluation. Section 3 shows

the effect of CFVG toward classiﬁcation based on

CIFAR-10. Section 4 shows the effect of CFVG to-

ward increment of goals. Section 5 shows network

motifs of neural networks obtained by CFVG.

Mitigate Catastrophic Forgetting by Varying Goals

531

2 GOALS FOR EVALUATION

MVG (Modularly varying goals) is to repeatedly

switch between several goals, each made of a different

combination of subgoals (Kashtan and Alon, 2005).

Although there is no detail information about how to

generate the goals, two examples are given in (Kash-

tan and Alon, 2005). For electronic combinatorial

logic circuits problem, the switching goals are given

as G1 and G2:

G1 =(X XOR Y) AND (Z XOR W), (1)

G2 =(X XOR Y) OR (Z XOR W). (2)

where, X, Y, Z, W are inputs, and G1, G2 are outputs.

G1 and G2 have share subproblems (X XOR Y) and

(Z XOR W). Circuits implementing each goal with

NAND gates are explored by GA. Another example

is for 8 bit-sized pattern recognition problem. The

goal is to recognize objects in the left and right sides

of the retina. The switching goals are given as G3 and

G4:

G3 = L and R, (3)

G4 = L or R. (4)

A left object exists if the four left pixels match one of

the patterns of a predeﬁned set L. A right object exists

if the four right pixels match one of the patterns of a

predeﬁned set R. G3 and G4 have shard subproblems

L and R. Neural networks implementing each goal are

explored by GA.

In this paper we explore the application of MVG

in a slightly practical situation, that is applying it

to classiﬁcation of small sized real images, such as

CIFAR-10. Since real-world data is neither perfectly

separable nor random, it is not the situation consid-

ered in (Kashtan and Alon, 2005). Goals are set to

learn whether images belong to a given class. Since

CIFAR-10 is used for evaluation, for example, the

goals can be expressed as:

G5 = Airplane, (5)

G6 = Ship. (6)

Different goals have different labels for each input,

which represents whether the input image is the tar-

get goal or not. The input data is the same for every

goal. Therefore, changing goals means to change the

label. Although the goals are set in a rough fashion,

and it is true that the goals do not share clear subprob-

lems, it is popular that images have similar features,

e.g. edge, intersecting lines, curves (Li et al., 2015).

Since real-world data is rarely perfectly separable, it

is valuable to explore using such goals. To distin-

guish from MVG which is targeted to goals made of

a different combination of subgoals, the approach in

this paper is renamed as CFVG (mitigate Catastrophic

Forgetting by Varying Goals).

In more detail, since the aim of this research is

to apply MVG to a practical situation, CIFAR-10

CNN introduced in Keras documentation

is used.

The dataset used for evaluation is 50,000 32x32 color

training images, which is originally labeled over 10

categories. The input data is the same for every goal,

which is 50,000 32x32 color training images. The

output data is relabeled based on the original labels

from 10 categories to 2 categories, which represents

whether the input image is the target goal or not. The

labels are different for the same input for different

goals.

3 EVALUATION FOR

CATASTROPHIC FORGETTING

In this section, we show that CFVG can mitigate

catastrophic forgetting. Forgetting is discussed by a

traditional measurement that measures the amount of

time it required to relearn the original goal (French,

1999).

Since the aim of this research is to apply MVG

to a practical situation, CIFAR-10 CNN introduced

in Keras documentation is used. The dataset used

for evaluation is 50,000 32x32 color training im-

ages, which is originally labeled over 10 cate-

gories(airplane, automobile, bird, cat, deer, dog, frog,

horse, ship, truck), and relabeled to 2 categories,

which represents whether the input image is the target

goal or not. The input data is the same for every goal,

which is 50,000 32x32 color training images, and the

labels are different for each goal. To balance the re-

tagged training data, class weight is set. The layer

structure of the neural network is unchanged from

CIFAR-10 CNN except for the output layer since the

number of categories changed from 10 to 2, which is

afterword:

input − 32C3− 32C3− MP2− 64C3− 64C3

−MP2− 512FC− 2softmax

Again, since the aim of this research is to apply MVG

to a practical situation, those parameters that have

been shown to be useful for practical use is remain

unchanged from Keras documentation. The optimiza-

tion algorithm is changed from RMSprop to SGD

without momentum and decay. Since RMSprop de-

creases the learning rate, it is not compatible with

CFVG. The learning rate of SGD is set to 0.1 after

several times tuning. Note that, preprocessing and

https://github.com/fchollet/keras

ICAART 2020 - 12th International Conference on Agents and Artiﬁcial Intelligence

532

0 200 400 600

Epoch

0.9

0.95

Acc.

Goal: Airplane

Ship

(a) FG.

0 200 400 600

Epoch

0.9

0.95

Acc.

(b) CFVG (Interval:1).

0 200 400 600

Epoch

0.9

0.95

Acc.

0 200 400 600

Epoch

0.9

0.95

Acc.

(d) CFVG (Interval:50).

(e) Evaluation (Goal: Airplane).

Figure 1: Evaluation of learning 2 goals. (a) Accuracy of neural networks learning airplane and ship respectively(FG). (b, c, d)

Accuracy of neural networks learning airplane and ship by CFVG in switching interval 1, 10, 50 respectively. (e) Evaluation

against neural network obtained in (a, b, c, d). Accuracy of learning airplane against those neural networks are shown.

Figure 2: Accuracy of neural networks learning each goal

respectively(FG).

data augmentation originated in Keras documentation

is left unchanged since those are done for practical

use. The accuracy showed in the evaluation below is

training accuracy, and it is categorical accuracy. Since

this research focusing on catastrophic forgetting, eval-

uating training accuracy is more clear than showing

test accuracy which could be affected by other rea-

sons.

For comparison, we do not compare using exist-

ing scenario, since it does not capture the property

of real-world data that they are intermediate modu-

lar, which is neither perfectly separable nor random,

for example, images have similar features, e.g. edge,

intersecting lines, curves (Li et al., 2015). In recent

research for overcome catastrophic forgetting (Kirk-

patrick et al., 2017), for evaluation goals, they gener-

ated goals by shufﬂing the order of pixels. Although

this leads to the equal difﬁculty to each goal, it is

easy to infer that the procedure disorder the features

presented in the original images. Therefore, instead

of comparing with it, we compare with neural net-

works learned a single goal, which is the most com-

mon method in practical use. Note that, to avoid ex-

ploding gradients, the learning rate is set to 0.01 for

FG, which is less aggressive than that used in CFVG.

However, the results below show signiﬁcant differ-

ence that could not be improved only by setting the

learning rate.

Figure 1a shows the accuracy of neural networks

learning airplane and ship respectively. This is called

FG(Fixed Goal). We regard FG as a neural network

catastrophically forgotten previous goals. From the

result, we can see that both achieved more than 0.98.

Figure 1b to Fig. 1d shows the accuracy of neural net-

works learning airplane and ship by CFVG in switch-

ing interval 1, 10, 50 respectively. We can see, al-

though the accuracy trained using CFVG is lower than

that of FG, the trained neural network can learn each

task faster as the epoch increase.

To show the amount of time it required to relearn

the original goal, the accuracy of learning airplane is

shown for some trained FG and CFVG neural net-

works. The neural networks used for evaluation are

the neural networks trained by airplane and ship using

CFVG in switching interval 1, 5, 10, 20, 50, 100 for

600 epoch. From Fig. 1e, we can see that CFVG can

reach 0.965 in a few epochs, which is much smaller

compare to that with Fig. 1a. Moreover, we can see

from Fig. 1 that CFVG neural networks learn the orig-

inal goal faster than FG learned ship. Therefore, the

results suggest that catastrophic forgetting can be mit-

igated by CFVG.

Mitigate Catastrophic Forgetting by Varying Goals

533

0 200 400 600

0.9

0.95

Acc.

3 Goals

0 200 400 600

0.9

0.95

5 Goals

0 200 400 600

Epoch

0.9

0.95

Acc.

8 Goals

0 200 400 600

Epoch

0.9

0.95

10 Goals

(a) CFVG (Interval: 1).

0 200 400 600

0.9

0.95

Acc.

3 Goals

0 200 400 600

0.9

0.95

5 Goals

0 200 400 600

Epoch

0.9

0.95

Acc.

8 Goals

0 200 400 600

Epoch

0.9

0.95

10 Goals

(b) CFVG (Interval: 10).

0 200 400 600

0.9

0.95

Acc.

3 Goals

0 200 400 600

0.9

0.95

5 Goals

0 200 400 600

Epoch

0.9

0.95

Acc.

8 Goals

0 200 400 600

Epoch

0.9

0.95

10 Goals

0 200 400 600

0.9

0.95

Acc.

3 Goals

0 500 1000

0.9

0.95

5 Goals

0 200 400 600 800

Epoch

0.9

0.95

Acc.

8 Goals

0 500 1000

Epoch

0.9

0.95

10 Goals

(d) CFVG (Interval: 100).

Figure 3: Accuracy of neural networks learning 3, 5, 8, 10 goals in different switching intervals. (a) Learning with switching

interval 1. (b) Learning with switching interval 10. (c) Learning with switching interval 20. (d) Learning with switching

interval 100.

4 EVALUATION FOR

INCREMENT OF GOALS

In this section, we evaluate CFVG against increment

of goals.

Before showing the results of CFVG, the accu-

racy of training different classes using FG is shown in

Fig. 2. We can see that after 600 epochs the accuracy

are all above 0.96. Also, we can see that the accuracy

of learning automobile, ship, and truck are higher,

and that of bird, dog, and cat is lower. This could

be because non-rigid objects like animals are difﬁcult

to classify since their intra-class pose and appearance

variations are expected to be very high(Ramesh et al.,

2019).

Figure 3 shows the accuracy of training by CFVG

for a different number of goals with different switch-

ing intervals. Goals are given by a certain order as

Table 1: Goals For Evaluation.

3 Goals Airplane, Ship, Automobile

4 Goals Airplane, Ship, Automobile, Bird

5 Goals Airplane, Ship, Automobile, Bird, Cat

6 Goals Airplane, Ship, Automobile, Bird, Cat,

Deer

7 Goals Airplane, Ship, Automobile, Bird, Cat,

Deer, Dog

8 Goals Airplane, Ship, Automobile, Bird, Cat,

Deer, Dog, Frog

9 Goals Airplane, Ship, Automobile, Bird, Cat,

Deer, Dog, Frog, Horse

10 Goals Airplane, Ship, Automobile, Bird, Cat,

Deer, Dog, Frog, Horse, Truck

Tab. 1. Figure 3a shows the result of switching inter-

val 1. We can see that, as goals increase, the upper

end of accuracy decreases. Figure 3b, Fig. 3c, Fig. 3d

shows the results of switching interval 10, 20, 100 re-

ICAART 2020 - 12th International Conference on Agents and Artiﬁcial Intelligence

534

(a) CFVG (Interval: 1). (b) CFVG (Interval: 10).

Figure 4: Evaluation against neural network obtained by CFVG(Fig.3). Accuracy of learning airplane against those neural

networks are shown. Although Fig.3 only showed several results of learning goals for page limitation, the number of learning

goals from 2 to 10 are all evaluated. (a) Learning with switching interval 1. (b) Learning with switching interval 10. (c)

Learning with switching interval 20. (d) Learning with switching interval 100.

spectively. We can see that the accuracy of 3 goals

increases as the switching interval increase. Also, we

can see that the decrease in the upper end become less

as switching interval increase.

Figure 4a to Fig. 4d shows the amount of time re-

quired to relearn the original goal for neural networks

trained by CFVG with different number of goals. Ac-

curacy of learning airplane is shown for some trained

CFVG neural networks. The neural networks used

for evaluation are the neural networks obtained in the

last epoch in Fig. 3. Those are trained until the epoch

where the goals go around for the ﬁrst time beyond

600 epoch. For comparison, The accuracy for neural

networks trained using FG against airplane and ship

are shown. Figure 4a shows the result of switching in-

terval 1. We can see that CFVG learns faster than FG

learned ship. In detail, the smaller number of goals it

learns, the faster it learns the original goal. Figure 4b,

Fig. 4c, Fig. 4d shows the results of switching interval

10, 20, 100 respectively. We can see that CFVG learns

less fast as the interval increase, and for switching in-

terval 100, training with more than 6 goals will re-

duce the speed of relearning. Therefore, when learn-

ing a large set of goals, a small switching interval rela-

tive to the total learning epoch is required to have the

advantage of mitigating catastrophic forgetting. On

the other hand, when learning a small set of goals, an

appropriate large switching interval is preferred since

this less worsens the advantage and also can improve

accuracy.

5 EVALUATION FOR NETWORK

MOTIFS

In this section, we show whether CFVG leads to

module structure. (Kashtan and Alon, 2005) showed

MVG lead to spontaneous evolution of modular net-

Mitigate Catastrophic Forgetting by Varying Goals

535

0 20 40 60 80

-4

-2

REL # of M204

0 20 40 60 80

-4

-2

REL # of M204

CFVG 2 Goals

0 20 40 60 80

-4

-2

REL # of M204

CFVG 5 Goals

0 20 40 60 80

-4

-2

REL # of M204

CFVG 10 Goals

INTVL: 1

100

M204

(a) Bifan Motif(M204).

0 20 40 60 80

-4

-2

REL # of M904

0 20 40 60 80

-4

-2

REL # of M904

CFVG 2 Goals

0 20 40 60 80

-4

-2

REL # of M904

CFVG 5 Goals

0 20 40 60 80

-4

-2

REL # of M904

CFVG 10 Goals

INTVL: 1

100

M904

(b) Diamond Motif(M904).

0 20 40 60 80

-4

-2

REL # of M74

0 20 40 60 80

-4

-2

CFVG 10 Goals

M74

0 20 40 60 80

-4

-2

REL # of M76

0 20 40 60 80

-4

-2

CFVG 10 Goals

M76

(d) M76.

0 20 40 60 80

-4

-2

REL # of M392

0 20 40 60 80

-4

-2

CFVG 10 Goals

M392

(e) M392.

0 20 40 60 80

-4

-2

REL # of M328

0 20 40 60 80

-4

-2

CFVG 10 Goals

M328

(f) M328.

Figure 5: The relative number of motifs to weight randomized networks. (a) Bifan motif of FG and CFVG obtained neural

networks with switching between 2 goals, 5 goals, and 10 goals. For CFVG, results of switching interval with 1, 10, 20, 100

is shown. For FG, results against 10 different goals are shown, and so on. (b) Same as (a) but for diamond motif. (c, d, e)

Motifs with one link removed from bifan and diamond motif of FG and CFVG with switching between 10 goals. (f) Same as

(c, d, e) but for a motif supposed to have almost the same number with weight randomized networks.

work and network motifs. Unlike MVG, since CFVG

is varying goals that are neither perfectly separable

nor random, it is interesting to know whether the ob-

tained neural networks also gain such modular struc-

ture. To unclear this, we evaluate network motifs

against obtained neural network structure.

We evaluate the number of motifs of obtained

CNN to weight randomized neural networks, refer to

Z-score which is not calculable because of the net-

work size. Since the entire network of CNN is too

large to compute, a network is extracted from convo-

lutional layers of the CNN. We assume channels as

nodes. A link exists from a node represent a input

channel X to a node represent an output channel Y if

the kernel used against X for calculatingY has a larger

variance than p in its elements. p is calculated for

each layer, and it is the qth percentile of the variance

of all the kernels in a layer. In other words, unim-

portant connections are pruned following q (Hou and

Kwok, 2018). Note that, no links are pruned when q

is 0. Since the two dense layers have a large num-

ber of nodes, they are not included in the network for

calculating network motif. However, to delete redun-

dant links in the network, the dense layers are once

connected, and links that are unable to reach output

layers are deleted. To connect the dense layer, we

consider neuron in dense layer as a node. For links

between a convolutions layer and a dense layer, it ex-

ists if the variance of weights of links headed to the

same neuron is above the qth percentile of all the vari-

ance. For links between two dense layers, it exists if

the link weight is above the qth percentile of all the

weights. For weight randomized neural network, the

number of layers and the number of nodes in each

ICAART 2020 - 12th International Conference on Agents and Artiﬁcial Intelligence

536

layer is set to the same value as the real trained neural

network. Then, pruning unimportant connections fol-

lowing q. Although a network with 5 layers is calcu-

lated, to set the conditions the same as the real trained

neural network, nodes in the dense layers are attached

for deleting links that are not reachable to the out-

put layer. The number of motifs is calculated against

10 weight randomized neural networks. The average

value of the result is used as a reference value. The

value shown in Fig. 5 are values that are subtracted

by the reference value. Note that, since the struc-

ture of neural networks obtained by FG and CFVG is

the same, the same weight randomized networks are

used. MFINDER1.21 is used for detecting network

motifs (Milo et al., 2004).

From the result of Fig. 5, we can see a tendency

that neural network obtained with CFVG have a rela-

tively large number of bifan and diamond motif, while

they have less number of those motifs with one link

removed from bifan and diamond motif especially

when q is 20. Moreover, although FG has the same

tendency, the tendency is even stronger in CFVG ob-

tained neural networks. This suggest that neural net-

works obtained by CFVG are more modular than that

by FG. And, the reason for CFVG having the advan-

tage in mitigating catastrophic forgetting could be be-

cause of the such structure the networks gained.

6 CONCLUSION AND FUTURE

WORK

In this paper, we explore the application of MVG to

mitigate catastrophic forgetting in a slightly practical

situation, that is applying it to classiﬁcation of small

sized real images and applying it to the increment of

goals. From the result, we ﬁnd that varying goals

can improve catastrophic forgetting using SGD in a

CIFAR-10 based classiﬁcation problem. We ﬁnd that,

when learning a large set of goals, a relatively small

switching interval is required to have the advantage

of mitigating catastrophic forgetting. On the other

hand, when learning a small set of goals, an appro-

priate large switching interval is preferred since this

less worsens the advantage and also can improve ac-

curacy. Also, from exploring the obtained neural net-

work structure, we ﬁnd that, after pruning some unim-

portant connections, it shows strong motif of bifan

and diamond motifs, which suggest that the obtained

neural networks are modular, and this could be the

reason of mitigating catastrophic forgetting.

For future work, the proposed approach should be

examined in other tasks and other layer structures, and

theoretical analysis is needed.

ACKNOWLEDGMENT

Thanks to Nadav Kashtan for providing his source

code. This work was supported by JSPS KAKENHI

Grant Number JP19K20415. The computational re-

source was partially provided by large-scale computer

systems at the Cybermedia Center, Osaka University.

REFERENCES

Alon, U. (2003). Biological networks: The tinkerer as an

engineer. Science, 301(5641):1866–1867.

Amer, M. and Maul, T. (2019). A review of modulariza-

tion techniques in artiﬁcial neural networks. Artiﬁcial

Intelligence Review, 52(1):527–561.

Ellefsen, K. O., Mouret, J.-B., and Clune, J. (2015). Neural

modularity helps organisms evolve to learn new skills

without forgetting old skills. PLoS Computational Bi-

ology, 11(4):e1004128.

French, R. M. (1999). Catastrophic forgetting in con-

nectionist networks. Trends in Cognitive Sciences,

3(4):128–135.

Hou, L. and Kwok, J. T. (2018). Power law in spar-

siﬁed deep neural networks. arXiv e-prints, page

arXiv:1805.01891.

Kashtan, N. and Alon, U. (2005). Spontaneous evolution

of modularity and network motifs. Proceedings of

the National Academy of Sciences, 102(39):13773–

13778.

Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J.,

Desjardins, G., Rusu, A. A., Milan, K., Quan, J.,

Ramalho, T., Grabska-Barwinska, A., Hassabis, D.,

Clopath, C., Kumaran, D., and Hadsell, R. (2017).

Overcoming catastrophic forgetting in neural net-

works. Proceedings of the National Academy of Sci-

ences, 114(13):3521–3526.

Li, Y., Wang, S., Tian, Q., and Ding, X. (2015). A survey

of recent advances in visual feature detection. Neuro-

computing, 149:736–751.

Milo, R., Itzkovitz, S., Kashtan, N., Levitt, R., Shen-Orr,

S., Ayzenshtat, I., Sheffer, M., and Alon, U. (2004).

Superfamilies of evolved and designed networks. Sci-

ence, 303(5663):1538–1542.

Newman, M. E. (2006). Modularity and community

structure in networks. Proceedings of The National

Academy of Sciences, 103(23):8577–8582.

Parter, M., Kashtan, N., and Alon, U. (2008). Facilitated

variation: How evolution learns from past environ-

ments to generalize to new environments. PLoS Com-

putational Biology, 4(11):e1000206.

Ramesh, B., Yang, H., Orchard, G. M., Le Thi, N. A.,

Zhang, S., and Xiang, C. (2019). Dart: Distribu-

tion aware retinal transform for event-based cameras.

IEEE Transactions on Pattern Analysis and Machine

Intelligence, pages 1–1.

Sporns, O. and K¨otter, R. (2004). Motifs in brain networks.

PLoS Biology, 2(11):e369.

Mitigate Catastrophic Forgetting by Varying Goals

537