THE ROLE OF SEQUENCES FOR INCREMENTAL LEARNING

Susanne Wenzel

Institute of Geodesy and Geoinformation, Department of Photogrammetry, University of Bonn, Germany

Lothar Hotz

HITeC e.V. c/o Department of Computer Science, University of Hamburg, Germany

Keywords:

Incremental learning, Ordering effects, Version space learning, Scene interpretation.

Abstract:

In this paper, we point out the role of sequences of samples for training an incremental learning method. We

deﬁne characteristics of incremental learning methods to describe the inﬂuence of sample ordering on the

performance of a learned model. We show the inﬂuence of sequence for two different types of incremental

learning. One is aimed on learning structural models, the other on learning models to discriminate object

classes. In both cases, we show the possibility to ﬁnd good sequences before starting the training.

1 INTRODUCTION

For the purpose of scene interpretation, e. g. the in-

terpretation of facade images it is impossible to cap-

ture the huge variability of facades with one dataset.

Instead we need a continuous learning system that is

able to improve already learned models using new ex-

amples. This process in principle does not stop. For

this intention of inﬁnitely long learning, there exists a

number of incremental learning methods. But one can

easily show that the performance of these methods de-

pends on the order of examples for training, also this

is not mentioned by most other authors.

Imagine little kids, they are sophisticated incre-

mental learners. They permanently increase their

knowledge just by observing their surrounding. But

obviously they learn much better when examples are

presented in a meaningful order.

Therefore teachers design a curriculum for teach-

ing their pupils in the beginning of a school year.

This way they ensure that the topics are presented to

the pupils in a well structured order depending of the

complexity and challenge of each single topic.

In this paper, we want to point out the relevance

of sequence for training incremental learning meth-

ods. Furthermore, we want to show the possibility

to deﬁne curricula for training incremental learning

methods. That means, we want to deﬁne good se-

quences of examples to train such methods such that

they always perform best.

We show this by means of two different learning

methods. One is aimed on learning models to discrim-

inative object classes. The other is based on Version

Space Learning and aims on structural learning.

Existing work about ordering effects in incremen-

tal learning is quite rare. A good overview of exist-

ing work and results in this ﬁeld is given by (Langley,

1995).

(Fisher, 1987) proposed with COBWEB an unsu-

pervised incremental learning method based on hier-

archical conceptual clustering. He already mentioned

the role of sequence for training.

In the following some other authors try to over-

come this problem. (Talavera and Roure, 1998) in-

troduced a buffering strategy, where they store difﬁ-

cult examples for later evaluation. Thus, they post-

pone their evaluation to a more qualiﬁed state of the

learned model. But this can lead to an oversized sam-

ple storage. This means just collecting all examples

and evaluating them in a batch procedure. (McKusick

and Langley, 1991) made some empirical experiments

on ACHACHNE, a modiﬁcation of COBWEB. For

this algorithm they got good results by choosing ex-

amples altering from all classes instead of a sequence

of samples from single classes.

Recently, the term curriculum learning was used

by (Bengio et al., 2009) for ﬁnding sequences for

training neural networks.

The rest of the paper is structured as follows. We

ﬁrst characterize incremental learning methods re-

434

Wenzel S. and Hotz L. (2010).

THE ROLE OF SEQUENCES FOR INCREMENTAL LEARNING.

In Proceedings of the 2nd International Conference on Agents and Artiﬁcial Intelligence - Artiﬁcial Intelligence, pages 434-439

DOI: 10.5220/0002762604340439

 SciTePress

garding the effects of sample ordering in Section 2.

Thereafter, we start description of ordering effects

on learning models to discriminate object classes in

Section 3. Here, we propose experiments to explore

these effects and propose a method to deﬁne suitable

class sequences for training based on the estimate of

bounds of the Bayes error in Section 3.2. Section 4

argues on the role of sequences for learning structural

models.

2 CHARACTERIZING

INCREMENTAL LEARNING

METHODS

To characterize a certain learning method we need to

provide a measure for success. In this work, we mea-

sure performance in terms of the error rate related to

number of samples used for learning so far to char-

acterize the performance of an incremental learning

method. This depend on various conditions, e. g. the

classes, the samples, the classiﬁer, and possibly the

sequence of samples.

A further measure used in this work is the number

of examples k that are used by a learning method for

correctly interpreting a ﬁxed number n of given exam-

ples. The goal would be to minimize the number of

needed examples k through selecting a best sequence

of examples.

As main characteristic of an incremental learning

method we deﬁne its dependence on the training se-

quence as follows:

Deﬁnition 2.1: Ordering sensitivity. We call an

incremental learning method ordering sensitive, if its

performance is effected by the sequence of training

examples.

In other words if there exist a training sequence

that yields a different curve of error rates or a differ-

ent number of examples than the method is ordering

sensitive.

Now, one might tend to use this to decide whether

a learning method is good or not. Thus, one could say

a good incremental learning method is not effected by

the training sequence, thus, is not ordering sensitive

(Giraud-Carrier, 2000).

But we argue vice versa. If a method is ordering

sensitive and we can assume that we are able to iden-

tify the best training sequence, thus, getting always

the best performance for this method, then we possi-

bly can outperform any method that is not ordering

sensitive.

3 THE ROLE OF SEQUENCES

WHEN LEARNING OBJECT

CLASSES

In this section, the goal of learning is identifying and

discriminating instances of classes.

In the following, we will concentrate on the

qualitative characteristics of an incremental learning

method, thus, the classiﬁcation power measured in

terms of error rates.

To measure ordering sensitivity, we just train a

method with several sequences of training samples

and record the error rates per used sample. The vari-

ance of these error rates indicates the ordering sensi-

tivity. The higher the variance of error rates the higher

the ordering sensitivity.

We need to distinguish to different training sce-

narios.

1. Over the whole training sequence there are sam-

ples from all classes simultaneously available.

2. Samples from different classes become available

sequentially, one class after the other.

In the ﬁrst case, we ask whether there is an inﬂu-

ence of the training sequence at all, thus, over all sam-

ples from all classes. Here, one can imagine either

to start with some very typical examples from each

class and increase their complexity afterwards. Or

vice versa start with complex examples to rapidly ﬁll

the feature space. Obviously the best strategy strongly

depends on the used method.

For the second case, we assume the possibility to

learn one class after the other, i. e. all about number

windows followed by all about doors and so on. The

question is, which class sequence is suited to learn

most about facade elements. We call ordering sensi-

tivity in the 2nd scenario class ordering sensitivity.

3.1 Experiments to Explore Ordering

Effects

Figure 1: Mean images per class of samples from used dig-

its dataset.

To investigate possible ordering effects we have used

the dataset of handwritten digits from (Seewald,

2005). These are handwritten numbers form 0 to 9,

thus, 10 classes. That are intensity images, normal-

ized to 16 × 16 pixel. Per class there are 190 samples

for training and 180 for testing, thus, 3700 images

in total. Figure 1 shows mean images for each class

THE ROLE OF SEQUENCES FOR INCREMENTAL LEARNING

435

of these dataset. As features, we use HOG-Features

(Dalal and Triggs, 2005).

As learning method we use an incremental formu-

lation of Linear Discriminant Analysis (LDA), called

iLDAaPCA as it is described by (Uray et al., 2007).

This is a combination of two linear subspace methods,

an incremental Principal Component Analysis (PCA)

and the LDA, thus, uses the generative power of PCA

and the discriminative power of LDA. Whereby the

use of PCA enables the incremental update of the

LDA space.

Figure 2: Ordering Sensitivity. Results of experiments with

100 different training sequences with samples randomly

chosen from all classes. The graph shows error rates vs.

number of used samples. Gray: error rates of each individ-

ual sequence, red: mean error rate of the according experi-

ment, blue: standard deviation of the according experiment.

Experiment 1: Sequences Including Samples from

all Classes. First, we need to know, if an incre-

mental learning method is ordering sensitive. There-

fore, we just try different sequences of training sam-

ples. For each new sample, we update the learned

model, evaluate it on the test set and record the error

rate. This way we get a curve of error rates depend-

ing on the number of used samples. The variance of

these error rates over the whole sequence indicates the

sensitivity against the training sequence. Results are

shown in Figure 2.

The variances of gray curves, which are the indi-

vidual curves of error rates, show the great sensitivity

with respect to the ordering of training samples.

We observe a decrease of variances down to equal

results of 23% error rate for all trials after processing

all samples.

Experiment 2: Special Ordering of Classes. The

second type of experiments tries to evaluate the de-

pendency of ordering the classes during training. As-

sume training samples of different classes become

available over time, thus, we start training with some

initial classes and add samples of a new class after-

wards. Hence, the experiment is deﬁned with ﬁrst

choosing the classes randomly and second choose the

order of samples of this class again randomly. Results

are shown in Figure 3.

Figure 3: Class ordering sensitivity for 100 different order-

ings of learned classes. Same meaning of axes and lines

as in Figure 2. Here we started with samples from 5 ran-

domly chosen classes and continued with samples from new

classes afterwards.

Here, we started sequential training with samples

from 5 randomly chosen classes out of 10. Then,

we added samples from single again randomly cho-

sen new classes. At this point, the error rates increase

drastically. This is caused by the fact that one new

sample from a new class is not sufﬁcient to train the

model on this new class, thus, the evaluation on the

test set fails on this class. But by adding more samples

from the new class the model becomes better and the

error rates decreases again. The overall performance

decreases a bit with every new class as the complexity

of the classiﬁcation problem increases with the num-

ber of classes.

Again, we observe a high variability of error rates

between different trials with different class sequences.

During the initialisation this goes up to 30%. Again,

we reach same results for each trial after processing

all samples.

Hence, the result of this experiment shows a

strong relation between class sequence and the per-

formance of the learned model. Furthermore, it might

be possible to use less samples when using the best

sequence of classes, as the best possible error rate

is already reached after using just a part of available

samples.

Of course, at the end we get same results inde-

pendent of the used training sequence. It tells: this

method behaves well, thus, we have learned the same

model after processing all available training samples.

But on the other hand, with the scenario of life long

learning we will never reach that end, the point, were

we have seen all possible examples. Hence, the re-

sults of this experiments show the signiﬁcance of hav-

ing a curriculum, i. e. we need to know the best or-

der of examples before starting the training.This way

we would have an error rate always on bottom of the

graph in Figure 3, hence, always reach best possible

ICAART 2010 - 2nd International Conference on Agents and Artificial Intelligence

436

performance.

We are now able to characterize an incremental

learning method in the sense of ordering sensitivity

regarding samples from all classes or regarding the

order of classes. However, by now we need to know

how to deﬁne the best sequence for training. The

following section proposes one criterion to choose a

good sequence of classes just depending on the data.

Up to now, we have not found a similar criterion

to get best suited within class sequences.

3.2 How to Deﬁne a Curriculum?

The task now is to ﬁnd from all permutations of ex-

amples the one that produces the best performance.

Related to the property of class ordering sensitivity of

an incremental learning method deﬁned in Section 2

we propose here a way to get good training sequences

regarding this property.

For this purpose, we considered different mea-

sures for separating classes, e. g. distance measures

as the Kullback-Leibler-distance or the Bhattacharyya

distance. But both assume knowledge about the un-

derlying class generating distributions. We do not

want to make any assumptions about that. Hence, we

looked for criteria that are independent from any as-

sumption about the data distribution.

For this, we found the Bayes error as a general

measure for the separability of classes that can be es-

timated without any knowledge about the data distri-

bution. Thus, it deﬁnes the best achievable error rate

in a classiﬁcation procedure. We can estimate bounds

of the Bayes error using the k-nearest-neighbour clas-

siﬁcator. This is shown by (Fukunaga, 1972, chap. 6).

We did some experiments that we do not report in

detail, at which we compared the error rates of dif-

ferent class sequences with the according bounds of

Bayes error. Our empirical results indicate a strong

relation between the Bayes error and the performance

of an incremental learning method given a particular

training sequence. We got good results in terms of

error rates using the following two rules:

• Choose class combinations for initialisation with

lowest bounds of Bayes error.

• Choose remaining classes with lowest bounds of

Bayes error regarding the complete set of learned

classes up to now.

In order to estimate good class sequences accord-

ing this rules, which are just based on the data, we

have used a greedy search procedure. Thereby we

estimate bounds of Bayes error for all possible com-

binations of classes and search for sequences with

the overall best performance. For description of the

whole procedure see (Wenzel and F

orstner, 2009).

The resulting list L consists of pairs of

sets {initialisation classes, single classes to add}.

Thereby the number of initialisation classes increases

from 2 to any number M of classes that is still suited

to initialise the learning method.

Experiment 3: Conﬁrm Best Class Sequence of

Classes Determined from Bounds of Bayes Error.

We used the greedy search to ﬁnd good sequences for

our test dataset of handwritten digits. This results in

lists of good class sequences which are shown in de-

tail in (Wenzel and F

orstner, 2009).

To verify this way of getting good class sequences

we show the results of one selected experiment,

shown in Figure 4. Primarily the experiment is de-

ﬁned as described in Section 3.1. Again, we trained

the learning method with different randomly chosen

sequences of classes. We started training with a ﬁxed

set of classes based on the results of searching for

good sequence. Here we used classes 1, 3, 5, 6 for

initialisation. In addition to the randomly chosen se-

quences afterwards we did an additional trial where

we continued with classes 4, 2, 9, 7, 8, 10, which was

one of 11 found suited sequences. The error rates of

this trial is shown as red line in Figure 4. Again, more

results, for other sequences found by greedy search

can be found in (Wenzel and F

orstner, 2009).

In fact, we get always almost best performance for

classes sequences given by the bounds of Bayes error.

First, we observe that error rates from all trials

during the initialisation are on the bottom area of

those from Experiment 2, shown in Figure 3.

Second, after initialisation, thus, when adding

samples from new classes, error rates from previous

deﬁned class sequences are always almost below er-

ror rates from random trials, shown as gray lines.

Hence, with the estimate of bounds of the Bayes

error we actually have found an appropriate indicator

for ﬁnding good sequences of classes for training an

incremental learning method.

Figure 4: Results of testing predeﬁned class ordering. Ex-

periment with a ﬁxed set of classes for initialisation.

THE ROLE OF SEQUENCES FOR INCREMENTAL LEARNING

437

4 COMPUTING THE BEST

IMAGE SEQUENCE

For the next experiment, we use a Version Space

Learning (VSL) module for exploring good se-

quences of examples. These experiments are pro-

cessed in the domain of interpreting facade scenes

(Foerstner, 2007). By giving annotated primitive ob-

jects like door, railing, and window the interpreta-

tion system will combine those to aggregates like bal-

conies or entrances. Models describing spatial rela-

tions, number, type, and size of parts of aggregates

are learned through the VSL method.

Learning Method. The VSL module generates a set

of possible concept hypotheses for positive examples

of a given aggregate (see (Mitchell, 1977)). A concept

hypothesis represents a possible conceptual descrip-

tion of real-world aggregates presented as learning

examples. By increasing the example set, a concept

hypothesis might change. In VSL, the space of pos-

sible concept hypotheses V S is implicitly represented

through an upper and a lower bound on their gener-

ality. The General Boundary GB contains all maxi-

mally general members of V S, the Speciﬁc Boundary

SB contains all maximally speciﬁc members of V S.

Experiment. For analysing the inﬂuence of the or-

der of presented examples on the number of needed

examples, we processed the following experiment.

Given a set S of 8 images, where each has a num-

ber of examples of the aggregate balcony (see Figure

5). Each sequence of images induced a certain num-

ber k of examples that are needed for recognizing all

balconies in the images. In general, one could create

every sequence, count k for each sequence, and select

the one with the smallest k. This can of course lead

to a large number of computations. By introducing

following improvements to this brute-force approach,

we reduced the needed effort:

• For a subsequence like

[F074 F072 F058 F057 F041], the resulting

conceptual models are stored.

• If two subsequence s

and s

of S have

a common starting subsequence s

(e.g.

= [F074 F072 F058 F057 F041] and

= [F074 F072 F058 F057 F036] have

= [F074 F072 F058 F057] as a common

starting subsequence) this subsequence have to

be interpreted only once. We say s

and s

are

follow-ups of s

. This reduces the effort for

determining k for s

and s

F074

F073

F072

F064

F058

F036

F041

F057

Figure 5: Images with annotated primitives (framed).

• If a best order s

is found for a subsequence of a

certain length l, a next image i is taken and po-

sitioned after each image of s

, thus, leading to

subsequences of length l + 1. For each s

l+1

k is

computed by taking follow-ups into account. In

other words, the i

is moved from start to the end

of s

. For each subsequences of a certain length, a

minimum k

is stored. Thus, if k of a subsequence

of that length exceeds k

, one can interrupt fur-

ther computations of that subsequence. This re-

duces the number of sequences needed for inter-

pretation.

In Figure 6, two subsequences are presented with

k = 12 and k = 11. The difference is the order of

the last two images, which determines that the right

ordering is better than the left one.

For the images in Figure 5, we compute the ﬁnal

ordering shown in Figure 7.

Discussion. The experiment shows that the order of

presented examples is crucial for the number of iter-

ations needed for learning a certain aggregate. The

most important criterion seems to be an early set of

examples covering a preferably large variety of ag-

gregate instances. Furthermore, if this early set con-

tains examples which cover extrem properties, like

ICAART 2010 - 2nd International Conference on Agents and Artificial Intelligence

438

Figure 6: Example for reduced number of examples: when

learning from F036 in F041 only one example is used for

learning, if F036 does not precede F041 (left diagram) two

examples are taken from F041.

Figure 7: Sequence for the given images which uses fewest

examples for learning.

small and large objects, smallest and largest number

of a certain object type, the number of iterations is re-

duced. This is strongly related to the generalization

method of the learning module which creates con-

cept descriptions which cover all presented examples.

If extremes are selected, regular examples are cov-

ered through this generalization. This does not only

hold for size and number restrictions but also for spa-

tial relations. Because spatial relations are organized

in a taxonomical hierarchy, generalizing relations be-

tween objects of distinct examples lead to the com-

putation of the least common subsumes (lcs) of these

relations. If the lcs has low taxonomical depth, it will

cover more examples which may come up in future

iterations.

5 CONCLUSIONS

In this work, we have shown the relevance of sample

ordering for training on the performance of an incre-

mental learning method. We called this effect the or-

dering sensitivity of an incremental learning method.

We have shown experiments to evaluate ordering

sensitivity related to the tasks of learning models for

object classiﬁcation and learning structural models

describing spatial relations. For the task to ﬁnd good

class sequences for training a classiﬁcation model, we

have shown the possibility to get those sequences be-

forehand just based on the data. Therefore, we iden-

tiﬁed the Bayes error as appropriate measure to ﬁnd

class sequences that cause best possible error rates.

For learning structural models we found rules to

empirically describe good image sequences. This

way, we can achieve always best performance given

a particular learning method. Furthermore, we can re-

duce the number of samples for learning if the error

rate during training a particular class does not drop

anymore.

Thus, we have shown the relevance of curriculum

learning that should be investigated on any type of in-

cremental learning methods. Further research should

concentrate on ﬁnding more measures to deﬁne cur-

ricula before starting training just based on the data

but related to the particular learning method.

REFERENCES

Bengio, Y., Louradour, J., Collobert, R., and Weston, J.

(2009). Curriculum learning. In Bottou, L. and

Littman, M., editors, ICML’09, pages 41–48, Mon-

treal. Omnipress.

Dalal, N. and Triggs, B. (2005). Histograms of oriented gra-

dients for human detection. In CVPR’05, volume 1,

pages 886–893. IEEE Computer Society.

Fisher, D. H. (1987). Knowledge acquisition via incre-

mental conceptual clustering. Machine Learning,

2(2):139–172.

Foerstner, W. (2007). Annotated image database. eTRIMS

EU-Project, Deliverable D1.1.

Fukunaga, K. (1972). Introduction to Statistical Pattern

Recognition. Academic Press, ﬁrst edition.

Giraud-Carrier, C. (2000). A note on the utility of incre-

mental learning. AI Communications, 13(4):215–223.

Langley, P. (1995). Order effects in incremental learning.

Learning in humans and machines: Towards an inter-

disciplinary learning science.

McKusick, K. B. and Langley, P. (1991). Constraints on tree

structure in concept formation. In IJCAI’91, pages

810–816.

Mitchell, T. (1977). Version spaces: A candidate elimina-

tion approach for rule learning. In IJCAI’77, pages

305–310.

Seewald, A. K. (2005). Digits - a dataset for handwritten

digit recognition. Technical Report TR-2005-27, Aus-

trian Research Institut for Artiﬁcial Intelligence.

Talavera, L. and Roure, J. (1998). A buffering strategy

to avoid ordering effects in clustering. In ECML’98,

pages 316–321.

Uray, M., Skocaj, D., Roth, P. M., Bischof, H., and

Leonardis, A. (2007). Incremental LDA Learning

by Combining Reconstructive and Discriminative Ap-

proaches. In BMVC’07, University of Warwick, UK.

Wenzel, S. and F

orstner, W. (2009). The role of sequences

for incremental learning. Technical Report TR-IGG-

P-2009-04, Department of Photogrammetry, Univer-

sity of Bonn.

THE ROLE OF SEQUENCES FOR INCREMENTAL LEARNING

439