A META-LEARNING METHOD FOR CONCEPT DRIFT

Runxin Wang, Lei Shi, M

ıche

O. Foghl

u and Eric Robson

Telecommunications Software & Systems Group Waterford Institute of Technology, Waterford, Ireland

Keywords:

Data Mining, Supervised Learning, Concept Drift, Meta-Learning, Evolving Data.

Abstract:

The knowledge hidden in evolving data may change with time, this issue is known as concept drift. It often

causes a learning system to decrease its prediction accuracy. Most existing techniques apply ensemble methods

to improve learning performance on concept drift. In this paper, we propose a novel meta learning approach

for this issue and develop a method: Multi-Step Learning (MSL). In our method, a MSL learner is structured

in a recursive manner, which contains all the base learners maintained in a hierarchy, ensuring the learned

concepts are traceable. We evaluated MSL and two ensemble techniques on three synthetic datasets, which

contain a number of drastic concept drifts. The experimental results show that the proposed method generally

performs better than the ensemble techniques in terms of prediction accuracy.

1 INTRODUCTION

The Machine Learning (ML) and Data Mining (DM)

communities have for many years been concerned

with the problem of concept drift. Essentially if a

learning model is trained with one set of training data,

the risk is that over time this training set becomes

less relevant to the new data being analysed, and thus

the new concepts drift away from those previously

learned. One practical example is weather prediction,

where the prediction rules vary depending on the sea-

son (Widmer and Kubat, 1996). A survey of concept

drift is provided by Tsymbal (Tsymbal, 2004). Most

traditional ML algorithms are subject to concept drift

where prediction accuracy decreases over time. These

algorithms are often called batch algorithms as they

are good at learning knowledge from data stored in

batch, but become inefﬁcient if the data grow dynam-

ically and are exposed to concept drift.

Figure 1 gives an illustration of the concept drift

problem in classiﬁcation, where the concept has

drifted from the ﬁrst circle (old dataset) to the second

(new dataset). Although classiﬁer C

correctly classi-

ﬁes the data points in the old dataset; if C

is applied

to the new dataset instead of C

, it can be easily seen

that some of the new data points would be misclassi-

ﬁed. As shown, C

is the appropriate classiﬁer to the

new dataset.

One traditional method for dealing with evolving

data is to relearn the knowledge based upon a com-

bined dataset that consists of both the old data and the

new data. However, this method is subject to “catas-

trophic forgetting” (Polikar et al., 2001) which refers

to the issue of learning the new knowledge but forget-

ting the older knowledge. Therefore when presented

with data relating to the original concepts, accurate

prediction may no longer occur. In fact, concept drift

requires a ML solution that does not just have one ini-

tial training phase, but has multiple phases of training,

or some form of continuous training (e.g. incremental

learning).

Figure 1: Two datasets are generated in different time

blocks. The square examples and cycle examples respec-

tively belong to different classes.

Some ensemble techniques have been proposed

to address concept drift. The approach of these

techniques is that the learning systems train multi-

ple learners in different time windows. The peri-

ods (windows) observe a number of instances and the

257

Wang R., Shi L., Foghlú M. and Robson E..

A META-LEARNING METHOD FOR CONCEPT DRIFT.

DOI: 10.5220/0003095502570262

In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval (KDIR-2010), pages 257-262

ISBN: 978-989-8425-28-7

 2010 SCITEPRESS (Science and Technology Publications, Lda.)

overall system combines these separate learners to-

gether into an integrated model using majority voting

or weighted majority voting. These approaches are

mainly inspired by Boosting (Freund and Schapire,

1997) and Bagging (Breiman, 1996).

In this paper we apply a meta-learning approach

instead of either combined dataset or ensemble tech-

niques to deal with concept drift. Our research fo-

cuses on Supervised Learning (SL), where we im-

plement Multi-Steps Learning (MSL) to enhance SL

performance on the data with concept drift. Nearly

all of the methods dealing with concept drift can be

seen as attempts to improve the performance of a

ıve Bayesian learning algorithm. These algorithms

are good baseline for experimental analysis of a new

method, as their accuracies are easy to adjust and

therefore allow room to improve. In experiments, we

compare MSL to a Bayesian combined dataset method

and two Bayesian ensemble techniques.

The remainder of this paper is organized as fol-

lows. In Section 2, we discuss the properties of en-

semble techniques, then brieﬂy review some existing

methods proposed to cope with concept drift. Sec-

tion 3 presents our proposed method, MSL, for en-

hancing supervised learning on evolving data. Sec-

tion 4 describes the experimental setup and presents

the performance evaluation. Finally in Section 5 the

key ﬁndings are emphasized, and some proposed fu-

ture work is outlined.

2 RELATED WORK

Boosting and Bagging are the two most famous en-

semble methods in ML and DM; they inspire so-

lutions to many learning problems that require en-

semble models, such as on-line learning, distributed

learning, and incremental learning. They mainly pro-

vide two components: a mechanism of utilising in-

stances in the available training sets, and a mecha-

nism of combining the base learners. There are many

existing combining rules (Kittler et al., 1998) in the

ﬁeld, where majority voting (used by Bagging) and

weighted majority voting (used by Boosting) are the

two mechanisms used most widely.

Streaming Ensemble Algorithm (SEA) (Street and

Kim, 2001), is a pioneering method for dealing with

concept drift in streaming data. It maintains a con-

stant number of classiﬁers in its ensemble pool and

when a new dataset is available, it performs major-

ity classiﬁcation on the new instances. It then re-

evaluates the composite classiﬁers according to their

classiﬁcation accuracies and replaces some classiﬁers

with new classiﬁers, if they are evaluated as under

performing. The overall accuracy is improved by us-

ing the updated classiﬁers.

Beyond SEA, Bifet designed a new streaming en-

semble method (Bifet et al., 2009), which enhances

supervised learning by using two adapted versions of

Bagging algorithms: adaptive windowing (ADWIN)

and Adaptive-Size Hoeffding Tree (ASHT). The AD-

WIN is a change detector and the ASHT is an incre-

mental anytime decision tree induction algorithm.

Dynamic weighted majority (DWM) (Kolter and

Maloof, 2007) is a representative method using

weighted majority voting. It maintains a couple of

learners trained in different datasets in different time

periods where each learner has a weight to specify

how reliable it is. All the weights are updated over

time according to the evaluation of the new datasets

and the learners with low weights are removed or re-

placed with new learners. The overall system makes

predictions using weighted majority voting among the

base learners. One advantage of DWM is the number

of base learners that should be used is initially spec-

iﬁed and this set of base learners is continuously up-

dated by the training process to reﬂect this.

There are also other methods for concept drift,

which do not use ensemble techniques. Widmer

(Widmer, 1997) was the ﬁrst to propose tracking the

context changes through meta-learning. Bach (Bach

and Maloof, 2008) proposed to use only two learners

(paired learners) to cope with all presented concepts

in datasets.

3 MSL LEARNER

The implementation of our learning system, Multi-

Steps Learning (MSL), is presented in Algorithm 1. It

consists of three types of learners, “old learners” that

learn previous knowledge, “new learners” that learn

current knowledge and “meta learners” that learn the

experiences on how to select learners. Generally, if

concept drift occurs, then a MSL learner will be com-

posed by three learners, a new learner, a meta learner

and an old learner. An important point to note is that

the old learner could be also a MSL learner. In other

words, although the new learner and the meta learner

must be single learners, the old learner could actually

be a composite learner. Figure 2 presents the struc-

ture of a MSL learner, which includes another MSL

learner as its component (older learner). By using this

structure, MSL can encode all the learned concepts in

a traceable hierarchy. We will see this in the following

section.

Zero or Singular Concept Drift in datasets. Initially

MSL system builds a learner on the ﬁrst dataset. If

KDIR 2010 - International Conference on Knowledge Discovery and Information Retrieval

258

no (zero) concept drift occurs, the MSL system will

keep using that learner. However if the MSL sys-

tem does discover concept drift in a new dataset, it

will then save the existing learner as the “old learner”

and subsequently train a “new learner” on the “hard

set” (line 15 & 13). This groups the examples that

the current learner fails to predict (line 10). Mean-

while, MSL generates the experience by labeling the

examples with “old” if the MSL learner predicts cor-

rectly otherwise with “new” (line 11 & 12), and uses

this to build a meta learner based on this experience

(line 14). When provided with instances to analyse,

the meta learner receives the instances at ﬁrst and se-

lects an appropriate learner between the new and the

old learner to perform the ﬁnal prediction. Equation 1

shows how a MSL learner produces its hypothesis on

a given example X. A meta learner can be viewed as

the public interface of a MSL learner, as it is always

the ﬁrst component to receive examples.

MSL(X) =



old

(X) if C

meta

(X) = old

new

(X) if C

meta

(X) = new

(1)

Figure 2: The structure of an overlap MSL learner. The old

learner of the current MSL learner is pointing to a previous

MSL learner.

Multiple Concept Drifts may appear in practice

when more than one concept appears within the

evolving data and some of the existing concepts may

recur (Widmer and Kubat, 1996) during the lifetime

of the data. To cope with multiple concepts, MSL re-

cursively replaces an old learner with a current MSL

learner (line 15) and accordingly trains a new learner

(line 13) and a meta learner (line 14). Therefore ev-

ery new concept can be learned, while all the previ-

ous concepts are preserved. Theoretically, each con-

cept drift produces two learners, a “new learner” and a

“meta learner”, thus the number of base learners gen-

erated in MSL is a function of the number of concepts

existing in the data:

|learners| = 2 × (|concepts| − 1) + 1. (2)

Algorithm 1: MSL Algorithm.

• Input: a sequence of examples.

• Output: a MSL learner.

1: while examples are continuously arriving do

2: for every t time do

3: Create a dataset d

collecting the examples

within t.

4: end for

5: if MSL learner = null then

6: Build MSL learner on d

7: else

8: Evaluate MSL learner on d

9: if evaluation not pass then

10: Add all mispredicted examples into d

hard

11: Label the mispredicted examples with o

otherwise with n.

12: Add the labeled examples into d

meta

13: Build learner C

new

on d

hard

14: Build learner C

meta

on d

meta

15: Set C

old

equal to the current MSL learner.

16: Create a new MSL learner:

MSL(X) =



old

(X) if C

meta

(X) = o

new

(X) if C

meta

(X) = n

17: end if

18: end if

19: end while

Due to the fact that base learners can only learn

on batch data, the MSL system creates a dataset for

every t time to allow the learners to work on bounded

datasets (line 3).

3.1 Complexity Analysis

The running time of MSL depends on the capability

of the base learners and the input distribution. Let

f (n) be the training time required by a base learner

to train n examples and g(n) be the running time for

predicting n examples. With one concept changing,

in which case two learners are produced, then the run-

ning time of this case is O(2 f (n)+g(n)). The running

time of MSL on m datasets is O( f (n) + k((n) + g(n))

where k ∈ [0, m] and f (n) is the training time on the

ﬁrst dataset. Suppose that the number of datasets

with changing concepts on m datasets has a Bernoulli

distribution





(1 − p)

m−i

with parameter p, we

could compute the average running time of MSL on

m datasets as

A META-LEARNING METHOD FOR CONCEPT DRIFT

259

E[O] = O(f (n) +

∑

i=0





(1 − p)

m−i

· i · (2f (n) + g(n)))

= O(f (n) + mp(2f (n) + g(n))).

4 EXPERIMENTS AND RESULTS

In our experiments we re-factored the Na

ıve Bayes

in weka (Hall et al., 2009) and used it as the base

learner. For the datasets, we used three synthetic

datasets which are widely used in the literature, a

detailed description of each dataset is given in the

following sections. In the experiments, we mainly

evaluated our MSL method and two ensemble tech-

niques, Streaming Ensemble Algorithm (SEA) and

Dynamic Weighted Majority (DWM). To ensure that

the datasets were subject to concept drift, we also

evaluated the single learner (a learner built on a sin-

gular dataset) and the combined dataset learner (a

learner built on the dataset combining all the avail-

able training sets) to see if their prediction accuracies

actually decrease as concepts change.

The general setup for the three type of datasets is

presented as follows: for each type of dataset, we gen-

erated 5,500 instances with 10% noise, where 500 in-

stances are reserved as a test set. The 5,000 instances

are collected into 10 training sets, where 5 concepts

appear in total. In other words, after the ﬁrst dataset

is generated, the remaining datasets experience con-

cept drift 4 times. At the beginning of each series of

experiments, we made a new benchmark of the con-

cepts to check how dramatically the concepts change

in each dataset.

4.1 SEA Data

SEA Data (Streaming Ensemble Algorithm) is an ar-

tiﬁcial dataset that simulates the environment of con-

cept drift, it was originally introduced by Street and

Kim (Street and Kim, 2001). This data consists of 3

attributes, one of which is irrelevant, and the other 2

attributes together deﬁne a concept in the following

way: x

+ x

6 v, where v is a user-deﬁned value. If

v changes, the concept changes. As shown in Figure

3, the target concept changes suddenly when there are

1000 instances available. It can be seen that the single

learner has a jump in error, while the MSL error does

not increase as dramatically. It can be also seen from

both Figure 3 and Figure 4 that, all the techniques’

error rates are continuously increasing until the ﬁfth

concept arrives, where all the techniques made their

worst performance. Afterwards, nearly all techniques

achieved error reduction, where MSL got the greatest

reduction. Generally, MSL and SEA have similar per-

formance on this problem in terms of accuracy, while

DWM seems to be slightly inferior to the others. One

problem we discovered in MSL but not presented in

the diagrams is that during the experiments MSL cre-

ated more learners than it was expected. Recall equa-

tion 2, for data with 5 concepts it should of produced

9 learners, yet in fact we got 13 ± 2 learners in our

experiments. We attribute this as a consequence of

the high standard we used in evaluation, where MSL

tends to believe a new concept arrived and accord-

ingly train a new learner.

Figure 3: Experiments on SEA data.

Figure 4: Experiments on SEA data.

4.2 Stagger Data

Stagger Data is another commonly used dataset for

evaluating a learner’s performance on concept drift. It

was ﬁrst created by Schlimmer and Granger (Schlim-

mer and Granger, 1986). Stagger data consists of

three nominal attributes, color, shape and size. A

concept is deﬁned by a combination of the attribute

values. For example, an instance of color = red ⊕

shape = circle⊕size = small belongs to a target con-

cept. The benchmark on ﬁgure 5 proves that the con-

cept drifts do effect the performance of the techniques

involved. On this problem, MSL achieved its best

performance in terms of accuracy. As shown in Fig-

ure 6, comparing MSL to DWM and SEA, one can

KDIR 2010 - International Conference on Knowledge Discovery and Information Retrieval

260

see MSL outperforms the others nearly in every test

block. Since a new concept is introduced at the test

block when 1500 instances are available, both DWM

and SEA continuously increased their error rates un-

til they got the highest errors. MSL also increases

its error rate from the same test block, but it almost

manages to keep the error rate below 0.2 at every test

block. With this dataset, the lower error rates suggest

that MSL is capable of differentiating between the

learned concepts and accordingly selected the most

appropriate learner to perform the prediction. When

only two concepts exist (1500 instances), the MSL

learner made a much better performance than the oth-

ers. However, as more concepts showed up, the dif-

ference between concepts become more difﬁcult to

identify, thus MSL’s error rates increased as well. In

this experiment, there were a couple of trials where

we achieved the expected number of base learners

in MSL, yet there were still some trials we did not

achieve the expected number.

Figure 5: Experiments on Stagger data.

Figure 6: Experiments on Stagger data.

4.3 Moving Hyperplane Data

Moving Hyperplane Data was originally introduced

by Hulten to test his proposed method VFDT (Hul-

ten et al., 2001), an incremental decision tree learning

algorithm. The concept is deﬁned by the following

equation:

∑

≤ v

where w

is the weight of the i

attribute, and x

is the

corresponding value. A concept is changed by mov-

ing the hyperplane (weights) of the equation. Fig-

ure 7 shows that though the error rates of the sin-

gle learner and combined dataset learner vary largely,

MSL works stably. In fact, as seen in Figure 8, the en-

semble techniques also have a stable and good perfor-

mance on this problem in terms of accuracy. With this

dataset, DWM presented the best performance over

all other algorithms, as shown on Figure 8. One in-

teresting point shown in Figure 8 is that, there is a

big gap showing on the point where 3000 instances

are available and MSL has an error rate of 0.103, yet

DWM and SEA only have 0.056 and 0.052 respec-

tively. We examined the related trials and found that

the ﬁfth concept appeared at this point. We hypothe-

sis far more concepts with insufﬁcient instances made

MSL unable to effectively classify the concepts. In-

deed, MSL immediately recovers its accuracy at the

follow point where more training data are available,

which we believe it is supportive to our hypothesis.

Again, we counted the learners that MSL produced

for this problem, and we found that there are 13 ± 2

learners generated by MSL. This fact does not match

with what we expected. We initially thought this issue

would effect the accuracy of MSL, however with this

particular dataset, the error rates of MSL are not bad,

as they range from 0.02 to 0.12.

Figure 7: Experiments on Moving Hyperplane data.

5 CONCLUSIONS

In this paper, we introduced the technique MSL

(Multi-Steps Learning) to facilitate effective super-

vised learning on evolving data with concept drift.

Our main contribution is exploring a meta-learning

approach to cope with concept drift, which is dif-

ferent to most previous techniques that use ensem-

A META-LEARNING METHOD FOR CONCEPT DRIFT

261

Figure 8: Experiments on Moving Hyperplane data.

ble approaches. MSL constructs its base learners in

a recursive fashion, thus one MSL learner could hi-

erarchically consist of several trained MSL learners.

The number of learners generated is a function of the

number of concepts hidden in the evolving data. This

allows a user to trace how many concepts appear at

a certain time period. MSL selects the best learner

to perform the ﬁnal prediction, which is also different

to ensemble approaches that make majority decisions.

Selecting the best one or using majority decision is a

problem that has been studied by Dzeroski and Zenk

(Dzeroski and Zenko, 2004).

The proposed method addresses concept drift in

data, it would be interesting to investigate how this

method works on the real datasets that contain con-

cept drifts.

ACKNOWLEDGEMENTS

This work received support from the CEARTAS and

SaaSiPA (Software as a Service implementation of

Predictive Analytics) projects funded by Enterprise

Ireland, references IP-2009-0320, CFTD-2007-0225.

REFERENCES

Bach, S. H. and Maloof, M. A. (2008). Paired learners for

concept drift. In 2008 Eighth IEEE International Con-

ference on Data Mining, pages 23–32.

Bifet, A., Holmes, G., Pfahringer, B., Kirkby, R., and

Gavalda, R. (2009). New ensemble methods for evolv-

ing data streams. In KDD09, June 28, 2009, Paris,

France. ACM Press.

Breiman, L. (1996). Bagging Predictors. In Machine Learn-

ing, pages 123–140.

Dzeroski, S. and Zenko, B. (2004). Is combining classiﬁers

better than selecting the best one? Machine Learning,

54:255–273.

Freund, Y. and Schapire, R. E. (1997). A Decision-

Theoretic Generalization of On-Line Learning and an

Application to Boosting.

Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann,

P., and Witten, I. H. (2009). The weka data mining

software: An update. SIGKDD Explorations, 11.

Hulten, G., Spencer, L., and Domingos, P. (2001). Mining

time-changing data streams. In KDD’01, San Fran-

cisco, CA. ACM Press.

Kittler, J., Hatef, M., Duin, R. P. W., and Matas, J. (1998).

On Combining Classiﬁers. IEEE Transactions on Pat-

tern Analysis and Machine Intelligence, 20:226–239.

Kolter, J. and Maloof, M. A. (2007). Dynamic weighted

majority: An ensemble method for drifting concepts.

Journal of Machine Learning Research, 8:2755–2790.

Polikar, R., Udpa, L., Udpa, S. S., and Honavar, V. (2001).

Learn++: An Incremental Learning Algorithm for Su-

pervised Neural Networks. IEEE Trans. on Systems,

Man and Cybernetics. Part C, 31:497–508.

Schlimmer, J. C. and Granger, J. (1986). Incremental learn-

ing from noisy data. Machine Learning, 1:317–354.

Street, W. N. and Kim, Y. (2001). A streaming ensem-

ble algorithm (sea) for large-scale classiﬁcation. In

Seventh ACM SIGKDD International Conference on

Knowledge Discovery and Data Mining, pages 377–

382. ACM Press.

Tsymbal, A. (2004). The problem of concept drift: Deﬁ-

nitions and related work. Computer Science Depart-

ment, Trinity College Dublin, Technical Report.

Widmer, G. (1997). Tracking context changes through

meta-learning. Machine Learning, 27:259–286.

Widmer, G. and Kubat, M. (1996). Learning in the pres-

ence of concept drift and hidden contexts. Machine

Learning, 23:69–101.

KDIR 2010 - International Conference on Knowledge Discovery and Information Retrieval

262