(oracle) is presented with small sets of examples for
labelling. The proposed algorithm is tested on
streams of instances, which is suitable for scenarios
where new instances need to be classified one at a
time, i.e. an incremental and online learning setting.
In this scenario, the goal is to achieve high
performance (in terms of accuracy) while utilizing as
few labelled examples as possible.
This paper is organized as follows. The next
section presents related works. We detail our active
online ensemble method in Section 3. Section 4
describes the experimental evaluation. Finally,
Section 5 concludes the paper.
2 RELATED WORK
Classifiers construct models that describe the
relationship between the observed variables of an
instance and the target label. However, as stated
above, in a data stream setting, the labels may often
be missing, incorrect or late arriving. Further,
labelling involves domain expertise and may be
costly to obtain.
Predictive models can be generated using
classification methods. However, the produced
model’s accuracy is highly related to the labelled
instances in the training set. Incorrectly classified
instances can result in inaccurate, or biased models.
Further a data set may be imbalanced, where one
class dominates another. One suggested solution is
to use active learning to guide the learning process
(Stefanowski and Pachocki, 2009, Muhivumundo
and Viktor, 2011). This type of learning tends to use
the most informative instances in the training set.
Active learning studies how to select the most
informative instances by using multiple classifiers.
Generally, informative examples are identified as the
ones that cause high disagreement among classifiers
(Stefanowski and Pachocki, 2009). Thus, the main
idea is using the diversity of ensemble learning to
focus the labelling effort. This usually works by
taking some information of the data from the users,
also known as the oracles. In other words, the
algorithm is initiated with a limited amount of
labelled data. Subsequently, it passes them to the
learning algorithm as a training set to produce the
first classifier. In each of the following iterations,
the algorithm analyses the remaining unlabelled
instances and presents the prediction to the oracle
(human expert) in order to label them. These
labelled examples are added to the training set and
used in the following iteration. This process is
repeated until the user is satisfied or until a specific
stopping criterion is achieved.
Past research in active learning mainly focused
on the pool-based scenario. In this scenario, a large
number of unlabelled instances need to be labelled.
The main objective is to identify the best subset to
be labelled and used as a training set (Sculley,
2007a, Chu et al., 2011). Hence, the basic idea
behind active learning stems from the Query-by-
Committee method, which is a very effective active
learning approach that has wide application for
labelling instances. Initially, a pool of unlabelled
data is presented to the oracle, which is then selected
for labelling. A committee of classifiers is trained
and models are generated based on the current
training data. The samples used for labelling are
based on the level of disagreement in between the
individual classifiers. In pool-based scenarios, the
unlabelled data are collected in the candidate pool.
However, in a data stream setting, maintaining the
candidate pool may prove itself to be challenging as
a large amount of data may arrive at high speed.
One of the main challenges in data stream active
learning is to reflect the underlying data distribution.
Such a problem may be solved by using active
learning to balance the distribution of the incoming
data in order to increase the model accuracy
(Zliobaite et al., 2014). The distribution is adapted
over time by redistributing the labelling weight as
opposed to actively labelling new instances.
Learn++ is another algorithm proposed by (Polikar
et al., 2001) that employ an incremental ensemble
learning methods in order to learn from data streams.
Also, traditional active learning methods require
many passes over the unlabelled data, in order to
select the informative one (Sculley, 2007a). This can
create a storage and computational bottleneck in the
data stream setting and big data. Thus, the active
learning process needs to be modified for the online
setting.
Another scenario is proposed by (Zhu et al.,
2007) to address the data distribution associated with
the data stream. Recall that in a data stream there is
a dynamic data distribution because of the
continuous arriving of data. In data stream mining, it
is unrealistic to build a single model based on all
examples. To address this problem, (Zhu et al.,
2007) proposed an ensemble active learning
classifier with the goal of minimizing the ensemble
variance in order to guide the labelling process. One
of the main objectives of active learning is to decide
the newly arrived instances labels. According to the
proposed framework in (Zhu et al., 2007), the
KDIR 2016 - 8th International Conference on Knowledge Discovery and Information Retrieval
276