Table 2: Final accuracy and cost according to different
strategies.
Relabel Discard
Discard
and
relabel
Weight
Weight
and
relabel
Optdigits dataset
Accuracy 97.49% 96.21% 96.43% 96.66% 97.21%
N
1
290 432 359 412 364
N
2
92 0 20 0 20
Pendigits dataset
Accuracy 98.11% 94.02% 97.31% 96.36% 97.88%
N
1
307 464 420 420 363
N
2
105 0 20 0 20
Letters-recognition dataset
Accuracy 85.98% 53.95% 66.79% 80.26% 85.52%
N
1
914 1242 956 1537 1344
N
2
211 0 20 0 20
Documents dataset
Accuracy 96.1%5 81.84% 90.61% 94.46% 96.0%
N
1
406 413 424 701 646
N
2
104 0 20 0 20
belling” strategy. However, by taking into consider-
ation the cost induced by relabelling the mislabelled
instances, the ”full relabelling” strategy will have a
higher overall cost than the other strategies, since all
the instances that are identified as being mislabelled
are relabelled.
Finally, we can conclude that although the ”hy-
brid weighting and relabelling” strategy has a low re-
labelling cost, it achieves a final classification accu-
racy which is close to the one achieved by the ”full re-
labelling” strategy. Therefore, if a limited relabelling
budget is available, then this budget should be devoted
to relabelling instances with a high informativeness I.
6 CONCLUSION AND FUTURE
WORK
In this paper we addressed the label noise detection
and mitigation problem in stream-based active learn-
ing for classification. In order to identify the po-
tentially mislabelled instances, we proposed a misla-
belling likelihood based on the disagreement among
the probabilities and the quantity of information that
the instance carries for the predicted and the queried
class labels. Then, we derived an informativeness
measure that reflects how much a queried label would
be useful if it is corrected. Our experiments on real
datasets show that the proposed mislabelling likeli-
hood is more efficient in characterizing label noise
compared to the commonly used entropy measure.
The experimental evaluation also shows that the po-
tentially mislabelled instances with high conflicting
information are worth relabelling.
Nonetheless, one limitation of the current hybrid
label noise mitigation strategy is that it requires a
threshold on the informativeness measure I which de-
pends on the data and its automatic adaptation con-
stitute one of our perspectives. As future work, we
want to minimize the correction cost by defining and
optimizing a multi-objective function that combines
together (i) the mislabelling likelihood, (ii) the infor-
mativeness, and (iii) the cost of relabelling instances.
Also, in the current work we observed that manually
relabelling few instances chosen according to their in-
formativeness I can improve results, but figuring out
the number of labelled instances that are required to
achieve closer accuracy to the case where all instances
are relabelled still constitute one of our future work.
REFERENCES
Bouguelia, M.-R., Bela
¨
ıd, Y., and Bela
¨
ıd, A. (2013).
A stream-based semi-supervised active learning ap-
proach for document classification. ICDAR, pages
611–615.
Brodley, C. and Friedl, M. (1999). Identifying mislabeled
training data. Journal of Artificial Intelligence Re-
search, pages 131–167.
Dasgupta, S. (2005). Coarse sample complexity bounds for
active learning. Neural Information Processing Sys-
tems (NIPS), pages 235–242.
Fang, M. and Zhu, X. (2013). Active learning with uncer-
tain labeling knowledge. Pattern Recognition Letters,
pages 98–108.
Fr
´
enay, B. and Verleysen, M. (2013). Classification in
the presence of label noise: a survey. IEEE Trans-
actions on Neural Networks and Learning Systems,
pages 845–869.
Gamberger, D., Lavrac, N., and Dzeroski, S. (1996). Noise
elimination in inductive concept learning: A case
study in medical diagnosis. Algorithmic Learning
Theory, pages 199–212.
Goldberg, A., Zhu, X., Furger, A., and Xu, J. (2011). Oasis:
Online active semi-supervised learning. AAAI Confer-
ence on Artificial Intelligence, pages 1–6.
Huang, L., Liu, Y., Liu, X., Wang, X., and Lang, B. (2014).
Graph-based active semi-supervised learning: A new
perspective for relieving multi-class annotation labor.
IEEE International Conference Multimedia and Expo,
pages 1–6.
Ipeirotis, P., Provost, F., Sheng, V., and Wang, J. (2014).
Repeated labeling using multiple noisy labelers. ACM
Conference on Knowledge Discovery and Data Min-
ing, pages 402–441.
Kremer, J., Pedersen, K. S., and Igel, C. (2014). Active
learning with support vector machines. Wiley Inter-
disciplinary Reviews: Data Mining and Knowledge
Discovery, pages 313–326.
Stream-basedActiveLearninginthePresenceofLabelNoise
33