On Selecting Helpful Unlabeled Data for Improving Semi-Supervised Support Vector Machines

Thanh-Binh Le, Sang-Woon Kim


Recent studies have demonstrated that Semi-Supervised Learning (SSL) approaches that use both labeled and unlabeled data are more effective and robust than those that use only labeled data. However, it is also well known that using unlabeled data is not always helpful in SSL algorithms. Thus, in order to select a small amount of helpful unlabeled samples, various selection criteria have been proposed in the literature. One criterion is based on the prediction by an ensemble classifier and the similarity between pairwise training samples. However, because the criterion is only concerned with the distance information among the samples, sometimes it does not work appropriately, particularly when the unlabeled samples are near the boundary. In order to address this concern, a method of training semi-supervised support vector machines (S3VMs) using selection criterion is investigated; this method is a modified version of that used in SemiBoost. In addition to the quantities of the original criterion, using the estimated conditional class probability, the confidence values of the unlabeled data are computed first. Then, some unlabeled samples that have higher confidences are selected and, together with the labeled data, used for retraining the ensemble classifier. The experimental results, obtained using artificial and real-life benchmark datasets, demonstrate that the proposed mechanism can compensate for the shortcomings of the traditional S3VMs and, compared with previous approaches, can achieve further improved results in terms of classification accuracy.


  1. Adankon, M. M. and Cheriet, M. (2011). Help-training for semi-supervised support vector machines. In Pattern Recognition, volume 44, pages 2946-2957.
  2. Ben-David, S., Lu, T., and Pal, D. (2008). Does unlabeled data provably help? worst-case analysis of the sample complexity of semi-supervised learning. In Proc. the 22th Ann. Conf. Computational Learning Theory (COLT08), pages 33-44, Helsinki, Finland.
  3. Bennett, K. P. and Demiriz, A. (1998). Semi-supervised support vector machines. In Proc. Neural Information Processing Systems, pages 368-374.
  4. Blum, A. and Mitchell, T. (1998). Combining labeled and unlabeled data with co-training. In Proc. the 11th Ann. Conf. Computational Learning Theory (COLT98), pages 92-100, Madison, WI.
  5. Chakraborty, S. (2011). Bayesian semi-supervised learning with support vector machine. In Statistical Methodology, volume 8, pages 68-82.
  6. Chang, C. -C. and Lin, C. -J. (2011). LIBSVM : a library for support vector machines. In ACM Trans. on Intelligent Systems and Technology, volume 2, pages 1-27.
  7. Chapelle, O., Schölkopf, B., and Zien, A. (2006). SemiSupervised Learning. The MIT Press, Cambridge, MA.
  8. Dagan, I. and Engelson, S. P. (1995). Committee-based sampling for training probabilistic classifiers. In A.
  9. Prieditis, S. J. Russell, editor, Proc. Int'l Conf. on Machine Learning, pages 150-157, Tahoe City, CA.
  10. Du, J., Ling, C. X., and Zhou, Z. -H. (2011). When does cotraining work in real data? In IEEE Trans. on Knowledge and Data Eng., volume 23, pages 788-799.
  11. Duin,R. P. W., Juszczak, P., de Ridder, D., Paclik, P., Pekalska, E., and Tax, D. M. J. (2004). PRTools 4: a Matlab Toolbox for Pattern Recognition. Delft University of Technology, The Netherlands.
  12. Everingham, M., Van Gool, L., William, C. K. I., Winn, J., and Zisserman, A. (2007). The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results.
  13. Goldberg, A. B. (2010). New Directions in Semi-Supervised Learning. University of Wisconsin - Madison, Madison, WI.
  14. Goldberg, A. B., Zhu, X., Singh, A., Zhu, Z., and Nowak, R. (2009). Multi-manifold semi-supervised learning. In D. van Dyk, M. Welling, editor, Proc. the 12th Int'l Conf. Artificial Intelligence and Statistics (AISTATS), pages 99-106, Clearwater, FL.
  15. Huber, P. J. (1981). Robust Statistics. John Wiley & Sons, New York, NY.
  16. Jiang, Z., Zhang, S., and Zeng, J. (2013). A hybrid generative/discriminative method for semi-supervised classification. In Knowledge-Based System, volume 37, pages 137-145.
  17. Joachims, T. (1999a). Making large-Scale SVM Learning Practical. In B. Sch?lkopf, C. Burges, A. Smola, editor, Advances in Kernel Methods - Support Vector Learning, pages 41-56, Cambridge, MA. The MIT Press.
  18. Joachims, T. (1999b). Transductive inference for text classification using support vector machines. In Proc. the 16th Int'l Conf. on Machine Learning, pages 200-209, San Francisco, CA. Morgan Kaufmann.
  19. Kuo, H. -K. J. and Goel, V. (2005). Active learning with minimum expected error for spoken language understanding. In Proc. the 9th Euro. Conf. on Speech Communication and Technology, pages 437-440, Lisbon. Interspeech.
  20. Le, T. -B. and Kim, S. -W. (2012). On improving semisupervised MarginBoost incrementally using strong unlabeled data. In P. L. Carmona, J. S. Sánchez, and A. Fred, editor, Proc. the 1st Int'l Conf. Pattern Recognition Applications and Methods (ICPRAM 2012), pages 265-268, Vilamoura-Algarve, Portugal.
  21. Leng, Y., Xu, X., and Qi, G. (2013). Combining active learning and semi-supervised learning to construct SVM classifier. In Knowledge-Based Systems, volume 44, pages 121-131.
  22. Li, Y. -F. and Zhou, Z. -H. (2011). Improving semisupervised support vector machines through unlabeled instances selection. In Proc. the 25th AAAI Conf. on Artificial Intelligence (AAAI'11), pages 386-391, San Francisco, CA.
  23. Lu, T. (2009). Fundamental Limitations of Semi-Supervised Learning. University of Waterloo, Waterloo, Canada.
  24. Mallapragada, P. K., Jin, R., Jain, A. K., and Liu, Y. (2009). SemiBoost: Boosting for semi-supervised learning. In IEEE Trans. Pattern Anal. and Machine Intell., volume 31, pages 2000-2014.
  25. McClosky, D., Charniak, E., and Johnson, M. (2008). When is Self-Training Effective for Parsing? In Proc. the 22nd Int'l Conf. Computational Linguistics (Coling 2008), pages 561-568, Manchester, UK.
  26. Nigam, K., McCallum, A. K., Thrun, S., and Mitchell, T. (2000). Text classification from labeled and unlabeled documents using EM. In Machine Learning, volume 39, pages 103-134.
  27. Riccardi, G. and Hakkani-Tur, D. (2005). Active learning: theory and applications to automatic speech recognition. In IEEE Trans. on Speech and Audio Processing, volume 13, pages 504-511.
  28. Rosenberg, C., Hebert, M., and Schneiderman, H. (2005). Semi-supervised self-training of object detection models. In Proc. the 7th IEEE Workshop on Applications of Computer Vision / IEEE Workshop on Motion and Video Computing (WACV/MOTION'05), pages 29-36, Breckenridge, CO.
  29. Singh, A., Nowak, R., and Zhu, X. (2008). Unlabeled data: Now it helps, now it doesn't. In T. Matsuyama, C. Cipolla, et al., editor, Advances in Neural Information Processing Systems (NIPS), pages 1513-1520, London. The MIT Press.
  30. Sun, S., Hussain, Z., and Shawe-Taylor, J. (2014). Manifold-preserving graph reduction for sparse semisupervised learning. In Neurocomputing, volume 124, pages 13-21.
  31. Sun, S. and Shawe-Taylor, J. (2010). Sparse semisupervised learning using conjugate functions. In Journal of Mach. Learn. Res., volume 11, pages 2423-2455.
  32. Vapnik, V. (1982). Estimation of Dependencies Based on Empirical Data (English translation 1982, Russian version 1979.). Springer, New York.
  33. Vapnik, V. (1995). The Nature of Statistical Learning Theory. Springer-Verlag, New York.
  34. Vapnik, V. and Chervonenkis, A. I. (1974). Theory of Pattern Recognition. Nauka, Moscow.
  35. Vedaldi, A. and Zisserman, A. (2011). Image Classification Practical, 2011.
  36. Zhu, X. (2006). Semi-Supervised Learning Literature Survey. University of Wisconsin - Madison, Madison, WI.

Paper Citation

in Harvard Style

Le T. and Kim S. (2014). On Selecting Helpful Unlabeled Data for Improving Semi-Supervised Support Vector Machines . In Proceedings of the 3rd International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM, ISBN 978-989-758-018-5, pages 48-59. DOI: 10.5220/0004810500480059

in Bibtex Style

author={Thanh-Binh Le and Sang-Woon Kim},
title={On Selecting Helpful Unlabeled Data for Improving Semi-Supervised Support Vector Machines},
booktitle={Proceedings of the 3rd International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM,},

in EndNote Style

JO - Proceedings of the 3rd International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM,
TI - On Selecting Helpful Unlabeled Data for Improving Semi-Supervised Support Vector Machines
SN - 978-989-758-018-5
AU - Le T.
AU - Kim S.
PY - 2014
SP - 48
EP - 59
DO - 10.5220/0004810500480059