Speeding up Support Vector Machines - Probabilistic versus Nearest Neighbour Methods for Condensing Training Data

Moïri Gamboni, Abhijai Garg, Oleg Grishin, Seung Man Oh, Francis Sowani, Anthony Spalvieri-Kruse, Godfried T. Toussaint, Lingliang Zhang

Abstract

Several methods for reducing the running time of support vector machines (SVMs) are compared in terms of speed-up factor and classification accuracy using seven large real world datasets obtained from the UCI Machine Learning Repository. All the methods tested are based on reducing the size of the training data that is then fed to the SVM. Two probabilistic methods are investigated that run in linear time with respect to the size of the training data: blind random sampling and a new method for guided random sampling (Gaussian Condensing). These methods are compared with k-Nearest Neighbour methods for reducing the size of the training set and for smoothing the decision boundary. For all the datasets tested blind random sampling gave the best results for speeding up SVMs without significantly sacrificing classification accuracy.

References

  1. Bache, K., Lichman, M., 2013. UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]
  2. Bakir, G. H, Bottou, L., Weston, J., 2004. Breaking SVM complexity with cross-training. In Advances in Neural Information Processing Systems 17 (NIPS-2004), Dec. 13-18, 2004, Vancouver, Canada], pp. 81-88.
  3. Bordes, A., Ertekin, S., Weston, J., Bottou, L., 2005. Fast kernel classifiers with online and active learning. J. of Machine Learning Research, vol. 6, pp. 1579-1619.
  4. Almeida, M. B., Braga, A. P., Braga, J. P., 2000. SVMKM: speeding SVMs learning with a priori cluster selection and k-means. In: Proc. of the 6th Brazilian Symposium on Neural Networks, pp. 162-167.
  5. Chen, J., Zhang, C., Xue, X., Liu, C.-H., 2013. Fast instance selection for speeding up support vector machines. Knowledge-Based Systems, vol. 45, pp. 1-7.
  6. Chen, J., Liu, C.-L., 2011. Fast multi-class sample reduction for speeding up support vector machines. Proceedings of the IEEE International Workshop on Machine Learning for Signal Processing, Beijing, China, September 18-21.
  7. Chen, J., Chen, C., 2002. Speeding up SVM decisions based on mirror points. Proc. 6th International Conf. Pattern Recognition, vol. 2, pp. 869-872.
  8. Devroye, L., 1981. On the inequality of Cover and Hart in nearest neighbour discrimination. IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 3, pp. 75-78.
  9. Hart, P. E., 1968. The condensed nearest neighbour rule. IEEE Trans. Infor. Theory, vol. 14, pp. 515-516.
  10. Kawulok, M., Nalepa, J., 2012. Support vector machines training data selection using a genetic algorithm. In G.L. Gimel'farb et al. (Eds.): Structural, Syntactic, and Statistical Pattern Recognition, LNCS 7626, pp. 557-565.
  11. Keerthi, S. S., Shevade, S. K., Bhattacharyya, C., Murthy, K. R. K., 2001. Improvements to Platt's SMO algorithm for SVM classifier design. Neural Computation, vol. 13, pp. 637-649.
  12. Lee, Y. L., Mangasarian, O. L., 2001. RSVM: Reduced support vector machines. In Proceedings of the First SIAM International Conference on Data Mining, SIAM, Chicago, April 5-7, (CD-ROM).
  13. Li, X., Cervantes, J., Yu, W., 2012. Fast classification for large datasets via random selection clustering and Support Vector Machines. Intelligent Data Analysis, vol. 16, pp. 897-914.
  14. Liu, X., Beltran, J. F., Mohanchandra, N., Toussaint, G. T., 2013. On speeding up support vector machines: Proximity graphs versus random sampling for preselection condensation. Proc. International Conf. Computer Science and Mathematics, Dubai, United Arab Emirates, Jan. 30-31, Vol. 73, pp. 1037-1044.
  15. Ng W. Q., Dash, M., 2006. An evaluation of progressive sampling for imbalanced datasets. In Sixth IEEE International Conference on Data Mining Workshops, Hong Kong, China. 2006.
  16. Panda, N., Chang, E. Y., Wu, G., 2006. Concept boundary detection for speeding up SVMs. Proc. 23 International Conf. on Machine Learning, Pittsburgh.
  17. Platt, J. C., 1998. Fast training of support vector machines using sequential minimial optimization. In Advances in Kernel Methods: Support Vector Machines, B. Scholkopf, C. Burges, and A. Smola, Eds., MIT Press.
  18. Portet, F., Gao, F., Hunter, J., Quiniou, R., 2007. Reduction of large training set by guided progressive sampling: Application to neonatal intensive care data. Proc. of Intelligent Data Analysis in Biomedicine and Pharmacology, Amsterdam, pp. 43-44 .
  19. Provost, F., Jensen, D., Oates, T., 1999. Efficient progressive sampling. In Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, USA, 1999.
  20. Sriperumbudur, B. K., Lanckriet, G., 2007. Nearest neighbour prototyping for sparse and scalable support vector machines. Technical Report No. CAL-2007-02, University of California San Diego.
  21. Toussaint, G. T., Berzan, C., 2012. Proximity-graph instance-based learning, support vector machines, and high dimensionality: An empirical comparison. Proceedings of the Eighth International Conference on Machine Learning and Data Mining, July 16-19, 2012, Berlin, Germany. P. Perner (Ed.): LNAI 7376, pp. 222-236, Springer-Verlag Berlin Heidelberg.
  22. Toussaint, G. T., 2005. Geometric proximity graphs for improving nearest neighbour methods in instancebased learning and data mining. International J. Computational Geometry and Applications, vol. 15, April, pp. 101-150.
  23. Toussaint, G. T., 1974. Bibliography on estimation of misclassification. IEEE Transactions on Information Theory, vol. 20, pp. 472-479.
  24. Vapnik, V., 1995. The Nature of Statistical Learning Theory, Springer-Verlag, New York, NY.
  25. Wang, Y., Zhou, C. G., Huang, Y. X., Liang, Y. C., Yang, X. W., 2006. A boundary method to speed up training support vector machines. In: G. R. Liu et al. (eds), Computational Methods, Springer, Printed in the Netherlands, pp. 1209-1213.
  26. Wilson, D. L., 1973. Asymptotic properties of nearest neighbour rules using edited-data. IEEE Trans. Systems, Man, and Cybernetics, vol. 2, pp. 408-421.
  27. Witten, I., Frank, E., 2000. WEKA: Machine Learning Algorithms in Java. In Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, Morgan Kaufmann, pp. 265-320.
Download


Paper Citation


in Harvard Style

Gamboni M., Garg A., Grishin O., Man Oh S., Sowani F., Spalvieri-Kruse A., T. Toussaint G. and Zhang L. (2014). Speeding up Support Vector Machines - Probabilistic versus Nearest Neighbour Methods for Condensing Training Data . In Proceedings of the 3rd International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM, ISBN 978-989-758-018-5, pages 364-371. DOI: 10.5220/0004927003640371


in Bibtex Style

@conference{icpram14,
author={Moïri Gamboni and Abhijai Garg and Oleg Grishin and Seung Man Oh and Francis Sowani and Anthony Spalvieri-Kruse and Godfried T. Toussaint and Lingliang Zhang},
title={Speeding up Support Vector Machines - Probabilistic versus Nearest Neighbour Methods for Condensing Training Data},
booktitle={Proceedings of the 3rd International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM,},
year={2014},
pages={364-371},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0004927003640371},
isbn={978-989-758-018-5},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 3rd International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM,
TI - Speeding up Support Vector Machines - Probabilistic versus Nearest Neighbour Methods for Condensing Training Data
SN - 978-989-758-018-5
AU - Gamboni M.
AU - Garg A.
AU - Grishin O.
AU - Man Oh S.
AU - Sowani F.
AU - Spalvieri-Kruse A.
AU - T. Toussaint G.
AU - Zhang L.
PY - 2014
SP - 364
EP - 371
DO - 10.5220/0004927003640371