MINING VERY LARGE DATASETS WITH SVM AND VISUALIZATION

Thanh-Nghi Do, François Poulet

Abstract

We present a new support vector machine (SVM) algorithm and graphical methods for mining very large datasets. We develop the active selection of training data points that can significantly reduce the training set in the SVM classification. We summarize the massive datasets into interval data. We adapt the RBF kernel used by the SVM algorithm to deal with this interval data. We only keep the data points corresponding to support vectors and the representative data points of non support vectors. Thus the SVM algorithm uses this subset to construct the non-linear model. We also use interactive graphical methods for trying to explain the SVM results. The graphical representation of IF-THEN rules extracted from the SVM models can be easily interpreted by humans. The user deeply understands the SVM models’ behaviour towards data. The numerical test results are obtained on real and artificial datasets.

References

  1. Ankerst, M., Elsen, C., Ester, M., and Kriegel, H-P., 1999, “Visual Classification: An Interactive Approach to Decision Tree Construction”, in proc. of Proceeding of the 5th ACM SIGKDD Int. Conf. on KDD'99, San Diego, USA, pp. 392-396.
  2. Asimov, D., 1985, “The Grand Tour: A Tool for Viewing Multidimensional Data”, in SIAM Journal on Scientific and Statistical Computing, 6(1), pp. 128- 143.
  3. Bennett, K., and Campbell, C., 2000, “Support Vector Machines: Hype or Hallelujah?”, in SIGKDD Explorations, Vol. 2, No. 2, pp. 1-13.
  4. Bock, H-H., and Diday, E., 1999, “Analysis of Symbolic Data”, Springer-Verlag.
  5. Boser, B., Guyon, I., and Vapnik, V., 1992, “An Training Algorithm for Optimal Margin Classifiers”, in Fifth ACM Annual Workshop on Computational Learning Theory, Pittsburgh, Pennsylvania, pp. 144-152.
  6. Caragea, D., Cook, D., and Honavar, V., 2001, “Gaining Insights into Support Vector Machine Pattern Classifiers Using Projection-Based Tour Methods”, in proc. the 7th ACM SIGKDD Int. Conf. on KDD'01, San Francisco, USA, pp. 251-256.
  7. Chang, C-C., and Lin, C-J., 2003, “A Library for Support Vector Machines”, http://www.csie.ntu.edu.tw/cjlin/- libsvm
  8. Delve, 1996, “Data for Evaluating Learning in Valid Experiments”, http://www.cs.toronto.edu/delve
  9. Do, T-N., and Poulet, F., 2004a, “Towards High Dimensional Data Mining with Boosting of PSVM and Visualization Tools”, in proc. of ICEIS'04, 6th Int. Conf. on Entreprise Information Systems, Vol. 2, pp. 36-41, Porto, Portugal.
  10. Do, T-N., and Poulet, F., 2004b, “Enhancing SVM with Visualization”, in Discovery Science 2004, E. Suzuki et S. Arikawa Eds., Lecture Notes in Artificial Intelligence 3245, Springer-Verlag, pp. 183-194.
  11. Fung, G., and Mangasarian, O., 2002, “Incremental Support Vector Machine Classification”, in proc. of the 2nd SIAM Int. Conf. on Data Mining SDM'2002 Arlington, Virginia, USA.
  12. Guyon, I., 1999, “Web Page on SVM Applications”, http://www.clopinet.com/isabelle/Projects/SVM/applist.html
  13. MacQueen, J., 1967, “Some Methods for classification and Analysis of Multivariate Observations”, in proc. of 5th Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, University of California Press, Vol. 1, pp. 281-297.
  14. Michie, D., Spiegelhalter, D-J., and Taylor, C-C., 1994, “Machine Learning, Neural and Statistical Classification”, Ellis Horwood.
  15. Osuna, E., Freund, R., and Girosi, F., 1997, “An Improved Training Algorithm for Support Vector Machines”, in Neural Networks for Signal Processing VII, J. Principe, L. Gile, N. Morgan, and E. Wilson Eds., pp. 276-285.
  16. Platt, J., 1999, “Fast Training of Support Vector Machines Using Sequential Minimal Optimization”, in Advances in Kernel Methods -- Support Vector Learning, B. Schoelkopf, C. Burges, and A. Smola Eds., pp. 185- 208.
  17. Poulet, F., 2002, “Full-View: A Visual Data Mining Environment”, Int. Journal of Image and Graphics, 2(1), pp. 127-143.
  18. Poulet, F., 2004, “Towards Visual Data Mining”,in proc. of ICEIS'04, 6th Int. Conf. on Entreprise Information Systems, Vol. 2, pp. 349-356, Porto, Portugal.
  19. Poulet, F., and Do, T-N., 2004, “Mining Very Large Datasets with Support Vector Machine Algorithms”, in Enterprise Information Systems V, O. Camp, J. Filipe, S. Hammoudi et M. Piattini Eds., Kluwer Academic Publishers, 2004, pp. 177-184.
  20. Syed, N., Liu, H., and Sung, K., 1999, “Incremental Learning with Support Vector Machines”, in proc. of the 6th ACM SIGKDD Int. Conf. on KDD'99, San Diego, USA.
  21. Tong, S., and Koller, D., 2000, “Support Vector Machine Active Learning with Applications to Text Classification”, in proc. of ICML'00, the 17th Int. Conf. on Machine Learning, Stanford, USA, pp. 999- 1006.
  22. Vapnik, V., 1995, “The Nature of Statistical Learning Theory”, Springer-Verlag, New York.
Download


Paper Citation


in Harvard Style

Do T. and Poulet F. (2005). MINING VERY LARGE DATASETS WITH SVM AND VISUALIZATION . In Proceedings of the Seventh International Conference on Enterprise Information Systems - Volume 2: ICEIS, ISBN 972-8865-19-8, pages 127-134. DOI: 10.5220/0002548601270134


in Bibtex Style

@conference{iceis05,
author={Thanh-Nghi Do and François Poulet},
title={MINING VERY LARGE DATASETS WITH SVM AND VISUALIZATION},
booktitle={Proceedings of the Seventh International Conference on Enterprise Information Systems - Volume 2: ICEIS,},
year={2005},
pages={127-134},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0002548601270134},
isbn={972-8865-19-8},
}


in EndNote Style

TY - CONF
JO - Proceedings of the Seventh International Conference on Enterprise Information Systems - Volume 2: ICEIS,
TI - MINING VERY LARGE DATASETS WITH SVM AND VISUALIZATION
SN - 972-8865-19-8
AU - Do T.
AU - Poulet F.
PY - 2005
SP - 127
EP - 134
DO - 10.5220/0002548601270134