Weighting and Sampling Data for Individual Classifiers and Bagging with Genetic Algorithms

Sašo Karakatič, Marjan Heričko, Vili Podgorelec

Abstract

An imbalanced or inappropriate dataset can have a negative influence in classification model training. In this paper we present an evolutionary method that effectively weights or samples the tuples from the training dataset and tries to minimize the negative effects from innaprotirate datasets. The genetic algorithm with genotype of real numbers is used to evolve the weights or occurrence number for each learning tuple in the dataset. This technique is used with individual classifiers and in combination with the ensemble technique of bagging, where multiple classification models work together in a classification process. We present two variations – weighting the tuples and sampling the classification tuples. Both variations are experimentally tested in combination with individual classifiers (C4.5 and Naive Bayes methods) and in combination with bagging ensemble. Results show that both variations are promising techniques, as they produced better classification models than methods without weighting or sampling, which is also supported with statistical analysis.

References

  1. Angiulli, F. (2005). Fast condensed nearest neighbor rule. In Proceedings of the 22Nd International Conference on Machine Learning, ICML 7805, pages 25-32, New York, NY, USA. ACM.
  2. Bezdek, J. C. and Kuncheva, L. I. (2001). Nearest prototype classifier designs: An experimental study. International Journal of Intelligent Systems, 16(12):1445- 1473.
  3. Breiman, L. (1996). Bagging predictors. Machine Learning, 24(2):123-140.
  4. Cano, A., Zafra, A., and Ventura, S. (2013). Weighted data gravitation classification for standard and imbalanced data. Cybernetics, IEEE Transactions on, 43(6):1672- 1687.
  5. Cano, J. R., Herrera, F., and Lozano, M. (2003). Using evolutionary algorithms as instance selection for data reduction in kdd: an experimental study. Evolutionary Computation, IEEE Transactions on, 7(6):561-575.
  6. Cateni, S., Colla, V., and Vannucci, M. (2014). A method for resampling imbalanced datasets in binary classification tasks for real-world problems. Neurocomputing, 135(0):32 - 41.
  7. Chou, C.-H., Kuo, B.-H., and Chang, F. (2006). The generalized condensed nearest neighbor rule as a data reduction method. In Pattern Recognition, 2006. ICPR 2006. 18th International Conference on, volume 2, pages 556-559. IEEE.
  8. Dietterich, T. G. (2000). Ensemble methods in machine learning. In Multiple classifier systems , pages 1-15. Springer.
  9. Freund, Y., Schapire, R. E., et al. (1996). Experiments with a new boosting algorithm. In ICML, volume 96, pages 148-156.
  10. Garca-Pedrajas, N. and Prez-Rodrguez, J. (2012). Multiselection of instances: A straightforward way to improve evolutionary instance selection. Applied Soft Computing, 12(11):3590 - 3602.
  11. Holland, J. H. (1992). Adaptation in natural and artificial systems: an introductory analysis with applications to biology, control, and artificial intelligence . MIT press.
  12. Japkowicz, N. (2000). The class imbalance problem: Significance and strategies. In Proc. of the Intl Conf. on Artificial Intelligence . Citeseer.
  13. Japkowicz, N. and Stephen, S. (2002). The class imbalance problem: A systematic study intelligent data analysis.
  14. John, G. H. and Langley, P. (1995). Estimating continuous distributions in bayesian classifiers. In Proceedings of the Eleventh conference on Uncertainty in artificial intelligence, pages 338-345. Morgan Kaufmann Publishers Inc.
  15. Kim, K.-j. (2006). Artificial neural networks with evolutionary instance selection for financial forecasting. Expert Systems with Applications, 30(3):519-526.
  16. Kotsiantis, S. and Pintelas, P. (2003). Mixture of expert agents for handling imbalanced data sets. Annals of Mathematics, Computing & Teleinformatics, 1(1):46- 55.
  17. Kubat, M. and Matwin, S. (1997). Addressing the curse of imbalanced data sets: One sided sampling. In Proc. of the Int'l Conf. on Machine Learning.
  18. Kuncheva, L. I. and Bezdek, J. C. (1998). Nearest prototype classification: clustering, genetic algorithms, or random search? Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on, 28(1):160-164.
  19. Lichman, M. (2013). UCI machine learning repository.
  20. Lindenbaum, M., Markovitch, S., and Rusakov, D. (2004). Selective sampling for nearest neighbor classifiers. Machine Learning, 54(2):125-152.
  21. Liu, H. (2010). Instance selection and construction for data mining. Springer-Verlag.
  22. Liu, J.-F. and Yu, D.-R. (2007). A weighted rough set method to address the class imbalance problem. In Machine Learning and Cybernetics, 2007 International Conference on, volume 7, pages 3693-3698.
  23. Liu, X.-Y., Li, Q.-Q., and Zhou, Z.-H. (2013). Learning imbalanced multi-class data with optimal dichotomy weights. In Data Mining (ICDM), 2013 IEEE 13th International Conference on, pages 478-487.
  24. Olvera-L ópez, J. A., Carrasco-Ochoa, J. A., MartínezTrinidad, J. F., and Kittler, J. (2010). A review of instance selection methods. Artificial Intelligence Review, 34(2):133-143.
  25. Quinlan, J. R. (1993). C4.5: programs for machine learning. Elsevier.
  26. Stefanowski, J. and Wilk, S. (2008). Selective preprocessing of imbalanced data for improving classification performance. In Song, I.-Y., Eder, J., and Nguyen, T., editors, Data Warehousing and Knowledge Discovery, volume 5182 of Lecture Notes in Computer Science, pages 283-292. Springer Berlin Heidelberg.
  27. Ting, K. M. (2002). An instance-weighting method to induce cost-sensitive trees. Knowledge and Data Engineering, IEEE Transactions on, 14(3):659-665.
  28. Tsai, C.-F., Eberle, W., and Chu, C.-Y. (2013). Genetic algorithms in feature and instance selection. Knowledge-Based Systems, 39(0):240 - 247.
  29. Wilson, D. R. and Martinez, T. R. (2000). Reduction techniques for instance-based learning algorithms. Machine learning, 38(3):257-286.
  30. Zhao, H. (2008). Instance weighting versus threshold adjusting for cost-sensitive classification. Knowledge and Information Systems, 15(3):321-334.
Download


Paper Citation


in Harvard Style

Karakatič S., Heričko M. and Podgorelec V. (2015). Weighting and Sampling Data for Individual Classifiers and Bagging with Genetic Algorithms . In Proceedings of the 7th International Joint Conference on Computational Intelligence - Volume 1: ECTA, ISBN 978-989-758-157-1, pages 180-187. DOI: 10.5220/0005592201800187


in Bibtex Style

@conference{ecta15,
author={Sašo Karakatič and Marjan Heričko and Vili Podgorelec},
title={Weighting and Sampling Data for Individual Classifiers and Bagging with Genetic Algorithms},
booktitle={Proceedings of the 7th International Joint Conference on Computational Intelligence - Volume 1: ECTA,},
year={2015},
pages={180-187},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005592201800187},
isbn={978-989-758-157-1},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 7th International Joint Conference on Computational Intelligence - Volume 1: ECTA,
TI - Weighting and Sampling Data for Individual Classifiers and Bagging with Genetic Algorithms
SN - 978-989-758-157-1
AU - Karakatič S.
AU - Heričko M.
AU - Podgorelec V.
PY - 2015
SP - 180
EP - 187
DO - 10.5220/0005592201800187