RESAMPLING BASED ON STATISTICAL PROPERTIES OF DATA SETS

Julia Bondarenko

Abstract

In imbalanced data sets, classes separated into majority (negative) and minority (positive) classes, are not approximately equally represented. That leads to impeding of accurate classification results. Well balanced data sets assume uniform distribution. The approach we present in the paper, is based on directed oversampling of minority class objects with simultaneous undersampling of majority class objects, to balance non-uniform data sets, and relies upon the certain statistical criteria. The resampling procedure is carried out for the daily traffic injuries data sets. The results obtained show the improving of rare cases (positive class objects) identification with accordance to several performance measures.

References

  1. Bondarenko, J. (2006a). Analysis of traffic injuries among children based on generalized linear model with a latent process in the mean. Discussion Paper in Statistics and Quantitative Economics, Helmut-Schmidt University Hamburg, (116).
  2. Bondarenko, J. (2006b). Children traffic accidents models: Analysis and comparison. Discussion Paper in Statistics and Quantitative Economics, HelmutSchmidt University Hamburg, (117).
  3. F. Pokropp, W. Seidel, A. B. M. H. and Sever, K. (2006). Control charts for the number of children injured in traffic accidents. In H.-J. Lenz, P.-T. W., editor, Frontiers in Statistical Quality Control, pages 151-171. Physica, Heidelberg, 5 edition.
  4. G. Cohen, M. Hilario, H. S. S. H. and Geissbuhler, A. (2005). Learning from imbalanced data in surveillance of nosocomial infection. Artificial Intelligence in Medicine, 37:7-18.
  5. H. Han, W. W. and Mao, B. (2005). Borderline-smote: A new over-sampling method in imbalanced data sets learning. In Proceedings of the International Conference on Intelligent Computing, pages 878-887.
  6. Hido, S. and Kashima, H. (2008). Roughly balanced bagging for imbalanced data. In Proceedings of the SIAM International Conference on Data Mining, pages 143- 152.
  7. Kohavi, R. and Provost, F. (1998). Glossary of terms. editorial for the special issue on applications of machine learning and the knowledge discovery process. Machine Learning, 30:271-274.
  8. Kubat, M. and Matwin, S. (1997). Adressing the curse of imbalanced training sets: Onesided selection. In Proceedings of the 14th International Conference on Machine Learning, pages 179-186.
  9. N. Chawla, K. W. Bowyer, L. O. H. and Kegelmeyer, W. P. (2002). Smote: Synthetic minority oversampling technique. Journal of Artificial Intelligence Research, 16:321-357.
  10. Quinlan, J. (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, California.
  11. S. Ertekin, J. H. and Giles, C. L. (2007). Active learning for class imbalance problem. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 823-824.
  12. S. Kotsiantis, D. K. and Pintelas, P. (2006). Handling imbalanced datasets: a review. GESTS International Transactions on Computer Science and Engineering, 30:25-36.
  13. V. Garcha, J. S. and Mollineda, R. (2008). On the use of surrounding neighbors for synthetic over-sampling of the minority class. In Proceedings of the 8th WSEAS International Conference on Simulation, Modelling and Optimization, pages 389-394.
  14. X.-Y. Liu, J. W. and Zhou, Z.-H. (2006). Exploratory undersampling for class-imbalance learning. In Proceedings of the International Conference on Data Mining, pages 965-969.
Download


Paper Citation


in Harvard Style

Bondarenko J. (2009). RESAMPLING BASED ON STATISTICAL PROPERTIES OF DATA SETS . In Proceedings of the 6th International Conference on Informatics in Control, Automation and Robotics - Volume 3: ICINCO, ISBN 978-989-8111-99-9, pages 143-148. DOI: 10.5220/0002171701430148


in Bibtex Style

@conference{icinco09,
author={Julia Bondarenko},
title={RESAMPLING BASED ON STATISTICAL PROPERTIES OF DATA SETS},
booktitle={Proceedings of the 6th International Conference on Informatics in Control, Automation and Robotics - Volume 3: ICINCO,},
year={2009},
pages={143-148},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0002171701430148},
isbn={978-989-8111-99-9},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 6th International Conference on Informatics in Control, Automation and Robotics - Volume 3: ICINCO,
TI - RESAMPLING BASED ON STATISTICAL PROPERTIES OF DATA SETS
SN - 978-989-8111-99-9
AU - Bondarenko J.
PY - 2009
SP - 143
EP - 148
DO - 10.5220/0002171701430148