POP: A Parallel Optimized Preparation of Data for Data Mining

Christian Ernst, Youssef Hmamouche, Alain Casali

2015

Abstract

In light of the fact that data preparation has a substantial impact on data mining results, we provide an original framework for automatically preparing the data of any given database. Our research focuses, for each attribute of the database, on two points: (i) Specifying an optimized outlier detection method, and (ii), Identifying the most appropriate discretization method. Concerning the former, we illustrate that the detection of an outlier depends on if data distribution is normal or not. When attempting to discern the best discretization method, what is important is the shape followed by the density function of its distribution law. For this reason, we propose an automatic choice for finding the optimized discretization method based on a multi-criteria (Entropy, Variance, Stability) evaluation. Processings are performed in parallel using multicore capabilities. Conducted experiments validate our approach, showing that it is not always the very same discretization method that is the best.

References

  1. Aggarwal, C. and Yu, P. (2001). Outlier detection for high dimensional data. In Mehrotra, S. and Sellis, T. K., editors, SIGMOD Conference, pages 37-46. ACM.
  2. Agrawal, R., Mannila, H., Srikant, R., Toivonen, H., and Verkamo, A. I. (1996). Fast discovery of association rules. In Advances in Knowledge Discovery and Data Mining, pages 307-328. AAAI/MIT Press.
  3. Arthur, D., Manthey, B., and R öglin, H. (2011). Smoothed analysis of the k-means method. Journal of the ACM (JACM), 58(5):19.
  4. Casali, A. and Ernst, C. (2013). Extracting correlated patterns on multicore architectures. In Availability, Reliability, and Security in Information Systems and HCI - IFIP WG 8.4, 8.9, TC 5 International Cross-Domain Conference, CD-ARES 2013, Regensburg, Germany, September 2-6, 2013. Proceedings, pages 118-133.
  5. Cauvin, C., Escobar, F., and Serradj, A. (2008). Cartographie thématique. 3. Méthodes quantitatives et transformations attributaires. Lavoisier.
  6. Clímaco, J. (2012). Multicriteria Analysis: Proceedings of the XIth International Conference on MCDM, 1-6 August 1994, Coimbra, Portugal. Springer Science & Business Media.
  7. Ernst, C. and Casali, A. (2011). Data preparation in the minecor kdd framework. In IMMM 2011, The First International Conference on Advances in Information Mining and Management, pages 16-22.
  8. Grubbs, F. E. (1969). Procedures for detecting outlying observations in samples. Technometrics, 11(1):1-21.
  9. Grun-Rehomme, M., Vasechko, O., et al. (2010). Méthodes de détection des unités atypiques: Cas des enqueˆtes structurelles ukrainiennes. In 42èmes Journées de Statistique.
  10. Jain, A. K. (2010). Data clustering: 50 years beyond kmeans. Pattern Recognition Letters, 31(8):651-666.
  11. Jarque, C. M. and Bera, A. K. (1980). Efficient tests for normality, homoscedasticity and serial independence of regression residuals. Economics Letters, 6(3):255- 259.
  12. Jenks, G. (1967). The data model concept in statistical mapping. In International Yearbook of Cartography, volume 7, pages 186-190.
  13. Kanungo, T., Mount, D. M., Netanyahu, N. S., Piatko, C. D., Silverman, R., and Wu, A. Y. (2002). An efficient k-means clustering algorithm: Analysis and implementation. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 24(7):881-892.
  14. Lilliefors, H. W. (1967). On the kolmogorov-smirnov test for normality with mean and variance unknown. Journal of the American Statistical Association, 62(318):399-402.
  15. Lindman, H. R. (2012). Analysis of variance in experimental design. Springer Science & Business Media.
  16. Mitov, I., Ivanova, K., Markov, K., Velychko, V., Stanchev, P., and Vanhoof, K. (2009). Comparison of discretization methods for preprocessing data for pyramidal growing network classification method. New Trends in Intelligent Technologies, Sofia, pages 31-39.
  17. Pardalos, P. M., Siskos, Y., and Zopounidis, C. (2013). Advances in multicriteria analysis, volume 5. Springer Science & Business Media.
  18. Pyle, D. (1999). Data preparation for data mining. Morgan Kaufmann.
  19. Roy, B. and Vincke, P. (1981). Multicriteria analysis: survey and new directions. European Journal of Operational Research, 8(3):207-218.
  20. Silverman, B. W. (1986). Density estimation for statistics and data analysis, volume 26. CRC press.
  21. Stepankova, O., Aubrecht, P., Kouba, Z., and Miksovsky, P. (2003). Preprocessing for data mining and decision support. In Publishers, K. A., editor, Data Mining and Decision Support: Integration and Collaboration, pages 107-117.
  22. Tukey, J. W. (1976). Exploratory data analysis. 1977. Massachusetts: Addison-Wesley.
  23. Zambom, A. Z. and Dias, R. (2012). A review of kernel density estimation with applications to econometrics. arXiv preprint arXiv:1212.2812.
  24. Zopounidis, C. and Pardalos, P. (2010). Handbook of multicriteria analysis, volume 103. Springer Science & Business Media.
Download


Paper Citation


in Harvard Style

Ernst C., Hmamouche Y. and Casali A. (2015). POP: A Parallel Optimized Preparation of Data for Data Mining . In Proceedings of the 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: KDIR, (IC3K 2015) ISBN 978-989-758-158-8, pages 36-45. DOI: 10.5220/0005594700360045


in Bibtex Style

@conference{kdir15,
author={Christian Ernst and Youssef Hmamouche and Alain Casali},
title={POP: A Parallel Optimized Preparation of Data for Data Mining},
booktitle={Proceedings of the 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: KDIR, (IC3K 2015)},
year={2015},
pages={36-45},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005594700360045},
isbn={978-989-758-158-8},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: KDIR, (IC3K 2015)
TI - POP: A Parallel Optimized Preparation of Data for Data Mining
SN - 978-989-758-158-8
AU - Ernst C.
AU - Hmamouche Y.
AU - Casali A.
PY - 2015
SP - 36
EP - 45
DO - 10.5220/0005594700360045