TOWARDS A UNIFIED STRATEGY FOR THE PREPROCESSING STEP IN DATA MINING
Camelia Vidrighin Bratu, Rodica Potolea
2009
Abstract
Data-related issues represent the main obstacle in obtaining a high quality data mining process. Existing strategies for preprocessing the available data usually focus on a single aspect, such as incompleteness, or dimensionality, or filtering out “harmful” attributes, etc. In this paper we propose a unified methodology for data preprocessing, which considers several aspects at the same time. The novelty of the approach consists in enhancing the data imputation step with information from the feature selection step, and performing both operations jointly, as two phases in the same activity. The methodology performs data imputation only on the attributes which are optimal for the class (from the feature selection point of view). Imputation is performed using machine learning methods. When imputing values for a given attribute, the optimal subset (of features) for that attribute is considered. The methodology is not restricted to the use of a particular technique, but can be applied using any existing data imputation and feature selection methods.
References
- Cheeseman, P., Stutz, J., 1995. “Bayesian classification (AutoClass): Theory and results”, Advances in Knowledge Discovery and Data Mining. Menlo Park, CA: AAAI Press, pp. 153-180.
- Freund, Y., Schapire, R., 1997. “A decision-theoretic generalization of on-line learning and an application to boosting”, Journal of Computer and System Sciences, 55(1):119-139
- Georgieva, P., 2008. “MLP and RBF algorithms”, Summer School on Neural Networks and Support Vector Machines, Porto, 7-11 July.
- Hall, M.A., 2000. Correlation based Feature Selection for Machine Learning. Doctoral dissertation, Department of Computer Science, The University of Waikato, Hamilton, New Zealand.
- Kohavi R., John, J. H., 1997, “Wrappers for feature subset selection”, Artificial Intelligence, Volume 7, Issue 1-2.
- Little, R.J.A., Rubin, D.B., 1987. Statistical Analysis with Missing Data, J. Wiley & Sons, New York.
- Moldovan, T., Vidrighin, B.C., Giurgiu, I. and Potolea, R., 2007. "Evidence Combination for Baseline Accuracy Determination". Proceedings of the 3rd ICCP 2007, Cluj-Napoca, Romania, pp. 41-48.
- Nilsson, R., 2007. Statistical Feature Selection, with Applications in Life Science, PhD Thesis, Linkoping University.
- UCI Machine Learning Data Repository, http://archive.ics.uci.edu/ml/ , last accessed Dec. 2008
- Vidrighin, B.C., Potolea, R., Petrut, B., 2007. "New Complex Approaches for Mining Medical Data", Proc. of the WCMD, ICCP 2007, pp 1-10.
- Vidrighin, B. C., Muresan, T., Potolea, R., 2008a. “Improving Classification Performance on Real Data through Imputation”, Proc. of the 2008 IEEE AQTR, Romania, Vol. 3, pp. 464-469.
- Vidrighin, B. C, Potolea, R., 2008b. “Towards a Combined Approach to Feature Selection”, In Proc. of the 3rd ICSOFT 2008, Porto, Portugal.
- Vidrighin, B. C., Muresan, T., Potolea, R., 2008c. “Improving Classification Accuracy through Feature Selection”, Proc. of the 4th IEEE ICCP 2008, pp 25-32
- Witten, I., Frank E., 2005. Data Mining: Practical machine learning tools and techniques, 2nd edition, Morgan Kaufmann.
Paper Citation
in Harvard Style
Vidrighin Bratu C. and Potolea R. (2009). TOWARDS A UNIFIED STRATEGY FOR THE PREPROCESSING STEP IN DATA MINING . In Proceedings of the 11th International Conference on Enterprise Information Systems - Volume 2: ICEIS, ISBN 978-989-8111-85-2, pages 230-235. DOI: 10.5220/0002008902300235
in Bibtex Style
@conference{iceis09,
author={Camelia Vidrighin Bratu and Rodica Potolea},
title={TOWARDS A UNIFIED STRATEGY FOR THE PREPROCESSING STEP IN DATA MINING},
booktitle={Proceedings of the 11th International Conference on Enterprise Information Systems - Volume 2: ICEIS,},
year={2009},
pages={230-235},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0002008902300235},
isbn={978-989-8111-85-2},
}
in EndNote Style
TY - CONF
JO - Proceedings of the 11th International Conference on Enterprise Information Systems - Volume 2: ICEIS,
TI - TOWARDS A UNIFIED STRATEGY FOR THE PREPROCESSING STEP IN DATA MINING
SN - 978-989-8111-85-2
AU - Vidrighin Bratu C.
AU - Potolea R.
PY - 2009
SP - 230
EP - 235
DO - 10.5220/0002008902300235