Figure 2: Classification accuracy using different versions
of the training set – attribute MaxHeartRate.
5 CONCLUSIONS
Among the best known data preprocessing strategies
are feature selection and procedures for handling
incomplete data, with various existing techniques.
Previous results on the data imputation step alone
show that predicting strongly correlated attributes
with the class can improve the learning accuracy.
Wrapper feature selection has also been shown to
boost the performance of an inducer.
In this paper we propose a new methodology for
preprocessing the training set. Its novelty resides in
the combination of the feature selection step with
data imputation, in order to obtain an improved
version of the training set. The main goal is to boost
the classification accuracy (i.e. improve the learning
step). The methodology is simple and generic, which
makes it suitable for a wide range of application
domains, where particular feature selection schemes
/data imputation procedures may be preferred.
We have performed a number of evaluations of the
combined methodology using benchmark datasets.
The results indicate that performing preprocessing
on the training set enhances the accuracy of the final
model. The new methodology we have introduced is
more successful than the individual steps it
combines, producing similar or even superior results
to the ones obtained with complete data.
However, just like in the case of classifiers
(Moldovan et. al, 2007), there is no absolute best
preprocessing particularization for a given dataset.
Therefore, there appears the need to assess a
baseline performance using several approaches, and
develop a semi-automated procedure for tuning the
preprocessing method for a given problem. This is
one of our current objectives. Another future
development of the methodology is aimed at
handling more complex patterns of incompleteness,
closer to the ones encountered in real-life data sets.
ACKNOWLEDGEMENTS
Our work for this paper has been supported by the
Romanian Ministry for Education and Research,
through grant no. 12080/01.10.2008 – SEArCH.
REFERENCES
Cheeseman, P., Stutz, J., 1995. “Bayesian classification
(AutoClass): Theory and results”, Advances in
Knowledge Discovery and Data Mining. Menlo Park,
CA: AAAI Press, pp. 153–180.
Freund, Y., Schapire, R., 1997. “A decision-theoretic
generalization of on-line learning and an application to
boosting”, Journal of Computer and System Sciences,
55(1):119–139
Georgieva, P., 2008. “MLP and RBF algorithms”, Summer
School on Neural Networks and Support Vector
Machines, Porto, 7-11 July.
Hall, M.A., 2000. Correlation based Feature Selection for
Machine Learning. Doctoral dissertation, Department
of Computer Science, The University of Waikato,
Hamilton, New Zealand.
Kohavi R., John, J. H., 1997, “Wrappers for feature subset
selection”, Artificial Intelligence, Volume 7, Issue 1-2.
Little, R.J.A., Rubin, D.B., 1987. Statistical Analysis with
Missing Data, J. Wiley & Sons, New York.
Moldovan, T., Vidrighin, B.C., Giurgiu, I. and Potolea, R.,
2007. "Evidence Combination for Baseline Accuracy
Determination". Proceedings of the 3rd ICCP 2007,
Cluj-Napoca, Romania, pp. 41-48.
Nilsson, R., 2007. Statistical Feature Selection, with
Applications in Life Science, PhD Thesis, Linkoping
University.
UCI Machine Learning Data Repository,
http://archive.ics.uci.edu/ml/ , last accessed Dec. 2008
Vidrighin, B.C., Potolea, R., Petrut, B., 2007. "New
Complex Approaches for Mining Medical Data",
Proc. of the WCMD, ICCP 2007, pp 1-10.
Vidrighin, B. C., Muresan, T., Potolea, R., 2008a.
“Improving Classification Performance on Real Data
through Imputation”, Proc. of the 2008 IEEE AQTR,
Romania, Vol. 3, pp. 464-469.
Vidrighin, B. C, Potolea, R., 2008b. “Towards a
Combined Approach to Feature Selection”, In Proc. of
the 3
rd
ICSOFT 2008, Porto, Portugal.
Vidrighin, B. C., Muresan, T., Potolea, R., 2008c.
“Improving Classification Accuracy through Feature
Selection”, Proc. of the 4
th
IEEE ICCP 2008, pp 25-32
Witten, I., Frank E., 2005. Data Mining: Practical
machine learning tools and techniques, 2
nd
edition,
Morgan Kaufmann.
Cleveland Dataset
Attribute MaxHeartRate
45
47
49
51
53
55
57
59
5% 10% 15% 20% 25% 30%
% incomplete
accuracy (%)
FSAfterI FSBeforeI Original data
TOWARDS A UNIFIED STRATEGY FOR THE PREPROCESSING STEP IN DATA MINING
235