Rubin, 2002) give a good overview. Imputing
missing values has an impact on required data
quantity, risk of over-fitting, data distribution and
learning algorithm performance. Errors in the data
are more difficult to spot but some of the methods
used for data imputation can also be used for error
detection.
4.1.3 Feature Selection
Feature selection is another well studied field with
many proposed techniques, see (Gheyas and Smith,
2010) for a recent example. We suggest that these
methods would benefit from being viewed in the
light of the other data suitability issues listed here. In
this we include other considerations such as feature
independence.
4.1.4 Data Quantity
The issues listed above all have an impact on the
quantity of data required for a successful machine
learning project. Although it is true that solving the
problems of data quality would mean that data
quantity is not an issue in itself, it is certainly a
useful measure of suitability when other aspects of
data quality are only partially understood.
5 CONCLUSIONS
The majority of time and resources on most
professional data mining projects is consumed by
data preparation. This deals with outliers, missing
values, abnormal distributions, data errors,
insufficient data quantities, ill-posed data, co-
dependent inputs and a list of other issues.
This paper does not argue that such data
preparation, cleaning and verification does not take
place, neither does it argue that the issue is ignored
by the research community. It argues that algorithms
for dealing with these issues are as important as
algorithms for machine learning and inference, and
so should constitute much more of the research in
that field and a larger proportion of the content of
teaching, text books and software.
We would like to see the data mining community
make more use of neural computing based methods
and we believe that an improved approach to data
suitability will encourage that to happen.
ACKNOWLEDGEMENTS
Thanks to Prof. Leslie Smith for his help in
preparing this paper
REFERENCES
Bishop, C. M., 2006. Pattern recognition and machine
learning. Springer.
Bouckaert, R. R, Frank, E., Hall, M. A., Holmes, G.,
Pfahringer, B., Reutemann, P. and Witten, I. H., 2010.
WEKA-experiences with a java open-source project.
Journal of Machine Learning Research, 11:2533-
2541. JMLR
Breiman, L., Friedman, J. H., Olshen, R. A. and Stone, C.
J., 1984. Classification and regression trees.
Wadsworth.
Dasu, T., Johnson, T., 2003. Exploratory data mining and
data cleaning. Wiley-Interscience.
Dreyfus, G., 2005. Neural networks: methodology and
applications. Springer.
Du, H., 2010. Data Mining Techniques and Applications:
An Introduction. Cengage Learning.
Gheyas, I. A. and Smith, L. S., 2010. Feature subset
selection in large dimensionality domains. Pattern
Recognition. 43. Elsevier.
Hand, D. J., Yu, K., 2001. Idiot’s Bayes—not so stupid
after all?. Int. Stat. Rev. 69:385–398. International
Statistical Institute.
Haykin, S. S., 1994. Neural networks: a comprehensive
foundation. Macmillan.
Hertz, J., Krogh, A. and Palmer, R. G., 1991. Introduction
to the theory of neural computation. Santa Fe institute
studies in the sciences of complexity: Lecture notes.
Westview Press.
Japkowicz, N. and Stephen, S., 2002. The class imbalance
problem: A systematic study. Intel. Data Anal. 6 pp.
429–449.
Little, R. J. A. and Rubin, D. B., 2002. Statistical Analysis
with Missing Data. Wiley.
Madnick S. E., Wang, R. Y., Yang, W. L. and Hongwei,
Z., 2009. Overview and Framework for Data and
Information Quality Research. ACM Journal of Data
and Information Quality. 1,1. ACM.
Pyle, D., 1999. Data preparation for data mining. Morgan
Kaufmann.
Quinlan, J. R., 1993 C4.5: Programs for machine
learning. Morgan Kaufmann.
Swingler, K., 1996. Applying neural networks: a practical
guide. Academic Press.
Tang, H., Tan, K. C. and Zhang, Y., 2007. Neural
networks: computational models and applications.
Springer.
Vapnik, V., 1995. The nature of statistical learning
theory. Springer.
Witten, I. H. and Frank, E., 2005. Data mining: practical
machine learning tools and techniques. Morgan
Kaufman.
NCTA 2011 - International Conference on Neural Computation Theory and Applications
408