on full training instances and trees built on cleaned
training instances. It has been tested on full training
instances, and the trees have been built on cleaned
training instances.
The average absolute observed percentage
reduction in tree size is significant, i.e. a 41.40%
reduction in tree size. The accuracy column displays
the absolute difference in accuracy between trees
built. The average difference in accuracy for a
cleaned data set is 19.13%. When the accuracies are
compared, we can argue that evolutionary trees with
cleaned data sets are more accurate. The proposed
work has an average classification accuracy of
93.73%, compared to an average of 74.63% with
outliners data. Thus, we can argue that robust
efficient evolutionary trees improve classifier
performance.
6 CONCLUSION
Based on a statistical approach, this paper proposed a
technique for dealing with outliers in data. When the
outliners are removed, the induced patterns become
more accurate and extremely simple. The results we
obtained validate the use of the proposed techniques
for this task. Furthermore, when compared to
previous approaches to the same data, the results
clearly outperform them, even with the same level of
erroneous data. The proposed algorithm employs an
evolutionary decision tree as a filter classifier for
training data in order to pursue a global search in the
problem space with classification accuracy as a
fitness function while avoiding a local optimum, and
the final classifier employs a cleaned data set. As a
result of this combination of techniques, we have a
robust and efficient classifier.
REFERENCES
J. Han, M. Kamber, (2006), Data Mining: Concepts and
Techniques (2
nd
edition), Morgan Kaufman Publishers.
N Lavarac, D. Gamberger, P. Turney (2018), Cost-sensitive
feature reduction applied to a hybrid genetic algorithm,
in 7
th
International workshop on algorithmic learning
theory (ALT’18), Sydney, Australia, October 2018,
pg.127-pg.134
H. Xiong, G. Pande, M. Steinbach, V. Kumar (2020),
Enhancing data analysis with noise removal, IEEE
Transaction on Data Engineering, Vol. 37(issue 3)
D. Gamberger, N. Lavrac, S. Dzeroski (2022), Noise
Elimination in Inductive Concept Learning: A Case
Study in Medical Diagnosis, International workshop on
algorithmic learning theory, Sydney, Australia
A. Aming, R. Agrawal, P. Raghavan (2019), A linear
method for deviation detection in large databases in
KDDM, pg.164-pg.169
Guyon, N. Matic, V. Vapnik (2019), Discovering
informative patterns and data cleaning, Advances in
knowledge discovery and data mining, pg.181-pg.203
D. Gamberger, N. Lavrae (2021), Conditions for occam’s
razor applicability and noise elimination, European
conference on machine learning (Springer), pg.108-
pg.123
E. M. Knorr, R. T. Ng (2017), A unified notion of outliers:
properties and computation, 3
rd
International
conference on knowledge discovery and data mining
E. M. Knorr, R. T. Ng (2018), Algorithms for mining
distance-based outliers in large datasets, pg.392-
pg.403
D. Tax, R. Duin. (2021), Outliner detection using classifier
instability, Workshop on statistical pattern recognition,
Sydney, Australia
C. Brodley, M. Friedl (2020), Identifying Mislabeled
Training Data, JAIR, pg.131-pg.161
D. Gamberger, N. Lavrac, C. Groselj (2022), Experiments
with Noise Filtering in a Medical Domain, in ICML,
Morgan Kaufman, San Francisco, CA, pg.143-pg.51
S. Schwarm, S. Wolfman (2020), Cleaning Data with
Bayesian Methods, Final Project Report for University
of Washington Computer Science and Engineering,
CSES74
S. Ramaswam, R. Rastogi, K. Shim (2022), Efficient
Algorithms for Mining Outliers from Large Data Sets,
in ACM SIGMOD, Vol. 29, pg.427-pg.438
V. Raman, J.M. Hellerstein (2020), An Interactive
Framework fo
r Data Transformation and Cleaning, Technical Report
University of California, Berkeley
J. Kubica, A. Moore (2019), Probabilistic Noise
Identification and Data Cleaning, IEEE International
Conference on Data Mining
V. Verbaeten, A. V. Assche (2020), Ensemble Methods for
Noise Elimination in Classification Problems, Multiple
Classifier Systems, (Springer)
J. A. Loureiro, L. Torgo, C. Soares, (2021), Outlier
Detection Using Clustering Methods: A Data cleaning
application, Proceedings of KDNet Symposium on
Knowledge-based Systems for the Public Sector, Bonn,
Germany.
A. Papagelis, D. Kalles (2021), GATree: Genetically
Evolved Decision Trees, International Conference on
Tools with Artificial Intelligence, pg.203-pg.206
G. H. John (2021), Robust Decision Trees: Removing
Outliers from Databases, 1
st
ICKDDM, pg.174-pg179.
D. J. Newman, S. Hettich, C. L. Blake, C. J. Merz (2020),
UCI Repository of Machine Learning Databases,
Department of Information and Computer Science,
University of California, Irvine
Improving Classification Accuracy in Using Evolutionary Decision Tree Filtering in Big Datasets
191