
on full training instances and trees built on cleaned 
training instances. It has been tested on full training 
instances, and the trees have been built on  cleaned 
training instances.  
The  average  absolute  observed  percentage 
reduction  in  tree  size  is  significant,  i.e.  a  41.40% 
reduction in tree size. The accuracy column displays 
the  absolute  difference  in  accuracy  between  trees 
built.  The  average  difference  in  accuracy  for  a 
cleaned data set is 19.13%. When the accuracies are 
compared, we can argue that evolutionary trees with 
cleaned  data  sets  are  more  accurate.  The proposed 
work  has  an  average  classification  accuracy  of 
93.73%,  compared  to  an  average  of  74.63%  with 
outliners  data.  Thus,  we  can  argue  that  robust 
efficient  evolutionary  trees  improve  classifier 
performance.  
6  CONCLUSION 
Based on a statistical approach, this paper proposed a 
technique for dealing with outliers in data. When the 
outliners are removed, the induced patterns become 
more accurate and extremely simple. The results we 
obtained validate the use of the proposed techniques 
for  this  task.  Furthermore,  when  compared  to 
previous  approaches  to  the  same  data,  the  results 
clearly outperform them, even with the same level of 
erroneous data. The proposed algorithm employs an 
evolutionary  decision  tree  as  a  filter  classifier  for 
training data in order to pursue a global search in the 
problem  space  with  classification  accuracy  as  a 
fitness function while avoiding a local optimum, and 
the final classifier employs a cleaned data set. As a 
result of this combination of techniques, we have a 
robust and efficient classifier. 
REFERENCES 
J. Han, M. Kamber, (2006), Data Mining: Concepts and 
Techniques (2
nd
 edition), Morgan Kaufman Publishers. 
N Lavarac, D. Gamberger, P. Turney (2018), Cost-sensitive 
feature reduction applied to a hybrid genetic algorithm, 
in 7
th
 International workshop on algorithmic learning 
theory  (ALT’18),  Sydney,  Australia,  October  2018, 
pg.127-pg.134 
H.  Xiong,  G.  Pande,  M.  Steinbach,  V.  Kumar  (2020), 
Enhancing  data  analysis  with  noise  removal,  IEEE 
Transaction on Data Engineering, Vol. 37(issue 3) 
D.  Gamberger,  N.  Lavrac,  S.  Dzeroski  (2022),  Noise 
Elimination  in  Inductive  Concept  Learning:  A  Case 
Study in Medical Diagnosis, International workshop on 
algorithmic learning theory, Sydney, Australia 
A.  Aming,  R.  Agrawal,  P.  Raghavan  (2019),  A  linear 
method  for  deviation detection  in  large databases in 
KDDM, pg.164-pg.169 
Guyon,  N.  Matic,  V.  Vapnik  (2019),  Discovering 
informative  patterns  and  data  cleaning,  Advances  in 
knowledge discovery and data mining, pg.181-pg.203 
D. Gamberger, N. Lavrae (2021), Conditions for occam’s 
razor  applicability  and  noise  elimination,  European 
conference  on  machine  learning  (Springer),  pg.108-
pg.123 
E. M. Knorr, R. T. Ng (2017), A unified notion of outliers: 
properties  and  computation,  3
rd
  International 
conference on knowledge discovery and data mining 
E.  M.  Knorr,  R.  T.  Ng  (2018),  Algorithms  for  mining 
distance-based  outliers  in  large  datasets,  pg.392-
pg.403 
D. Tax, R. Duin. (2021), Outliner detection using classifier 
instability, Workshop on statistical pattern recognition, 
Sydney, Australia 
C.  Brodley,  M.  Friedl  (2020),  Identifying  Mislabeled 
Training Data, JAIR, pg.131-pg.161 
D. Gamberger, N. Lavrac, C. Groselj (2022), Experiments 
with Noise Filtering in a Medical Domain, in ICML, 
Morgan Kaufman, San Francisco, CA, pg.143-pg.51 
S.  Schwarm,  S.  Wolfman  (2020),  Cleaning  Data  with 
Bayesian Methods, Final Project Report for University 
of  Washington  Computer  Science  and  Engineering, 
CSES74 
S.  Ramaswam,  R.  Rastogi,  K.  Shim  (2022),  Efficient 
Algorithms for Mining Outliers from Large Data Sets, 
in ACM SIGMOD, Vol. 29, pg.427-pg.438 
V.  Raman,  J.M.  Hellerstein  (2020),  An  Interactive 
Framework fo 
r  Data  Transformation  and  Cleaning,  Technical  Report 
University of California, Berkeley 
J.  Kubica,  A.  Moore  (2019),  Probabilistic  Noise 
Identification  and  Data  Cleaning, IEEE  International 
Conference on Data Mining 
V. Verbaeten, A. V. Assche (2020), Ensemble Methods for 
Noise Elimination in Classification Problems, Multiple 
Classifier Systems, (Springer) 
J.  A.  Loureiro,  L.  Torgo,  C.  Soares,  (2021),  Outlier 
Detection Using Clustering Methods: A Data cleaning 
application,  Proceedings  of  KDNet  Symposium  on 
Knowledge-based Systems for the Public Sector, Bonn, 
Germany. 
A.  Papagelis,  D.  Kalles  (2021),  GATree:  Genetically 
Evolved Decision Trees, International Conference on 
Tools with Artificial Intelligence, pg.203-pg.206 
G.  H.  John  (2021),  Robust  Decision  Trees:  Removing 
Outliers from Databases, 1
st
 ICKDDM, pg.174-pg179. 
D. J.  Newman, S. Hettich, C. L. Blake, C. J. Merz (2020), 
UCI  Repository  of  Machine  Learning  Databases, 
Department  of  Information  and  Computer  Science, 
University of California, Irvine 
Improving Classification Accuracy in Using Evolutionary Decision Tree Filtering in Big Datasets
191