The reason for choosing different datasets is to
investigate if the classification performance is ham-
pered by the characteristics of the dataset. For in-
stance, missing values in a dataset are a common form
of attribute noise. Therefore, we deliberately choose
some datasets with missing values. It would be worth
experimenting to see how classifiers behave in such
a scenario. In addition, we focus on imbalanced data
sets. In such data sets, the distribution can vary from a
slight bias to a severe imbalance where there could be
one example in the minority class for hundreds, thou-
sands, or millions of examples in the majority of the
class or classes. For instance, the class distribution of
the wdbc data set is 62.74% for the positive class and
37.26% for the negative class. In imbalanced data,
the minority class is more sensitive than the majority
class. Therefore, we include both balanced and imbal-
anced datasets in our experiments. Fig. 1 illustrates
the class distribution for each UCI dataset included in
our experiments.
We also include large training datasets such as
CIFAR-10, MNIST, and Fashion-MNIST. As an ex-
ample, the training data for CIFAR-10 is 60,000.
Table 2: Characteristics of Image Datasets.
Dataset Training set Testing set class Balanced?
MNIST 60000 10000 10 No
Fashion-MNIST 60000 10000 10 Yes
CIFAR-10 50000 10000 10 Yes
4.2 Experimental Setup
In our experimental setup, we divide the UCI datasets
into three categories:
• Balanced dataset: Each class has equal distribu-
tion (iris, and segmentation dataset)
• Slightly balanced dataset: The distribution of
classes is uneven by a small amount. In our set-
ting, if the majority class to minority class ratio is
between 1:1 to 1:69, we define it as a slightly im-
balanced dataset (credit card, spect, glass, wdbc,
wine, and dermatology dataset)
• Highly imbalanced dataset: The distribution of
classes is uneven by a large amount. In our set-
ting, if the majority class to minority class ratio is
greater than 1:70, we define it as a highly imbal-
anced dataset (ecoli, and yeast dataset)
To evaluate the performance of noise on classifica-
tion performance, we use 2 different evaluation met-
rics :
• AUROC: The AUROC computes the area under
the ROC curve. The ROC curve plots the true pos-
itive rate vs false positive rate at various threshold
settings. In our experiments, we use AUROC for
balanced and slightly imbalanced datasets. This is
because the AUROC gives the same result regard-
less of what the class probabilities are.
• AUPRC: AUPRC is defined as the average of pre-
cision scores calculated for each recall threshold.
We use AUPRC for highly imbalanced datasets as
it focuses mainly on the positive class.
In the case of image datasets, we use the loss func-
tion to evaluate the performance of the deep neural
network. The loss function we use in our experiments
is categorical cross-entropy.
We split our dataset into training and test sets. To
preserve the percentage of samples for each class, we
use a variation of K-fold named stratified K-fold. The
value for K is set to 10 because the low values of K
will result in a noisy estimate of model performance
and a very large value will result in a less noisy esti-
mate of model performance.
When simulating class noise, the training dataset
is corrupted with varying degrees of noise while keep-
ing the test dataset clean. It allows us to evaluate the
true performance of the classifier. In the case of class
noise, random noises are injected with rates of 10%,
20%, 30%, 40%, and 50%. We restrict our noise level
up to 50% of the original dataset because in realistic
situations only certain types of classes are likely to
be mislabeled. For attribute noise, we use Gaussian
noise with zero mean and 2 different variance values
of 0.5 and 0.7. In image datasets, we vary the value
of variance from 0.1 to 0.9. The reason is that with a
small variance a noisy image can still have good per-
formance and the distortion level will be minimum.
Hence, we want to observe the performance with dif-
ferent variances of noise. For image datasets, we cor-
rupted the training data with Gaussian noise and eval-
uated it with test data.
For class noise evaluation, we use the Weka tool
(Eibe et al., 2016). It is a free software tool for data
mining tasks. In the case of attribute noise evalua-
tion, we implemented the noise injection model and
the classifiers in Python 3.5. The parameter settings
for five classifiers are as follows: J48 (confidence
factor C=0.25), NB (bacthSzie=100, useKernelEsti-
mator=False), SVM (kernel: polynomial kernel with
degree 1, tolerance parameter=0.001), k-NN (k=1,
distance: euclidean distance) and RF (bagSizePer-
cent=100, maxDepth=0, numIterations=100). All the
experiments were run on Mac OS Big Sur with a 3.1
GHz CPU and 8GB memory.
ICPRAM 2022 - 11th International Conference on Pattern Recognition Applications and Methods
166