Cost-Aware Ensemble Learning Approach for Overcoming Noise in

Labeled Data

Abdulrahman Gharawi, Jumana Alsubhi and Lakshmish Ramaswamy

School of Computing, University of Georgia, Athens, U.S.A.

Keywords:

Machine Learning, Deep Learning, Ensemble Learning, Label Noise, Class Label Noise, Labeling Cost

Optimization, Mislabeled Data.

Abstract:

Machine learning models have demonstrated exceptional performance in various applications as a result of

the emergence of large labeled datasets. Although there are many available datasets, acquiring high-quality

labeled datasets is challenging since it involves huge human supervision or expert annotation, which are ex-

tremely labor-intensive and time-consuming. Since noisy datasets can affect the performance of machine

learning models, acquiring high-quality datasets without label noise becomes a critical problem. However,

it is challenging to signiﬁcantly decrease label noise in real-world datasets without hiring expensive expert

annotators. Based on extensive testing and research, this study examines the impact of different levels of label

noise on the accuracy of machine learning models. It also investigates ways to cut labeling expenses without

sacriﬁcing required accuracy.

1 INTRODUCTION

Machine learning has shown outstanding perfor-

mance in a variety of applications since the recent

emergence of large-scale datasets. This success de-

pends on the availability of large amounts of la-

beled data (Krizhevsky et al., 2017) and (Li et al.,

2017), which is both expensive and time-consuming

(Nguyen et al., 2015). There are several techniques

presented in the literature to reduce the high label-

ing cost by using non-expert annotators on Amazon’s

Mechanical Turk (Nguyen et al., 2015); however, the

use of non-experts often results in erroneously labeled

data, commonly referred to as noisy labels (Nguyen

et al., 2015). The percentage of incorrect labels has

been observed to range from 5% to 38% in real-world

datasets (Song et al., 2019).

Training supervised machine learning algorithms

are known to be sensitive to noisy labels since it

is assumed that the training dataset is correctly la-

beled (Krizhevsky et al., 2017),(Li et al., 2017), and

(Song et al., 2019). Noisy labels can negatively affect

the performance of ML models more than any other

type of noise (Song et al., 2019). Furthermore, they

can impact the structure of the models and the time

needed to train the classiﬁers (Garcia et al., 2015).

In addition, ML models can even learn on corrupted

labels, and it fails to generalize the model (Krogh

and Hertz, 1991). Batch normalization, dropout, and

data augmentation (Perez and Wang, 2017) have been

utilized to overcome the overﬁtting issue (Perez and

Wang, 2017), but with noisy datasets, they did not

completely overcome the overﬁtting issue (Lee et al.,

2018).

Since real-world datasets contain label noise by its

nature and it is challenging to eliminate label noise

completely, utilizing an expert annotator is consid-

ered to minimize label noise (Nguyen et al., 2015)

and (Aslan et al., 2017). Unfortunately, in most do-

mains, expert annotators are expensive, limited, and

many projects cannot afford them (Aslan et al., 2017).

Therefore, some proposals include designing anno-

tating techniques with non-experts to acquire high-

quality labeled datasets at a considerably lower cost

(Nguyen et al., 2015). Other recent studies try to over-

come label noise by removing noisy records (Huang

et al., 2019). Even so, it is challenging to predict

whether or not a record label has noise. Other lit-

erature introduces machine learning models that are

more robust to a noisy dataset (Xiao et al., 2015) and

(Song et al., 2019). Despite the techniques introduced

in the literature to obtain high-quality labeled datasets

at a cheaper cost, obtaining a clean and accurately an-

notated dataset remains challenging and costly (Aslan

et al., 2017). In addition, reducing the label noise may

not help the machine learning model to generalize or

Gharawi, A., Alsubhi, J. and Ramaswamy, L.

Cost-Aware Ensemble Learning Approach for Overcoming Noise in Labeled Data.

DOI: 10.5220/0011675200003393

In Proceedings of the 15th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2023) - Volume 3, pages 381-388

ISBN: 978-989-758-623-1; ISSN: 2184-433X

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

381

acquire higher accuracy, but it just increases the label-

ing cost.

Based on extensive experimentation and analysis,

this study investigates the inﬂuence of various degrees

of label noise on the accuracy of machine learning

models. Also, we examine the the trade off between

labeling cost and accuracy. Furthermore, we propose

a cost-effective method to improve the accuracy when

there is high label noise. To mimic a noisy real-

world dataset. Firstly, we adopt a common approach

for adding synthetic label noise to a training set of

available benchmark datasets and keeping the test set

clean. Next, we evaluate each machine learning algo-

rithm’s robustness to noise with distinct levels of label

noise. Finally, we discuss cost-effective learning and

the best way to select the annotator level of expertise

depending on the machine learning task and the bud-

get. This paper makes the following contributions:

• Investigate the effect of class labels noise on some

ML models to minimize the cost of annotation by

determining the robustness of different ML mod-

els towards label noise.

• Explore the trade off between labeling cost and

accuracy when the size of dataset and the level of

noise change.

• Introduce a cost-effective method using ensemble

learning to improve accuracy even in the presence

of signiﬁcant label noise.

The rest of this paper is organized as follows: Sec-

tion 2 surveys related work and Section 3 discusses

the design of the experimental study. The results and

limitations are explored in Section 4 and Section 5,

respectively. In Section 6, we conclude and discuss

our plans for future work.

2 MOTIVATION AND RELATED

WORK

Correctly labeled datasets are essential for supervised

learning to build a reliable model. Machine learn-

ing algorithms typically assume that the dataset is

labeled correctly. However, obtaining a reliable la-

beled dataset is expensive and difﬁcult, leading to a

noisy dataset and unreliable models. Crowdsourc-

ing from non-experts and web annotations are two

low-cost yet ineffective options for gathering anno-

tations on a big scale. These two options were widely

used for image data, where tags and online search

terms are recognized as acceptable labels. Unfortu-

nately, both of these options frequently introduce in-

correct or noisy labels. Furthermore, the inclusion

of noise in a dataset’s class label degrades its quality

and may reduce classiﬁer prediction accuracy. There-

fore, there is an increasing interest in obtaining high-

quality datasets with minimum amount of noises for

image and text processing applications. In the liter-

ature, many methods for training machine learning

in the presence of noisy labels have been proposed.

However, the effect of mislabeled data and class la-

bel noise did not receive appropriate consideration in

terms of ML structure and annotator cost. This paper

explores the effect of label noise on machine learning

and deep learning algorithms with different datasets

sizes and labeling costs.

Various strategies and techniques recently intro-

duced in the literature to handle class label noise in

datasets (Huang et al., 2019). These techniques try to

identify the noisy label and remove the record from

the dataset before the training. However, it is hard

to identify the noise label. The other suggested tech-

nique in literature is to design a noise tolerant ma-

chine learning models (Xiao et al., 2015) and (Gar-

cia et al., 2015). On the contrary, in this paper, we

study the effect of the label noise on different ma-

chine learning algorithms to optimize the cost of la-

beling. We study the effect of label noise on different

machine learning models such as SVM, KNN, MLP,

CNN, DT, and RF.

Support vector machines (SVM) are capable of

handling both classiﬁcation and regression problems.

The decision boundary for this method is the hyper-

plane, which must be determined. A decision plane

is required to divide a collection of objects into their

classes. In other words, SVM searches for a hyper-

plane in a high-dimensional feature space that has

the greatest potential distance between two classes

of data. The goal of SVM is to correctly identify

the objects using examples from the training data set

(Sitawarin and Wagner, 2019).

K-nearest neighbors (KNN) is a straightforward

supervised machine learning algorithm that can be

used to address classiﬁcation and regression prob-

lems. KNN attempts to identify the k training sam-

ples that are closest to a new element before predict-

ing the label of that new element using the k-nearest

points. Calculating the distance can be done using

any similarity function. Despite being straightfor-

ward, KNN frequently performs well in classiﬁca-

tion scenarios with very irregular decision boundaries

(Sitawarin and Wagner, 2019).

The Multilayer perceptron (MLP) is fully con-

nected multi-layer neural network. It is a model for

the nonlinear mapping of an input vector to an output

vector. The weights and output signals connecting the

nodes are a function of the total of the node’s inputs,

ICAART 2023 - 15th International Conference on Agents and Artiﬁcial Intelligence

382

as adjusted by a straightforward nonlinear activation

function. MLP can learn through training. A collec-

tion of training data is needed, which is made up of

input and corresponding output vectors. The training

data is repeatedly fed into the multilayer perceptron,

and the weights in the network are changed until the

desired input—output mapping is achieved (Gardner

and Dorling, 1998).

Convolutional neural networks (CNNs), which

fall under the Deep Neural Networks category, are

widely used in conjunction with two-dimensional in-

puts like such as images. Millions of photos can be

used as inputs to enable them to learn about thou-

sands of different objects. By altering the depth and

breadth of the model, CNN’s learning capacity can be

changed. The convolutional layer is a crucial compo-

nent of the CNN architecture that comprised of learn-

able ﬁlter banks that are active when a certain feature

is discovered (Durga et al., 2019).

Decision trees (DT) are a form of supervised ma-

chine learning in which the training data is continu-

ally divided based on a particular parameter, by de-

scribing the input and the associated output. Decision

nodes and leaves are the two components that can be

used to explain the tree. The leaves represent the ﬁnal

decisions whereas the decision nodes are where the

data is split (Su and Zhang, 2006).

The random forest algorithm (RF) is made up of

a variety of independent decision trees. RF employs

two random selection processes to build a single de-

cision tree: the ﬁrst is the random selection of train-

ing samples, and the second is the random selection

of the sample’s characteristic features. When all the

decision trees have been built, equal-weight voting

is used to determine the ﬁnal classiﬁcation decision

(Ren et al., 2017).

Recently, many machine learning problems are

successfully solved using ensemble techniques. Such

techniques include training multiple models and in-

tegrating their predictions to increase the predictive

performance of a single model. In ensemble learn-

ing (EL), an inducer or base-learner uses a series of

labeled instances as input to create a model such as

a classiﬁer that generalizes these examples. Decision

tree, neural network, and linear regression model are

some examples of machine learning algorithms that

can work as inducers. The fundamental idea behind

ensemble learning is that by merging many models,

the faults of one inducer will most likely be made

up for by other inducers, improving the ensemble’s

overall prediction performance. A straightforward

but efﬁcient method for creating an ensemble of in-

dependent models is bagging, in which each inducer

is trained using a replacement sample of instances

drawn from the original dataset. Each sample typi-

cally has the same number of examples as the origi-

nal dataset in order to guarantee a sufﬁcient number of

cases per inducer. The ﬁnal prediction of an unknown

instance is decided by majority voting among the in-

ducers’ predictions (Sagi and Rokach, 2018). In this

paper, we study the performance of ensemble learn-

ing in the presence of label noise by exploring dif-

ferent combinations of the aforementioned machine

learning models. The results indicate that ensemble

learning is a cost-effective approach for overcoming

noise in labeled data.

3 DESIGN OF EXPERIMENTAL

STUDY

In this empirical study, we add the noise to the class

label in two different ways: (i) random class ﬂipping

and (ii) ﬂipping to one of the most three classes hu-

man labelers will get confused with. To optimize each

model’s hyperparameters Ray Tune with HyperOpt-

Search was used (Krogh and Hertz, 1991). All mod-

els were trained with a batch size of 64. For each time

we increase the label noise, we train separate models

with different learning rates ranging from 0.01 to 0.9

and pick the learning rate that results in the best per-

formance. Also, generally, we observe that the higher

the label noise, the lower the optimal learning rate.

We used MNIST, Flowers, Adult, Breast Cancer

Wisconsin, and IMDB reviews datasets in these ex-

periments to show the effect of the label noise on dif-

ferent machine learning algorithms and to optimize

the labeling cost.

Adult: The adult dataset, also known as the Cen-

sus Income dataset from the UCI Machine Learning

repository (Dua and Graff, 2017), consists of 48,842

entries extracted from the US Census database with

16 columns and 14 attributes. After cleaning the

dataset, 7 percent was deleted because it contained

missing values. The remaining 44,5222 entries con-

tain 24.78 percent for the income less than 50K and

75.22 percent for equal or more than 50K. The task

is to predict income levels based on the individual’s

personal information.

The Breast Cancer Wisconsin (WBC): This

dataset also was retrieved from the UCI Machine

Learning Repository (Dua and Graff, 2017). It con-

tains 569 instances with ten features computed from

digitized images of a breast mass. Each feature value

is recorded with four signiﬁcant digits. There are two

classes, which are benign with 357 instances and 212

malignant instances.

MNIST: This dataset is being used as a bench-

Cost-Aware Ensemble Learning Approach for Overcoming Noise in Labeled Data

383

mark for classiﬁcation algorithms (Deng, 2012). The

MNIST dataset contains 70,000 images, 28x28 pix-

els each of handwritten digits between 0 and 9. The

training set contains 60,000 images, and the test set

includes 10,000 images.

IMDB Reviews: It contains 50,000 reviews for

movies. The review in a textual format and class la-

bels are 0 for negative or 1 for positive reviews.

Flowers: This dataset contains 4242 images of 5

different types of ﬂowers chamomile, tulip, rose, sun-

ﬂower, and dandelion. Each class contains almost 800

images. The size of each image is about 320x240 pix-

els, but some images have lower resolution than the

320x240 pixels, so we resized all images to 300x200.

3.1 Trade off Between Labeling Cost

and Accuracy

This section aims to answer the following question:

What are the trade off between labeling cost and ac-

curacy? Table 3 shows the cost of labeling by hu-

mans with various expertise levels ranging from non-

experts to experts in the ﬁeld. The cost of labeling

increases as the level of expertise increases. Label-

ing the data with expert labelers would result in less

noise and higher accuracy. However, we can deter-

mine a suitable ML model, which results in a likable

accuracy even if the data has some noise. Some ML

models’ accuracy can be considerably impacted by la-

bel noise (Song et al., 2019). On the other hand, there

are some ML models that can cope with label noise

because they are more resilient and less sensitive to

label noise (Krizhevsky et al., 2017).

Multiple experiments have been conducted using

several ML models with different datasets in order to

investigate the resilience of various ML models to-

wards label noise. Through intensive investigation of

the robustness of various ML models toward misla-

beled data, it has been shown that there are some ML

models that are more resilient towards label noise.

We studied the effect of label noise on differ-

ent machine learning models such as SVM, KNN,

MLP, CNN, DT, and RF using MNIST, Flowers,

Adult, Breast Cancer Wisconsin, and IMDB reviews

datasets. To estimate the accuracy when hiring dif-

ferent labelers, we can assume that the dataset with

different levels of label noise is obtained from differ-

ent labelers ranging from expert labelers with 0-5% of

noise to non-expert labelers with around %5-30% of

noise. Based on the accuracy obtained in the exper-

iments we conducted, we can determine the optimal

labeler for each dataset based on the cost and desired

accuracy.

In each experiment, the degree of noise is in-

creased from 0 to 30% to demonstrate how the ML

responds as the noise level increases. We execute

each experiment with 100%, 50%, and 20% of the

dataset in order to determine how well ML mod-

els perform when there is label noise. Additionally,

each experiment was run 30 times, with an average

taken to obtain a rough estimate. Since knowledge

experts are expensive, obtaining a huge, precisely la-

beled dataset on a low budget is difﬁcult. However,

not all machine learning models need a perfect dataset

to achieve higher accuracy. The results in section 4.1

show how well ML models perform when trained on

a big, noisy dataset against a smaller, clean dataset.

Moreover, in section 4.1, we will discuss the impact

of various levels of label noise on 100, 50, and 20 per-

cent of the same dataset, and we will assess the trade-

off between dataset size and cleanness to compare the

labeling cost for each size.

3.2 Using Ensemble Learning to Reduce

Labeling Cost

Ensemble learning combines several individual mod-

els to obtain better performance. In other words,

an ensemble can be considered a learning technique

where many models are joined to solve a problem.

This is done because an ensemble tends to perform

better than singles improving the generalization abil-

ity. In ensemble learning, predictions from vari-

ous neural network models are combined to lower

prediction variance and generalization error (Wang

et al., 2014), (Alsubhi et al., 2021), and (Wang et al.,

2014). This machine learning paradigm, where mul-

tiple learners are trained to solve the same problem,

have shown its ability to make an accurate prediction

from multiple machine learning in classiﬁcation prob-

lem (Sagi and Rokach, 2018).

This section aims to answer the following ques-

tion: can ensemble learning perform better than a sin-

gle model in the presence of label noise? Also, which

machine learning model combinations can cope bet-

ter with mislabeled? Ensemble learning is being used

in the literature to increase task prediction accuracy in

many ﬁelds like Health Care, Speech, Image Classi-

ﬁcation, Forecasting, and Others (Wang et al., 2014),

(Moon et al., 2020), (Dong et al., 2020), and (Yu et al.,

2008).

Our ensemble approach has only three classiﬁers

which are combined to operate the ensemble. The

models that we used in this experiment are CNN,

MLP, DT, RF, KNN, and SVM using MNIST, Flow-

ers, Adult, Breast Cancer Wisconsin, and IMDB re-

views datasets. The best three combinations of these

machine learning models are presented in section 4.2.

ICAART 2023 - 15th International Conference on Agents and Artiﬁcial Intelligence

384

The amount of noise introduced increases from 0 to

30% in each experiment to show how the ensemble

learning performs each time noise increases.

4 RESULTS AND DISCUSSION

In this section, we empirically examine the effect of

noisy datasets on ML performance, accuracy, and the

labeling cost associated with each annotator. Firstly,

we simulate class label noise by randomly ﬂipping the

class label from its actual label to any other class la-

bel. Secondly, we gradually increase the label noise.

Then, we compare the annotators cost with the label-

ing noise.

Figure 1: Effect of label noise on MLP, RF, SVM, DT, and

KNN with WBC dataset (A) 100% of dataset, (B) 50% of

dataset and (C) 20% of the dataset.

The result in this experiment indicates that ML

models can be affected differently depending on the

dataset. For instance, random forest performance

dropped just 0.14% for each 1 percent of noise only

with MNIST, but with other datasets, the accuracy

dropped around 0.5% , Table 1. Furthermore, with

different size of the same dataset the effect is sig-

niﬁcant as Figure 1 shows. This indicates that the

dataset’s size and type play an essential role in terms

of accuracy and generalizing the model (Lee et al.,

2018). Therefore, some machine learning models

show more resilience to noise than others, depending

on the dataset type and the task.

4.1 Experiment 1

Many machine learning and natural language process-

ing tasks require human-labeled data. It is critical for

the ML model to have a high-quality dataset with less

Table 1: Average Accuracy Drop.

Dataset MLP RF DT CNN SVM KNN

MNIST 0.59 0.14 1 0.41 0.70 0.19

WBC 0.17 0.43 0.87 - 0.74 0.63

Adult 0.73 0.38 0.94 - 0.70 0.61

Flower 0.91 0.57 - 0.51 - -

label noise on the training set, since falsely labeled

data affects machine learning models. Unfortunately,

obtaining such a high-quality dataset with low label-

ing noise is usually quite expensive and necessitates

the assistance of a domain expert (Krizhevsky et al.,

2017).

This experiment shows that we can simultane-

ously optimize the cost of labeling and obtain desir-

able accuracy. We used the Human Intelligent Tasks

(HITs) prices presented in (Krizhevsky et al., 2017)

and (Lee et al., 2018) to acquire human knowledge

for labeling datasets using Amazon Mechanical Turk

(MTurk). According to (Nguyen et al., 2015) and

(Feng et al., 2009), the price can be between $1.5

for a non-expert to $150 knowledge expert, which

is $0.015 to $1.5 per image. By evaluating the per-

formance of each machine learning models with la-

beler price shown in Table 3, we can identify the cost

needed to achieve the desired accuracy in each ma-

chine learning models.

Precisely, we measure: (1) the performance of

each classiﬁer with acquired labels from the annota-

tors on the Table 3 on the test set, and (2) the total

labeling cost by each annotator. Figure 1 presents the

performance of each classiﬁer depending on the accu-

racy of the labeling for the Cancer dataset and MNIST

dataset, respectively. For the most part, when the level

of noise increases as a result of using non-expert an-

notators, we can see that most machine learning mod-

els’ performance decrease. In other words, the per-

formance of machine learning models increases when

the price of labeling increases. However, some ma-

chine learning can have better accuracy than others

with the same or even more label noise. For instance,

SVM can perform better than MLP with the $12,600

labeling cost, but with $14,000 MLP performs bet-

ter than SVM. Likewise, Random Forest outperforms

CNN when the labeling cost is $1,400, but after in-

creasing the labeling cost to $13,000, CNN exceeds

Random Forest’s performance.

Total Cost = Number of images × Annotator Cost

per image

This indicates that with low labeling cost (labeled

by non-expert where the data can include label noise)

and by choosing the right ML model, we can achieve

desirable performance. Figures 1 shows that with up

to 10% of added noise, some ML is slightly affected

in some datasets than other. For instance, the effect

Cost-Aware Ensemble Learning Approach for Overcoming Noise in Labeled Data

385

Figure 2: Machine learning algorithm cost with different

labeling accuracy using MNIST dataset.

Figure 3: Voting VS Machine learning algorithm costs.

of 10 percent label noise on CNN with the MNIST

dataset is just 3 percent. On the other hand, the ef-

fect on MLP with the same dataset and percentage of

noise is more than 6 percent. Let us assume that the

MNIST dataset is unlabeled, and it needs a human

annotator. Choosing an expert annotator is very ex-

pensive as Figure 2 shows. The cost of labeling the

dataset is 28,000 dollars if we choose the experts with

only 5 percent noise. In this case, the accuracy is 96

percent when using CNN and 94 percent using MLP.

On the contrary, if we labeled the same dataset by an

expert with 10 percent noise, the cost is only 14,000

dollars. The accuracy using CNN is 95 percent, and

MLP is 92 percent, 2% less than when the label noise

is 5%. This shows that we can trade off 2% accuracy

and decrease the labeling cost dramatically.

Therefore, depending on the ML task, we can de-

termine if we need to invest more money in the label-

ing to increase accuracy. In a manufacturing classiﬁ-

cation problem, for example, assuming that the task is

to label the product images as damaged or good. The

importance of eliminating the noise will be different

depending on the cost of the product. If the cost of

the product is 1 dollar, then 10 percent of noise would

cost much less than a 50-dollar product. If the cost

of misclassiﬁcation is not crucial, then we can cost-

effectively label the data even if it has some noise.

Additionally, by comparing the performance of

ML models, with more miniature dataset sizes, the

accuracy is lower with the same level of noise. When

we deduce the dataset’s size to 20%, the accuracy

dropped dramatically even with the same level of

noise as shown in Figure 1. Nevertheless, expert an-

notators are expensive, and many projects cannot af-

ford them. For example, the cost of expert annotators

can be 30 times more than the cost of non-expert an-

notators, as Figure 2 shows. By analyzing the results

in Figure 2, ML models with large noisy datasets per-

form better than smaller datasets. For instance, with

20% of the MNIST dataset labeled by an expert la-

beler costing $11,200, ML models can only achieve

90% accuracy. On the other hand, if we used non-

expert labeler to label the entire dataset, we can obtain

similar accuracy with only $224. Therefore, while it

is essential to have accurately labeled data, the size of

the dataset is more valuable to generalize the machine

learning model and have an accurate classiﬁcation.

4.2 Experiment 2

As experiment 1 shows, some algorithms showed

more resilience to label noise than others. In this sec-

tion, we analyze the performance of ensemble learn-

ing with different conﬁgurations against different lev-

els of label noise. We used three different machine

learning algorithms presented in section 2 and com-

bined them together in the voting structure to ﬁnd the

most robust combination. Each model is trained sepa-

rately and then combined for a hard voting ensemble.

For consistency, we use the same hyperparameter op-

timization framework to get the best conﬁgurations

in each model. Correspondingly, we implement the

same level of in each model. We increase the noise

from 0 to 30 percent with each ensemble learning to

investigate the model robustness and the sensitivity

by examining the average accuracy drop with each 1

percent class noise.

The experimental results presented in Table 2

show the labeling cost associated with different lev-

els of label noise as well as the accuracy of vari-

ous ML models, including three combinations of en-

semble learning models, which are (RF, KNN, and

MLP), (CNN, KNN, and SVM), and (CNN, KNN,

and MLP). We can see that the labeling cost for the

MNIST dataset that has 70,000 images is $56,000,

ICAART 2023 - 15th International Conference on Agents and Artiﬁcial Intelligence

386

Table 2: Comparison between ensemble learning and traditional machine learning in terms of the labeling cost.

Cost/item Noise Cost EL 1

EL 2

EL 3

CNN DT RF KNN SVM MLP

0.01 30 700 93.32 94.89 95.63 93.16 61.03 92.67 91.04 79.74 78.75

0.02 25 1400 94.7 95.04 96.54 94.73 66.32 94.75 92.36 83.93 82.65

0.03 20 2100 96.83 95.97 96.77 95.43 70.3 95.89 93.19 85.13 85.35

0.18 15 12600 96.95 96.15 96.98 96.31 75.01 96.68 94.17 89.65 88.16

0.2 10 14000 96.09 96.18 97.01 97.1 78.53 96.94 95.51 92.45 93.38

0.4 5 28000 97.04 96.91 97.17 98.54 85.81 97 96.8 95.81 96.72

0.8 3 56000 97.01 97.23 97.69 98.54 85.81 97 96.8 95.81 96.72

EL1: RF + KNN + MLP.

EL2: CNN + KNN + SVM.

EL3: CNN + KNN + MLP.

Table 3: Labeling Price VS Accuracy.

Cost/Label Noise% Total Cost

Non-Expert 0.01 30 700

Non-Expert 0.02 25 1400

Non-Expert 0.03 20 2100

Non-Expert 0.18 15 12600

Expert 0.2 10 14000

Expert 0.4 5 28000

Expert 0.8 3 56000

which is the most expensive and accurate labeling

with only 3% of noise. On the other hand, the cost of

labeling is only $700 in total with 30 percent of noise.

It can be seen that using ensemble learning outper-

forms all other machine learning models when used

individually. In fact, using a combination of CNN,

KNN, and MLP yields the highest accuracy of 95.63

even with the high presence of label noise. We got

a desirable accuracy with the lowest cost even with

up to 30% of noise. This shows that we can use en-

semble learning as a cost effective method that can

cope with label noise. Through extensive experimen-

tation, we can see that using ensemble learning can

save us $55,300 when using non-expert to label the

MNIST dataset with 70,000 images while maintain-

ing relatively similar accuracy. In other words, paying

an extra $55,300 for an expert annotator to reduce the

label noise from 30 to 3 percent will only increase the

accuracy by 2%. Therefore, ensemble learning is con-

sidered one of the best method that copes with label

noise with low cost to maintain the desirable accuracy.

Unlike using single ML models where the accu-

racy can drop dramatically when we increase the level

of label noise, using an ensemble method maintain

high accuracy regardless of the level of noise in the

dataset. For instance, the accuracy of DT dropped by

roughly 25% when the level of noise increases from

3% to 30% as shown in Figure 2. Even with ML mod-

els that are more robust to label noise such as CNN,

RF, and KNN, there are a noticeable decrease in the

accuracy. These models perform better with cleaner

dataset, which is more expansive to obtain. On the

contrary, using an ensemble learning with any combi-

nations of ML models would result in a better accu-

racy with lower cost as illustrated in Figure 3.

5 LIMITATIONS

Although we used a representative distribution of ML

models, there are still some ML algorithms that can be

investigated. It is challenging to conduct the same ex-

periments with all ML models. However, based on the

results of the experiments in this research paper, we

can see that the results can be generalized. Further-

more, even though we used various datasets in terms

of size and types of data, there are more data that

needs to be explored such as sound datasets. How-

ever, it is beyond the scope of this research to conduct

the same experiments with every data type. Thus, we

used various data such as numerical and image data.

Also, there are different combinations of ML models

that can be used in ensemble learning. Although we

explored various combinations of ML algorithms to

come up with the best ensemble method in terms of

robustness to label noise, there are still other options

to explore that may result in a better accuracy.

6 CONCLUSION

In this paper, we explored the impact of class label

noise on machine learning algorithms’ performance

and accuracy. We proposed two simple settings for

labeling cost optimization. We explored both settings

by analyzing: the tradeoff between the size and clean-

liness of the dataset, as well as the tradeoff between

labeling cost and machine learning performance. De-

pending on the budget, we can choose the best ML

model. We have shown that machine learning models

Cost-Aware Ensemble Learning Approach for Overcoming Noise in Labeled Data

387

have different performance even with the same level

of noise. To have the desired accuracy with low-cost,

we have also shown that it is more important to have

a huge dataset even with high level of noise as op-

posed to small clean dataset. This is because most

ML algorithms need bigger data to train in order to

perform well. We have further shown that desirable

ML performance can be achieved with a low labeling

cost using ensemble learning since it is more resilient

and robust to label noise.

REFERENCES

Alsubhi, J., Gharawi, A., and Alahmadi, M. (2021). A per-

formance study of membership inference attacks on

different machine learning algorithms. Journal of In-

formation Hiding and Privacy Protection, 3(4):193.

Aslan, S., Mete, S. E., Okur, E., Oktay, E., Alyuz, N.,

Genc, U. E., Stanhill, D., and Esme, A. A. (2017).

Human expert labeling process (help): towards a reli-

able higher-order user state labeling process and tool

to assess student engagement. Educational Technol-

ogy, pages 53–59.

Deng, L. (2012). The mnist database of handwritten digit

images for machine learning research. IEEE Signal

Processing Magazine, 29(6):141–142.

Dong, X., Yu, Z., Cao, W., Shi, Y., and Ma, Q. (2020). A

survey on ensemble learning. Frontiers of Computer

Science, 14(2):241–258.

Dua, D. and Graff, C. (2017). UCI machine learning repos-

itory.

Durga, S., Nag, R., and Daniel, E. (2019). Survey on ma-

chine learning and deep learning algorithms used in

internet of things (iot) healthcare. In 2019 3rd inter-

national conference on computing methodologies and

communication (ICCMC), pages 1018–1022. IEEE.

Feng, D., Besana, S., and Zajac, R. (2009). Acquiring

high quality non-expert knowledge from on-demand

workforce. In Proceedings of the 2009 Workshop on

The People’s Web Meets NLP: Collaboratively Con-

structed Semantic Resources (People’s Web), pages

51–56.

Garcia, L. P., de Carvalho, A. C., and Lorena, A. C. (2015).

Effect of label noise in the complexity of classiﬁcation

problems. Neurocomputing, 160:108–119.

Gardner, M. W. and Dorling, S. (1998). Artiﬁcial neural

networks (the multilayer perceptron)—a review of ap-

plications in the atmospheric sciences. Atmospheric

environment, 32(14-15):2627–2636.

Huang, J., Qu, L., Jia, R., and Zhao, B. (2019). O2u-net:

A simple noisy label detection approach for deep neu-

ral networks. In Proceedings of the IEEE/CVF inter-

national conference on computer vision, pages 3326–

3334.

Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2017). Im-

agenet classiﬁcation with deep convolutional neural

networks. Communications of the ACM, 60(6):84–90.

Krogh, A. and Hertz, J. (1991). A simple weight decay can

improve generalization. Advances in neural informa-

tion processing systems, 4.

Lee, K.-H., He, X., Zhang, L., and Yang, L. (2018). Clean-

net: Transfer learning for scalable image classiﬁer

training with label noise. In Proceedings of the IEEE

conference on computer vision and pattern recogni-

tion, pages 5447–5456.

Li, W., Wang, L., Li, W., Agustsson, E., and Van Gool,

L. (2017). Webvision database: Visual learning

and understanding from web data. arXiv preprint

arXiv:1708.02862.

Moon, W. K., Lee, Y.-W., Ke, H.-H., Lee, S. H., Huang, C.-

S., and Chang, R.-F. (2020). Computer-aided diagno-

sis of breast ultrasound images using ensemble learn-

ing from convolutional neural networks. Computer

methods and programs in biomedicine, 190:105361.

Nguyen, A. T., Wallace, B. C., and Lease, M. (2015). Com-

bining crowd and expert labels using decision theo-

retic active learning. In Third AAAI conference on hu-

man computation and crowdsourcing.

Perez, L. and Wang, J. (2017). The effectiveness of data

augmentation in image classiﬁcation using deep learn-

ing. arXiv preprint arXiv:1712.04621.

Ren, Q., Cheng, H., and Han, H. (2017). Research on

machine learning framework based on random forest

algorithm. In AIP conference proceedings, volume

1820, page 080020. AIP Publishing LLC.

Sagi, O. and Rokach, L. (2018). Ensemble learning: A sur-

vey. Wiley Interdisciplinary Reviews: Data Mining

and Knowledge Discovery, 8(4):e1249.

Sitawarin, C. and Wagner, D. (2019). On the robustness of

deep k-nearest neighbors. In 2019 IEEE Security and

Privacy Workshops (SPW), pages 1–7. IEEE.

Song, H., Kim, M., and Lee, J.-G. (2019). Selﬁe: Re-

furbishing unclean samples for robust deep learning.

In International Conference on Machine Learning,

pages 5907–5915. PMLR.

Su, J. and Zhang, H. (2006). A fast decision tree learning

algorithm. In Aaai, volume 6, pages 500–505.

Wang, X.-Z., Xing, H.-J., Li, Y., Hua, Q., Dong, C.-R.,

and Pedrycz, W. (2014). A study on relationship be-

tween generalization abilities and fuzziness of base

classiﬁers in ensemble learning. IEEE Transactions

on Fuzzy Systems, 23(5):1638–1654.

Xiao, T., Xia, T., Yang, Y., Huang, C., and Wang, X. (2015).

Learning from massive noisy labeled data for image

classiﬁcation. In Proceedings of the IEEE conference

on computer vision and pattern recognition, pages

2691–2699.

Yu, L., Wang, S., and Lai, K. K. (2008). Forecasting crude

oil price with an emd-based neural network ensemble

learning paradigm. Energy economics, 30(5):2623–

2635.

ICAART 2023 - 15th International Conference on Agents and Artiﬁcial Intelligence

388