Evaluating the Accuracy of Machine Learning Algorithms on Detecting

Code Smells for Different Developers

Mário Hozano

, Nuno Antunes

, Baldoino Fonseca

and Evandro Costa

COPIN, Federal University of Campina Grande, Paraíba, Brazil

CISUC, Department of Informatics Engineering, University of Coimbra, Portugal

IC, Federal University of Alagoas, Maceió-Alagoas, Brazil

Keywords:

Code Smell, Machine Learning, Experiment.

Abstract:

Code smells indicate poor implementation choices that may hinder the system maintenance. Their detection is

important for the software quality improvement, but studies suggest that it should be tailored to the perception

of each developer. Therefore, detection techniques must adapt their strategies to the developer’s perception.

Machine Learning (ML) algorithms is a promising way to customize the smell detection, but there is a lack

of studies on their accuracy in detecting smells for different developers. This paper evaluates the use of ML-

algorithms on detecting code smells for different developers, considering their individual perception about

code smells. We experimentally compared the accuracy of 6 algorithms in detecting 4 code smell types for 40

different developers. For this, we used a detailed dataset containing instances of 4 code smell types manually

validated by 40 developers. The results show that ML-algorithms achieved low accuracies for the developers

that participated of our study, showing that are very sensitive to the smell type and the developer. These

algorithms are not able to learn with limited training set, an important limitation when dealing with diverse

perceptions about code smells.

1 INTRODUCTION

Code smells are poor implementation choices that of-

ten worsen software maintainability (Fowler, 1999).

Studies (Khomh et al., 2009a; Fontana et al., 2013)

have found that the presence of code smells may lead

to design degradation (Oizumi et al., 2016), increa-

sing effort for comprehension (Abbes et al., 2011),

and contributing to introduction of faults (Khomh

et al., 2011a). Therefore, these smells must be care-

fully detected and refactored in order to improve the

software longevity (Rasool and Arshad, 2015).

Several studies indicate that Machine Learning al-

gorithms (ML-algorithms) (Witten and Frank, 2005)

are a promising way to automate part of the code

smell detection process, without asking the develo-

pers to deﬁne their own strategies of code smell de-

tection (Khomh et al., 2011b; Maiga et al., 2012;

Amorim et al., 2015). In a nutshell, the MLalgorithms

require a set of code examples classiﬁed as smell

or non-smell, i.e. the training set or the oracle.

From them, the learning algorithms generate smell

detection models based on the software metrics, ai-

ming at detecting smells.

In this context, studies evaluated the accuracy

of Bayesian Belief Networks (BBNs) (Khomh et al.,

2011b), Support Vector Machine (SVM) (Maiga et al.,

2012), and Decision Trees (Amorim et al., 2015),

in the detection of code smells in existing software

projects. Moreover, Fontana et al. (Fontana et al.,

2015) performed a broader study on comparing the

efﬁciency of these learning algorithms on detecting

code smells. The efﬁciency was evaluated in terms of

the detection accuracy and the training effort (i. e.,

the number of examples) required to the algorithm to

reach a high accuracy.

However, the abstract deﬁnitions of smell types

in existing catalogs (e.g. (Fowler, 1999)) lead deve-

lopers to have different perceptions about the occur-

rence of smells (Mäntylä and Lassenius, 2006; Schu-

macher et al., 2010; Santos et al., 2013). For instance,

while a developer may consider that a method con-

taining more than 100 lines of code as being a Long

Method smell (Fowler, 1999), other developers may

not necessarily agree. As a consequence, the develo-

pers have to customize such deﬁnitions to their speci-

ﬁc context.

In this case, to use the ML-algorithms, it is ne-

474

Hozano, M., Antunes, N., Fonseca, B. and Costa, E.

Evaluating the Accuracy of Machine Learning Algorithms on Detecting Code Smells for Different Developers.

DOI: 10.5220/0006338804740482

In Proceedings of the 19th International Conference on Enterprise Information Systems (ICEIS 2017) - Volume 2, pages 474-482

ISBN: 978-989-758-248-6

cessary that developers provide examples that repre-

sent their understanding about smells. However, there

is still no knowledge about the accuracy of ML-

algorithms on detecting smells for developers that

may perceive such anomalies differently. Indeed,

the existing studies performed evaluations on datasets

containing code examples annotated by a single per-

son or a small number of developers that shared the

same perception about the analyzed code smells.

In this paper we present a study aimed at eva-

luating the accuracy of ML-algorithms on detecting

code smells for different developers, and that may

have different perceptions about code smells. We ana-

lyzed the 6 widely used algorithms on detecting 4 ty-

pes of code smells from three different perspectives:

the overall accuracy, the accuracy for different develo-

pers, and the learning efﬁciency (the number of exam-

ples necessary to reach a speciﬁc accuracy). For this

study we built a dataset based on the input of 40

diverse developers which classiﬁed 15 code snippets

according to their individual perspective. This resul-

ted in dataset containing 600 (non-)smells examples

representing the individual perception of all 40 deve-

lopers involved in our study.

The results of our study indicate that, in average,

the analyzed ML-algorithms were not able to reach a

high accuracy on detecting code smells for developers

with different perceptions. The results also indicate

that the detection process is very sensitive not only

the smell type but also the individual perception of

each developer. These two factors must be conside-

red in the design of approaches that aim to accurately

detect code smells.

2 RELATED WORK

The use of intelligent techniques has been widely in-

vestigated in order to deal with the different percepti-

ons concerning code smells (Mäntylä and Lassenius,

2006; Schumacher et al., 2010; Santos et al., 2013).

In this context, several machine learning algorithms

have been adapted in order to enable an automatic de-

tecting customization, based on a set of examples ma-

nually validated (Khomh et al., 2011b; Maiga et al.,

2012; Amorim et al., 2015; Fontana et al., 2015).

Although these studies report important results con-

cerning the efﬁciency of techniques based on Ma-

chine Learning (ML) algorithms, they do not discuss

about how such techniques deal on customizing the

detection for different developers. After all, the stu-

dies did not evaluate the efﬁciency of these techniques

when customized from different training sets valida-

ted individually by single developers.

Bayesian Belief Network (BBN) algorithm

has been proposed to detect instances of God

Class (Khomh et al., 2009b). The authors used 4 gra-

duate students to validate, manually, a set of classes,

reporting if each class contains a God Class instance

or not. From such procedure was created a single

oracle containing 15 consensual smell instances.

After performing a 3-fold cross-validation procedure

in order to calibrate the BBN, the authors assessed an

accuracy (precision) of 0.68 on detecting God Class

instances. In (Khomh et al., 2011b) the same authors

extended their previous work, by applying the BBNs

to detect three types of code smells. By following the

same procedure of their previous work, the authors

submitted 7 students to create a single oracle. After

calibrating, the produced BBN reached an accuracy

of almost 0.33.

The work presented in (Maiga et al., 2012) eva-

luated the efﬁciency of a technique based on a Sup-

port Vector Machine (SVM) to detect code smells.

In order to evaluate the proposed technique, the aut-

hors consider the some oracles deﬁned in a previous

work (Moha et al., 2010). Although deﬁned by se-

veral developers, such oracles are not indexed by their

evaluators. After considering these oracles, the SVM

approach reached, in average, accuracies up to 0.74.

In (Amorim et al., 2015) the authors proposed the

use of Decision Tree algorithm to detect code smells.

The authors used a single training set containing a

huge number of examples and validated by different

developers. The work used a third party dataset as an

oracle for the training the modules. At end, the paper

reported the algorithm was able to reach accuracies

up to 0.78.

More recently, Fontana et al. (Fontana et al., 2015)

presented a large study that compares and experi-

ments different conﬁgurations of 6 ML-algorithms on

detecting 4 smell types. For training, the authors con-

sidered a set of oracles composed of several exam-

ples of code smells manually validated by different

developers. However, these oracles did not identify

these developers. As results, the authors reported

that all evaluated techniques present a high accuracy.

The highest one was obtained by two algorithms ba-

sed on Decision Trees (J48 and Random Forest (Mit-

chell, 1997)). In addition, the authors also afﬁrmed

that were necessary a hundred training examples to

the techniques reach an accuracy of, at least, 0.95.

3 STUDY DESIGN

The main goal of this study was to evaluate the ability

of Machine Learning (ML) algorithms to detect code

Evaluating the Accuracy of Machine Learning Algorithms on Detecting Code Smells for Different Developers

475

smells for different developers, considering their in-

dividual perception about code smells. Although pre-

vious work (Fontana et al., 2015) studied the accu-

racy and efﬁciency of ML-algorithms, the different

perspectives of the developers was not considered. In

summary, the study tries to answer the following re-

search questions:

• RQ1: How accurate are the ML-algorithms in de-

tecting smells?

• RQ2: How do the ML-approaches deal with dif-

ferent perceptions about code smells?

• RQ3: How efﬁcient are the ML-algorithms on de-

tecting smells?

To answer these questions, an experimental appro-

ach was deﬁned based on the usage of machine lear-

ning algorithms in code with and without smells. An

overview of the approach is presented in Figure 1.

Figure 1: Training and testing the ML-algorithms.

The approach is based on two main parts: trai-

ning phase and testing phase. The training dataset

(1) is composed of a set of evaluations provided by

developers, reporting the presence (or absence) of a

smell in code snippets, and the software metrics ex-

tracted from those snippets. Next, the ML-algorithm

(2) is trained to learn which metrics may determine

the smell evaluations given as input. Based on this

learning, the algorithm produces a model (3), able to

classify any snippet as a smell or non-smell.

In a test phase, the classiﬁer model are tested to

verify if it is able to classify novel instances as the

developer does. For that, test dataset (4) is produced

with evaluations and software metrics related to other

code snippets. Next, the classiﬁer model (5) tries to

predict the developer’s evaluations. Finally, classiﬁ-

cations are compared with the developer’s answers to

obatin the results (6).

3.1 Smell Types and ML-algorithms

In our study, we analyzed four types of code smells,

described in Table 1. We chose these smell types for

two main reasons. First, they affect different scopes

of a program. While God Class and Data Class affect

mainly classes, the Long Method and Feature Envy

are related to methods. Second, these four smell types

were also evaluated by previous studies on code smell

detection (Khomh et al., 2011b; Maiga et al., 2012;

Amorim et al., 2015; Fontana et al., 2015).

Table 1: Types of Code Smells investigated in this study.

Name Description

God

Class

Classes that tend to centralize the intelligence of the system

Data

Class

Classes that have ﬁelds, getting and setting methods for the

ﬁelds, and nothing else

Long

Met-

hod

A method that is too long and tries to do too much

Feature

Envy

Methods that use more attributes from other classes than

from its own class, and use many attributes from few classes

We evaluated and compared 6 ML-algorithms,

which reached a high accuracy in detecting code

smells during the experiments presented in (Fontana

et al., 2015). Initially, we applied these algorithms

by following the same conﬁguration adopted (Fontana

et al., 2015) and then we tried other conﬁgurations

in order to ﬁnd one in which the analyzed algorithm

could reach its highest efﬁciency. We present below

a short description of the algorithms analyzed in our

study:

• J48: an implementation of the C4.5 decision tree.

• JRip: an implementation of a propositional rule

learner.

• Random Forest: a classiﬁer that builds many

classiﬁcation trees as a forest of random decision

trees.

• Naive Bayes: a simple probabilistic classiﬁer ba-

sed on applying Bayes’ theorem.

• SMO: an implementation of John Platt’s sequen-

tial minimal optimization algorithm to train a sup-

port vector classiﬁer.

• SVM: an integrated software for support vector

classiﬁcation.

We used the Weka (Hall et al., 2009) implemen-

tation of these ML-algorithms. In order to provide a

fair comparison, we created an environment in which

all algorithms were executed by considering a same

set of data in each execution.

ICEIS 2017 - 19th International Conference on Enterprise Information Systems

476

3.2 Building the Oracle Datasets

We built a set of data containing examples of

(non-)smells according to the developers’ input. For

this, we sent an invitation message to several contacts

from academy and industry. We were able to recruit

40 developers with at least 3 years experience in soft-

ware development and code smell detection, which

had emphasis on software quality.

The selected developers evaluated a set of code

snippets by looking for a speciﬁc smell type. The

code snippets comprehend methods or classes sus-

pects of containing the smell type under analysis. For

example, to recognize the occurrence of God Class,

the developers analyzed classes that could contain

this smell type. The developer concluded each clas-

siﬁcation by answering: YES, if he agrees that the

code snippet contained the speciﬁed smell type; and

NO, otherwise. The code snippets used in our study

were extracted from GanttProject

(v2.0.10), an open

source Java project. We selected this project because

they had been used in other studies related to code

smells (Moha et al., 2010; Khomh et al., 2011b; Fon-

tana et al., 2015). Moreover, such studies reported a

variety of suspicious code smells in this project. In

this context, the code snippets that were evaluated in

our study counted with several of these reported in-

stances.

In order to create the datasets that were conside-

red in our study, we divided the 40 developers into

4 groups. Each group, composed of 10 participants,

was responsible for analyzing one of the smell types

described in Table 1. To this aim, each group analy-

zed a set of 15 code snippets, and each participant of

the group classiﬁed each snippet indicating the occur-

rence or not of the smell type assigned to the group.

Each set of 15 evaluations performed by a single de-

veloper detecting a given code smell formed an oracle

that was used to train the ML-algorithms in our study.

This way, the 600 classiﬁcations formed, in total, 40

oracle datasets.

To verify if the participants had different percep-

tions about smells, all the developers of a group ana-

lyzed the same set of 15 code snippets. From that,

we could assess the agreement degree among the 10

developers that analyzed a single code smell type.

Such agreement was calculated by using the Fleiss’

Kappa, which is a measure to evaluate the concor-

dance or agreement among multiple raters (Fleiss,

1971). This measure reports a number lower or equal

to 1. A Kappa value equal to 1, means that the raters

are in complete agreement. The categories presen-

ted in (Landis and Koch, 1977), propose that a low

http://www.ganttproject.biz/

Table 2: Inter-rater agreement for each code smell type.

Code Smell

Agreement

Strength deﬁned by

(Landis and Koch, 1977)

God Class 0.308 Fair

Data Class 0.421 Moderate

Long Method 0.322 Fair

Feature Envy 0.222 Fair

agreement is veriﬁed in values close or below 0, and

that values above 0.6 represent substantial agreement.

After collecting all developers’ evaluations and

assessed the Kappa measure, we veriﬁed the inter-

rater agreement for each smell type as described in

Table 2. The developers that evaluated the Data

Class smell, presented the higher, albeit still weak,

agreement. They reached an Kappa value equals to

0.421, that is considered a Moderate agreement. All

the remaining smells were evaluated by developers

that presented a Fair agreement. The Kappa results

for them varied from 0.222 (Feature Envy) to 0.308

(God Class). From such results, we conﬁrmed that

the developers that built the oracle datasets present

different perceptions about the smells analyzed in our

study. Such fact enable us to analyze the accuracy

of the ML-algorithms on detecting code smells from

these oracles aiming at answering our research ques-

tions.

3.3 Experimentation Phase

Using the datasets containing the developers’ evalua-

tions and the software metrics for each analyzed snip-

pet, we performed two different experiments trying to

answer the research questions, as follows:

1) Accuracy – Analyzes the accuracy of the ML-

algorithms globally and by developer. The algorithms

are used to produce a classiﬁer model for each deve-

loper that evaluated a single smell type. In this sce-

nario we used each one of the 40 oracle datasets (10

for each smell) as input through a 5-fold cross vali-

dation procedure. The creation of the experimented

folds was guided to balance the number of code smell

instances for each fold. As each participant classi-

ﬁed manually each oracle used in our experiment, we

could not control that such oracle was composed of

a closest number of smell and non-smell instances.

Moreover, in order to limit over-ﬁtting problems, we

repeated this procedure 10 times. In the end, we mea-

sured the mean of the accuracy reached by the produ-

ced classiﬁer models on detecting the smell instances

over the remaining fold.

2) Efﬁciency – Analyzes the accuracy of the ML-

approaches according to the number of examples used

as training. For each oracle dataset, we randomly

ordered their evaluations and ran the ML-algorithms

Evaluating the Accuracy of Machine Learning Algorithms on Detecting Code Smells for Different Developers

477

with a varied number of these evaluations. Next, we

executed the ML-algorithms 15 times, where the n-

execution considers as learning the n ﬁrst evaluations

in the ordered dataset. The produced classiﬁer mo-

dels were tested on the entire dataset, containing the

15 examples. At end of each execution, we assessed

the accuracy of the classiﬁer model deﬁned by the al-

gorithms over the dataset. Each procedure with the 15

executions were repeated 10 times, in order to limit

overﬁtting problems. Again, we measure the mean of

accuracy reached by each algorithm on considering

all executions.

4 RESULTS AND DISCUSSION

This section presents and discusses the main results

of the study.

4.1 Overall Accuracy

Figure 2 presents the results concerning the mean va-

lues of accuracy obtained by the ML-algorithms on

detecting the smell types God Class, Long method,

Data Class and Feature Envy. We attach the exact

mean value to the bar associated with each algorithm.

The legend on top of the ﬁgure describes the colors

used to identify each algorithm.

Figure 2: Mean of accuracy reached by the ML-algorithms

on detecting smells.

As we can observe, the results of each algorithm

in each smell type are quite diverse. In fact, the algo-

rithms that present better results in a smell present in

many cases much lower results for other smells. Furt-

hermore, none of the ML-algorithms analyzed in our

study was able to reach a mean above 0.635.

Random Forest presented the best results on de-

tecting God Class and Data Class, but SMO perfor-

med better in Long Method and Feature Envy smells.

These results suggest that the accuracy of the algo-

rithms may be inﬂuenced by the scope of the type of

code smell analyzed. While the God Class and Data

Class are related to classes, the Long Method and Fe-

ature Envy are more related to methods.

Regarding the God Class, the SMO reached only

≈0.44, which is the lowest mean value of accuracy

observed in all analysis. On the other hand, the

Random Forest reached a mean 45% higher than SMO

by reaching ≈0.64, which also correspond to highest

observed mean value. JRip, Naive Bayes and SVM

were still able to reach values greater than 0.6. Fi-

nally, J48 reached ≈0.54, which is an intermediate

value between the mean obtained by the SMO and the

other algorithms.

Similarly, Random Forest also reached the highest

mean value of accuracy on detecting Data Class. In-

deed, Random Forest was the only algorithm to re-

ach a mean greater than 0.6. The algorithms J48,

JRip, SMO and SVM reached values between 0.57

and 0.58. The lowest mean 0.49 was obtained by the

Naive Bayes.

For Long Method, Random Forest was not able to

reach the highest mean. Regarding this smell type, the

highest mean values were obtained by the algorithms

SMO and SVM, both reached ≈0.60. Next, the al-

gorithm Naive Bayes reached ≈0.58. The remaining

algorithms reached values around ≈0.52.

Regarding the Feature Envy, we observe that

even though none of the algorithms was able to re-

ach a mean above 0.6 on detecting this smell type,

they did not reach a mean below 0.5. While the hig-

hest mean 0.578 was obtained by the SMO, the JRip

reached the lowest mean 0.511.

The analysis of these results provides us with the

answer to the research question RQ1. The results

indicate that, in average, the algorithms were not

able to detect smells with high accuracy for develo-

pers with different perceptions. In fact, we observe

that the accuracy of each ML-algorithm depends on

the type of smells being studied.

4.2 Accuracy by Developer

After analyzing the overall accuracy obtained by the

ML-algorithms, we decided to perform a deeper in-

vestigation on the accuracy reached by each algorithm

in order to better understanding how these algorithms

deal with the individual perception of each developer.

Table 3 presents the results obtained. The ﬁrst co-

lumn describes the smell type in analysis. The se-

cond column identiﬁes the developer that evaluated

the code snippets related to each smell type, as descri-

bed in Section 3.2. The following six columns report

the accuracy reached by each ML-algorithm. Besi-

des the accuracy values, we also report the mean of

the accuracy values obtained by each algorithm on

detecting the smell type in analysis. From the re-

sults described in Table 3, we can identify the accu-

ICEIS 2017 - 19th International Conference on Enterprise Information Systems

478

racy obtained by each algorithm on detecting smells

for each developer. In particular, we can recognize

the algorithms that reached higher accuracy to a spe-

ciﬁc developer. In order to improve our discussion,

we highlight (in gray) the cells containing the higher

accuracies for each developer in the table. In addi-

tion, the lowest accuracy reached by the algorithms

on detecting a given smell are marked in red.

God Class. The SVM reached an accuracy higher

than the other algorithms on detecting God Class to

the developers 1, 3 and 7-10. However, it also obtai-

ned the lowest accuracy, among all the God Class ana-

lysis, on detecting smells to the Developer 2. On the

other hand, the Random Forest reached higher accu-

racy only to the developers 2 and 4-6, but it obtai-

ned the highest value of accuracy, among all the God

Class analysis, on detecting smells to the Developer

4. The SMO, JRip, J48 and Naive Bayes did not reach

an accuracy higher than the other algorithms for any

developer.

Data Class. Regarding the Data Class, even the

Naive Bayes that reached the lowest mean, it could

reach an accuracy equal or higher than the other algo-

rithms on detecting this smell type to the Developers

18 and 20. We also observe that the SVM reached

the same accuracy of the Naive Bayes to the develo-

pers 18 and 20. In addition, the SVM reached higher

accuracy than the other algorithms to the Developer

16. However, we note that the SVM reached the lo-

west accuracy, among all the Data Class analysis, on

detecting smells to the developer 17.

The Random Forest, which obtained the highest

mean, could reach an accuracy equal or higher than

the other algorithms only to the developers 13, 14

and 17. Note the JRip obtained the same accuracy

of the Random Forest to the Developer 14. Moreo-

ver, the JRip reached a higher accuracy to the deve-

lopers 15 and 19. In the case of the Developer 19,

the JRip obtained the highest accuracy, among all the

Data Class analysis. Finally, the J48 obtained higher

accuracy than the other algorithms to the developers

11-12.

Long Method. The SVM and SMO reached the hig-

hest means on detecting Long Method to the develo-

pers. While the SVM could reach higher accuracy

than the other algorithms only to the Developer 30,

the SMO reached higher accuracy to ﬁve developers:

21, 22, 26, 28 and 29. However, the SMO obtained

the lowest accuracy among all the algorithms on de-

tecting Long Method to the Developer 27.

The J48 and Random Forest reached higher accu-

racy than the other algorithms on detecting Long Met-

hod to the developers 25. In addition, the J48 obtai-

ned higher accuracy to the Developer 23. Finally, the

Table 3: Accuracy of the ML-algorithms on detecting

smells for each developer.

Dev

Accuracy

J48 JRip RF SMO NB SVM

1 0.407 0.633 0.493 0.400 0.640 0.667

2 0.547 0.647 0.707 0.587 0.493 0.360

3 0.473 0.573 0.593 0.387 0.593 0.667

4 0.640 0.667 0.760 0.433 0.747 0.667

5 0.587 0.667 0.753 0.420 0.720 0.667

6 0.533 0.580 0.747 0.467 0.500 0.387

7 0.500 0.580 0.467 0.427 0.547 0.600

8 0.567 0.540 0.607 0.440 0.647 0.667

9 0.480 0.593 0.627 0.373 0.613 0.667

10 0.660 0.640 0.600 0.440 0.580 0.667

Mean 0.539 0.612 0.635 0.437 0.608 0.601

Dev J48 JRip RF SMO NB SVM

11 0.693 0.673 0.513 0.453 0.380 0.600

12 0.693 0.400 0.640 0.533 0.533 0.533

13 0.420 0.480 0.613 0.413 0.433 0.600

14 0.627 0.813 0.813 0.627 0.420 0.600

15 0.627 0.780 0.733 0.640 0.400 0.600

16 0.480 0.433 0.453 0.420 0.540 0.600

17 0.473 0.520 0.640 0.500 0.453 0.333

18 0.460 0.340 0.600 0.427 0.667 0.667

19 0.660 0.840 0.753 0.633 0.407 0.600

20 0.607 0.447 0.540 0.667 0.667 0.667

Mean 0.574 0.573 0.630 0.531 0.490 0.580

Dev J48 JRip RF SMO NB SVM

21 0.487 0.500 0.500 0.627 0.600 0.600

22 0.560 0.600 0.600 0.780 0.700 0.600

23 0.627 0.600 0.480 0.487 0.520 0.600

24 0.407 0.433 0.560 0.667 0.700 0.667

25 0.593 0.527 0.593 0.547 0.480 0.533

26 0.427 0.480 0.520 0.547 0.487 0.533

27 0.427 0.473 0.400 0.373 0.747 0.667

28 0.480 0.467 0.467 0.827 0.540 0.533

29 0.527 0.493 0.507 0.747 0.600 0.600

30 0.633 0.640 0.600 0.413 0.387 0.667

Mean 0.517 0.521 0.523 0.601 0.576 0.600

Dev J48 JRip RF SMO NB SVM

31 0.593 0.520 0.540 0.567 0.467 0.580

32 0.453 0.547 0.447 0.473 0.273 0.600

33 0.693 0.593 0.607 0.620 0.467 0.587

34 0.547 0.467 0.547 0.527 0.613 0.420

35 0.613 0.533 0.513 0.540 0.533 0.600

36 0.600 0.600 0.587 0.587 0.687 0.600

37 0.533 0.413 0.580 0.740 0.760 0.453

38 0.527 0.533 0.660 0.653 0.760 0.593

39 0.453 0.360 0.407 0.420 0.453 0.327

40 0.533 0.547 0.547 0.653 0.533 0.600

Mean 0.555 0.511 0.543 0.578 0.555 0.536

GC-God Class, DC-Data Class, LM-Long Method, FE-Feature Envy

Naive Bayes obtained higher accuracy to the Develo-

pers 24 and 27.

Feature Envy. Regarding the Feature Envy, the SMO

reached higher accuracy than the other algorithms

only to the Developer 40. Similarly to the SMO, the

SVM also reached higher accuracy only to a speciﬁc

developer (32).

We also note that the Naive Bayes could reach hig-

Evaluating the Accuracy of Machine Learning Algorithms on Detecting Code Smells for Different Developers

479

her accuracy than the other algorithms to ﬁve develo-

pers: 34 and 36-39. However, it also obtained the lo-

west accuracy among all the algorithms on detecting

Feature Envy to the Developer 32. Finally, the J48

obtained a higher accuracy to four developers: 31, 33,

35 and 39.

These results help us to answer the RQ2 presen-

ted on Section 3. Such results indicate that each algo-

rithm reached a wide range of accuracy values on de-

tecting a same smell type for different developers. In

fact, the results show that the accuracy obtained by

each algorithm is very sensitive to the smell type

and the developer. To make matters worse, some

algorithms that reached higher accuracy than the ot-

her algorithms on detecting a speciﬁc smell type to a

developer, also obtained the lowest accuracy on de-

tecting the same smell type to other developers.

4.3 Efﬁciency

In this section, we analyze the efﬁciency of the ML-

algorithms, i.e., the number of examples required by

each algorithm to reach its accuracy. Figure 3 presents

the results of our evaluations that support the research

question RQ3. The charts describe the learning cur-

ves that represent the efﬁciency reached by each al-

gorithm on detecting the smell types God Class, Data

Class, Long Method and Feature Envy, respectively.

The y-axis represents the mean of the accuracy va-

lues obtained by each algorithm on detecting smells

for different developers. The x-axis represents the

number of examples used in the learning phase by the

algorithm to reach such mean. For instance, in the

case of the God Class, the algorithm SMO reached

a mean accuracy of 0.450 by using only 7 examples.

We ranged the number of examples from 1 to 15 be-

cause each developer analyzed 15 code snippets, as

described in Section 3.2.

Concerning the God Class, all the analyzed algo-

rithms reached a mean between 0.55 and 0.6 by using

only one example. The only exception is the Naive

Bayes (NB) that obtained a mean lower than 0.5. Ho-

wever, the efﬁciency of these algorithms suffered dif-

ferent variations as we increase the number of exam-

ples. We observe that the Random Forest (RF) was the

only algorithm to reach mean values above 0.65. In-

deed, none algorithm was able to overcome the mean

obtained by the Random Forest when we considered

more than three examples in the learning phase. On

the other hand, the algorithm SMO presented the lo-

west mean by using more than three examples. In ad-

dition, we note that this algorithm tended to decrease

its mean in almost all cases analyzed from 1 to 15 ex-

amples, reaching a mean lower than 0.4 by using

15 examples. Such result was a surprising for us be-

cause we expected that the algorithms could improve

its accuracy as we increase the number of examples.

The remaining algorithms reached an mean between

0.45 and 0.65.

Similarly, the Random Forest also reached the hig-

hest mean in the majority of the cases related to the

Data Class. The only exceptions occurred when we

considered 2 or more than 13 examples. In the ﬁrst

case, the SVM, SMO and Naive Bayes reached a

mean equal or greater than the Random Forest. In

the second case, the J48 overcame the Random Fo-

rest. In addition, we observe that the J48 presented

a consistence increase in its accuracy after conside-

ring more than 12 examples. As a consequence, only

the J48 could reach a mean above 0.7 among all the

cases investigated in the four smell types analyzed in

our study. Regarding the SVM and Naive Bayes, alt-

hough they had reached the highest mean values by

using only two examples, their mean values decrea-

sed on considering three examples. From this point,

they could not reach a mean above 0.6 as we increase

the number of examples.

For Long Method, the algorithm SMO reached

mean values equal or greater than the other algorithms

in almost all cases. The only exception occurred when

we considered 10 examples, where the Naive Bayes

overcame the SMO. We also note that the SVM pre-

sented a consistent increase in its accuracy from three

examples, reaching a mean of 0.6. However, it was

not able to overcome the SMO in any analyzed case.

The algorithms J48 and JRip always reach values be-

tween 0.5 and 0.55 in the majority of the cases. Alt-

hough the Random Forest had obtained the highest

efﬁciency in the vast majority of the cases related to

God Class and Data Class, it was not able to maintain

its efﬁciency on detecting Long Method.

Regarding the Feature Envy, the algorithm Naive

Bayes reached the highest mean in the most of the

cases. The exceptions occurred when we conside-

red from 2 to 5 examples, where the SMO reached

mean values equal or greater than Naive Bayes. Even

though the SMO has been overcome by the Naive

Bayes in most of the analyzed cases, the SMO presen-

ted an increase in its accuracy as we considered more

examples. Indeed, the SMO reached a mean grea-

ter than the remaining algorithms (J48, JRip, Random

Forest and SVM) in the vast majority of the analyzed

cases. The only exception occurred when conside-

red 8 examples, where the SMO was overcome by the

Random Forest and Naive Bayes.

The analysis of these results provides us with the

answer to the research question RQ3. The results

show that in most cases, the algorithms are not

ICEIS 2017 - 19th International Conference on Enterprise Information Systems

480

(a) God Class (b) Data Class (c) Long Method (d) Feature Envy

Figure 3: Learning curves of the ML-algorithms on detecting smells.

accurate when trained with a small training set. In

fact, the tendency in most cases is for increasing mean

accuracy. The results also shows that the efﬁciency

of the algorithm varies substantially depending on the

type of code smells being evaluated.

5 THREATS TO VALIDITY

In this section we discuss the threats to validity by

following the criteria deﬁned in (Wohlin et al., 2000).

Construct Validity. The oracle datasets that suppor-

ted our study were built with from a huge quantity of

code snippets manually evaluated by developers. In

this case, the developers evaluated each snippet by re-

porting the option “YES” or “NO”, referring the pre-

sence or absence of a given code smell into the snip-

pet. Providing only these two options may be a threat,

since the developers could not inform the degree of

conﬁdence in their answers. However, we adopted

such procedure aiming at ensuring that the develo-

pers were able to decide about the existence of a code

smell and we could obtain a set of examples that ena-

bles to evaluate the efﬁciency of the ML-algorithms.

Internal Validity. The use of the Weka tool to imple-

ment the algorithms analyzed in our study enabled to

experiment a variety of conﬁgurations, which affect

the learning process. Thus, the conﬁgurations consi-

dered in our experiments may impact in the accuracy

and efﬁciency of the algorithms. In order to mitigate

this threat, we conﬁgured all algorithms according to

the better settings deﬁned in (Fontana et al., 2015).

Indeed, that study performed a variety of experiments

in order to ﬁnd the best adjust for each algorithm.

External Validity. The code snippets evaluated

by the developers were extracted from only 1 Java

project, named GanttProject. Such project has

been widely used in other works related to code

smell (Khomh et al., 2011b; Moha et al., 2010; Maiga

et al., 2012). However, although the implementation

of this project presents classes and methods with dif-

ferent characteristics (i.e. size and complexity), our

results might not hold to other projects. In the same

way, even though we have performed our experiments

with 40 different developers, our results might not

also hold for other developers since they may have

different perceptions about the code smells analyzed

in our study (Mäntylä and Lassenius, 2006; Schuma-

cher et al., 2010; Santos et al., 2013).

6 CONCLUSION

This paper presented the results of a study that evalu-

ated the accuracy of ML-algorithms on the detection

of code smells for different developers. This inves-

tigation is important because machine learning has

been considering a promising way to customize the

code smell detection. Nevertheless, to the best of

our knowledge, there are no study that evaluate the

performance of ML-algorithms on customizing the

detection for developers with different perceptions

about code smells.

We used 6 ML-algorithms to detect smells for 40

software developers that differently detected instan-

ces of 4 code smell types. Altogether, we counted

with 600 evaluations, that supported our study. With

such data we ran two different experiments aiming at

analyzing the accuracy and efﬁciency of the analyzed

algorithms.

According to our results, the ML-algorithms were

not able to detect smells with high accuracy for the

developers that participated of our study. In this con-

text, the quantity of examples considered on training

may difﬁcult the learning. However, given that diffe-

rent developers have different perceptions about code

smells, and the context may inﬂuence in their deci-

sion, the detection approaches must be able to custo-

mize with a limited portion of data to customize the

detection. After all, in a real scenario it is not reaso-

nable that a developer evaluates manually hundreds of

code snippets before in order to customize a detection

approach. This way, other strategies to speed up the

customization performed by the ML-algorithms must

be investigated.

Independently of the accuracy levels, we observed

Evaluating the Accuracy of Machine Learning Algorithms on Detecting Code Smells for Different Developers

481

that the ML-algorithms are sensitive to the smell type

and the developer. For instance, while the SMO pre-

sented the higher accuracies on detecting Long Met-

hod and Feature Envy smells, such algorithm pre-

sented lowest accuracies on detecting God Class and

Data Class instances. An, almost, inverted behaviour

was veriﬁed by the Random Forest algorithm. Simi-

larly, even when an algorithm presented a mean high

accuracy on detecting a given smell, his performance

was not consistent on on detecting such anomalies for

the different developers.

Finally, we will make the dataset used in our ex-

periments available in order to help other studies in

smell detection. We had no knowledge of other data-

sets with a large portion of evaluations manually vali-

dated by different developers over a same set of code

snippets. Thus, the availability of this dataset repre-

sents another contribution of this paper.

REFERENCES

Abbes, M., Khomh, F., Gueheneuc, Y.-G., and Antoniol, G.

(2011). An empirical study of the impact of two anti-

patterns, blob and spaghetti code, on program compre-

hension. In Software maintenance and reengineering

(CSMR), 2011 15th European conference on, pages

181–190. IEEE.

Amorim, L., Costa, E., Antunes, N., Fonseca, B., and

Ribeiro, M. (2015). Experience report: Evaluating

the effectiveness of decision trees for detecting code

smells. In Proceedings of the 2015 IEEE 26th Interna-

tional Symposium on Software Reliability Engineer-

ing (ISSRE), ISSRE ’15, pages 261–269, Washington,

DC, USA. IEEE Computer Society.

Fleiss, J. L. (1971). Measuring nominal scale agreement

among many raters. Psychological bulletin,

76(5):378.

Fontana, F. A., Ferme, V., Marino, A., Walter, B., and Mar-

tenka, P. (2013). Investigating the Impact of Code

Smells on System’s Quality: An Empirical Study on

Systems of Different Application Domains. 2013

IEEE International Conference on Software Mainte-

nance, pages 260–269.

Fontana, F. A., Mäntylä, M. V., Zanoni, M., and Marino, A.

(2015). Comparing and experimenting machine lear-

ning techniques for code smell detection. Empirical

Software Engineering.

Fowler, M. (1999). Refactoring: Improving the Design of

Existing Code. Addison-Wesley, Boston, MA, USA.

Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann,

P., and Witten, I. H. (2009). The weka data mining

software: an update. ACM SIGKDD explorations new-

sletter, 11(1):10–18.

Khomh, F., Di Penta, M., and Gueheneuc, Y.-G. (2009a).

An Exploratory Study of the Impact of Code Smells

on Software Change-proneness. 2009 16th Working

Conference on Reverse Engineering, pages 75–84.

Khomh, F., Penta, M. D., Guéhéneuc, Y.-G., and Antoniol,

G. (2011a). An exploratory study of the impact of an-

tipatterns on class change- and fault-proneness. Em-

pirical Software Engineering, 17(3):243–275.

Khomh, F., Vaucher, S., Guéhéneuc, Y. G., and Sahra-

oui, H. (2009b). A bayesian approach for the de-

tection of code and design smells. In Quality Software,

2009. QSIC’09. 9th International Conference on, pa-

ges 305–314. IEEE.

Khomh, F., Vaucher, S., Guéhéneuc, Y.-G., and Sahraoui,

H. (2011b). Bdtex: A gqm-based bayesian appro-

ach for the detection of antipatterns. J. Syst. Softw.,

84(4):559–572.

Landis, J. R. and Koch, G. G. (1977). The measurement of

observer agreement for categorical data. biometrics,

pages 159–174.

Maiga, A., Ali, N., Bhattacharya, N., Sabane, A., Gue-

heneuc, Y.-G., and Aimeur, E. (2012). SMURF: A

SVM-based Incremental Anti-pattern Detection Ap-

proach. 2012 19th Working Conference on Reverse

Engineering, pages 466–475.

Mäntylä, M. V. and Lassenius, C. (2006). Subjective eva-

luation of software evolvability using code smells: An

empirical study, volume 11. Springer.

Mitchell, T. M. (1997). Machine learning. McGraw-

Hill series in computer science. McGraw-Hill, Boston

(Mass.), Burr Ridge (Ill.), Dubuque (Iowa).

Moha, N., Gueheneuc, Y.-G., Duchien, L., and a. F. Le

Meur (2010). DECOR: A Method for the Speciﬁca-

tion and Detection of Code and Design Smells. IEEE

Transactions on Software Engineering, 36(1):20–36.

Oizumi, W., Garcia, A., da Silva Sousa, L., Cafeo, B., and

Zhao, Y. (2016). Code anomalies ﬂock together: Ex-

ploring code anomaly agglomerations for locating de-

sign problems. In Proceedings of the 38th Internati-

onal Conference on Software Engineering, ICSE ’16,

pages 440–451, New York, NY, USA. ACM.

Rasool, G. and Arshad, Z. (2015). A review of code smell

mining techniques. Journal of Software: Evolution

and Process, pages n/a–n/a.

Santos, J. A. M., de Mendonça, M. G., and Silva, C. V. A.

(2013). An exploratory study to investigate the im-

pact of conceptualization in god class detection. In

Proceedings of the 17th International Conference on

Evaluation and Assessment in Software Engineering,

EASE ’13, pages 48–59, New York, NY, USA. ACM.

Schumacher, J., Zazworka, N., Shull, F., Seaman, C., and

Shaw, M. (2010). Building empirical support for au-

tomated code smell detection. Proceedings of the

2010 ACM-IEEE International Symposium on Empi-

rical Software Engineering and Measurement - ESEM

’10, page 1.

Witten, I. H. and Frank, E. (2005). Data Mining: Practi-

cal machine learning tools and techniques. Morgan

Kaufmann.

Wohlin, C., Runeson, P., Höst, M., Ohlsson, M. C., Reg-

nell, B., and Wesslén, A. (2000). Experimentation in

Software Engineering: An Introduction. Kluwer Aca-

demic Publishers, Norwell, MA, USA.

ICEIS 2017 - 19th International Conference on Enterprise Information Systems

482