monochrome images.
• If the CNN model can infer the fault proneness of
a program, we will be able to review and fix the
program faults regardless of its completion.
Thus, we utilize source code images to train a CNN
model in order to infer the fault proneness of pro-
grams.
The research questions of this paper are as fol-
lows:
• RQ1: Can a CNN extract features of fault prone-
ness from images of programs?
• RQ2: If the CNN cannot infer fault proneness,
what kinds of defects are missing?
• RQ3: Is the accuracy of the inference acceptable?
How can accuracy be improved?
In order to get the answers to these research questions,
we need to categorize elements of program source
codes and transform programs into images that the
CNN-BI system reads, which was carried out as fol-
lows. We transformed program source codes into im-
ages with colored elements to train the CNN-BI sys-
tem. Since we did not have an acceptable number
of training data sets, transfer learning (Weiss et al.,
2016) was applied. Before the evaluation of the CNN,
we train the CNN with images of fault prone and cor-
rect programs. Supervised learning was selected as
the CNN’s learning style, since the goal of inference
was to conclude whether the program is fault prone
or not. After the training, the CNN-BI system was
evaluated as to whether the system could predict fault
proneness of a program.
The remainder of this article is structured as fol-
lows. Section 2 presents related work. Section 3 de-
scribes a case study. In Section 4, we discuss the find-
ings and their implications. Section 5 concludes.
2 RELATED WORK
There are a lot of methods that improve the quality of
software. Techniques utilizing statistical methods are
popularly applied, since the statistical approach can
visualize the quality of software. If qualitative infor-
mation can be presented quantitatively, we can com-
pare the qualitative characteristics of programs with
the quantitative data. We will be able to observe and
control the quality of software quantitatively to im-
prove the quality of software.
Cyclomatic complexity was introduced by Mc-
Cabe in 1976 (McCabe, 1976). There is a strong
correlation coefficient between the cyclomatic num-
ber and the number of defects. Therefore, when a
program is measured with a big cyclomatic number,
it implies that the program is more complex than pro-
grams with smaller cyclomatic numbers. The quan-
titatively presented “relatively complex” programs
need to be carefully reviewed. After a decade, mul-
tiple paradigms began to be applied. For object-
oriented software, Chidamber and Kemerer intro-
duced metrics to measure complexity based on cou-
pling and cohesion, readability, understandability,
modifiability, etc. (Chidamber and Kemerer, 1994).
Their set of metrics is known as CK metrics. Zimmer-
mann also introduced a defect prediction model (Zim-
mermann et al., 2007). These methods are based on a
quantitative approach to evaluate the quality of soft-
ware systems.
The quantitative approach needs evaluation crite-
ria. Lorenz and Kidd presented empirical data for typ-
ical metrics as criteria of the quality of object-oriented
programs (Lorenz and Kidd, 1994). Their intention
was that if the measurement of a program is not within
the “normal” range, the program must be reviewed.
This means that the metrics do not find defects or er-
rors of programs. Furthermore, even though testing
is effective in improving the quality of programs, it
cannot discover all errors. One of the difficult points
of improving the quality of programs is that all of the
defects do not occur in the complex program. We em-
ployed another approach to decide which part to re-
view intensively. The CNN approach may be able to
find defects that occur not only by the complexity of
the program, but also by the structural feature of the
program.
After the google cat (Le et al., 2012), research
and application examples of deep learning have in-
creased. There are studies that have applied deep
learning to software quality evaluation. For example,
Morisaki applied deep learning to evaluate the read-
ability of software in order to control the quality of
software (Morisaki, 2018). Kondo et al. utilized deep
learning to cope with the delay of feedback from static
code analysis (Kondo et al., 2018). Yang et al. (Yang
et al., 2015) also applied deep learning to observe
software changes and proposed W-CNN rather than
CNN. We evaluate the CNN-BI system whether it can
predict fault proneness of programs.
3 CASE STUDY
In order to evaluate an inference of fault proneness
of programs with deep learning, we analyzed the ef-
fectiveness of the inference qualitatively and quantita-
tively. In this section, we look at a case in which CNN
has been applied in order to infer the fault proneness
HAMT 2019 - Special Session on Human-centric Applications of Multi-agent Technologies
322