to compare the performance of OCC models to the
one of binary classification models and to analyse if
some patterns and trends may be uncovered in the
SVM-based models’ behaviour when solving SDP.
Extensive experiments performed on Apache Calcite
software yielded several interesting research findings.
The main conclusion of our study is that in order to
have effective means of finding bugs in source code,
we may need to either ensure that the labels are ap-
propriate and the bug descriptions are more informa-
tive, or we could focus more on defective instances
during training. We believe the latter option may be
the general solution, since defects are more concise,
and don’t change their characteristics during the de-
velopment stages of the software, while non-defects
are more volatile, subjective, and interpretable, lead-
ing us to certain conflicts for later software releases.
We further aim to verify the findings of the cur-
rent study in a cross-version SDP scenario on another
Apache software systems (Ant, Archive, Commons,
etc) by training the OCC model on the software de-
fects from all versions of a software system and, sub-
sequently, testing the model on the releases of other
software systems. The AUC-based evaluation of the
results may also be extended by considering a recent
work (Carrington et al., 2023) that describes a deep
ROC analysis to measure performance in groups of
true-positive rate or false-positive rate. The use of
data augmentation to increase the number of faulty
classes may also provide better results, as for these
experiments we didn’t address the issue.
As another direction for future work we will focus
on ML models trained on specific types of defects.
There may be a multitude of software bug types which
we could not properly classify since the data set anno-
tations do not include the nature of the problem, just
its presence. We believe it may be useful to include
such annotations since clustering defects by their cat-
egory could be better understood this way. Code
smells may be a possible starting point for trying to
automatically classify defects into categories, as there
is a clear link between code smells and the quality of
the code. The experimental results obtained also sug-
gested us to further investigate using both OCSVM
and SVM at the same time and check where the mod-
els contradict, so that eventually we may benefit from
the strengths of both OCSVM and SVM models.
ACKNOWLEDGEMENTS
This work was supported by a grant of the Min-
istry of Research, Innovation and Digitization,
CNCS/CCCDI – UEFISCDI, project number PN-III-
P4-ID-PCE-2020-0800, within PNCDI III.
REFERENCES
Batool, I. and Khan, T. A. (2022). Software fault prediction
using data mining, machine learning and deep learn-
ing techniques: A systematic literature review. Com-
puters and Electrical Engineering, 100:107886.
Begoli, E., Camacho-Rodr
´
ıguez, J., Hyde, J., Mior, M. J.,
and Lemire, D. (2018). Apache calcite: A founda-
tional framework for optimized query processing over
heterogeneous data sources. In Proceedings of SIG-
MOD ’18, page 221–230, New York, NY, USA. ACM.
Carrington, A. M., Manuel, D. G., and et al. (2023).
Deep ROC Analysis and AUC as Balanced Average
Accuracy, for Improved Classifier Selection, Audit
and Explanation. IEEE Trans. Pattern Anal. Mach.,
45(1):329–341.
Chen, L., Fang, B., and Shang, Z. (2016). Software fault
prediction based on one-class SVM. In ICMLC 2016,
volume 2, pages 1003–1008.
Ciubotariu, G. (2022). OCC-SDP GitHub repository. https:
//github.com/george200150/CalciteData/.
D’Ambros, M., Lanza, M., and Robbes, R. (2012). Eval-
uating defect prediction approaches: A benchmark
and an extensive comparison. Empirical Softw. Engg.,
17(4–5):531–577.
Fawcett, T. (2006). An introduction to ROC analysis. Pat-
tern Recognition Letters, 27(8):861–874.
GitHub (2023). PMD - An extensible cross-language static
code analyzer. https://pmd.github.io/.
Hassan, A. E. (2009). Predicting faults using the complex-
ity of code changes. In 2009 IEEE 31st International
Conference on Software Engineering, pages 78–88.
Herbold, S., Trautsch, A., Trautsch, F., and Ledel, B.
(2022). Problems with SZZ and features: An empir-
ical study of the state of practice of defect prediction
data collection. Empir. Softw. Eng., 27(2).
Malhotra, R. (2014). Comparative analysis of statistical and
machine learning methods for predicting faulty mod-
ules. Applied Soft Computing, 21:286–297.
Marian, Z., Mircea, I., Czibula, I., and Czibula, G. (2016).
A novel approach for software defect prediction using
fuzzy decision trees. In SYNASC’18, pages 240–247.
Miholca, D.-L., Tomescu, V.-I., and Czibula, G. (2022). An
in-Depth Analysis of the Software Features’ Impact
on the Performance of Deep Learning-Based Software
Defect Predictors. IEEE Access, 10:64801–64818.
Moser, R., Pedrycz, W., and Succi, G. (2008). A compar-
ative analysis of the efficiency of change metrics and
static code attributes for defect prediction. In ICSE
’08, page 181–190, New York, NY, USA. ACM.
Moussa, R., Azar, D., and Sarro, F. (2022). Investigating the
Use of One-Class Support Vector Machine for Soft-
ware Defect Prediction. CoRR, abs/2202.12074.
Pedregosa, F., Varoquaux, G., and et al. (2011). Scikit-
learn: Machine learning in Python. JMLR journal,
12:2825–2830.
ICSOFT 2023 - 18th International Conference on Software Technologies
256