gorized software libraries in the Java Maven reposi-
tories is currently just over 8%, we have used a tag-
based approach to label most of the libraries. This
applies to an additional 67% of the libraries found in
the largest mentioned repositories.
At first, we were able to determine the composi-
tion of the categories and tags available online. More-
over, we were able to introduce a more general la-
beling of the libraries, adjusted for our further re-
search work. In addition, a similar distribution of tags
seems to be found for labeled and unlabeled libraries.
Finally, an imbalance in the data was found, which
we assume to be due to the domain under investiga-
tion. Based on these findings, we applied different ap-
proaches for automatic classification. For a tag-based
approach on our presented relabeled dataset, a neural
network with a achieved accuracy of 97.46% seems to
be the most promising. We also found a good result
with the applied na
¨
ıve Bayes approach. In contrast,
logistic regression and random forest decision trees
did not bring sufficient results.
8 FUTURE WORK
With such promising results in the automated classifi-
cation, we see only limited need for further optimiza-
tion work. However, by hyperparameter optimization
of the neural network, there is a chance for even better
results.
Furthermore, we still see a need for the evaluation
of more generic approaches, because 25% of the li-
braries from our dataset as well as libraries from other
platforms might not be tagged. This is where our ap-
proach has limitations for the management of repos-
itory items. For our trained models, tags must exist
and need to be of similar quality. Since this is proba-
bly not always the case, alternative features for classi-
fication should be considered. We see future work in
applying more generic approaches, using NLP to an-
alyze the always available group-ids and artefact-ids
as well as analyze the always available binary code.
This procedure could also be beneficial to the relabel-
ing of libraries from the currently excluded class ”an-
droid packages”, since this class is too generic in our
view. In addition, other features could be taken from
metadata and considered for classification in combi-
nation with the features already listed. If available,
we would consider the amount of tags and downloads,
code metrics, licences, connections between contribu-
tors behind those libraries and keywords/entities from
online websites.
Besides the further classification approaches of
the libraries, our further research work, as already de-
scribed in the introduction, aims to identify similar
software on a domain-related and technical basis. For
the technical basis, we plan to use the classified li-
braries and determine migration paths by analyzing
the development of open source projects on the time-
line. By analyzing the commit history in software
projects, we aim to provide decision support through
automatically generated design decision recommen-
dations.
REFERENCES
Apache Spark (2020a). Classification and regression.
https://spark.apache.org/docs/3.0.0/ml-classification-
regression.html, visited 2021-01-17.
Apache Spark (2020b). Ensembles - rdd-based
api. https://spark.apache.org/docs/3.0.0/mllib-
ensembles.html, visited 2021-01-17.
Apache Spark (2020c). Evaluation metrics - rdd-
based api. https://spark.apache.org/docs/3.0.0/mllib-
evaluation-metrics.html, visited 2021-01-17.
Auch, M., Weber, M., Mandl, P., and Wolff, C. (2020).
Similarity-based analyses on software applications: A
systematic literature review. Journal of Systems and
Software, page 110669.
Bogers, T. (2018). Tag-Based Recommendation, pages 441–
479. Springer International Publishing, Cham.
B
¨
ohning, D. (1992). Multinomial logistic regression algo-
rithm. Annals of the institute of Statistical Mathemat-
ics, 44(1):197–200.
Breiman, L. (2001). Random forests. Machine learning,
45(1):5–32.
Chawla, N. V. (2010). Data Mining for Imbalanced
Datasets: An Overview, pages 875–886. Springer US,
Boston, MA.
Duda, R. O., Hart, P. E., et al. (1973). Pattern classification
and scene analysis, volume 3. Wiley New York.
Escobar-Avila, J. (2015). Automatic categorization of soft-
ware libraries using bytecode. In 2015 IEEE/ACM
37th IEEE International Conference on Software En-
gineering, volume 2, pages 784–786.
Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep
Learning. MIT Press.
Gu, Q., Zhu, L., and Cai, Z. (2009). Evaluation measures
of the classification performance of imbalanced data
sets. In Cai, Z., Li, Z., Kang, Z., and Liu, Y., editors,
Computational Intelligence and Intelligent Systems,
pages 461–471, Berlin, Heidelberg. Springer Berlin
Heidelberg.
Hara, K., Saito, D., and Shouno, H. (2015). Analysis of
function of rectified linear unit used in deep learning.
In 2015 International Joint Conference on Neural Net-
works (IJCNN), pages 1–8.
He, H. and Garcia, E. A. (2009). Learning from imbalanced
data. IEEE Transactions on Knowledge and Data En-
gineering, 21(9):1263–1284.
DATA 2021 - 10th International Conference on Data Science, Technology and Applications
26