Authors:
Maximilian Auch
1
;
Maximilian Balluff
1
;
Peter Mandl
1
and
Christian Wolff
2
Affiliations:
1
University of Applied Sciences Munich, Lothstraße 34, 80335 Munich, Germany
;
2
University of Regensburg, Universitätsstraße 31, 93053 Regensburg, Germany
Keyword(s):
Software Libraries, Classification, Tags, Similarity, Naíve Bayes, Logistic Regression, Random Forest, Neural Network.
Abstract:
The number of software libraries has increased over time, so grouping them into classes according to their functionality simplifies repository management and analyses. With the large number of software libraries, the task of categorization requires automation. Using a crawled dataset based on Java software libraries from Apache Maven repositories as well as tags and categories from the indexing platform MvnRepository.com, we show how the data in this set is structured and point out an imbalance of classes. We introduce a class mapping relevant for the procedure, which maps the libraries from very specific, technical classes into more generic classes. Using this mapping, we investigate supervised machine learning techniques that classify software libraries from the dataset based on their available tags. We show that a tag-based approach to classify libraries with an accuracy of 97.46% can be achieved by using neural networks. Overall, we found techniques such as neural networks and na
íve Bayes more suitable in this use case than a logistic regression or a random forest.
(More)