Similarity of Software Libraries: A Tag-based Classiﬁcation Approach

Maximilian Auch

1 a

, Maximilian Balluff

1 b

, Peter Mandl

1 c

and Christian Wolff

2 d

University of Applied Sciences Munich, Lothstraße 34, 80335 Munich, Germany

University of Regensburg, Universit

atsstraße 31, 93053 Regensburg, Germany

Keywords:

Software Libraries, Classiﬁcation, Tags, Similarity, Na

ıve Bayes, Logistic Regression, Random Forest, Neural

Network.

Abstract:

The number of software libraries has increased over time, so grouping them into classes according to their

functionality simpliﬁes repository management and analyses. With the large number of software libraries,

the task of categorization requires automation. Using a crawled dataset based on Java software libraries from

Apache Maven repositories as well as tags and categories from the indexing platform MvnRepository.com,

we show how the data in this set is structured and point out an imbalance of classes. We introduce a class

mapping relevant for the procedure, which maps the libraries from very speciﬁc, technical classes into more

generic classes. Using this mapping, we investigate supervised machine learning techniques that classify

software libraries from the dataset based on their available tags. We show that a tag-based approach to classify

libraries with an accuracy of 97.46% can be achieved by using neural networks. Overall, we found techniques

such as neural networks and na

ıve Bayes more suitable in this use case than a logistic regression or a random

forest.

1 INTRODUCTION

Nowadays, more and more software libraries are used

in software projects to query other services, to build

APIs according to speciﬁcations, to simplify code

and ﬁle handling, to integrate ﬁnished components, to

simplify testing, analyze code, or to integrate complex

processing, for example in case of distributed calcu-

lations. By using libraries, typically less code needs

to be written and therefore less code need to be tested

and maintained. Thung et al. (Thung et al., 2013)

found in 2013 that 93.3% of software projects exam-

ined in their study use third-party libraries. In aver-

age, these projects included 28 third-party libraries

each.

Within the context of our research project, it is

planned to identify similar software projects auto-

matically in order to derive design decisions and

business-relevant information. The recognition of

similarity should be determined by a hybrid approach

consisting of technical and subject-related similarity.

For technical similarity, libraries could help classify

https://orcid.org/0000-0002-4860-7464

https://orcid.org/0000-0002-6837-0628

https://orcid.org/0000-0003-4508-7667

https://orcid.org/0000-0001-7278-8595

software by providing references to the types of pro-

cessing and storage of the data, boundary systems,

providing APIs and the types of usage of the software.

By analyzing technology stacks in similar software

that evolve over time, decisions on selected technolo-

gies, design and architecture should be derived. For

this purpose, however, there is the problem that the li-

braries must be classiﬁed in order to detect migrations

from library A to library B with similar functions. As

too many libraries are available, manual classiﬁcation

is not feasible. Therefore, this research project aims

to evaluate whether libraries crawled from Apache

Maven repositories can be classiﬁed automatically

into generic classes derived from classes available on-

line by using machine learning on tags.

Our motivation behind the choice of a pure tag-

based approach is to investigate whether a classiﬁca-

tion of software libraries into mapped classes can al-

ready be accomplished exclusively by the use of tags.

We want to investigate this approach is suitable for

classiﬁcation or other features, as described in related

work, should be used in addition. Furthermore, the

introduction of new classes for yet too ﬁne-granular

classes and the identiﬁcation of classes that are too

coarse should provide an opportunity for future re-

search work on this ﬁeld.

Auch, M., Balluff, M., Mandl, P. and Wolff, C.

Similarity of Software Libraries: A Tag-based Classiﬁcation Approach.

DOI: 10.5220/0010521600170028

In Proceedings of the 10th International Conference on Data Science, Technology and Applications (DATA 2021), pages 17-28

ISBN: 978-989-758-521-0

2 RELATED WORK

The ﬁeld of identifying similar software projects for

categorization is large. A recently conducted and

published systematic literature review (Auch et al.,

2020) shows the variety of approaches and motiva-

tions behind the work. In comparison, the catego-

rization of software libraries is rarely found in pub-

lished research projects. The review paper only points

out the work of Escobar-Avila (Escobar-Avila, 2015),

who published an approach to automatically catego-

rize software libraries. He manually collected the

bytecode (.jar ﬁles), proﬁles and categories of the li-

braries. In total, 158 software libraries written in Java,

which are published and maintained by the Apache

Software Foundation (ASF) were examined. The

bytecode documents were then transformed into a

vector space model representation and clustered af-

terwards.

This approach has some beneﬁts: The major one is

probably that the bytecode of the libraries is one type

of artefact that is always present, since it is needed

to run the application. While some information is

lost in the compilation process (e.g., comments), the

bytecode still contains most of the textual information

found in the source code. With this approach, good

results in binary classiﬁcation could be achieved with

an accuracy of 86%. However, the author also points

out that the same approach achieved an accuracy of

only 40% in multi-class categorization. Since we are

aiming for a multi-class categorisation in this work,

we decided on using meta-data and have therefore se-

lected a different, tag-based approach.

Related to our approach, a paper with a hybrid

approach was also recently published (Vel

azquez-

Rodr

ıguez and De Roover, 2020). Their work aims

for an automated multi-label classiﬁcation, recom-

mending tags for untagged software libraries. To do

so, they use a combination of the existing tags and

word-vectors from binary-ﬁles. They point out, that

the data from the indexing platform MVNRepository,

which we also use, has a limited usability. This is

because the necessary tags are often missing or only

a single tag is available. This observation was made

based on a crawled dataset with about 3000 tagged li-

braries. We were also able to identify this limitation

to some extent in our larger dataset and see it as a

challenge for our approach to classify tagged, but not

yet categorized libraries based on their tags. To get

a better understanding of the distribution of tags, we

also describe this distribution of our crawled data in

the following section. Apart from this ﬁnding, how-

ever, their study cannot be benchmarked with ours, as

they pursue a different goal on the basis of different

data and methods.

Aside from the work on improving repository

management, similar approaches using tags for a cat-

egorization have been applied to other data and pur-

poses. For instance in social media, pictures and

content are categorized by using user-generated tags

(Mo

ellic et al., 2008). Such approaches can be used to

organize and manage the large amount of data, such as

images shared by platform users. In general, the term

social recommendation tasks is used to describe vari-

ous goals, such as guided search, people proﬁling, tag

recommendation and ﬁnding domain experts, while

taking tags into account (Bogers, 2018). Another pos-

sible application is the categorization of music and

artists using tags (Hong et al., 2008). For example,

a tag-based experiment to ﬁnd similarities between

artists was conducted. In addition to good results

in the mentioned studies, these approaches describe

some challenges that can also arise when categoriz-

ing software repositories. These challenges include

a lexical variability of a terms as well as highly spe-

ciﬁc, personal and noisy tags (Mo

ellic et al., 2008).

Another challenge is caused by applying and evaluat-

ing imbalanced data, which was also identiﬁed in the

earlier study (Vel

azquez-Rodr

ıguez and De Roover,

2020) on the MVNRepository tags. Since we were

confronted with a similar problem in our experiment,

we describe below our approach and all the steps

taken in order to obtain an evalable result.

3 APPROACH

3.1 Dataset

Software libraries for the Java Virtual Machine (JVM)

were crawled between May to July 2020 to create a

machine learning dataset. For this purpose, the li-

braries were collected by their group-id and artefact-

id from the largest public repositories. For that pur-

pose of the search a custom crawler was implemented,

which parses the DOM tree of the repository websites.

The by far largest repository was found to

be Maven Central

, which contained slightly over

300,000 libraries. The other ﬁve selected reposito-

ries for crawling were Sonatype

, Spring IO

, Atlas-

sian

, Hortonworks

and Wso2

. After merging and

https://repo1.maven.org/maven2/

https://oss.sonatype.org/content/repositories/

https://repo.spring.io/plugins-release/

https://maven.atlassian.com/content/repositories/atlassian-

public/

https://repo.hortonworks.com/content/repositories/releases/

https://maven.wso2.org/nexus/content/repositories/releases/

DATA 2021 - 10th International Conference on Data Science, Technology and Applications

Figure 1: Crawling of maven repositories.

removing duplicates, the dataset has a size of 328,000

software libraries.

Subsequently, tags and labels already assigned

were searched for all crawled libraries. We found the

online service MvnRepository.com, which provides a

searchable index with additional information based

on our crawled repositories. Among other things,

categories and tags are provided for some libraries.

While categories bundle similar libraries of a domain,

tags show more coarse-grained and eventually unique

properties of a library (Vel

azquez-Rodr

ıguez and De

Roover, 2020). For our dataset, we found around

26,550 libraries that were labeled and tagged already.

Consequently, about 8% of the dataset was initially

available for training and evaluation. 246,400 addi-

tional libraries were provided by the service only in

tagged form. Therefore, about 75% of the crawled

record was still unlabeled, but already tagged. These

libraries are the target of this automated classiﬁcation

approach in order to reliably categorize the most com-

mon libraries we encountered. This process is shown

again in Figure 1. A quarter of the entire dataset, con-

sisting of about 80,000 libraries, is neither labeled nor

tagged. Whether these are as relevant in terms of fre-

quency of use as the tagged and labeled libraries and

how they could be categorized remains to be clariﬁed.

Therefore, we have removed them from the study and

will include them in future work.

3.2 Labeling and Balancing

As described above, some of the crawled libraries

are already sufﬁciently tagged and labeled. How-

ever, some classes contain more libraries than oth-

ers. This is due to different reasons. First, it is

due to the domain being crawled. Some subject ar-

eas are supported more strongly and diversely by ex-

isting libraries than others. For example, there are

many more libraries that simplify testing code with

helpful functions than there are libraries that support

the generation of hashes. On the other hand, the la-

bels of MvnRepository.com are chosen in different,

subject-speciﬁc granularities. For example, the An-

droid Packages class includes all libraries that pro-

vide any functionality, whether database persistence,

process speciﬁc or ui-related. This may be useful for

a search platform, but it may bias the results when

analyzing technically similar software projects. Sec-

ond, persistence related libraries have been divided

into classes by technology. This means that classes

exist that are too ﬁne-grained, e. g. the data per-

sistence classes ”Cassandra Clients”, ”DynamoDB

Clients”, ”Embedded SQL Databases”, ”MySQL

Drivers”, ”PostgreSQL Drivers”, ”Object/Relational

Mapping”, ”JPA Implementations”. These should be

combined into more generic classes like Database or

Persistence for future research goals, such as detect-

ing the migration from e.g. MySQL to PostgreSQL.

As a follow-up step, several efforts were taken to

better balance the data set and to use a more appro-

priate classiﬁcation for the subsequent research work.

A relabeling is to be mentioned here as a main con-

sideration. For this purpose, we introduce a mapping

which reduces the 162 categories crawled into more

general or concatenated categories if needed. The

mapping to the resulting 69 classes is presented by

tables 1 to 3.

As an additional step in order to better balance

the distribution of the classes in the dataset, classes

with few libraries were speciﬁcally enriched. For this

purpose, unlabeled but tagged libraries, which can be

clearly assigned to one of the classes, were manually

labeled afterwards. This procedure was also used to

validate the new class assignment from tables 1 to 3.

As a result of the validation it was recognized that

some libraries only provide examples of functionali-

ties and their utilization. These example libraries of

the crawled dataset are not categorized separately, but

are divided into the respective categories presented

in the tables. However, for planned future research

on the detection of technical similarity of software

projects, these sample libraries should be classiﬁed

in a new category. Therefore, an additional cate-

gory ”Example” has been introduced to bundle these

types of libraries. Partly the relabeling was straight-

forward, as the corresponding libraries were already

tagged with tags like ”example”. In some cases we

found that these example libraries did not have the

appropriate tags. However, the group-id and artefact-

id allowed us to draw reliable conclusions about ex-

ample libraries. These were tagged and labeled by

us manually afterwards. In addition, we did more

manual work by applying a qualitative content anal-

ysis (Mayring, 2004) on the dataset and especially

aimed to enrich the underrepresented classes. Con-

sequently, this resulted in a slightly larger dataset for

Similarity of Software Libraries: A Tag-based Classiﬁcation Approach

Table 1: Mapping of crawled labels to newly deﬁned labels.

Crawled labels Mapped labels

Actor Frameworks Actor Frameworks

Android Packages Android Packages

Android Platform Android Platform

Annotation Processing Tools Annotation Processing Tools

Application Servers

Application-/ Web-Server

Network App Frameworks

Tomcat Session Managers

Web Servers

FTP Clients and Servers

Application Layer Protocol In-

tegration

HTTP Clients

REST Framework

SSH Libraries

SSL Component Libraries

WebServices Metadata

WebSocket Clients

Simple Network Management Protocol

Aspect Oriented Aspect Oriented

Barcode Libraries Barcode Handler

Benchmarks

Microbenchmarks

Bitcoin Bitcoin

Build Models

Build

Build Tools

Maven Plugins

Build Automation Tool PluginsMaven Repositories Api

Gradle Plugins

Bytecode Libraries Bytecode Libraries

Cache Clients

Caching

Cache Implementations

Chart Libraries Chart Libraries

Classpath Tools Classpath Tools

Cloud Computing Cloud Computing

Cluster Management Cluster Management

Code Analyzers

Code analyses

Defect Detection Metadata

Docker Clients Containerization

Command Line Parsers Command Line Parsers

Compression Libraries Compression Libraries

Concurrency Libraries Concurrency Libraries

Conﬁguration Libraries Conﬁguration Libraries

XMPP Integration Libraries

Communication Protocol Inte-

gration

Sms Library

Crawler

Web Crawlers

Encryption Libraries Cryptography

ArangoDB Clients

Database

Android DB

Cassandra Clients

Column Database Clients

Database

DB Migration Tools

DynamoDB Clients

ElasticSearch Clients

Embedded SQL Databases

Graph Databases

Hadoop Databases

Hadoop Query Engines

HBase Clients

Table 2: Mapping of crawled labels to newly deﬁned labels.

Crawled labels Mapped labels

JDBC Extensions

JDBC Pools

JPA Implementations

LevelDB Integration

MongoDB Clients

MySQL Drivers

Object/Relational Mapping

PostgreSQL Drivers

RDF Libraries

Redis Clients

SQL Drivers

Dependency Injection Dependency Injection

Distributed Communication

Transaction APIs/Managers

Distributed Computing Distributed Computing

Distributed Coordination Distributed Coordination

Distributed Tracing Distributed Tracing

DNS Libraries DNS Libraries

Exception Handling Exception Handling

External Process Execution External Process Execution

CSV Libraries

File Handler

Excel Libraries

HTML Parsers

JSON Libraries

Markdown Processors

PDF Libraries

SVG Libraries

XML Processing

XPath Libraries

YAML Parsers

Distributed File Systems

File SystemS3 Clients

Virtual File Systems

Geospatial Libraries Geospatial Libraries

Graph Algorithms

Graph Algorithms and Tools

GraphViz Libraries

Hashing Libraries Hashing

Eclipse Plugins

IDE ModulesEclipse Runtime

NetBeans Modules

Java Speciﬁcations Java Speciﬁcations

Job Scheduling Job Scheduling

JWT Libraries JWT Libraries

Expression Languages

Languages / Compiler / Inter-

pretation

Functional Programming

Java Compilers/Parsers

JavaScript Processors

JVM Languages

Language Runtime

Parser Generators

Python Integration

Scala Compilers

Application Metrics

Logging / Monitoring

Logging Bridges

Logging Frameworks

Monitoring Tools

Machine Learning Machine Learning

Mail Clients

Mailing

Mail Server

DATA 2021 - 10th International Conference on Data Science, Technology and Applications

Table 3: Mapping of crawled labels to newly deﬁned labels.

Crawled labels Mapped labels

JMX Libraries Managing / Monitoring

Object Serialization Marshalling / Unmarshalling

Enterprise Integration

Messaging / Integration

Enterprise Service Bus

Message Brokers

Message Queue Clients

Money Libraries Money Libraries

Native Access Tools Native Access Tools

Natural Language Processing Natural Language Processing

Off-Heap Libraries Off-Heap Libraries

BPM Engines Operations management

OSGI Containers

OSGiOSGI Frameworks

OSGI Utilities

Reﬂection Libraries Reﬂection Libraries

Rule Engines Rule Engines

Full-Text Indexing Libraries

Search Engines

OAuth Libraries

Security / Authentiﬁcation

Security Frameworks

Social Network Clients Social Network Clients

Stream Processing Stream Processing

Template Engines Template Engines

Assertion Libraries

Testing

Code Coverage Tools

Mocking

Testing Frameworks

Web Browser Automation

Web Testing

CSS, LESS, SASS

JSF Libraries

JSP Tag Libraries

Swing Layouts

Swing Libraries

Web Assets

Web Frameworks

Base64 Libraries

Utilities

Collections

Console Utilities

Core Utilities

Date and Time Utilities

Diff and Patch Libraries

I/O Utilities

I18N Libraries

Math Libraries

MIME Types Libraries

Object Pools

Regular Expression Libraries

String Utilities

Units of Measurement

UUID Generators

Validation Frameworks

Vector/Matrix Libraries

Git Tools

Version-control system tools

Subversion Tools

Web Applications

Web Upload Managers

Testing Frameworks

Web Applications

20%

Example

21%

Maven Plugins

23%

Other 64 classes

24%

Logging Frameworks

Figure 2: Imbalanced distribution of libraries.

training and evaluation. At this point the manual

labeling allowed us to access about 28,600 labeled

and tagged libraries, which is around 9% of the total

dataset. We hosted the modiﬁed dataset on our repos-

itory

for traceability of the classiﬁcation approaches

and future works.

Not all crawled classes are too speciﬁc. We could

also identify a class that is too generic. The android

package class is by far the largest class and is not

useful for our overall research project, which aims

to identify technically similar software using library

analysis. For this project it would be relevant which

beneﬁt these android packages bring with them. The

class is therefore treated as a collection of libraries

to be dissolved and therefore the libraries belonging

to it are removed from the collected library corpus.

Label-mapping, ﬁltering, and relabeling result in a

new distribution of libraries in classes. Figure 2 gives

an overview of the resulting ﬁve largest classes. The

remaining 56 classes cover only 24% of the labeled

data. According to Chawla (Chawla, 2010) a dataset

is imbalanced, if the classes are not represented ap-

proximately equal.

This means that there is still an imbalancement in

the data, which is partly due to the domain. Methods

such as the typically used oversampling or undersam-

pling can help with imbalanced data (Chawla, 2010).

By using random oversampling, replicated data is ap-

pended to the original dataset. With random under-

sampling, on the other hand, data is deleted from the

original dataset. However, these methods do not only

bring advantages, but also information losses or lead

to an overﬁtting of the trained model. To overcome

these disadvantages, there are various other methods

such as informed undersampling, synthetic sampling

with data generation or cluster-based sampling meth-

ods. Furthermore, cost-sensitive learning methods

can be used, which do not rely on balanced costs for

training, but on weighted costs for misclassiﬁcation

through cost matrices (He and Garcia, 2009). For fu-

ture work, we still see possibilities for an improved

balancing in the dataset or the application of a training

https://github.com/CCWI/corpus-libsim.git

Similarity of Software Libraries: A Tag-based Classiﬁcation Approach

method that is adapted to imbalancement. However,

since these approaches can also have drawbacks, we

have not taken any further balancing approaches for

this ﬁrst study, but consider them only as future work

if it helps to improve the results in the classiﬁcation.

3.3 Distribution and Exclusion of Tags

The recognized 437 different tags are partly assigned

across classes and must be checked in their composi-

tion. For example, while the tags ”json” and ”yaml”

refer to a yaml parser and thus libraries are classiﬁed

in this class, the tag combination ”json” and ”map-

ping” points to a json library, which supports a dese-

rialization to objects by means of mapping. In order

to be able to follow a tag-based approach, it must ﬁrst

be ensured that the categorized training and evalua-

tion data, as well as the uncategorized data, are suf-

ﬁciently and similarly tagged. For this purpose, we

ﬁrst analyzed the distribution of the tags on the basis

of the newly created dataset. The result is shown in

Figure 3. It is noticeable that a small part of the cate-

gorized libraries are not tagged. This means that they

cannot be considered for the training of a machine

learning model. Furthermore, it is noticeable that a

higher percentage of tagged-only libraries have a sin-

gle tag and less libraries have more than three tags.

Apart from these observations a similar distribution

could be found. Additionally, we checked whether

the uncategorized data contains other tags than those

already categorized. This was not the case, which is

why we considered this to be a good basis for this

research goal. Furthermore the dataset has a few out-

liers, which have up to 13 tags. These are not included

in Figure 3 for a better illustration, but should be men-

tioned.

A ﬁrst analysis of the crawled dataset has shown

that not all tags have the same relevance. Some tags

were found to be irrelevant when viewing the crawled

dataset and could be excluded for training. These are

tags that occurred across classes and do not contribute

to the description of the functionality of a class. The

excluded tags were ”github”, ”codehaus”, ”apache”,

”experimental”, ”starter”, ”runner”, ”api” and ”bom”.

The ﬁrst three indicate in most cases where the project

was hosted. The remaining tags to be excluded indi-

cate an irrelevant status, function or structure. By ex-

cluding these tags, we assume that an improved train-

ing result can be achieved.

1 2 3 4

5 6

Amount of Tags

Rate in %

categorized and tagged libraries

tagged-only libraries

Figure 3: Distribution of tags in the dataset under separate

consideration of categorized and uncategorized, but tagged,

libraries.

4 METHOD

In this study different algorithms for a multi-class

classiﬁcation (Manning et al., 2008) were applied.

For this purpose, models were trained and evalu-

ated using multinomial logistic regression (B

ohning,

1992), multinomial na

ıve Bayes (Manning et al.,

2008), a random forest decision tree (Breiman, 2001)

as well as a feed forward neural network (Goodfellow

et al., 2016). In the following, the different algorithms

are brieﬂy explained to provide an overview.

4.1 Multinomial Logistic Regression

As a ﬁrst approach we applied the logistic regression.

Since it usually relies on binary labels, we have used

the multinomial logistic regression (B

ohning, 1992).

It can be used for the multi-class classiﬁcation pre-

sented in this work. Unlike the na

ıve Bayes described

below, in logistic regression the analyzed tags are con-

sidered in a statistical dependence. For this purpose,

the implementation by the Apache Spark mllib was

used (Apache Spark, 2020a). The equation 1 shows

the approach. It calculates the probability of the cat-

egorical outcome Y, which can be one of the possi-

ble classes K, while for k=1,2,...,K (Apache Spark,

2020a). In this, X is a vector representation of the

tags for k, while the calculated regression coefﬁcients

representing a vector of weights corresponding to out-

come k are presented as β

and β

P(Y = k|X,β

,β

) =

·X+β

∑

K−1

·X+β

(1)

DATA 2021 - 10th International Conference on Data Science, Technology and Applications

In addition, the maximum number of iterations

and elastic net regularization can be optimized by

parametrization for the training of the logistic regres-

sion model. For this purpose, we tried different values

but could not ﬁnd a signiﬁcantly better result for the

dataset. Therefore, the standard parameters from the

documentation (Apache Spark, 2020a) were used and

no optimization was performed.

4.2 Multinomial Na

ıve Bayes

The na

ıve Bayes (Duda et al., 1973) works even more

simple than the logistic regression. It assumes that

all of the tags assigned to a libraries class are inde-

pendent of each other and is called the ”na

ıve Bayes

assumption”. While this assumption is false in many

real-world tasks, the na

ıve Bayes classiﬁer often per-

forms well. In addition, this assumption of indepen-

decy simpliﬁes training by allowing the tags to be

learned separately for each library. Especially with

larger datasets the training is easier and more efﬁcient

(McCallum et al., 1998). A single iteration over the

training data is sufﬁcient to calculate the conditional

probability distribution of each tag for each class.

The multinomial na

ıve Bayes classiﬁer also supports

multi-class classiﬁcation and is therefore a possible

method for a reliable classiﬁcation of libraries in this

study.

Here the probability is calculated after the im-

plementation by Spark (Apache Spark, 2020a) as in

equation 2. This calculates for each feature vector x,

containing the tags of a library, a prediction for each

available class C

. Let x

stand for the number of ap-

pearances of tags in a speciﬁc instance and p

as the

probability that a tag exists for a class.

p(x|C

) =

(

∑

∏

(2)

4.3 Random Forest

The stochastic model approach random forest de-

scribed by Ho (Tin Kam Ho, 1995) is an ensemble

of decision trees. It is used for classiﬁcation by com-

bining many decision trees to reduce the risk of over-

ﬁtting. Similar to decision trees, the random forest

offers a multi-class classiﬁcation based on categori-

cal features. The accuracy should increase by com-

bining many decision trees, which is why we decided

to include the random forest algorithm for evaluation

on our library corpus. An implementation was done

according to the documentation of Apache Spark ml-

lib (Apache Spark, 2020b). The parameterization al-

lows the number of trees in the forest and the maxi-

mum depth of each tree in the forest. In addition, a

subsampling rate and a feature subset strategy can be

optimized. After several optimization attempts and

no signiﬁcant improvements, we have taken the con-

ﬁgured default without optimization attempts.

4.4 Neural Network (NN)

Neural networks have become increasingly important

in recent years. A recently published SLR (Auch

et al., 2020) also shows that neural networks have

been used more and more in approaches for the recog-

nition of similar software projects. For the implemen-

tation of our feed forward neural network (Goodfel-

low et al., 2016) we used Keras on top of TensorFlow

2.4 as a high-level API (Keras SIG, 2020).

The applied neural network uses a simple input

layer with a ﬁxed size of 13 x 437. This size is cal-

culated based on the maximum amount of tags per

library and the total amount of tags. As mentioned

in the description of the dataset, the number of tags

can vary for each library. For the network to properly

handle a varying number of tags, we decided to create

13 vectors for each library. Since the libraries of our

dataset have a maximum of 13 tags, each vector cor-

responds to a possible tag. Each of these vectors has

a size of 437, as we were able to identify as many dif-

ferent tags in our dataset. For instance, if the library

has only a single tag, the ﬁrst vector contains a one

at the index of the tag, while the other 12 vectors are

just containing zero values.

We kept the structure of our feed forward neu-

ral network relatively simple. As described in Fig-

ure 4, the network starts with an input layer followed

by fully connected (dense) hidden layers and a ﬂat-

ten layer. Finally, 69 neurons for each class were set

as the output layer. The hidden layers use a rectiﬁed

linear unit (ReLU) activation function (Hara et al.,

2015), while the output layer uses the SoftMax func-

tion. We use the callback function ReduceLROn-

Plateau and EarlyStopping in Keras to improve the

training speed of our model (Zaheer et al., 2018). In

addition, we used the ADAM optimizer which is set

as default in Keras and is popular in the deep learning

community (Zaheer et al., 2018).

During the optimization we experimented with

different hyperparmeters for the introduced network.

Therefore, we applied a 5-fold nested cross validation

(Varma and Simon, 2006) to optimize the number of

neurons per hidden layer and the depth of the network.

For each outer cross validation split, the algorithm

determines the best model based on the validated

mean accuracy of the inner cross validation split. The

best model selected from the inner cross validation

was trained on the complete outer loop training set

Similarity of Software Libraries: A Tag-based Classiﬁcation Approach

and was evaluated on the outer test dataset. Finally,

we found three different models, each with a hidden

layer, being conﬁgured using between 150 and 250

neurons. However, models with more than one hidden

layer performed slightly worse, by about 0.5% aver-

age accuracy. Within the scope of the study, we did

not perform any further hyperparameter optimization

since it is computationally expensive as well as the re-

sults of this ﬁrst study described below were consid-

ered to be sufﬁcient. Nevertheless, we have included

a possible potential for improvement in future work.

Input Layer

Shape: 13 x 437

Dense Layer

150-250 Neurons

(ReLU)

Flatten Layer

Dense Output Layer

69 Neurons

(SoftMax)

Figure 4: Layers of the applied dense neural network.

5 EVALUATION

The evaluation of models in a multi-class problem

requires a different approach than binary classiﬁca-

tion. This applies additionally in the case of an im-

balanced dataset where the distribution of classes is

not evenly distributed (Gu et al., 2009). In case of

a multi-class problem, the usual precision and re-

call measurements cannot be taken over all classes,

since they are usually calculated for each label sepa-

rately. However, it is possible to calculate a weighted

average for measurements like precision, recall and

f1 (Sokolova and Lapalme, 2009), (Scikit-learn De-

velopers, 2020). Therefore, for the evaluation of the

results the weighted precision or weighted ”positive

predictive value” (Tharwat, 2020) (PPV

) was calcu-

lated using the formula in equation 3 (Apache Spark,

2020c). This is done by dividing the correctly labeled

libraries - true positives (TP) - by TP and false posi-

tives (FP), which are the libraries that are incorrectly

labeled for the corresponding class. A weighting by

the number of true instances for each class is added

accordingly. The result of each class is summarized

for an overall result.

PPV

∑

`∈L

T P

T P + FP

(`) ·

N−1

∑

i=0

δ(y

− `) (3)

Similar to the PPV

the T PR

was used, which

is the weighted ”true positive rate” (Tharwat, 2020)

and therefore represents the weighted recall. For this,

unlike the weighted precision, the TP is divided by the

Table 4: The evaluation results of the given models.

Models Measure 5-fold cross

Logistic

Regression

Accuracy 0.4476 (±0.0074)

Weighted Precision 0.3245 (±0.0063)

Weighted Recall 0.4476 (±0.0074)

Weighted F1 0.3465 (±0.0066)

Multinominal

ıve

Bayes

Accuracy 0.9257 (±0.0025)

Weighted Precision 0.9254 (±0.0021)

Weighted Recall 0.9257 (±0.0025)

Weighted F1 0.9213 (±0.0024)

Random

Forest

Accuracy 0.6081 (±0.0082)

Weighted Precision 0.5866 (±0.0165)

Weighted Recall 0.6081 (±0.0082)

Weighted F1 0.5365 (±0.0090)

Neural

Network

Accuracy 0.9746 (±0.0017)

Weighted Precision 0.9738 (±0.0014)

Weighted Recall 0.9736 (± 0.0013)

Weighted F1 0.9734 (±0.0014)

TP and the libraries, which are wrongly categorized

in other classes, also called false negative (FN). The

calculation is shown in equation 4.

T PR

∑

`∈L

T P

T P + FN

(`) ·

N−1

∑

i=0

δ(y

− `) (4)

Since the F-score is the harmonic mean of pre-

cision and recall, it is also considered in this study.

First the F(β)-score is calculated by setting β = 1 to

equally weight the used precision and recall (Zhang

and Zhou, 2014). This is shown in equation 5.

F(β) =



1 + β





PPV · T PR

· PPV + T PR



(5)

This calculated F(β) was then used to calculate

the weighted F value (F

), which is shown in equa-

tion 6.

(β) =

∑

`∈L

F(β, `) ·

N−1

∑

i=0

δ(y

− `) (6)

Finally, the accuracy was also calculated by the

equation 7. In contrast to binary classiﬁcation,

multi-class classiﬁcation problems require metrics

that provide a result across all classes. According to

(Apache Spark, 2020c), accuracy measures the preci-

sion across all labels - the number of times a class was

correctly predicted (TP), normalized by the number of

classes.

ACC =

T P

T P + FP

N−1

∑

i=0

δ(

− y

) (7)

For the evaluation a k-fold cross validation was

applied. This involves splitting the shufﬂed dataset k

times to get k random, exclusive subsets. These folds

,...,S

should be roughly the same size and can

DATA 2021 - 10th International Conference on Data Science, Technology and Applications

be rotated as datasets for training and evaluation. The

dataset for evaluation of the trained model rotates in

every run according to the number of folds (k). All

remaining datasets of a run are used for training the

machine learning model. The result should be cal-

culated across all runs. A detailed description of the

procedure is given by Kohavi (Kohavi, 1995). It is

important to choose a suitable k for the procedure.

Kohavi concludes that choosing k between 10 and 20

reduces the variance of the results, while the bias in-

creases. A smaller k between 2 and 5 will show a

lower bias, but possibly a greater variance in results.

We therefore decided to use a 5-fold cross validation

and additionally a 10-fold cross validation. The result

of the evaluation of the models is shown in Table 4.

The values determined for the respective metric repre-

sent the rounded mean of the evaluation results. The

standard deviation behind the mean value shows the

distribution of the results. The highest achieved ac-

curacy from the 5-fold cross validation is highlighted

in bold. Since the deviations of the results from the

10-fold cross validation against the 5-fold were only

noticeable in the decimal place, we only focus on the

results of the lower biased 5-fold cross validation.

As described in the previous section, we applied

nested cross-validation taking 5-fold outer cross-

validation and 5-fold inner cross-validation for the

neural network while performing hyperparameter tun-

ing. The nested cross-validation reduces the bias

when evaluating different neural networks and gives

an improved estimate of the reachable accuracy on

the test dataset as well as other measurements, like

the error (Varma and Simon, 2006). As a result, we

obtained several trained models that performed differ-

ently on the respective test datasets of the outer cross

validation with little deviation. The result of the neu-

ral network in Table 4 therefore does not represent the

performance of a single trained model, but the aver-

age across all 3 best models.

Additionally it should be noted, that the weighted

F1 value, which normally represents the harmonic

mean of precision and recall, does not have to be be-

tween the calculated values of weighted precision and

weighted recall. This is due to the fact that the F1

metric also takes an imbalancement of the data into

account and does not necessarily have to lie between

precision and recall by calculation (Scikit-learn De-

velopers, 2020).

6 DISCUSSION

As described in the introduction, the goal of this study

was to determine whether the libraries of Apache

Maven repositories can be classiﬁed automatically

by their available tags using machine learning tech-

niques. For this purpose, we have applied differ-

ent approaches and can answer this leading question

with a clear ”yes” while achieving a high accuracy of

97.46% with a standard deviation of ±0.0017%. This

result also includes the mapping of the online crawled

classes to more generic classes that are important for

our further research. The question was whether the

machine learning methods would still be able to as-

sign the libraries to their tags even after restructuring

the classes. Considering the accuracy achieved, this

question can also be conﬁrmed.

We found that the neural network models in our

training scenario give the best result on average com-

pared to the other machine learning techniques. The

overall accuracy of 97.46% is almost 5% higher than

ıve Bayes. The trained nets also achieved a high

weighted precision and a high weighted recall. These

ﬁndings result in a weighted f1 score of 97.34% with a

standard deviation of ±0.0014%. The appended con-

fusion matrix in Table 5, generated from a randomly

picked evaluation run, demonstrates that also within

the separate classes an assignment is mostly correct

and reliable. The na

ıve Bayes approach achieved a

good result as well with 92.57%. The weighted preci-

sion, recall and f1-values are on a similarly high level,

which is why this approach can also be considered re-

liable as well. For the random forest, the standard im-

plementation parameters of Apache Spark were used

in the ﬁnal evaluation phase (Apache Spark, 2020b).

If necessary, the accuracy and the general result of

the decision tree approach could be further improved

by optimizing the number of trees and the maximum

depth in the forest.

As a clear limitation of the approach, it must be

noted that these good results for the automated classi-

ﬁcation of libraries probably do not work in general,

but only on the crawled libraries and their selected

classes. The approach depends strongly on the qual-

ity of the assigned tags and the choice of the classes.

It should be avoided to select too coarse-grained or

too small-grained classes when using these tags. We

still see a need for further research work and there-

fore refer to these points again in the following sec-

tion ”future work”.

7 CONCLUSIONS

With our approach, we want to bring more order to

the large collections of libraries and enable further

research in this ﬁeld for us and potentially other re-

search teams. Since the percentage of crawled, cate-

Similarity of Software Libraries: A Tag-based Classiﬁcation Approach

gorized software libraries in the Java Maven reposi-

tories is currently just over 8%, we have used a tag-

based approach to label most of the libraries. This

applies to an additional 67% of the libraries found in

the largest mentioned repositories.

At ﬁrst, we were able to determine the composi-

tion of the categories and tags available online. More-

over, we were able to introduce a more general la-

beling of the libraries, adjusted for our further re-

search work. In addition, a similar distribution of tags

seems to be found for labeled and unlabeled libraries.

Finally, an imbalance in the data was found, which

we assume to be due to the domain under investiga-

tion. Based on these ﬁndings, we applied different ap-

proaches for automatic classiﬁcation. For a tag-based

approach on our presented relabeled dataset, a neural

network with a achieved accuracy of 97.46% seems to

be the most promising. We also found a good result

with the applied na

ıve Bayes approach. In contrast,

logistic regression and random forest decision trees

did not bring sufﬁcient results.

8 FUTURE WORK

With such promising results in the automated classiﬁ-

cation, we see only limited need for further optimiza-

tion work. However, by hyperparameter optimization

of the neural network, there is a chance for even better

results.

Furthermore, we still see a need for the evaluation

of more generic approaches, because 25% of the li-

braries from our dataset as well as libraries from other

platforms might not be tagged. This is where our ap-

proach has limitations for the management of repos-

itory items. For our trained models, tags must exist

and need to be of similar quality. Since this is proba-

bly not always the case, alternative features for classi-

ﬁcation should be considered. We see future work in

applying more generic approaches, using NLP to an-

alyze the always available group-ids and artefact-ids

as well as analyze the always available binary code.

This procedure could also be beneﬁcial to the relabel-

ing of libraries from the currently excluded class ”an-

droid packages”, since this class is too generic in our

view. In addition, other features could be taken from

metadata and considered for classiﬁcation in combi-

nation with the features already listed. If available,

we would consider the amount of tags and downloads,

code metrics, licences, connections between contribu-

tors behind those libraries and keywords/entities from

online websites.

Besides the further classiﬁcation approaches of

the libraries, our further research work, as already de-

scribed in the introduction, aims to identify similar

software on a domain-related and technical basis. For

the technical basis, we plan to use the classiﬁed li-

braries and determine migration paths by analyzing

the development of open source projects on the time-

line. By analyzing the commit history in software

projects, we aim to provide decision support through

automatically generated design decision recommen-

dations.

REFERENCES

Apache Spark (2020a). Classiﬁcation and regression.

https://spark.apache.org/docs/3.0.0/ml-classiﬁcation-

regression.html, visited 2021-01-17.

Apache Spark (2020b). Ensembles - rdd-based

api. https://spark.apache.org/docs/3.0.0/mllib-

ensembles.html, visited 2021-01-17.

Apache Spark (2020c). Evaluation metrics - rdd-

based api. https://spark.apache.org/docs/3.0.0/mllib-

evaluation-metrics.html, visited 2021-01-17.

Auch, M., Weber, M., Mandl, P., and Wolff, C. (2020).

Similarity-based analyses on software applications: A

systematic literature review. Journal of Systems and

Software, page 110669.

Bogers, T. (2018). Tag-Based Recommendation, pages 441–

479. Springer International Publishing, Cham.

ohning, D. (1992). Multinomial logistic regression algo-

rithm. Annals of the institute of Statistical Mathemat-

ics, 44(1):197–200.

Breiman, L. (2001). Random forests. Machine learning,

45(1):5–32.

Chawla, N. V. (2010). Data Mining for Imbalanced

Datasets: An Overview, pages 875–886. Springer US,

Boston, MA.

Duda, R. O., Hart, P. E., et al. (1973). Pattern classiﬁcation

and scene analysis, volume 3. Wiley New York.

Escobar-Avila, J. (2015). Automatic categorization of soft-

ware libraries using bytecode. In 2015 IEEE/ACM

37th IEEE International Conference on Software En-

gineering, volume 2, pages 784–786.

Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep

Learning. MIT Press.

Gu, Q., Zhu, L., and Cai, Z. (2009). Evaluation measures

of the classiﬁcation performance of imbalanced data

sets. In Cai, Z., Li, Z., Kang, Z., and Liu, Y., editors,

Computational Intelligence and Intelligent Systems,

pages 461–471, Berlin, Heidelberg. Springer Berlin

Heidelberg.

Hara, K., Saito, D., and Shouno, H. (2015). Analysis of

function of rectiﬁed linear unit used in deep learning.

In 2015 International Joint Conference on Neural Net-

works (IJCNN), pages 1–8.

He, H. and Garcia, E. A. (2009). Learning from imbalanced

data. IEEE Transactions on Knowledge and Data En-

gineering, 21(9):1263–1284.

DATA 2021 - 10th International Conference on Data Science, Technology and Applications

Hong, J., Deng, H., and Yan, Q. (2008). Tag-based artist

similarity and genre classiﬁcation. In 2008 IEEE

International Symposium on Knowledge Acquisition

and Modeling Workshop, pages 628–631.

Keras SIG (2020). About keras. https://keras.io/about/, vis-

ited 2021-01-31.

Kohavi, R. (1995). A study of cross-validation and boot-

strap for accuracy estimation and model selection. In

Proceedings of the 14th International Joint Confer-

ence on Artiﬁcial Intelligence - Volume 2, IJCAI’95,

page 1137–1143, San Francisco, CA, USA. Morgan

Kaufmann Publishers Inc.

Manning, C. D., Sch

utze, H., and Raghavan, P. (2008). In-

troduction to information retrieval. Cambridge uni-

versity press.

Mayring, P. (2004). Qualitative content analysis. A com-

panion to qualitative research, 1(2004):159–176.

McCallum, A., Nigam, K., et al. (1998). A comparison of

event models for naive bayes text classiﬁcation. In

AAAI-98 workshop on learning for text categoriza-

tion, volume 752, pages 41–48. Citeseer.

ellic, P.-A., Haugeard, J.-E., and Pitel, G. (2008). Im-

age clustering based on a shared nearest neighbors ap-

proach for tagged collections. In Proceedings of the

2008 International Conference on Content-Based Im-

age and Video Retrieval, CIVR ’08, page 269–278,

New York, NY, USA. Association for Computing Ma-

chinery.

Scikit-learn Developers (2020). sklearn metrics f1-score.

https://scikit-learn.org/stable/modules/generated/

sklearn.metrics.f1

score.html, visited 2021-01-17.

Sokolova, M. and Lapalme, G. (2009). A systematic analy-

sis of performance measures for classiﬁcation tasks.

Information processing & management, 45(4):427–

437.

Tharwat, A. (2020). Classiﬁcation assessment methods. Ap-

plied Computing and Informatics.

Thung, F., Lo, D., and Lawall, J. (2013). Automated library

recommendation. In 2013 20th Working Conference

on Reverse Engineering (WCRE), pages 182–191.

Tin Kam Ho (1995). Random decision forests. In Pro-

ceedings of 3rd International Conference on Docu-

ment Analysis and Recognition, volume 1, pages 278–

282 vol.1.

Varma, S. and Simon, R. (2006). Bias in error estimation

when using cross-validation for model selection. BMC

bioinformatics, 7(1):1–8.

Vel

azquez-Rodr

ıguez, C. and De Roover, C. (2020). Mu-

tama: An automated multi-label tagging approach for

software libraries on maven. In 2020 IEEE 20th Inter-

national Working Conference on Source Code Analy-

sis and Manipulation (SCAM), pages 254–258.

Zaheer, M., Reddi, S. J., Sachan, D., Kale, S., and Kumar,

S. (2018). Adaptive methods for nonconvex optimiza-

tion. In Proceedings of the 32nd International Con-

ference on Neural Information Processing Systems,

NIPS’18, page 9815–9825, Red Hook, NY, USA. Cur-

ran Associates Inc.

Zhang, M. and Zhou, Z. (2014). A review on multi-label

learning algorithms. IEEE Transactions on Knowl-

edge and Data Engineering, 26(8):1819–1837.

Similarity of Software Libraries: A Tag-based Classiﬁcation Approach

APPENDIX

Table 5: Confusion matrix of a neural network 5-fold cross evaluation result showing classes with more than a single eval-

uated library. Rows show the predicted classes and are arranged in the same order as the columns. The columns show the

actual crawled and mapped classes. The 10 randomly excluded classes were removed for a better presentation of the matrix.

However, they do not show any particular deviations in the prediction.

Testing

Graph Algorithms and Tools

Build Automation Tool Plugins

Machine Learning

Example

Security / Authentiﬁcation

Utilities

Languages / Compiler / Interpretation

File Handler

Annotation Processing

Web Applications

IDE Modules

Stream Processing

Benchmarks

Rule Engines

Database

Native Access Tools

Contexts and Dependency Injection

Bitcoin

Actor Frameworks

Logging / Monitoring

Cryptography

Application Layer Protocol Integration

Barcode Handler

Code analyses

Marshalling / Unmarshalling

Distributed Computing

Build

Operations management

File System

Crawler

Natural Language Processing

Classpath Tools

Chart Libraries

Reﬂection Libraries

Social Network Clients

DNS Libraries

Aspect Oriented

Messaging / Integration

Bytecode Libraries

Exception Handling

Managing / Monitoring

Containerization

Money Libraries

Distributed Coordination

OSGi

Conﬁguration Libraries

Distributed Communication

Search Engines

Caching

Job Scheduling

Mailing

Distributed Tracing

Application-/ Web-Server

390 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 51 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 943 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 50 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

1 0 0 0 881 0 0 0 0 0 21 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0

0 0 0 0 0 64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

1 0 1 0 0 0 126 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1

0 0 0 0 0 0 0 134 2 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0

3 0 1 0 0 0 0 1 252 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

1 0 0 0 0 0 0 0 0 18 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

5 0 0 0 49 0 0 0 0 0 801 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

1 0 1 0 0 0 0 0 1 0 0 148 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 137 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

1 0 2 0 0 0 0 0 0 0 0 0 0 223 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 49 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 39 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 174 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 38 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 44 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 18 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 24 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 203 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0

1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 91 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 63 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 51 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 40 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 57 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 17 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

2 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 16 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 27 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 24 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 22 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 12 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 14 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 28 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 24 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 27 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 8 0 0 0 0 0 0 0 0 0 0 0 0 0

1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 11 0 0 0 0 0 0 0 0 0 0 0 0

2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 16 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 8 0 0 0 0 0 0 0 0 0

0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 32 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 8 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 14 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 6 0 0 0 0 0

1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 39 0 0 0

1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 0

1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 15

DATA 2021 - 10th International Conference on Data Science, Technology and Applications