Integrated Data Repository System: Fusion, Learning and Sharing

Jeferson Lopes

2 a

, Giancarlo Lucca

1 b

, Rafael Huszcza

2 c

, Amanda Mendes

2 d

Eduardo N. Borges

2 e

, Pablo D. B. Guilherme

3 f

and Leandro A. Pereira

4 g

Mestrado em Engenharia Eletr

onica e Computac¸

ao, Universidade Cat

olica de Pelotas, Pelotas, Brazil

Centro de Ci

encias Computacionais (C3), Universidade Federal do Rio Grande (FURG), Rio Grande, Brazil

Centro de Educac¸

ao, Humanidades e Ci

encias Biol

ogicas, Universidade Estadual do Paran

a (UNESPAR), Paranagu

Brazil

Eixo Tecnol

ogico de Meio Ambiente, Instituto Federal do Paran

a (IFPR), Paranagu

a, Brazil

Keywords:

Data Repository, Machine Learning, Deep Learning, Geographic Information System.

Abstract:

Currently, an enormous volume of data is being generated from diverse sources, including sensors and social

media. Effectively managing this unprecedented scale of data and deriving meaningful insights from these

extensive datasets present a signiﬁcant challenge for computer scientists. In this context, this paper outlines

the development and documentation of a project dedicated to actively contributing to these critical data-driven

initiatives. The described system integrates the features of a scientiﬁc data repository with a suite of data

science methods, machine learning tools, and resources for geographic data visualization. By consolidating

these functionalities on a single platform, users can streamline their workﬂow and extract insights from data

more efﬁciently. This integrated approach facilitates seamless transitions from data storage to model training

and analysis, fostering collaboration and facilitating knowledge sharing among researchers and practitioners.

In this work, we highlight the system’s key features, focusing on the datasets repository and the machine

learning module as central components of our platform.

1 INTRODUCTION

It is widely recognized that data has been generated

at increasingly high rates, especially due to advances

in mobile devices and digital sensors, which have

greatly facilitated data collection. Hence, large-scale

datasets have been curated and are widely employed

across various domains (Yaqoob et al., 2016).

Managing this unprecedented volume of data in

Big Data era poses a signiﬁcant challenge for com-

puter scientists (Al Aghbari., 2015). Hence, machine

learning algorithms (Tan et al., 2005) have risen as

indispensable tools for uncovering intricate patterns

in this vast and complex datasets, providing valuable

assistance to professionals in diverse data-intensive

https://orcid.org/0009-0004-9113-3921

https://orcid.org/0000-0002-3776-0260

https://orcid.org/0009-0000-1949-9284

https://orcid.org/0000-0002-5335-166X

https://orcid.org/0000-0003-1595-7676

https://orcid.org/0000-0001-7471-6907

https://orcid.org/0000-0001-6055-8063

ﬁelds, including medicine, biology and beyond.

However, to effectively harness the wealth of data

through machine learning is a challenge for non-

computer science end-users. Hence, platforms must

transcend mere data repositories and seamlessly inte-

grate machine learning tools for extracting meaning-

ful insights. Recognizing the expense of organizing

databases, a focus on cost-effective solutions involv-

ing database management and artiﬁcial intelligence is

crucial for technological advancement.

The paramount importance of services in this do-

main is evident in the growth of platforms like Kag-

gle (Kaggle, 2023) and Hugging Face (Hugging Face,

2023). These platforms succeed in addressing chal-

lenges with a versatile approach that spans multiple

domains, assisting in the integration of information

among stakeholders.

Nevertheless, these feature-rich tools can be in-

timidating for users who are not familiar with the ﬁeld

of Computer Science, including scientists from other

areas, since they tend to require in-depth technical

knowledge for operation. Moreover, another chal-

Lopes, J., Lucca, G., Huszcza, R., Mendes, A., Borges, E., Guilherme, P. and Pereira, L.

Integrated Data Repository System: Fusion, Learning and Sharing.

DOI: 10.5220/0012733700003690

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 26th International Conference on Enterprise Information Systems (ICEIS 2024) - Volume 1, pages 409-416

ISBN: 978-989-758-692-7; ISSN: 2184-4992

409

lenge faced by many is the need for local installa-

tion of these systems, which may involve complicated

conﬁgurations and speciﬁc hardware requirements.

Therefore, an efﬁcient dataset management plat-

form that integrates data science tools, including ar-

tiﬁcial intelligence algorithms, yet remains simple

and intuitive for all users, is crucial. The developed

system, namely Data Symbion Environmental Intelli-

gence

, aims to encompass all of these aspects. It is

a centralized repository where users, both individuals

and legal entities, can share their data on the platform.

This enables the integration and combination of infor-

mation from different sources, increasing the chances

of extracting useful and valuable insights.

Besides being a dataset repository with integrated

machine learning tools that anyone can use, regardless

of technical knowledge, our platform is web based,

which enables users to access it from anywhere, at any

time, with internet connection. The machine learning

tools are tailored for creating and utilizing predictive

models for both tabular and visual data.

Moreover, our system can be accessed directly

through the API, which facilitates seamless integra-

tion with data collection systems, streamlining au-

tomation processes. The platform also includes a tool

for geographic visualization tailored to datasets con-

taining latitude and longitude information, facilitating

visual comprehension through interactive maps.

The general objective of this work is to present the

development and the structure of the proposed sys-

tem. The following list outlines the set of speciﬁc ob-

jectives of the project:

• Offer an efﬁcient and accessible data repository;

• Integrate data from multiple sources;

• Allow users to train and share supervised ma-

chine learning models, including, but not limited

to, deep learning models for image classiﬁcation;

• Provide tools for operations on geographic data;

• Allow software and hardware developers to im-

plement graphical interfaces and/or automate pro-

cesses through the consumption of REST API;

• Provide a user-friendly and welcoming interface

accessible to users from diverse domains.

The study is structured as follows: Section 2 pro-

vides a review of existing tools for data repositories

and machine learning. In Section 3, we begin by com-

paring these existing tools to our proposed system.

We then outline our system’s modules and describe

its main features. Finally, Section 4 summarizes the

contributions of our system and possible limitations.

https://datasymbion.com/

2 RELATED WORK

Effectively organizing and utilizing the extensive col-

lected data in a systematic manner has a core impor-

tance in various ﬁelds. Tools focused on database

management and artiﬁcial intelligence (AI) have

shown tremendous potential in recent years (Russell

and Norvig, 2009). This section discusses existing

software that serve as data repositories, and platforms

that provide machine learning tools.

Available solutions for data management such as

DSpace

, Dataverse

and ArcGIS

, are commited to

the FAIR principles (Findable, Accessible, Interoper-

able, and Reusable). This is crucial to ensure relia-

bility, quality and standardization in the organization

and sharing of data (Rocha et al., 2021).

Although repositories play an important role by

providing a centralized location for storing and shar-

ing data, effective data management requires a more

comprehensive approach (Say

ao and Sales, 2022).

In the presented context, several platforms offer AI

tools, which allow the extraction of valuable in-

sights from collected data, thus supporting decision-

making (Hachicha Belghith et al., 2020).

Artiﬁcial Intelligence (AI) is a ﬁeld that focuses

on algorithms that bring computers closer to human

perception, performing automated tasks. Machine

learning (ML) algorithms are AI methods capable of

learning and improving from experience (Tan et al.,

2005). By training ML models with sufﬁcient amount

of information, we can leverage them to recognize

patterns in the training data.

Two well recognized and open-source solutions

are Waikato Environment for Knowledge Analysis

(WEKA) (Hall et al., 2009)

and RapidMiner

(Hof-

mann and Klinkenberg, 2016), which provide ma-

chine learning tools for data mining used to extract

valuable insights from datasets. Also, many tools

designed for machine learning excel in addressing

challenges with a more versatile approach that spans

multiple domains. Some emerging tools in this con-

text include Kaggle

(Kaggle, 2023) and Hugging

Face

(Hugging Face, 2023). In what follows we dis-

cuss some of them:

• DSpace – DSpace is a web application, that al-

lows users to publish documents and data, serving

as a long-term digital archives system. DSpace

https://dspace.lyrasis.org

https://dataverse.org

https://www.arcgis.com

https://www.cs.waikato.ac.nz/ml/weka/

https://altair.com/altair-rapidminer

https://www.kaggle.com/

https://huggingface.co/

ICEIS 2024 - 26th International Conference on Enterprise Information Systems

410

supports access to all types of digital content in-

cluding text, images, moving images, mpegs and

data sets. Its features make this software a good

choice for academic, non-proﬁt, and commercial

organizations building open digital repositories.

However, despite being free and open source,

DSpace requires installation.

• Dataverse – Another data repository platform is

the Dataverse Project, which is an open source

web application to share, preserve, cite, explore,

and analyze research data. This tool facilitates

data sharing, empowering users to replicate oth-

ers’ work more easily.

• ArcGIS – One important application of data

repositories are Geographic Information System

(GIS), which is a software designed to collect,

manage, analyze, and visualize geographic data.

A GIS is particularly valuable for integrating

and organizing spatial information, covering ar-

eas such as marine ecosystems, coastal land use,

and economic activities (Randazzo et al., 2021).

One important example of GIS is the ArcGIS plat-

form. It is a software that allows secure map-

ping and spatial analysis, empowering users to

unlock geospatial insights. This software is built

on scalable and resilient technology, allowing ef-

ﬁcient data collection, management, and analysis,

thus, facilitating decision-making by easily shar-

ing maps and apps.

• WEKA – An open source software that provides a

collection of machine learning algorithms for data

mining tasks. It encompasses tools for data prepa-

ration, classiﬁcation, regression, clustering, asso-

ciation rules mining, and visualization. Despite

its wide range of applications, WEKA not only

requires installation of the software itself but also

needs a compatible version o Java.

• RapidMiner – RapidMiner data science plat-

form is designed for diverse teams to collabora-

tively generate and share data-driven insights. It

caters to various skill sets, from data scientists

to business analysts, offering a uniﬁed environ-

ment. Users can build data and machine learning

pipelines with code-free to code-friendly experi-

ences, increasing trust with interactive decision

trees and model simulators.

• Kaggle – Kaggle empowers users to access hun-

dreds of pre-trained machine learning models

for deployment and facilitates exploration, anal-

ysis, and sharing of high-quality datasets (Kag-

gle, 2023). This platform offers a wealth of re-

sources for machine learning enthusiasts, includ-

ing 296K high-quality public datasets, 2,200 pre-

trained ML models, and a large repository of

community-published models, data, and code.

• Hugging Face – Hugging Face extends its ser-

vices beyond providing a platform for model and

dataset sharing; it equips users with tools for

building, training, and deploying machine learn-

ing models based on open-source (OS) code and

technologies (Hugging Face, 2023). The plat-

form offers a wide range of libraries and frame-

works, including Transformers, Tokenizers, and

Datasets, which are designed to simplify the de-

velopment process and make it more accessible.

3 THE PROPOSED SYSTEM

In real-world applications, it is essential to offer plat-

forms equipped with a comprehensive set of tools fea-

turing convenient and accessible functions. Some key

features include:

• Data repository: enable users to efﬁciently,

safely, and securely store and manage data.

• Machine learning: seamlessly integrate data

repositories with AI capabilities to extract knowl-

edge from data.

• Interactive maps: offering features like inter-

active maps for data visualization is essential in

geographic systems,.

• Open Source: being open source is particularly

valuable as it promotes collaboration, innovation,

and accessibility in the tech industry.

• Online access: online systems that require no

installation or complex settings make technology

more accessible.

Table 1 presents a comparison of the developed

system with the other tools discussed in the previous

section considering the aforementioned criteria.

All the presented tools offer valuable services to

users, whether for storing and managing data or em-

ploying artiﬁcial intelligence for data mining. Nev-

ertheless, these systems, whether tailored speciﬁcally

for the environmental ﬁeld or designed for more gen-

eral applications, may be considered overly complex

and less intuitive, particularly for users lacking famil-

iarity with the computer science domain.

Our proposed system integrates the functionali-

ties of a scientiﬁc data repository with those of a ma-

chine learning tool, enabling users to store and share

datasets and to employ them for training and using

machine learning models. Data Symbion EI incorpo-

rates a GIS module, providing access to interactive

maps.

Integrated Data Repository System: Fusion, Learning and Sharing

411

Table 1: Comparison of the discussed tools with the proposal.

Tool Data repository Machine Learning Interative maps Open-Source Online acess

DSpace X X X

Dataverse X X X

ArcGIS X X X

WEKA X X

Rapid Miner X X

Kaggle X X X X

Hugging Face X X X X

Proposed System X X X X X

To have a more complete understanding of the

project, Figure 1 provides an overview of all screens

and features of the proposed platform. It is mapped

among screens (green) and system features (yellow).

To register in the system, users are required to

provide personal information, including email, name,

personal identiﬁcation number, and password. Once

registration is complete, users can access the system

and will be redirected to the Dashboard page.

This Dashboard page presents crucial information

pertaining to the system, including details such as the

number of institutions with which the user is afﬁliated

and the number of registered Data Sources. Also, the

system incorporates a guide button to help users easily

navigate and access the platform’s features.

3.1 Data Repository

The system introduces a comprehensive data repos-

itory service, empowering users to securely upload

their datasets and customize sharing permissions as

desired. This platform offers a structured approach

to data storage, ensuring organization and accessibil-

ity. These datasets can then be seamlessly utilized

across various tools within the system, including data

integration, geographic information visualization, and

machine learning model training, enhancing the over-

all functionality and utility of the platform.

Users can securely upload and manage diverse

datasets, including structured data such as CSV ﬁles,

as well as unstructured data like images for deep

learning training. Additionally, CRUD (Create, Read,

Update, Delete) operations can be adeptly executed

on these repositories, thereby ensuring the efﬁcacy of

their management and maintenance.

The system offers two types of data storage struc-

tures, namely Data Source and Image Source, allow-

ing users to categorize and structure their information

systematically. The Data Source is designed for tab-

ular data, such as that found in a spreadsheet. Users

can create a Data Source instance, deﬁne its parame-

ters such as name, description, access type and insti-

tution, and upload a CSV ﬁle.

The Image Source is a structure analogous to the

Data Source, but instead of being associated with a

CSV ﬁle, it encapsulates a set of images. For the clas-

siﬁcation task, the images are organized into folders,

each associated with one of the data classes.

To enhance collaboration and data sharing, our

platform incorporates customizable sharing permis-

sions. Users can effortlessly adjust access levels, de-

scribed in Section 3.3, allowing them to share data

with others institutions.

3.2 Data Integration

In our system, it is possible to integrate two or more

Data Sources to produce a new instance, resultant

from their combination. This is the so-called Data

Integration Module. This module allows users to per-

form two types of integration: by rows or by columns.

Integrating two or more Data Sources by row in-

volves concatenating them vertically, stacking one on

top of the other. Users can choose which columns

to include in the integrated dataset and automatically

add a new column to indicate the origin of each sam-

ple. Conversely, integrating by column enables users

to merge two or more Data Sources horizontally. This

is especially valuable when the Data Sources describe

different attributes of the same samples.

After the Data Sources are selected, it is necessary

to deﬁne the type of integration and if it’s necessary

to remove/maintain duplicated samples. The meta-

data of the new Data Source instance such as name,

description and access is also required.

3.3 Data Protection

The system includes several visibility proﬁles, which

enable users to control the accessibility of their Data

Sources, Image Sources and trained models. The

three available privacy modes are:

1. Private: Only its responsible can access the

data/model. So, even when it is linked to an in-

stitution, the data/model will not be shared with

anyone until it is published.

ICEIS 2024 - 26th International Conference on Enterprise Information Systems

412

Figure 1: General view of the Project system.

2. Public: The data/model will be shared with the

users linked to the institution with which it has an

active connection. All the other users that are not

linked to this institution will not have access.

3. Open: All the users, including the ones that are

not registered on the system will have access to

the data/model.

It is important to note that if a user does not explic-

itly choose a visibility setting, it is set to the default

mode, which is private.

3.4 Geographic Information

Visualization

Within our platform, users are empowered to ex-

plore and analyze geographical data through an in-

teractive maps section, enhancing their capabilities

in biological research and decision-making processes.

This module allows users to visualize speciﬁc Data

Sources of interest by selecting them. It can contain a

wide range of geographical data.

To ensure accurate representation, users are re-

quired to specify the columns that correspond to lati-

tude and longitude coordinates and select a label col-

umn that identiﬁes the primary attribute associated

with each data point. This functionality allows for

the differentiation of individual points, facilitating ef-

fective data interpretation and analysis.

In Figure 2, the Geographic Visualization Module

is divided into two components. The left screen dis-

plays the settings for map creation, showing the selec-

tion of the data source and columns related to latitude,

longitude, and the attribute to be visualized. Users

can also specify the initial number of samples to be

loaded onto the map. The second screen showcases

the visualization of data points and their geographical

locations, which can be grouped for improved visual-

ization. Furthermore, a button is available for loading

additional data onto the map.

3.5 Machine Learning Tools

A model is a computational representation created by

a machine learning or deep learning algorithm during

the training process. This representation is learned

from data and, at the end of training, can be used to

make predictions on new input data.

Our system provides a powerful module to al-

low the persistence and sharing of models trained by

users. Thus, shortly after training, in both the Ma-

chine Learning and Deep Learning modules, it is pos-

sible to save the generated model in the system itself,

eliminating the need for the user to save it locally.

These persistence and sharing capabilities offer

users greater ﬂexibility, as model training occurs in

the background and results—including the model,

its parameters, and training status—are automatically

saved in this module. This eliminates the need for

users to remain on the training page until completion.

In addition to facilitating model sharing among

users, this module enhances security, organization,

and convenience. It provides a list of all trained mod-

els and includes a ﬁlter system to simplify searches.

Integrated Data Repository System: Fusion, Learning and Sharing

413

Figure 2: The Geographic Visualization Module.

This tool offers the user the possibility to evaluate

the trained models and use them for inference on new

data. In order to use the model, the user can manually

input the new sample to obtain the model’s predic-

tion. This feature allows users to employ their trained

models in real-world scenarios.

The Machine Learning Module empowers users

to extract valuable insights and knowledge from their

data. By utilizing advanced analytical techniques, it is

possible to uncover patterns, trends, and relationships

within datasets. The module allows for experimen-

tation with different approaches, using various algo-

rithms and testing different parameter settings. This

functionality enables students and professionals from

diverse domains to gain a deeper understanding of

their data, empowering them to make informed de-

cisions and drive innovative research.

The system offers a range of machine learning ca-

pabilities, including classical algorithms for regres-

sion and classiﬁcation and advanced deep learning al-

gorithms tailored for image classiﬁcation.

3.6 Classical Machine Learning

Algorithms

Classical Machine Learning algorithms are tailored

to learn from structured data, such as data in spread-

sheet form. Therefore, these methods can be utilized

to learn from data stored in the Data Source structure

provided by our system.

We initially consider explainable methods, to bet-

ter describe the generated models. In this sense,

for the classiﬁcation problem (Duda et al., 2000),

CART (Breiman et al., 1984) and Random Forrest

(Ho, 1995) algorithms were implemented. For the re-

gression problem (Hastie et al., 2009), we consider

the Support Vector Machines (SVM) (Cortes and Vap-

nik, 1995) and Linear Regression (Weisberg, 2005).

The classiﬁcation task involves categorizing input

data into predeﬁned classes or categories (Sen et al.,

2020). The Decision Tree (Breiman et al., 1984) al-

gorithm addresses this task by constructing a series of

rules organized in a tree-like structure. Each leaf node

in the tree corresponds to a class label. The algorithm

learns these rules from the training data, determining

the best splits to classify the data accurately.

Random Forest (Ho, 1995) is a widely used tech-

nique for solving classiﬁcation tasks. It belongs to

the ensemble learning methods and works by creating

multiple decision trees. The ﬁnal class prediction is

determined through a voting mechanism, where each

tree’s prediction contributes to the ﬁnal outcome.

Regression tasks aim to predict a continuous value

based on a set of attributes (Sen et al., 2020). For ex-

ample, predicting the weight of a person based on a

set of body measurements is a regression task. One

of the algorithms available to address this challenge

is the Support Vector Machine (SVM) (Cortes and

Vapnik, 1995). SVM aims to ﬁnd a hyperplane that

best ﬁts the data points. The hyperplane is chosen

such that it passes through as many points as possi-

ble, while still staying within the margin of tolerance.

Another approach to ﬁnding the line that best rep-

resents the data is by using the linear regression algo-

rithm (Weisberg, 2005), also available in our system.

This is a straightforward technique that aims to ﬁnd

the line that, on average, is closest to the data points.

To train a machine learning model, users need to

follow these steps:

1. Choose a Data Source;

2. Select the type of problem to be solved (classiﬁ-

cation or regression);

3. Deﬁne the target variable to be predicted;

4. Set a random seed for reproducibility;

5. Select the attributes (i.e., columns) from the Data

Source to be considered;

6. Specify the proportion of the Data Source to be

employed for model training and evaluation.

3.7 Deep Learning Algorithms

Another important feature in our system is the Deep

Learning module. Deep learning, a subset of AI,

ICEIS 2024 - 26th International Conference on Enterprise Information Systems

414

Figure 3: Training a Deep Learning model.

focuses on developing models capable of complex

learning and pattern recognition tasks. This approach

uses deep neural network architectures, composed of

multiple layers of interconnected units, to automat-

ically learn hierarchical and abstract representations

of data (Goodfellow et al., 2016).

In addition to being widely applied in areas such

as natural language processing, recommendation sys-

tems, and speech recognition, deep learning models

also excel in computer vision. This ﬁeld of study fo-

cuses on developing algorithms capable of extracting

knowledge from images and videos to perform tasks

such as image classiﬁcation, object detection, and se-

mantic segmentation (Goodfellow et al., 2016).

The ﬁrst functionality of this new module is im-

age classiﬁcation, which consists in categorizing im-

ages into predeﬁned groups. Two deep neural net-

work architectures designed for this task have been

made available: MobileNetV3 (Howard et al., 2019)

and ResNet50 (He et al., 2016). Users can employ

these algorithms to train deep learning models on im-

ages stored within a Image Source structure.

MobileNetV3, initially developed for mobile and

edge devices such as smartphones and IoT, strikes a

balance between speed and accuracy. This makes it

ideal for applications where computational resources

are limited (Howard et al., 2019).

In contrast, ResNet50 is a more robust network

architecture. Its name, ”ResNet,” comes from the

concept of residual learning, which is central to its

design. This approach involves creating shortcuts,

known as skip connections, throughout the network.

These connections make it easier for information to

ﬂow through the network without being lost or dis-

torted, ultimately improving the network’s ability to

learn and perform complex tasks (He et al., 2016).

In addition to choosing one of the two networks,

the user can also add layers to customize their archi-

tecture. After deﬁning the training parameters, as il-

lustrated in Figure 3, the model is trained and made

available to the user for inference on new images.

4 CONCLUSION

Effectively managing the vast volume of data gen-

erated today is a critical and evolving ﬁeld in com-

puter science. It is essential to provide end-users from

various ﬁelds with data repositories integrated with

machine learning tools to extract meaningful insights

from their data. However, existing tools are often too

complex and lack accessibility and intuitiveness for

users unfamiliar with computer science.

Our system provides a scientiﬁc data repository,

enabling users to securely store, share, and utilize

datasets for seamlessly training and deploying ma-

chine learning models. Two storage structures, Data

Source and Image Source, enable organization and

sharing of tabular and image data.

The Machine Learning Module offers various al-

gorithms and parameters for users to experiment and

better understand their data for decision-making. The

Data Integration Module merges datasets, and the GIS

module provides an interactive maps section that fa-

cilitates geographical data analysis.

While our proposed system offers a comprehen-

sive array of features to tackle various challenges in

data management and analysis, it is imperative to rec-

ognize its potential limitations. One notable con-

straint may lie in the scalability of the system, par-

ticularly when confronted with exceptionally large

datasets. Limited hardware capabilities and software

constraints could restrict the size and complexity of

Integrated Data Repository System: Fusion, Learning and Sharing

415

models that can be effectively trained, potentially im-

peding the exploration of highly dense model archi-

tectures or the utilization of extensive datasets.

Future works could focus on enhancing the scala-

bility and efﬁciency of ML algorithms for processing

large-scale datasets stored in scientiﬁc repositories.

Moreover, leveraging emerging technologies such as

federated learning and edge computing holds promise

for enabling distributed and real-time analysis could

be an interesting feature for the system.

ACKNOWLEDGMENTS

The authors would like to thank Fundac¸

ao de Am-

paro

a Pesquisa do Estado do Rio Grande do Sul -

FAPERGS, Conselho Nacional de Desenvolvimento

Cient

ıﬁco e Tecnol

ogico - CNPq (3305805/2021-

5, 23/2551-0000126-8), Fundac¸

ao Grupo Botic

ario

(camp 001 2021) and Fundac¸

ao Arauc

aria de Apoio

ao Desenvolvimento Cient

ıﬁco e Tecnol

ogico do Es-

tado do Paran

a (FA).

REFERENCES

Al Aghbari., Z. (2015). Mining big data - challenges

and opportunities. In Proceedings of the 17th Inter-

national Conference on Enterprise Information Sys-

tems - Volume 2: ICEIS, pages 379–384. INSTICC,

SciTePress.

Breiman, L., Friedman, J., Olshen, R., and Stone, C. (1984).

Classiﬁcation and Regression Trees. Chapman and

Hall (Wadsworth and Inc.).

Cortes, C. and Vapnik, V. (1995). Support-vector networks.

Machine learning, 20(3):273–297.

Duda, R. O., Hart, P. E., and Stork, D. G. (2000). Pattern

Classiﬁcation (2Nd Edition). Wiley-Interscience.

Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep

learning. MIT press.

Hachicha Belghith, E., Rioult, F., and Bouzidi, M. (2020).

Acoustic diversity classiﬁcation using machine learn-

ing techniques: Towards automated marine big data

analysis. International journal on artiﬁcial intelli-

gence tools, 29(3n04):2060011.

Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann,

P., and Witten, I. H. (2009). The weka data min-

ing software: An update. SIGKDD Explor. Newsl.,

11(1):10–18.

Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Ele-

ments of Statistical Learning: Data Mining, Inference,

and Prediction. Springer.

He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep resid-

ual learning for image recognition. In Proceedings of

the IEEE conference on computer vision and pattern

recognition, pages 770–778.

Ho, T. K. (1995). Random decision forests. IEEE Trans-

actions on Pattern Analysis and Machine Intelligence,

17(8):832–844.

Hofmann, M. and Klinkenberg, R. (2016). RapidMiner:

Data mining use cases and business analytics appli-

cations. CRC Press.

Howard, A., Sandler, M., Chu, G., Chen, L.-C., Chen, B.,

Tan, M., Wang, W., Zhu, Y., Pang, R., Vasudevan, V.,

et al. (2019). Searching for mobilenetv3. In Pro-

ceedings of the IEEE/CVF international conference

on computer vision, pages 1314–1324.

Hugging Face (2023). Hugging face - the ai community

building the future.

Kaggle (2023). Kaggle: Your machine learning and data

science community.

Randazzo, G., Italiano, F., Micallef, A., Tomasello, A., Cas-

setti, F. P., Zammit, A., D’Amico, S., Saliba, O., Cas-

cio, M., Cavallaro, F., Crupi, A., Fontana, M., Gre-

gorio, F., Lanza, S., Colica, E., and Muzirafuti, A.

(2021). Webgis implementation for dynamic mapping

and visualization of coastal geospatial data: A case

study of bess project. Applied sciences, 11(17):8233.

Rocha, R. P. d., Gabriel Junior, R. F., Vanz, S. A. d. S.,

Borges, E. N., Azambuja, L. A. B., Caregnato, S. E.,

Pav

ao, C. M. G., Passos, P. C. S. J., and Felicis-

simo, C. H. (2021). An

alise dos sistemas dspace e

dataverse para reposit

orios de dados de pesquisa com

acesso aberto. Revista Brasileira de Biblioteconomia

e Documentac¸

ao. S

ao Paulo. Vol. 17 (2021), p. 1-25.

Russell, S. J. and Norvig, P. (2009). Artiﬁcial Intelligence:

a modern approach. Pearson, 3 edition.

Say

ao, L. F. and Sales, L. F. (2022). Plataformas de

gest

ao de dados de pesquisa: expandindo o conceito

de reposit

orios de dados. Palabra clave (La Plata),

12(1):171–171.

Sen, P. C., Hajra, M., and Ghosh, M. (2020). Super-

vised classiﬁcation algorithms in machine learning:

A survey and review. In Emerging Technology in

Modelling and Graphics: Proceedings of IEM Graph

2018, pages 99–111. Springer.

Tan, P.-N., Steinbach, M., and Kumar, V. (2005). Introduc-

tion to Data Mining, (First Edition). Addison-Wesley

Longman Publishing Co., Inc., Boston, MA, USA.

Weisberg, S. (2005). Applied linear regression, volume 528.

John Wiley & Sons.

Yaqoob, I., Hashem, I. A. T., Gani, A., Mokhtar, S., Ahmed,

E., Anuar, N. B., and Vasilakos, A. V. (2016). Big

data: From beginning to future. International Journal

of Information Management, 36(6):1231–1247.

ICEIS 2024 - 26th International Conference on Enterprise Information Systems

416