A Well-founded Ontology to Support the Preparation of Training

and Test Datasets

Lucimar de A. Lial Moura

1 a

, Marcus Albert A. da Silva

2 b

, Kelli de Faria Cordeiro

1,3 c

and Maria Cl

audia Cavalcanti

1,2 d

Departamento de Sistemas e Computac¸

ao, Instituto Militar de Engenharia (IME), Rio de Janeiro, RJ, Brazil

Departamento de Engenharia de Defesa, Instituto Militar de Engenharia (IME), Rio de Janeiro, RJ, Brazil

Centro de An

alise de Sistemas Navais (CASNAV), Rio de Janeiro, RJ, Brazil

Keywords:

Data Preprocessing, Training and Test Datasets, Ontology, UFO, Provenance.

Abstract:

In the knowledge discovery process, a set of activities guide the data preprocessing phase, one of them is

the data transformation from raw data to training and test data. This complex and multidisciplinary phase

involves concepts and structured knowledge in distinct and particular ways in the literatures and specialized

tools, demanding data scientists with suitable expertise. In this work, we present PPO-O, a reference ontology

of the data preprocessing operators, to identify and represent the semantics of the concepts related to the data

preprocessing phase. Moreover, the ontology highlights data preprocessing operators to the preparation of

the training and test datasets. Based on PPO-O, Assistant-PP tool was developed, which made it capable to

capture the retrospective data provenance during the execution of data preprocessing operators, facilitating the

reproducibility and explainability of the dataset created. This approach might be helpful to non-experts users

in data preprocessing.

1 INTRODUCTION

The research and application of technologies related

to Artiﬁcial Intelligence (AI) has been motivating

the modern world. According to Gartner

group re-

searches, AI is one of the ﬁve emerging technologies

required in 2020. With large amounts of data avail-

able, organizations in almost all sectors of society are

focused on exploiting it for the purpose of discover-

ing and gaining knowledge. One of the reasons for the

growth of AI comes from the development of power-

ful algorithms capable of connecting and processing

datasets, allowing much broader and deeper analyses.

However, AI came with a diversity of technical

terms such as Data Mining (DM), Big Data (BD),

Data Science (DS), Machine Learning (ML). This va-

riety of terms might cause difﬁculties in understan-

ding and sharing knowledge. In this context, (Fayyad

https://orcid.org/0000-0003-1575-860X

https://orcid.org/0000-0002-1259-1763

https://orcid.org/0000-0001-5161-8810

https://orcid.org/0000-0003-4965-9941

https://www.gartner.com/smarterwithgartner/5-trends-

drive-the-gartner-hype-cycle-for-emerging-technologies-

2020/

et al., 1996) proposed the Knowledge Discovery and

Data Mining Process (KDD), inspired by Knowledge

Discovery in Database. (Chapman et al., 2000) pro-

posed Cross-Industry Standard Process of Data Min-

ing (CRISP-DM). Both KDD and CRISP-DM were

conceived with the goal of structuring and guiding the

discovery of knowledge based on data.

The KDD process describes the data preprocess-

ing phase as a data-centric step, which aims to im-

prove data quality for later consumption by ML algo-

rithms. However, as highlighted in the report (Crowd-

Flower, 2016) data scientists spend over 80 percent

of their time preparing the data. As the area of AI

presents many concepts from different perspectives;

similarly, the data preprocessing phase also deals with

a diversity of terms discussed in a distinct and partic-

ular way in the literature (Han et al., 2011) (Faceli

et al., 2015) (Goldschmidt et al., 2015) (Garc

ıa et al.,

2015). For example, there are different terms to de-

scribe the correction operation, which aims to balance

the distribution of records during a classiﬁcation task:

correction of prevalence, data balancing or data sam-

pling.

In this sense, when different terms have the same

meaning, there is a natural difﬁculty to understand

Moura, L., da Silva, M., Cordeiro, K. and Cavalcanti, M.

A Well-founded Ontology to Support the Preparation of Training and Test Datasets.

DOI: 10.5220/0010460000990110

In Proceedings of the 23rd International Conference on Enterprise Information Systems (ICEIS 2021) - Volume 2, pages 99-110

ISBN: 978-989-758-509-8

and choose the data preprocessing operator to execute

the transformation of a raw dataset into training and

test datasets. Thus, the use of ontologies might be

helpful to deal with this terminological problem in the

data preprocessing phase.

First of all, ontologies might structure, represent

and store knowledge according to a conceptual model,

allowing a consensual and uniform view of the data

preprocessing scenario. An ontology is rich in seman-

tic expressiveness, and remove or reduce ambiguities,

facilitating human and machine understanding. Ad-

ditionally, such model might guide the data prepro-

cessing operator’s execution and support non-expert

users learning. Furthermore, it may assist critical re-

quirements in the KDD process, such as reproducibil-

ity and explainability (Souza et al., 2020), capturing

the transformation of raw data into training and test

data.

Currently, ontologies have been adopted to sup-

port the KDD process in order to structure and repre-

sent the concepts related to the various entities in this

domain (Vanschoren and Soldatova, 2010) (Panov

et al., 2013) (Keet et al., 2015) (Esteves et al., 2015)

(Publio et al., 2018) (Celebi et al., 2020) (Souza et al.,

2020). However, these ontologies approaches do not

cover most of the details within the data preprocess-

ing phase. Moreover, most of them do not use well-

founded conceptual modeling to improve semantic

expressiveness and minimize ambiguities.

This article presents the PreProcessing Operators

Ontology (PPO-O), which was built based on the Uni-

ﬁed Foundational Ontology (UFO) (Guizzardi, 2005).

The PPO-O is an ontology applied to the data pre-

processing phase of the KDD process; it identiﬁes

and represents the semantics of the concepts related

to this phase. Moreover, PPO-O evidences the pre-

processing operators that transform the raw dataset

into training and test datasets. Besides, it simulta-

neously enables the capture and retrieval of the exe-

cuted operators through provenance queries. The on-

tology was developed following the Systematic Ap-

proach to Build Ontologies (SABiO) methodology

(Falbo, 2014) and modeled using the OntoUML on-

tology language (Guizzardi, 2005).

This paper is organized according to the following

structure. Section 2 provides an overview of some on-

tologies in the KDD area that are related to this work.

Section 3 discusses the main concepts used to develop

PPO-O. Section 4 presents PPO-O in detail, and ﬁ-

nally, Section 5 makes the conclusion and points out

future work.

2 ONTOLOGIES TO SUPPORT

THE KNOWLEDGE

DISCOVERY PROCESS

The study of ontologies, as a way of expressing

knowledge about a domain, has been largely adopted.

And, as deﬁned by (Gruber, 1995), ontologies are for-

mal and explicit speciﬁcations of the concepts and

relationships that can exist in a given domain. Al-

ready (Falbo et al., 2002) points out that ontologies

are used to describe a uniform and unambiguous do-

main model of entities and their relationships. While

(Nigro, 2007) highlights that ontologies can be used

for the DM process, in order to represent the descrip-

tion of the process, and also, to describe its execution,

i.e., provenance metadata on the transformation of a

given dataset.

The Ontology for Data Mining Experiments (Ex-

pos

e) (Vanschoren and Soldatova, 2010) was devel-

oped with the aim of sharing ML experiment meta-

data. Expos

e highlights the various entities related to

the speciﬁcation of Dataset for the Supervised Clas-

siﬁcation Task. While the Ontology for Representing

the Knowledge Discovery Process (OntoDM-KDD)

prepared by (Panov et al., 2013) uses the CRISP-DM

model to represent the main entities in the area of

DM, in the context of KDD. OntoDM-KDD repre-

sents the taxonomy of entities, which are essential for

data preparation.

Data Mining Optimization Ontology (DMOP) de-

veloped by (Keet et al., 2015) supports decision mak-

ing and the meta-learning of the complete DM pro-

cess. It is a uniﬁed conceptual structure and, among

the represented concepts, it shows that the execution

of the DM-Process occurs by a DM-Workﬂow. The

DM-Process especializes DM-Operation executed by

DM-Operator, which is an algorithm for to execute

tansformations in DM-Data.

MEX Vocabulary presented by (Esteves et al.,

2015), has as main purpose to describe terms used

in ML experiments and share provenance informa-

tion, captured with the PROV-O provenance ontol-

ogy (Lebo et al., 2013). However, MEX Vocabulary

not capture information regarding the data preparation

process.

ML-Schema proposed by (Publio et al., 2018) is a

simple shared schema that provides a set of classes,

properties and restrictions that can be used to rep-

resent and interchange ML information. Regarding

the preprocessing context, it includes the representa-

tion of Data (ML Schema::Data) and also its special-

izations, Dataset (ML Schema::Dataset) and Feature

(ML Schema::Feature).

On the other hand, the Ontological Representation

ICEIS 2021 - 23rd International Conference on Enterprise Information Systems

100

of Relational Databases (RDBS-O) (de Aguiar et al.,

2018) is a well-founded reference ontology that repre-

sents the structure of relational database systems and,

although it is not an ontology in the context of DM,

it includes the representation of entities that are im-

portant to describe the preprocessing domain, such as

Table, Line, Line Type and Column.

OpenPREDICT (Celebi et al., 2020) is a uniﬁed

semantic model of several existing ontologies, among

them W3C PROV (Groth and Moreau, 2013), for

the prospective, retrospective and workﬂow evolution

provenance of ML scientiﬁc workﬂows. The con-

ceptual modeling was supported by the UFO founda-

tion ontology and its ontological language OntoUML.

It captures operations performed by workﬂow plan

steps.

PROV-ML developed by (Souza et al., 2020) is

a data representation, also compatible with W3C

PROV, from retrospective provenance workﬂows to

support the lifecycle of scientiﬁc ML. It represents

that a workﬂow is a composition of data transforma-

tions executed by an ML task.

Table 1 shows a comparative model that classiﬁes

these related works, according to the following crite-

ria:

• (C1). Operational Ontology: denotes whether the

ontology was developed with the Web Ontology

Language (OWL)(Horrocks et al., 2004), a World

Wide Web Consortium (W3C) standard;

• (C2). Provenance Ontology: informs if the ontol-

ogy was developed using any of the W3C PROV

document models;

• (C3). Foundational Ontology: indicates whether

the ontology was developed based on the concepts

proposed by UFO; and

• (C4). Details of the Preprocessing Phase: shows

whether the ontology considers entities and rela-

tionships present in the preprocessing phase, ac-

cording to the detailing criteria: Partial (P) or To-

tal (T).

Table 1: Summary of Related Works.

Ontology C1 C2 C3 C4

Expos

e X - - P

OntoDM-KDD X - - P

DMOP X - - P

MEX Vocabulary - X - P

ML-Schema X - - P

RDBS-O - - X P

OpenPREDICT - - X P

PROV-ML - X - P

Given the above, it was identiﬁed that, among the re-

lated works, and as far as it was possible to investi-

gate, RDBS-O and OpenPREDICT use the UFO ap-

proach, but different from OpenPREDICT, RDBS-O

was not built for the KDD context. On the other hand,

Expos

e, OntoDM-KDD, DMOP, MEX Vocabulary,

ML-Schema and PROV-ML, developed in the context

of KDD, aim to support all or most of the process,

with an emphasis on the DM phase, while the prepro-

cessing phase is given partial or no attention. And,

additionally, they are ontologies built without taking

into account the precepts of a foundation ontology. In

order to ﬁll this gap, this article proposes the PPO-O.

To understand this proposal, in the next section, we

present a brief discussion on the preprocessing phase,

highlighting its speciﬁcities, and on the peculiarities

of UFO.

3 BACKGROUND

3.1 Data Preprocessing Phase

The preprocessing phase covers all the activities nec-

essary to build the training and test datasets, data that

will be inserted in the modeling tool, based on the

raw data (Chapman et al., 2000). The data prepara-

tion techniques used in this phase aim to improve the

quality of raw data, by eliminating or minimizing var-

ious problems in the data. For example, the values

of the attributes can be numeric or categorical; they

may be clean or may contain outliers, incorrect, in-

consistent, duplicate or missing values; attributes can

be independent or related; datasets can have few or

many objects, which in turn can have a small or high

number of attributes (Faceli et al., 2015).

There is not a consensus in the literature (Han

et al., 2011) (Faceli et al., 2015) (Goldschmidt et al.,

2015) (Garc

ıa et al., 2015) about the classiﬁcation and

conceptualization of preprocessing activities.

On the other hand, they all agree on the meaning

of an operation in the KDD domain. It is the exe-

cution of an operator, a program that implements an

algorithm, which speciﬁes a procedure addressed to

a KDD activity or task (Keet et al., 2015). Operators

are executed on data items

made up of data examples

and, in terms of granularity levels, can be a dataset

(derived from one or more tables) or just one attribute

(a column within a table), or just one instance (a row

in a table).

An important concept that must be discussed here

is the table concept, because the most DM works

use a single ﬁxed format table (Provost and Fawcett,

2016). In the RDBMS domain, a table is represented

http://ml-schema.github.io/documentation/

A Well-founded Ontology to Support the Preparation of Training and Test Datasets

101

as a logical structure, that is, an abstraction of the

way the data is physically stored, with explicit values

in column positions, organized in table lines (Date,

2004). While in the DM domain, a table corresponds

to the data itself, i.e., data about certain entities in a

given domain, such as customer data, product data,

purchase data, etc Therefore, a dataset can be ob-

served under the following aspects: i) intentional,

when it refers to its scheme, which in this context

are: columns, features or attributes; and ii) exten-

sional, when it refers to facts, examples, instances

or records (Elmasri and Navathe, 2011). As pointed

out by (Goldschmidt et al., 2015) the KDD process

assumes that the data is organized in a single two-

dimensional tabular structure containing facts (orga-

nized in rows) and attributes (organized in columns)

of the problem to be analyzed.

Another important concept is the supervised learn-

ing task. It refers to learning the examples labeled in

the training dataset (Han et al., 2011), whose goal is to

ﬁnd a function, model or hypothesis, so that the label

of a new example can be predicted. If the label data

type is categorical (Cotton, 1999), i.e., its domain is a

ﬁnite set of unordered values (Faceli et al., 2015), this

task is usually known as classiﬁcation (Faceli et al.,

2015).

ML tasks such as the classiﬁcation task, may need

some transformations to be applied to some column

values in the dataset. These transformations, also

named operations, correspond to the application of an

operator on data items. It can be understood, in this

way, that the set of enchained executions of prepro-

cessing operators results in a workﬂow execution.

On the other hand, as deﬁned in (Celebi et al.,

2020) a workﬂow is a collective of instructions, since

its parts have the same functional role in the whole.

The execution of a workﬂow is a description of the

process, that is, a set of step-by-step instructions,

where each instruction describes an action taken.

In fact, those workﬂow execution data correspond

to provenance data for the generated training and test

datasets. It contains structured and linked records of

the data derivation paths, referring to the transforma-

tion activities those data went through (Souza et al.,

2020). This kind of data provenance is known as ret-

rospective provenance. It facilitates the reproducibil-

ity and explainability of the training and test datasets.

3.2 Uniﬁed Foundational Ontology

In the last decades, ontological analysis has brought

signiﬁcant advances providing a sound foundation for

conceptual model development to reach better repre-

sentations of computational artifacts, especially con-

ceptual schemas (Guizzardi, 2012).

The ontological analysis is based on the use of

foundational ontologies (also called top-level ontolo-

gies), which provide a set of principles and basic

categories (Guarino, 1998). Foundation ontologies

apply formal theories to represent aspects of real-

ity and describe, as accurately as possible, the real-

world knowledge regardless of the domain, language,

or state of affairs.

UFO, initially presented by (Guizzardi, 2005), is a

descriptive ontology that represents universals (types)

and particulars (substantial or individual), endurants

and perdurants. Through continually updating, it has

been incorporating ideas from other ontologies such

as GFO (Herre et al., 2006) and DOLCE (Gangemi

et al., 2002), as well as from the OntoClean methodol-

ogy (Guarino and Welty, 2004). UFO has three main

fragments: UFO-A (Ontology of Endurants), UFO-B

(Ontology of Perdurants), and UFO-C (Ontology of

Social and Intentional Entities).

Over the years, UFO has been applied to the de-

velopment of core and domain ontologies in different

areas (Guizzardi et al., 2015). For instance, it has

been successfully used to provide conceptual clari-

ﬁcation in complex domains such as Legal (Ghosh

et al., 2017), Brazilian Higher Education (Silva and

Belo, 2018), Information Security Incidents (Faria

et al., 2019) and Critical Communications (Tesolin

et al., 2020).

Figure 1 represents some UFO-A constructs and

their relations used in PPO-O conceptual model. This

fragment of the UFO deals with the structural aspects

of conceptual modeling and referring to objects and

entities from the real-world. It also represents types

(Universal) and Individuals of these types (Individu-

als).

Figure 1: UFO-A fragment, based on (Guizzardi et al.,

2018).

Furthermore, UFO-A categorizes types such as Sor-

tals carrying identity principle and NonSortal aggre-

gating properties in common from different Sortals.

Thus, Sortals describe the real-world objects using

ICEIS 2021 - 23rd International Conference on Enterprise Information Systems

102

concepts with strong or rigid identity principle such

as Kinds and Subkinds or anti-rigid concepts such

as Roles, which classify rigid elements under transi-

tory conditions. On the other hand, NonSortal con-

structs generalize different identity principles, based

on common characteristics, such as Category, which

abstracts two or more rigid elements.

Besides, Endurants constructs, such as Sortals,

are existentially independent; on the other hand, Mo-

ment Types, also known as Tropes, are existentially

dependent. Thus, a Quality Type is an intrinsic prop-

erty of an Individual, and Relator Type plays the role

of connecting, relating, or mediating, at least two in-

dividuals who share the same foundation (Guizzardi

and Wagner, 2008). Hence, a Quality (Moment) as

”color” cannot exist without a Sortal like a car (kind),

or a Relator (Moment) as a marriage cannot exist

without two objects (Sortals) like a man (Subkind)

and woman (Subkind).

UFO-B deals with objects that persist based on

temporal features, representing Events acting on Situ-

ations, Dispositions, Time Points, as well as the con-

nections between Endurants and Perdurants (Guiz-

zardi et al., 2013) and (Almeida et al., 2019).

Figure 2: UFO-B fragment, based on (Guizzardi et al.,

2013).

As shown in Figure 2, Situation is a snapshot of the

real-world reality obtained at a particular point in

time, modiﬁed or created by an Event that has mere-

ological features and can be classiﬁed as atomic or

complex.

An Atomic Event has no proper parts and de-

pends on a unique Object. On the other hand, Com-

plex Events are aggregations of at least two disjoint

subEvents. Participation is an example of subEvent

that materializes object participation in an Event. Fur-

thermore, Events can be caused by other Events, di-

rectly or indirectly. All of these event types are des-

cribed using axioms in (Guizzardi et al., 2013).

A Disposition is a trope existentially dependent on

an object, representing a speciﬁc propensity, capacity

or feature of an object that can or not be manifested

through an Atomic Event. For instance, the event of

a heart pumping is the manifestation of the heart’s

capacity to pump (disposition); the event of a metal

being attracted by the magnet is the manifestation of

the magnet’s disposition to attract metallic material

(Guizzardi et al., 2016).

3.3 Methodologies for Building

Ontologies

A large number of ontologies have been developed

by different groups, under different approaches, and

using different methods and techniques (Fern

andez-

opez et al., 1997). This way, some relevant method-

ological approaches in ontology engineering, such as

Methontology (Fern

andez-L

opez et al., 1997), Neon

(Su

arez-Figueroa et al., 2015) and SABiO (Falbo,

2014), have been proposed as a best practice to build

domain ontologies grounded in meta-ontologies or

foundational ontologies. These practices make onto-

logical analysis able to explain concepts and relations

in the light of such top ontologies, which provides a

sound basis for better reality representation applying

conceptual modeling.

Moreover, SABiO recommends UFO as founda-

tional ontology and distinguishes between reference

and operational ontologies, providing activities that

apply to the development of both domain ontologies.

A reference ontology is a special type of conceptual

model because it makes a clear and precise descrip-

tion of the domain entities and might improve com-

munication, learning, and problem-solving. On the

other hand, an operational ontology is a machine-

readable implementation version of the reference on-

tology.

In this Section, we present some approaches that

have been adopted as guidelines on the elaboration of

an ontology. Thus, SABiO was adopted as the PPO-

O ontology building approach, for its clear distinction

between reference and operational ontologies. More-

over, UFO was chosen as the foundational ontology

for its conceptual coverage and its large number of

case studies found in the literature.

4 PPO-O ONTOLOGY

This research presents the PPO-O ontology, a refer-

ence ontology for mastering the preprocessing phase

of the KDD process. The purpose of this ontology is

to identify and represent the concepts related to the

preparation of “cured (Souza et al., 2020)” raw data,

A Well-founded Ontology to Support the Preparation of Training and Test Datasets

103

that is, data signiﬁcantly selected, more organized,

easier to analyze and understand. The idea is to fa-

cilitate the generation of training and test data to be

consumed by ML algorithms.

The preparation of PPO-O was supported by the

initial phases of the development process proposed

by the SABiO approach, namely: Purpose Identiﬁca-

tion and Requirements Elicitation, leading to the def-

inition of functional and non-functional requirements

(FRs and NFRs, respectively); Ontology Capture and

Formalization, giving rise to conceptual modeling of

captured concepts; and Design, with the establish-

ment of technological architecture and NFRs for the

implementation of the reference ontology.

The following Competency Questions (CQs) are

related to the FRs: CQ1. What are the types of data

preprocessing operator? CQ2. Which data structure

granularity are needed in the context of KDD? CQ3.

How can a dataset be described? CQ4. What are the

data types of the dataset columns? CQ5. How can we

characterize a labeled dataset? CQ6. What is the data

type of the target column of the dataset speciﬁed for

the classiﬁcation task? CQ7. How can a data prepro-

cessing assistant be characterized? CQ8. How can a

data preprocessing operator execution be registered?

CQ9. What is the chain of operators that executed to

generate a training and test datasets?

4.1 PPO-O Modeling

The ontologies are built to be reused or shared

(Fern

andez-L

opez et al., 1997). In this sense, the se-

mantic models of the PPO-O reuse concepts already

formalized by the DMOP, ML Schema and RDBS-O

ontologies.

Figure 3 categorizes the taxonomy of the types of

data preprocessing operators, according to their role

in the data preparation process. The idea is to es-

tablish a hierarchy in order to resolve ambiguities.

Among the subtypes of a Data Preprocessing Oper-

ator, we distinguish an operator used to improve data

quality as a Data Cleaning Preprocessing Operator.

On the other hand, to obtain more accurate data, sub-

types of Data Tansformation Preprocessing Operator

represent operators that are used for feature engineer-

ing. For example, the Data Reduction Preprocessing

Operator subtype represent those operators to obtain

the most appropriate dimensionality for the dataset.

In addition, for a dataset with an imbalance in the

number of samples of the target attribute, a typical

situation in the context of a supervised classiﬁcation

task, a Data Sampling Correction Preprocessing Op-

erator is used. And ﬁnally, a Data Partition Prepro-

cessing Operator is used for partitioning the dataset

Figure 3: UFO-A based model that represents the categories

of data preprocessing operators.

into training and test datasets.

In order to explain the elements related to data

used in KDD processes, Figure 4 presents the con-

ceptual model that characterizes a dataset. In rela-

tion to the extensional perspective, a Dataset is a spe-

cialization of ML Schema::Data, which may be spe-

cialized in different types of representation accord-

ing to its granularity and identity principles, such as

RDBS-O::Line and Column Value. A Dataset is a

bidimensional tabular structure for representing data,

which is composed of RDBS-O::Line instances. Each

RDBS-O::Line instance is a true proposition (fact) of

the problem to be analyzed, and it is instantiated ac-

cording to a RDBS-O::Line Type. Thus, a Dataset

is described by a RDBS-O::Line Type, which is said

to be its schema. The RDBS-O::Line Type aggre-

gates a set of RDBS-O::Columns, where each col-

umn represents an attribute that describes some Col-

umn Values. A set of Column Values, described

by the different RDBS-O::Columns that compose a

RDBS-O::Line Type, constitute a RDBS-O::Line, or a

fact. Finally, each RDBS-O::Column is deﬁned by a

RDBS-O::Data Type, which is specialized in Quali-

tative Data Type, when the domain of values is cate-

gorical, or Quantitative Data Type, when it represents

numerical values.

While preprocessing data in the context of a KDD

process, a set of measured values can help in charac-

terizing a dataset and its columns. A Dataset Charac-

teristic, for example, may be its dimensionality (num-

bers of lines and columns), or the proportion of miss-

ing values. Also, descriptive statistics may character-

ICEIS 2021 - 23rd International Conference on Enterprise Information Systems

104

Figure 4: UFO-A based model for the dataset concept.

ize a column (Column Characteristic), such as mean

and standard deviation. These measured values can be

obtained from the RDBS-O::Lines or Column Values

that constitute a dataset or a column, respectively.

This characterization facilitates the understanding of

the data under analysis, and the identiﬁcation of the

need to apply other preprocessing operators.

A Labeled Dataset is a type of Dataset speciﬁ-

cally created to be processed by a Supervised Learn-

ing Task. When the goal is to generate a model to

predict the value of a Target Column, deﬁned by a

Nominal Qualitative Data Type, then it means that the

dataset is to be processed by a Supervised Learning

Classiﬁcation Task. Other specializations of Super-

vised Learning Tasks were not within the scope of the

present work.

Figure 5 shows, the Data Preprocessing Assistant

tool is a computational Software resource, whose pur-

pose is to support the execution of a Data Prepro-

cessing Workﬂow Plan. This tool uses the metadata

of the dataset and its columns (Dataset Characteris-

tic and Column Characteristic) to facilitate the choice

of Data Transformations, systematically. And, in par-

allel, this tool captures the retrospective provenance

from each Operator Execution, registering the trans-

formation implementation (Data Preprocessing Exe-

cutable Operator) that each RDBS-O::Column of a

raw dataset is submitted to, in order to generate the

corresponding training and test datasets. In this way,

the execution of each data transformation is encapsu-

lated by a data capture task, which occurs through a

function call, a program execution. Note that the exe-

cutions belong to a Data Preprocessing Executable

Workﬂow, which implements a Data Preprocessing

Workﬂow Plan, initially deﬁned by the Assistant tool.

Figures 3, 4 and 5 present models grounded on UFO-

A. Kinds and subkinds categorize the operator and the

data and data type hierarchies. Additionally, note that

a relator is used to bring to light the execution of an

operator over a dataset column. This representation

is specially important for the provenance capture, as

it distinguishes concepts such as the executable code,

the transformation it implements, and the relationship

of the code when it runs on some dataset column.

Figure 5: UFO-A based model to capture provenance infor-

mation of the operator’s execution.

The representation of the dynamic aspects of PPO-

O are grounded on UFO-B fragment, which is sum-

marized in the metamodel of Figure 2. The model

in Figure 6 represents the events of the preprocess-

ing phase that are necessary for the construction of

a training and test datasets. It identiﬁes the situa-

tions that triggers each event, and the participants in-

volved. It shows that a Data Preprocessing is a com-

plex event composed of two sub-events. One of them

is the Exploratory Data Analysis, which might iden-

tify, among other problems, columns with outliers,

null or blank data, denormalized data or unbalanced

data. As a consequence of this identiﬁcation, a Situ-

ation named (Column Characteristic Identiﬁed) rep-

resents these anomalies, which might activate Dispo-

A Well-founded Ontology to Support the Preparation of Training and Test Datasets

105

sitions that are inherent capabilities or abilities (in-

heresIn) of Data Preprocessing Operators, such as

outlier removers, data imputation operators, data nor-

malization, or data balance operators.

In other words, since the Exploratory Data Analy-

sis event identiﬁes a suspicious column, it activates

one of the Data Preprocessing Operators disposi-

tions, which is manifested through another sub-event,

named Data Preprocessing Operator Execution. As a

ﬁnal step, the Situation Processed Column represents

the identiﬁed anomaly and duly resolved.

Figure 6: UFO-A and UFO-B based model of the data pre-

processing event.

4.2 PPO-O Evaluation and Application

The PPO-O evaluation was carried out through the

veriﬁcation activity, according to the evaluation sup-

port process of the SABiO methodology. This activity

involves the identiﬁcation of the answers to the com-

petence questions presented previously in this Sec-

tion, using the concepts and relationships that consti-

tute the ontology, as detailed below:

• CQ1. Data Preprocessing Operator specializes

Data Cleaning Preprocessing Operator, Data Re-

duction Preprocessing Operator, Data Transfor-

mation Preprocessing Operator, Data Sampling

Correction Preprocessing Operator and Data Par-

tition Preprocessing Operator;

• CQ2. ML Schema::Data specializes Dataset,

RDBS-O::Line and Column Value;

• CQ3. Dataset has RDBS-O::Line, which is an

instance of RDBS-O::Line Type, which is de-

ﬁned by Dataset; The RDBS-O::Line has Col-

umn Value(s), which are instance(s) of RDBS-

O::Column; The RDBS-O::Column isPart of

RDBS-O::Line Type and is deﬁned by RDBS-

O::Data Type;

• CQ4. RDBS-O::Data Type specializes Qualitative

Data Type which is specialized in Nominal Qual-

itative Data Type and Ordinal Qualitative Data

Type; RDBS-O::Data Type specializes Quantita-

tive Data Type which is specialized in Discreet

Quantitative Data Type and Continuous Quanti-

tative Data Type;

• CQ5. The Labeled Dataset is a specialization

of Dataset, which is processed by a Supervised

Learning Task and has a Target Column, which is

a specialization of RDBS-O::Column;

• CQ6. The Supervised Learning Classiﬁcation

Task predicts the Target Column deﬁned by Nom-

inal Qualitative Data Type;

• CQ7. The Data Preprocessing Assistant is a

Kind of Software that supports a Data Prepro-

cessing Workﬂow Plan, which is a collection

of Data Transformations; a Data Transforma-

tion is implemented by a Data Preprocessing Ex-

ecutable Operator, whose execution over each

RDBS-O::Column is captured by the Relator Op-

erator Execution, which is executed by a Person

playing the User Relation Role;

• CQ8. The Data Preprocessing Operator Execution

is a Complex Event manifesting a Data Process-

ing Ability, which inheresin a Data Preprocessing

Executable Operator; The Data Processing Abil-

ity is activated by a Column Characteristic Iden-

tiﬁed, which had been broughtAbout by the Ex-

ploratory Data Analysis Complex Event; The Data

Preprocessing Operator Execution is captured by

the Relator Operator Execution, with the partic-

ipationOf a Person playing the User Processual

Role; and

• CQ9. The Data Preprocessing Executable Work-

ﬂow is a collection of Data Preprocessing Ex-

ecutable Operator, whose execution over each

RDBS-O::Column is captured by the Relator Op-

erator Execution during the Data Preprocessing

Operator Execution Complex Event, that occurs

with the participationOf a Person playing the

User Processual Role when use Data Preprocess-

ing Assistant.

The conceptualization of the data preprocessing phase

was made explicit through conceptual modeling

grounded on UFO-A and UFO-B. Based on these

models (PPO-O) it was possible to conceive an archi-

tecture, shown in Figure 7, that was implemented as

a PreProcessing Assistant (Assistant-PP)

tool, using

the Python Programming Language with the Streamlit

https://github.com/LucimarLial/AssistantPP

ICEIS 2021 - 23rd International Conference on Enterprise Information Systems

106

Framework and PostgreSQL RDBMS. The main pur-

pose of this tool is to guide a non-expert user in the se-

lection of data preprocessing operators and, in paral-

lel, to capture structured and rich provenance data, for

each data preprocessing operator execution, which

are stored in the (ProvOp) layer.

Figure 7: Assistant-PP architecture components.

The PPO-O validation activity occurred with the in-

stantiation of the competence question (CQ9), dur-

ing the execution of the Assistant-PP and with prove-

nance queries from the ProvOp layer to verify if it

was able to capture the workﬂow generated during

creation of a training and test datasets.

As a test case, we used the Adult (Dua and Graff,

2017) dataset speciﬁed for the Supervised Learning

Classiﬁcation Task, whose goal is to predict whether

an adult’s income exceeds $50K per year, based

on census data. Table 2 summarizes the number

of RDBS-O::Lines and RDBS-O::Columns processed

through the execution of a Data Preprocessing Exe-

cutable Workﬂow. Note that the output datasets has a

smaller numbers of columns and lines. This is due to

the impact of operators such as Data Reduction - Col-

umn Selection and Data Sampling Correction - Un-

dersampling. In addition, test dataset does not include

the target column.

Table 2: Input and output Datasets processed by the Data

Preprocessing Executable Workﬂow (CQ9).

Dataset Line Column

Input Raw 48842 15

Output

Training 16250 14

Test 14653 13

Getting into the preprocessing workﬂow execution

details, Table 3 shows all the Operator Executions

that took place in the workﬂow execution. It cor-

relates each Adult dataset RDBS-O::Column, to the

Data Preprocessing Executable Operator type, ac-

cording to the categorization shown in Figure 3: Data

Cleaning (DC), Data Reduction (DR), Data Transfor-

mation (DT), Data Partition (DP) and Data Sampling

Correction (DS). It is possible to see in Table 3 that it

involved 42 operator executions, showing how hard it

is to keep track of the preprocessing operations, which

is why we need to keep a record of each one of them.

Analyzing Table 3, we can see that most of the

Adult dataset columns were processed by DP and DS

operators, which are the most general, i.e., indepen-

dent of the data type of column, unlike DC and DT

operators that take into account the data type of col-

umn. Moreover, with the analysis of the particular

properties of the column values only a few columns

should be cleaned and transformed to obtain more ac-

curate data to ML algorithms. As example, standard-

ization for columns with quantitative data type and

coding for columns with qualitative data type, such as

the sex column, which was coded by a Data Coding

Operator, deriving two new columns, Sex 1 (1=male)

and Sex 2 (0=female) and, in this case, DP and DS

operators will be processed in Sex 1 and Sex 2 and

no longer in Sex. Finally, the Assistant-PP may sug-

gest the elimination (DR operator) of columns with

low relevance after the analysis of the correlation be-

tween the predictive columns and the target column.

This was the case of Capital-loss and Fnlwgt.

Table 3: Operator Execution by the Data Preprocessing Ex-

ecutable Workﬂow (CQ9).

Columns DC DR DT DP DS

Age X X X X

Capital-gain X X X X

Capital-loss X

Country X X X

Education X X

Education-num X X X X

Fnlwgt X

Hours-per-week X X X X

Marital-status X X

Occupation X X X X

Race X X

Relationship X X

Sex X X X

Workclass X X X

Target X X X

All these operator executions are captured by the

Assistant-PP tool and registered in our provenance

database. Figure 8 shows an example of a query in

such database, which lists the operators, and their cor-

responding categories, that were applied to columns

Age, Occupation, Fnlwgt, Sex, Sex 1, Sex 2 and Tar-

get. Note that Assistant-PP is able to provide a ﬁne

grain provenance record, where it is possible to keep

A Well-founded Ontology to Support the Preparation of Training and Test Datasets

107

track of actions on each column of a dataset.

Figure 8: ProvOp - provenance query of part of the Data

Preprocessing Executable Workﬂow captured.

Figure 9: Assistant-PP - Data Preprocessing Executable

Operator by RDBS-O::Data Type.

It is worth to highlight that the Assistant-PP tool in-

corporates the knowledge raised by the models pre-

sented in Section 4.1, and guides a non-expert user to

select the appropriate Data Preprocessing Operator,

according to the RDBS-O::Data Type of the RDBS-

O::Column. For example, the ”unknown” imputa-

tion option is indicated only for columns of a Qual-

itative Data Type. Other imputation options, such as

mode, median, mean, values 0 or -1, are indicated for

columns of a Quantitative Data Type. Both examples

are illustrated in Figure 9.

More examples of such expertise are shown in

Figure 10, where the Data Discretization operator

is chosen for Continuous Quantitative Data Type

columns (e.g. Age, Education Num, etc.). Note that

Capital-loss and Fnlwgt columns were excluded by

DR operator (Table 3) previously during the process,

and therefore, they do not appear in the list of avail-

able columns for Data Discretization. On the other

hand, the Data Standardization operator is chosen for

the Discreet and Continuous Quantitative Data Type

columns. In this case, all available columns of this

type are processed. Finally, the Data Coding opera-

tor, in Figure 11, is recommended only for Qualitative

Data Type columns (e.g. Workclass, Sex, etc.).

Figure 10: Assistant-PP - Data Preprocessing Executable

Operator by RDBS-O::Data Type.

Figure 11: Assistant-PP - Data Preprocessing Executable

Operator by RDBS-O::Data Type.

ICEIS 2021 - 23rd International Conference on Enterprise Information Systems

108

5 CONCLUSION AND FUTURE

WORK

In this paper we present PPO-O, a domain reference

ontology for the preprocessing phase of the KDD pro-

cess, built using UFO ontological foundations. The

idea is to support the non-expert user in data prepro-

cessing, indicating the appropriate operators for the

transformation of a cured raw dataset into a train-

ing and test datasets. It was developed following

the guidelines of the SABiO ontology engineering

approach. Its focus is on the supervised learning

classiﬁcation task, and it reused concepts from KDD

and RDBMS ontologies, which incorporate already

grounded concepts that are essential to clarify the se-

mantics of the preprocessing phase.

The PPO-O evaluation was carried out by answer-

ing the competence questions previously deﬁned, and

showed the completeness of the represented concepts

and relationships. In addition, a tool named Assistant-

PP was built based on the PPO-O ontology, which

made it capable of capturing the retrospective data

provenance during the execution of preprocessing op-

erators. Therefore, it was shown that it attends the

reproducibility and explainability requirements for a

preprocessing workﬂow executed.

As future work, we intend to extend the PPO-O

to incorporate other data preprocessing operators, as

well as other ML tasks, such as operators applied to

the Supervised Regression Task. Also, we plan to de-

velop a new version of the assistant tool, using an op-

erational version of the PPO-O ontology.

REFERENCES

Almeida, J. P. A., de Almeida Falbo, R., and Guizzardi,

G. (2019). Events as entities in ontology-driven con-

ceptual modeling. In Laender, A. H. F., Pernici, B.,

Lim, E., and de Oliveira, J. P. M., editors, Conceptual

Modeling - 38th International Conference, ER 2019,

Salvador, Brazil, November 4-7, 2019, Proceedings,

volume 11788 of Lecture Notes in Computer Science,

pages 469–483. Springer.

Celebi, R., Moreira, J. R., Hassan, A. A., Ayyar, S., Ridder,

L., Kuhn, T., and Dumontier, M. (2020). Towards fair

protocols and workﬂows: the openpredict use case.

PeerJ Computer Science, 6:e281.

Chapman, P., Clinton, J., Kerber, R., Khabaza, T., Reinartz,

T., Shearer, C., Wirth, R., et al. (2000). Crisp-dm 1.0:

Step-by-step data mining guide. SPSS inc, 9:13.

Cotton, P. (1999). Iso/iec fcd 13249-6: 1999 sql/mm saf-

005: Information technology-database languages-sql

multimedia and application packages-part 6: Data

mining.

CrowdFlower (2016). Cfds16.pdf. http://www2.cs.uh.edu/

∼ceick/UDM/CFDS16.pdf. (Accessed on

11/21/2020).

Date, C. J. (2004). Introduc¸

ao a sistemas de bancos de da-

dos. Elsevier Brasil.

de Aguiar, C. Z., de Almeida Falbo, R., and Souza, V.

E. S. (2018). Ontological representation of relational

databases. In ONTOBRAS, pages 140–151.

Dua, D. and Graff, C. (2017). UCI machine learning repos-

itory.

Elmasri, R. and Navathe, S. B. (2011). Database systems,

volume 9. Pearson Education Boston, MA.

Esteves, D., Moussallem, D., Neto, C. B., Soru, T., Us-

beck, R., Ackermann, M., and Lehmann, J. (2015).

Mex vocabulary: a lightweight interchange format for

machine learning experiments. In Proceedings of the

11th International Conference on Semantic Systems,

pages 169–176. ACM.

Faceli, K.; Lorena, A., Gama, J., and Carvalho, A. (2015).

Intelig

encia Artiﬁcial - Uma Abordagem de Apren-

dizado de M

aquina. Edic¸

ao 1. LTC Editora, 2015.

378 f.

Falbo, R. d. A. (2014). Sabio: Systematic approach for

building ontologies. In ONTO. COM/ODISE@ FOIS.

Falbo, R. d. A., Guizzardi, G., and Duarte, K. C. (2002). An

ontological approach to domain engineering. In Pro-

ceedings of the 14th international conference on Soft-

ware engineering and knowledge engineering, pages

351–358. ACM.

Faria, M. R., de Figueiredo, G. B., de Faria Cordeiro, K.,

Cavalcanti, M. C., and Campos, M. L. M. (2019). Ap-

plying multi-level theory to an information security

incident domain ontology. In ONTOBRAS.

Fayyad, U. M., Piatetsky-Shapiro, G., Smyth, P., Uthu-

rusamy, R., et al. (1996). Advances in knowledge

discovery and data mining, volume 21. AAAI press

Menlo Park.

Fern

andez-L

opez, M., G

omez-P

erez, A., and Juristo, N.

(1997). Methontology: from ontological art towards

ontological engineering. AAAI-97 Spring Symposium

Series.

Gangemi, A., Guarino, N., Masolo, C., Oltramari, A., and

Schneider, L. (2002). Sweetening ontologies with

dolce. In International Conference on Knowledge En-

gineering and Knowledge Management, pages 166–

181. Springer.

Garc

ıa, S., Luengo, J., and Herrera, F. (2015). Data prepro-

cessing in data mining. Springer.

Ghosh, M. E., Abdulrab, H., Naja, H., and Khalil, M.

(2017). Using the uniﬁed foundational ontology (ufo)

for grounding legal domain ontologies. In Proceed-

ings of the 9th International Joint Conference on

Knowledge Discovery, Knowledge Engineering and

Knowledge Management - Volume 2: KEOD, (IC3K

2017), pages 219–225. INSTICC, SciTePress.

Goldschmidt, R., Passos, E., and Bezerra, E. (2015). Data

Mining, Conceitos, T

ecnicas, algoritmos, orientac¸

oes

e aplicac¸

oes. Edic¸

ao 2. Elsevier, 2015. 296 f.

Groth, P. and Moreau, L. (2013). W3c prov: An overview

of the prov family of documents.

Gruber, T. R. (1995). Toward principles for the design of

ontologies used for knowledge sharing? International

A Well-founded Ontology to Support the Preparation of Training and Test Datasets

109

Journal of Human-Computer Studies, 43(5):907 –

928.

Guarino, N. (1998). Formal ontology in information sys-

tems: Proceedings of the ﬁrst international conference

(FOIS’98), June 6-8, Trento, Italy, volume 46. IOS

press.

Guarino, N. and Welty, C. A. (2004). An overview of on-

toclean. In Handbook on ontologies, pages 151–171.

Springer.

Guizzardi, G. (2005). Ontological foundations for struc-

tural conceptual models. PhD thesis, University of

Twente, The Netherlands.

Guizzardi, G. (2012). Ontological meta-properties of de-

rived object types. In International Conference on Ad-

vanced Information Systems Engineering, pages 318–

333. Springer.

Guizzardi, G., Fonseca, C. M., Benevides, A. B., Almeida,

J. P. A., Porello, D., and Sales, T. P. (2018). En-

durant types in ontology-driven conceptual modeling:

Towards ontouml 2.0. In International Conference on

Conceptual Modeling, pages 136–150. Springer.

Guizzardi, G., Guarino, N., and Almeida, J. P. A. (2016).

Ontological considerations about the representation of

events and endurants in business models. In Interna-

tional Conference on Business Process Management,

pages 20–36. Springer.

Guizzardi, G. and Wagner, G. (2008). What’s in a relation-

ship: an ontological analysis. In International Confer-

ence on Conceptual Modeling, pages 83–97. Springer.

Guizzardi, G., Wagner, G., Almeida, J. P. A., and Guizzardi,

R. S. (2015). Towards ontological foundations for

conceptual modeling: The uniﬁed foundational ontol-

ogy (ufo) story. Applied ontology, 10(3-4):259–271.

Guizzardi, G., Wagner, G., de Almeida Falbo, R., Guiz-

zardi, R. S., and Almeida, J. P. A. (2013). Towards

ontological foundations for the conceptual modeling

of events. In International Conference on Conceptual

Modeling, pages 327–341. Springer.

Han, J., Pei, J., and Kamber, M. (2011). Data mining: con-

cepts and techniques. Elsevier.

Herre, H., Heller, B., Burek, P., Hoehndorf, R., Loebe, F.,

and Michalek, H. (2006). General formal ontology

(gfo)-a foundational ontology integrating objects and

processes [version 1.0].

Horrocks, I., Patel-Schneider, P. F., Boley, H., Tabet, S.,

Grosof, B., Dean, M., et al. (2004). Swrl: A semantic

web rule language combining owl and ruleml. W3C

Member submission, 21(79):1–31.

Keet, C. M., Ławrynowicz, A., d’Amato, C., Kalousis, A.,

Nguyen, P., Palma, R., Stevens, R., and Hilario, M.

(2015). The data mining optimization ontology. Jour-

nal of web semantics, 32:43–53.

Lebo, T., Sahoo, S., McGuinness, D., Belhajjame, K., Ch-

eney, J., Corsar, D., Garijo, D., Soiland-Reyes, S.,

Zednik, S., and Zhao, J. (2013). Prov-o: The prov

ontology. W3C recommendation, 30.

Nigro, H. O. (2007). Data Mining with Ontologies: Imple-

mentations, Findings, and Frameworks: Implementa-

tions, Findings, and Frameworks. IGI Global.

Panov, P., Soldatova, L., and D

zeroski, S. (2013). Ontodm-

kdd: ontology for representing the knowledge discov-

ery process. In International Conference on Discovery

Science, pages 126–140. Springer.

Provost, F. and Fawcett, T. (2016). Data science para

neg

ocios. Traduc¸

ao de Marina Boscatto.

Publio, G. C., Esteves, D., Ławrynowicz, A., Panov, P.,

Soldatova, L., Soru, T., Vanschoren, J., and Zafar, H.

(2018). Ml-schema: Exposing the semantics of ma-

chine learning with schemas and ontologies. arXiv

preprint arXiv:1807.05351.

Silva, C. and Belo, O. (2018). A core ontology for brazil-

ian higher education institutions. In Proceedings of

the 10th International Conference on Computer Sup-

ported Education - Volume 2: CSEDU,, pages 377–

383. INSTICC, SciTePress.

Souza, R., Azevedo, L. G., Lourenc¸o, V., Soares, E., Thi-

ago, R., Brand

ao, R., Civitarese, D., Brazil, E. V.,

Moreno, M., Valduriez, P., et al. (2020). Work-

ﬂow provenance in the lifecycle of scientiﬁc machine

learning. arXiv preprint arXiv:2010.00330.

arez-Figueroa, M. C., Gomez-Perez, A., and Fern

andez-

opez, M. (2015). The neon methodology frame-

work: Ascenario-based methodology for ontologyde-

velopment. Applied Ontology, 10:107–145.

Tesolin, J., Silva, M., Campos, M., Moura, D., and Cav-

alcanti, M. C. (2020). Critical communications sce-

narios description based on ontological analysis. In

ONTOBRAS.

Vanschoren, J. and Soldatova, L. (2010). Expos

e: An on-

tology for data mining experiments. In International

workshop on third generation data mining: Towards

service-oriented knowledge discovery (SoKD-2010),

pages 31–46.

ICEIS 2021 - 23rd International Conference on Enterprise Information Systems

110