Authors:
Lucimar de A. Lial Moura
1
;
Marcus Albert A. da Silva
2
;
Kelli de Faria Cordeiro
3
;
1
and
Maria Cláudia Cavalcanti
2
;
1
Affiliations:
1
Departamento de Sistemas e Computação, Instituto Militar de Engenharia (IME), Rio de Janeiro, RJ, Brazil
;
2
Departamento de Engenharia de Defesa, Instituto Militar de Engenharia (IME), Rio de Janeiro, RJ, Brazil
;
3
Centro de Análise de Sistemas Navais (CASNAV), Rio de Janeiro, RJ, Brazil
Keyword(s):
Data Preprocessing, Training and Test Datasets, Ontology, UFO, Provenance.
Abstract:
In the knowledge discovery process, a set of activities guide the data preprocessing phase, one of them is the data transformation from raw data to training and test data. This complex and multidisciplinary phase involves concepts and structured knowledge in distinct and particular ways in the literatures and specialized tools, demanding data scientists with suitable expertise. In this work, we present PPO-O, a reference ontology of the data preprocessing operators, to identify and represent the semantics of the concepts related to the data preprocessing phase. Moreover, the ontology highlights data preprocessing operators to the preparation of the training and test datasets. Based on PPO-O, Assistant-PP tool was developed, which made it capable to capture the retrospective data provenance during the execution of data preprocessing operators, facilitating the reproducibility and explainability of the dataset created. This approach might be helpful to non-experts users in data preproc
essing.
(More)