In this paper we present PPO-O, a domain reference
ontology for the preprocessing phase of the KDD pro-
cess, built using UFO ontological foundations. The
idea is to support the non-expert user in data prepro-
cessing, indicating the appropriate operators for the
transformation of a cured raw dataset into a train-
ing and test datasets. It was developed following
the guidelines of the SABiO ontology engineering
approach. Its focus is on the supervised learning
classification task, and it reused concepts from KDD
and RDBMS ontologies, which incorporate already
grounded concepts that are essential to clarify the se-
mantics of the preprocessing phase.
The PPO-O evaluation was carried out by answer-
ing the competence questions previously defined, and
showed the completeness of the represented concepts
and relationships. In addition, a tool named Assistant-
PP was built based on the PPO-O ontology, which
made it capable of capturing the retrospective data
provenance during the execution of preprocessing op-
erators. Therefore, it was shown that it attends the
reproducibility and explainability requirements for a
preprocessing workflow executed.
As future work, we intend to extend the PPO-O
to incorporate other data preprocessing operators, as
well as other ML tasks, such as operators applied to
the Supervised Regression Task. Also, we plan to de-
velop a new version of the assistant tool, using an op-
erational version of the PPO-O ontology.
A Well-founded Ontology to Support the Preparation of Training and Test Datasets