2 RELATED WORK
Data engineers deal frequently with a lot of data
quality problems when developing analytical
systems. The use of several sources, structural and
semantically heterogeneous, lacking in
documentation and consistency, results in several
data quality problems that compromise analytical
systems trust.
Data quality can be related to very different
problems that produce noisy data that can lead to
wrong or inadequate analysis. These problems are
related to missing values, data duplication,
misspellings, contradictory values, or inconsistent
values. Rahm (Rahm & Do, 2000) classified data
quality problems in single-source and multi-source
problems. For both scenarios, schema and instance
level data quality problems can occur. Rahm also
discusses cleaning approaches to deal with such
problems, presenting the several phases needed to
data cleaning processes: data analysis involving the
identification of metadata, use of transformation, and
mapping rules applied by an ETL process that assures
a common data schema to represent multi-source
data, ETL correctness and effectiveness verification,
execution of the transformation steps, and the
backflow of cleaned data that results in data
correction directly in the data sources to reduce
further cleaning processes. Rahm also addresses
conflict resolution, describing preparation steps that
involve data extraction from free-form attributes, data
validation, and correction, that can be applied using
existing attributes or data dictionaries to correct or
even standardize data values. Another common
problem referred is related to the identification of
matching instances without common attributes,
which involves the calculation of the similarity to
evaluate the matching confidence between data. Most
of these problems can be identified using specific
strategies, typically embodied in data profiling tools,
proving several metrics used to measure data
adequacy. However, despite being useful, the metrics
are not easy to understand and use. Data perfection is
almost impossible in real scenarios, and it is difficult
to integrate metrics from each data quality dimension
and conclude about its global state.
DQ is also frequently classified and measured
using dimensions, each one representing a class of
errors that can occur. There are several approaches
based on theoretical (Wand & Wang, 1996),
empirical (Wang & Strong, 1996), or intuitive
approaches (Redman & Godfrey, 1997). Comparing
and defining the definitive approach is not an easy
task since most of the proposals in the area are based
on different assumptions related to the granularity
level considered and the approached data model
(most research works only focus on the relational
model).
Batini (Batini & Scannapieco, 2016) provided a
classification framework based on a set of quality
dimensions: Accuracy, Completeness, Redundancy,
Readability, Accessibility, Consistency, Usefulness,
and Trust. For each dimension, several metrics are
presented in form of measure values. Several other
authors addressed similar dimensions classification
(Loshin, 2010)(Kumar & Thareja, 2013), providing
slight variations to the dimension definition and using
specific taxonomies and ontologies to relationship
them and their potential metrics (Geisler, Quix,
Weber, & Jarke, 2016). For example, in (Loshin,
2010), the author classified the dimensions between
intrinsic (related to the data model, such as the
structure and accuracy) and contextual (related to the
bounded context, such as completeness and
consistency).
Batini (Batini & Scannapieco, 2016) divides the
DQ into several dimensions. The Accuracy
dimension defines how accurately a specific value
represents reality. The (structural) accuracy can be
classified as syntactic and semantic. The syntactic
accuracy measures the distance from a specific value
to its correct representation (e.g., when mismatch
input is stored) and is measured by comparison
functions to evaluate the distance between two
values. For semantic accuracy, the correct value
should be known or deduced, and it is measured based
on a “correct” or “not correct” domain. For sets of
values, the duplication problem is also addressed,
resulting in data duplication, mainly when less
structured sources are used. Accuracy can be
discussed in several scopes: single value, attribute,
relation (or entity) and database. When a set of values
is considered, a ratio between correct values and total
values can be used. The relative importance of value
accuracy is also considered since the errors found can
have different importance in the context of a tuple or
a table. For example, an accuracy error in attributes
used for matching/identification data has greater
importance than descriptive attributes that not
compromise the data integration.
The completeness dimension also represents
several problems that typically occur in real-work
scenarios. Non-null values assigned to data elements
are analyzed considering the context in which they
are applied (Loshin, 2009). The null values can be
applied to data elements that should have a valid
value (and for that reason is considered an invalid
case), can be applied to optional values (in case the