4 UNOBTRUSIVE INTEGRATION
OF DATA QUALITY IN
INTERACTIVE DATA
ANALYSIS
In this section, we present our novel approach to over-
come the aforementioned limitations of existing ap-
proaches and, thus, to enable domain experts to eas-
ily maintain an overview of data quality without re-
quiring additional effort while conducting an explo-
rative analysis. To achieve this, (a) different user
roles need to collaborate and contribute their respec-
tive strengths, (b) generic data quality metrics need to
be defined, (c) depending on the analysis context, the
data quality metrics have to be adapted, and (d) issues
in data quality need to be identified and solutions have
to be recommended.
Figure 3 shows an overview of our approach. As
described above, a popular approach to involve do-
main experts in exploratory analysis is the use of
graphical data analysis tools. These provide a range
of data sources and allow domain experts without
deeper technical knowledge to combine these data
sources and specify transformations in a graphical
manner. To keep the necessary effort for domain ex-
perts at a minimum, these data sources should be pro-
vided in advance by an IT expert. In our approach,
this is done in the preparation phase (Figure 3, 1).
Here, an IT expert first adds a new data source to the
repository (Figure 3, 1a), e.g., by specifying a connec-
tion to the respective database. In a subsequent step,
the IT expert has to create a domain-agnostic ground
truth for this data source (Figure 3, 1b). This ground
truth comprises various data quality metrics that must
hold in order to consider the data as qualitative. For
instance, if we consider the data quality dimension
completeness under the closed-world assumption for
one data feature, the IT expert has to specify which
values have to be included in the data, e.g., the 50
states of the USA are defined and each differing value
or missing value would decrease the quality. When
this ground truth is specified for each data feature, the
task of the IT expert is done and the ground truth is
stored as a data quality artifact in the repository. At
this point, the IT expert’s task ends for the time being.
This phase is decoupled from the explorative anal-
ysis of domain experts to facilitate flexibility without
the need to reach out to an IT expert. To support do-
main experts in their exploratory analysis with regard
to data quality, we describe our approach based on
graphical data analysis tools. For a domain expert,
the first phase is the specification phase (Figure 3, 2),
in which the analysis workflow is created.
First, the required data sources are selected (Fig-
ure 3, 2a). Since a data quality artifact is provided
in the repository for all available data sources in this
phase, the data quality can be determined by means
of this artifact (Figure 3, 2b). These calculated data
quality metrics are then displayed to the domain ex-
pert and allow for a direct assessment (Figure 3, 3a)
whether this data source is qualitatively sufficient or,
if it is not, where possible problems may be located,
e.g., if there are data completeness concerns. Subse-
quently, it can be decided to either change the data
source(s) (Figure 3, 2a) or to proceed with the speci-
fication of the analysis workflow (Figure 3, 2c), e.g.,
preprocessing or data mining transformations.
If the latter is chosen, the workflow is being exe-
cuted (Figure 3, 2d), which is necessary because these
transformations affect the data and, thus, influence the
data quality. In both cases, the data quality is calcu-
lated again based on the data quality artifacts (Fig-
ure 3, 2b) and displayed for review (Figure 3, 3a). Up
to this point, all data quality metrics are calculated
based on the data quality artifact defined by the IT
expert in the preparation phase and are, therefore, in-
dependent of the context of the domain expert’s anal-
ysis. Although this is a significant advance over state-
of-the-art approaches without automatic data quality
monitoring, it is still insufficient in many cases. For
instance, it is possible that the data timeliness dimen-
sion has been defined in advance in such a way that
the data must be up-to-date on a daily basis, but for
the current analysis, historical data is required.
In this scenario, the definition of data quality by
means of the data quality artifact is no longer suitable
and has to be adapted to the intended analysis. This
is supported by our approach in the adaptation phase
(Figure 3, 4), where the domain-specific quality met-
rics are defined (Figure 3, 4a). This can be a wide va-
riety of different measures, which is why our architec-
ture is generic and extensible in this respect. Possible
adjustments include, for instance, adding or remov-
ing quality dimensions, adjusting thresholds, or even
weighting the different dimensions according to their
importance. With each adjustment, the now domain-
specific data quality is immediately calculated (Fig-
ure 3, 4b) and again visualized for assessment by the
domain expert (Figure 3, 3b).
By this stage of the process, data quality metrics
have been predefined by an IT expert and/or adapted
to the context by a domain expert. However, if the
calculated data quality is still insufficient, or if a more
reliable analysis should be performed, the domain ex-
pert is required to focus on the data itself and use the
improvement phase (Figure 3, 5) to enrich or clean the
data until a sufficient data quality is achieved.
ICEIS 2023 - 25th International Conference on Enterprise Information Systems
280