Various frameworks have been proposed which ad-
dress data quality definition and measurement for
both numerical and non-numerical data, with em-
phasis on data types typically found in (No)SQL data-
bases (Batini et al., 2009; Li, 2012). For quality meas-
urement purposes, these frameworks have analysed
the concept of data quality along a number of differ-
ent dimensions, proposing a specific metric for each
such dimension. Moreover, a framework has been re-
cently proposed which explicitly addresses numerical
data (Marev et al., 2018), focusing on eight data qual-
ity dimensions that are relevant to the numerical sub-
domain.
However, some of the numerical data quality di-
mensions proposed in (Marev et al., 2018) – namely,
accessibility, currency, timeliness, and uniqueness –
only address extrinsic data quality aspects. More spe-
cifically, access easiness and speed, newness, real-
time loading and processing, and lack of duplicates
(i.e., the exemplifying instantiations of these dimen-
sions) are not intrinsic properties of numerical data
as such, but depend on some external conditions.
These can be actually addressed by modifying ‘the
machinery’ around data rather than data themselves.
For instance, extrinsic data quality issues may delay
workflows (because of the extra time needed to ac-
quire and filter all data needed for computations) but
have no impact on the quality of workflow results.
Conversely, the other four dimensions proposed
in (Marev et al., 2018) (namely, accuracy, consist-
ency, completeness, and precision) represent proper-
ties of numeric datasets that explicitly affect the qual-
ity of workflow results. In other words, they ad-
dress intrinsic numerical data quality aspects. This
is because the quality improvement of workflow res-
ults explicitly depends on the improvement of the
workflow-consumed datasets along one or more of
the accuracy, consistency, completeness, and preci-
sion dimensions, which are discussed in detail below.
We now introduce and describe the following fea-
tures that set numerical data apart from other types:
Intrinsic Approximation. Numerical data are often
the result of either physical measurements or model-
based calculations. Hence, in theory at least, such res-
ults can take any value in a given subset of real num-
bers. In a very few cases, complex numbers (i.e., real
and imaginary value pairs represented as z = x + iy,
where i =
√
−1) are used. However, they are not
discussed in this paper as the real and the imagin-
ary part would be separately treated using techniques
developed for real numbers. Similarly, we do not
discuss integer numbers, as they either represent ex-
tremely approximated values (in which case they can
be treated as very rough real numbers subject to our
framework) or they represent counters/identifiers of
no interest in our context.
Having restricted our focus to real numbers, we
note that there are two compelling reasons why nu-
merical data values are never actually represented as
real values but rather as rational values. The latter
are defined as ratios between two integer values (with
a non-zero denominator) and are either characterised
by a finite number of digits or by an endless repetition
of the very same finite sequence of digits.
The first reason why rational numbers are used
to represent real numeric entities in any practical
situation is that both measurements and model-
based calculations are approximations of the meas-
ured/computed reality. This leads to a truncation in
the number of digits used to represent a real number,
which depends on the accuracy of a measurement or
of a calculation in each specific context.
The second reason why rational numbers are used
in place of real ones is that current (and likely, future)
digital technologies have limited capacity to store and
process real numbers. Pragmatically, although the ac-
curacy of a number is constantly improving, we are
unlikely to reach a short-term situation of endless ca-
pacity whatever the medium, which is what would be
required to fully represent real values accurately with
a mathematical precision.
Intrinsic Uncertainty. A fundamental characteristic
of numerical data, which sets them apart from other
data types, is that numbers generally have an intrinsic
uncertainty associated to them. This is because nu-
merical data typically represent the result of either
approximate physical measurements or calculations
based on truncations and finite-method approxima-
tions. Both such measurements and calculations asso-
ciate an inherently unavoidable degree of uncertainty
to their results. Uncertainty is an intrinsic property of
all numerical datasets that are not just collections of
integer counter values or identifiers. One of the con-
tributions of this paper is the modelling of intrinsic
uncertainty and how this can be used to measure data
quality.
Numerical data uncertainty and its implications
are often overlooked in numerical workflows. This
may be due to uncertainty not being perceived to have
a major impact on numerical information processing
and on their results, which tend to focus on datasets as
if they were uncertainty-free. However, this is a dan-
gerous misconception, as uncertainty (which typically
represents an estimate of the average indeterminacy
associated with dataset values) is actually the basis to
measure numerical data quality and thus to evaluate
the effect of different kinds of data quality improve-
ments. Uncertainty is not only unavoidable because
IoTBDS 2020 - 5th International Conference on Internet of Things, Big Data and Security
342