Intrinsic Indicators for Numerical Data Quality

Milen S. Marev, Ernesto Compatangelo and Wamberto W. Vasconcelos

Department of Computing Science, University of Aberdeen, Aberdeen, AB24 3UE, U.K.

Keywords:

Data Quality, Intrinsic Data Quality, Data Quality Indicators, Pre-processing, Numerical Data Quality,

Numerical Data Quality.

Abstract:

This paper focuses on data quality indicators conceived to measure the quality of numerical datasets. We

have devised a set of three different indicators, namely Intrinsic Quality, Distance-based Quality Factor and

Information Entropy. The results of quality measures based on these indicators can be used in further data

processing, helping to support actual data quality improvements. We argue that the proposed indicators can

adequately capture in a quantitative way the impact of different numerical data quality issues including (but

not limited to) gaps, noise or outliers.

1 INTRODUCTION

The generation and processing of very large amounts

of digitally recorded information from a variety of

heterogeneous sources (sensorial or otherwise) is at

the core of the ‘big data society’ emerging at the on-

set of the 21

century. A relevant portion of this

information, which is seamlessly produced and con-

sumed to keep society going, consists of numerical

datasets. These are generated and processed as part

of the wider digitalised management of goods and ser-

vices, performed using computational workﬂows that

run from inception to delivery. Such workﬂows in-

creasingly use a combination of artiﬁcial intelligence

and other advanced software technologies to derive

new results in a variety of monitoring and processing

scenarios. For the results of workﬂows that produce

and use numerical datasets to be meaningful, accurate

and reliable (i.e., to be of quality), data used in each

workﬂow step input must comply with some context-

dependent quality metrics. A number of frameworks

and dimensions have been proposed to measure data

quality (Batini et al., 2009; Li, 2012; Redman, 2008;

Deming, 1991; Dobyns and Crawford-Mason, 1991;

Juran, 1989; Group, 2013; Batini and Scannapieco,

2006; Cai and Zhu, 2015; Pipino et al., 2002; Swan-

son, 1987; Olson and Lucas Jr, 1982; De Amicis et al.,

2006; Todoran et al., 2015; Hufford, 1996; Loshin,

2001; Marev et al., 2018); however, most of these

frameworks focus on non-numerical types such as al-

phanumerical strings, free text or timestamps. Once

focus is restricted to numerical types, uncertainty is

explicitly taken into account, and the existing qual-

ity dimensions are fully analysed in depth, the res-

ulting landscape only remains populated by very few

useful notions. Hence, new concepts for the effect-

ive measurement (i.e., for the quantitative evaluation)

of numerical data quality must be introduced. Hence,

in this paper we deﬁne a set of novel numerical data

quality indicators speciﬁcally designed to address ef-

fective quality measurements.

Paper Structure and Content. Section 2 identiﬁes

the relevant aspects for the deﬁnition of a quantitative

framework to measure numerical data quality and its

changes. Section 3 discusses the normalised decimal

format used to represent numerical data with uncer-

tainty. Section 4 introduces the core concept of in-

formation entropy and describes entropy variation as

the basis for measuring numerical data quality im-

provements. Section 5 uses the formulas as deﬁned

in the previous section and evaluates their efﬁciency

with the use of simple sine wave dataset. Finally,

Section 6 draws conclusions on the entropy-centred

framework presented in this paper and outlines fur-

ther research work in this area.

2 DATA QUALITY ASPECTS

The question arises as to how to deﬁne and meas-

ure quality in a numerical dataset characterised by a

given degree of associated uncertainty; this issue is

made more complicated by the fact that any numerical

data quality ﬁgure is inherently context-dependent.

Marev, M., Compatangelo, E. and Vasconcelos, W.

Intrinsic Indicators for Numerical Data Quality.

DOI: 10.5220/0009411403410348

In Proceedings of the 5th International Conference on Internet of Things, Big Data and Security (IoTBDS 2020), pages 341-348

ISBN: 978-989-758-426-8

341

Various frameworks have been proposed which ad-

dress data quality deﬁnition and measurement for

both numerical and non-numerical data, with em-

phasis on data types typically found in (No)SQL data-

bases (Batini et al., 2009; Li, 2012). For quality meas-

urement purposes, these frameworks have analysed

the concept of data quality along a number of differ-

ent dimensions, proposing a speciﬁc metric for each

such dimension. Moreover, a framework has been re-

cently proposed which explicitly addresses numerical

data (Marev et al., 2018), focusing on eight data qual-

ity dimensions that are relevant to the numerical sub-

domain.

However, some of the numerical data quality di-

mensions proposed in (Marev et al., 2018) – namely,

accessibility, currency, timeliness, and uniqueness –

only address extrinsic data quality aspects. More spe-

ciﬁcally, access easiness and speed, newness, real-

time loading and processing, and lack of duplicates

(i.e., the exemplifying instantiations of these dimen-

sions) are not intrinsic properties of numerical data

as such, but depend on some external conditions.

These can be actually addressed by modifying ‘the

machinery’ around data rather than data themselves.

For instance, extrinsic data quality issues may delay

workﬂows (because of the extra time needed to ac-

quire and ﬁlter all data needed for computations) but

have no impact on the quality of workﬂow results.

Conversely, the other four dimensions proposed

in (Marev et al., 2018) (namely, accuracy, consist-

ency, completeness, and precision) represent proper-

ties of numeric datasets that explicitly affect the qual-

ity of workﬂow results. In other words, they ad-

dress intrinsic numerical data quality aspects. This

is because the quality improvement of workﬂow res-

ults explicitly depends on the improvement of the

workﬂow-consumed datasets along one or more of

the accuracy, consistency, completeness, and preci-

sion dimensions, which are discussed in detail below.

We now introduce and describe the following fea-

tures that set numerical data apart from other types:

Intrinsic Approximation. Numerical data are often

the result of either physical measurements or model-

based calculations. Hence, in theory at least, such res-

ults can take any value in a given subset of real num-

bers. In a very few cases, complex numbers (i.e., real

and imaginary value pairs represented as z = x + iy,

where i =

√

−1) are used. However, they are not

discussed in this paper as the real and the imagin-

ary part would be separately treated using techniques

developed for real numbers. Similarly, we do not

discuss integer numbers, as they either represent ex-

tremely approximated values (in which case they can

be treated as very rough real numbers subject to our

framework) or they represent counters/identiﬁers of

no interest in our context.

Having restricted our focus to real numbers, we

note that there are two compelling reasons why nu-

merical data values are never actually represented as

real values but rather as rational values. The latter

are deﬁned as ratios between two integer values (with

a non-zero denominator) and are either characterised

by a ﬁnite number of digits or by an endless repetition

of the very same ﬁnite sequence of digits.

The ﬁrst reason why rational numbers are used

to represent real numeric entities in any practical

situation is that both measurements and model-

based calculations are approximations of the meas-

ured/computed reality. This leads to a truncation in

the number of digits used to represent a real number,

which depends on the accuracy of a measurement or

of a calculation in each speciﬁc context.

The second reason why rational numbers are used

in place of real ones is that current (and likely, future)

digital technologies have limited capacity to store and

process real numbers. Pragmatically, although the ac-

curacy of a number is constantly improving, we are

unlikely to reach a short-term situation of endless ca-

pacity whatever the medium, which is what would be

required to fully represent real values accurately with

a mathematical precision.

Intrinsic Uncertainty. A fundamental characteristic

of numerical data, which sets them apart from other

data types, is that numbers generally have an intrinsic

uncertainty associated to them. This is because nu-

merical data typically represent the result of either

approximate physical measurements or calculations

based on truncations and ﬁnite-method approxima-

tions. Both such measurements and calculations asso-

ciate an inherently unavoidable degree of uncertainty

to their results. Uncertainty is an intrinsic property of

all numerical datasets that are not just collections of

integer counter values or identiﬁers. One of the con-

tributions of this paper is the modelling of intrinsic

uncertainty and how this can be used to measure data

quality.

Numerical data uncertainty and its implications

are often overlooked in numerical workﬂows. This

may be due to uncertainty not being perceived to have

a major impact on numerical information processing

and on their results, which tend to focus on datasets as

if they were uncertainty-free. However, this is a dan-

gerous misconception, as uncertainty (which typically

represents an estimate of the average indeterminacy

associated with dataset values) is actually the basis to

measure numerical data quality and thus to evaluate

the effect of different kinds of data quality improve-

ments. Uncertainty is not only unavoidable because

IoTBDS 2020 - 5th International Conference on Internet of Things, Big Data and Security

342

of the way most numerical datasets are generated; it is

also needed as a basis for any quality considerations.

Characteristic Series Structure. A feature shared

by many numerical datasets is their structure as series

of value pairs, triples or quadruples. In this paper, we

only focus on pairs for simplicity, as n-ples are treated

similarly in the context we focus on.

A numerical series is a ﬁnite discretisation of

some theoretical function y = f (x), i.e., y

= f (x

where k = 1,2,... n. Such discretisation is often ar-

ranged as a list of pairs (y

) ordered by the value

of x

, where both numerical values in a pair have as-

sociated uncertainties that are generally independent

from one another. If x

represents time, then the list

of pairs (y

) is called a time series.

The abscissa x

represents the physical ‘independ-

ent’ variable, while the ordinate y

represents the ‘de-

pendent’ variable. Numerical data series are custom-

arily ordered by abscissa values, which are generally

spaced evenly. This may make it easier to detect

whether a dataset is lacking any elements within the

interval of independent variable values the dataset is

a record of. Ordinate values generally follow some

underlying physical pattern that is representative of a

physical phenomena or a discretised model.

In time series, the uncertainty associated with

each pair of values representing the element is often

the same for all x

on the one side and for all y

on the

other, although these two uncertainty values are gen-

erally different from each other. In case uncertainty

values are the same for all x

(and, separately, for all

) they do not need to be recorded beside each pair

of values, but can be speciﬁed separately elsewhere

for the whole dataset. This approach, which avoids

the overloading of a dataset with the unnecessary re-

petition of identical information, may however result

in the role and in the impact of the associated uncer-

tainty values being overlooked.

3 DATA UNCERTAINTY

The representation of numerical data with uncertainty

has a long history that encompasses over four hundred

years of experimental physics. In this leading branch

of science, the result of an observable variable meas-

urement is represented as a pair that speciﬁes both the

measured value and its associated measurement error,

where the latter characterises the uncertainty associ-

ated with the measurement process. Although differ-

ent numeric representations are currently used in sci-

ence, technology, engineering, and mathematics, the

decimal representation (see below) is by far the most

widely used to record and display numerical data.

3.1 The Normalised Decimal Format

In order to make our discussion precise, we intro-

duce a general and ﬂexible format for our numeric

data. The decimal representation is a class of differ-

ent variants conveying the same information in dif-

ferent formats. For instance, let us consider a ra-

tional number R: this could be either expressed in nat-

ural decimal format as R = 0.00000234567 or, more

compactly, in exponential format as R = 0.0234567×

−4

. The latter leaves some degree of freedom as to

how to ‘distribute’ non-zero numerical digits between

base and exponent. In our example,

R = 0.0234567 ×10

−4

= 0.234567 ×10

−5

= 2.34567 ×10

−6

One useful, standardised way to represent ‘real’

numerical data in exponential format is the normal-

ised decimal format (NDF), where the base is always

a number smaller than one but its ﬁrst fractional digit

is always non-zero. The exponent is set accordingly.

Using the EBNF (Extended Backus-Naur Form) nota-

tion, a (rational) number representing the result of

some measurement or digital calculation can be thus

represented in a ‘normalised’ form as

hNormalisedNumberi ::= hNDFi×hPoweri

where

hNDFi::= [ hSigni ] 0.hNonZeroDigitihDigiti

hPoweri ::= 10

[ hSigni ] hDigiti { hDigiti }

hSigni ::= + | −

hNonZeroDigiti ::= 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

hDigiti ::= 0 | hNonZeroDigiti

Each numeric value resulting from a measurement

is characterised by an uncertainty, even if this is some-

times omitted or speciﬁed elsewhere. This depends

on a number of factors (e.g., limited accuracy of the

measurement process and/or limited precision of the

measuring instrument, environmental noise). Hence,

the hNDFi of numeric values that represent measure-

ments is often expressed in the form

hNDFi ::= ( hBaselinei±hSemiUncertaintyi )

where

hBaselinei ::= [ hSigni ] 0.hNonZeroDigiti { hdigiti }

hSemiU ncertaintyi ::= 0.{0}hNonZeroDigiti

3.2 Baseline, Decimal Signiﬁcance, and

Uncertainty Amplitude

A numeric data item with its associated uncertainty,

once represented using a normalised decimal format

(e.g., N = 0.12379 ×10

−4

±0.0004 ×10

−4

) is char-

acterised by different elements, such as

Intrinsic Indicators for Numerical Data Quality

343

• the baseline (0.12379 in this case),

• the number of decimal digits in its decimal part (5

for baseline 0.12379, as in this case),

• the uncertainty half amplitude expressed as a frac-

tional decimal part (0.0004 in this case),

• the order of magnitude (10

−4

in this case).

Uncertainty Set. Following the above grammar, a

numeric data element in NDF-compliant form can be

more concisely represented as

N = {V ±u}×10

hExpi

(1)

where V is the numeric value of the element, u is the

(numeric) uncertainty associated with such value and

hExpi is the exponent of the power of ten used to

represent the ‘normalised’ magnitude of both V and

u. Moreover, {V ±u} is explicitly used to indicate

that the numeric data element N is actually the set

of all possible rational values in the discrete interval

(V −u,V + u), where the distance between any two

elements of this set is given by the unit value of the

least signiﬁcant digit in u. The ﬁnite set of all possible

V values in the interval (V −u,V + u) deﬁnes the un-

certainty set of N. In other words, N can take any

u-dependent value in the above interval. For instance,

if N = {0.12379 ±0.00004}×10

−4

, the unit value of

the least signiﬁcant digit in u is 0.00001 and N = {V ±

u} = {0.12375, 0.12376,0.12377, 0.12378,0.12379,

0.12380,0.12381,0.12382,0.12383}×10

−4

The order of magnitude of both the baseline and

the uncertainty tends not to play a big part in the

numerical data quality metrics introduced and dis-

cussed later in this paper if the numerical elements

considered are all expressed using the normalised

decimal format. In fact, these quality metrics are

either based on ratios between V and u or on the frac-

tional decimal part of u, which means that the order of

magnitude of NDF-compliant numerical data can be

ignored altogether. Here and in the following we will

thus use examples where a numeric element is repres-

ented in the simpler form N = {V ±u}, thus avoiding

having to show the order of magnitude unless strictly

necessary in speciﬁc contexts.

Extension to Datasets. Our discussion so far has

focused on datasets where each element is implicitly

composed of one value with associated uncertainty.

We need to consider what changes are necessary if a

dataset is a series of pairs of numeric values rather

than a sequence of single numbers. Extending for-

mula 1, it is possible to represent each paired element

in the series as

= (C

where

= {V

±u

}×10

hExp

= {V

±u

}×10

hExp

(2)

and where the meaning of symbols in this formula is

clear. Note that while hExp

i is likely to be the same

for all elements (such that ∀k : hExp

i= hExp

i) and,

independently, hExp

i – where for any two values k,

k’, hExp

i = Exp

– is likely to have the same value

for all x

(such that ∀k : hExp

i = hExp

i), in general

hExp

i 6= hExp

Dealing with a numerical series in terms of evalu-

ating and measuring data quality means that the con-

siderations introduced in the following section will

have to be separately applied to the independent and

to the dependent variable in the series.

4 DATA QUALITY INDICATORS

The big challenge of all methods and frameworks

introduced to evaluate numerical data quality is the

identiﬁcation of suitable quality indicators that can

be used along each (intrinsic) numerical data qual-

ity dimension as the basis for a corresponding set of

metrics. Such indicators would make it possible to

compute ‘quality values’ that can be used to position

a given dataset element (as well as the whole data-

set) along a numerical data quality scale. The qual-

ity level associated to any such value would necessar-

ily be entirely context-dependent. This is because the

very same numerical element or dataset could be ﬁt

for purpose (and thus deemed of ‘suitable’ quality) in

a speciﬁc computational scenario, while it could not

be so (and thus be deemed of ‘unsuitable’ quality) in

another, different scenario.

For the purpose of identifying which paramet-

ers should be used as indicators to compute numer-

ical data quality, let us consider a numerical data-

set S = {N

,. .. ,N

} composed of N elements,

where the k-th element N

is such that ∀k = 1, 2,..., N :

= {V

±u

}. We focus on this straightforward

dataset structure for sake of simplicity. In any case,

it is implicit that the set S = {N

} corresponds to

the sampling of some variable V at regular intervals

along a discretised, implicit x axis, so that in reality

= Φ(x

However, if (as in most cases) the distance

between any two values of the sampling abscissa is

a constant for a given dataset, i.e., ∀k : x

−x

k−1

= c,

we can avoid explicitly considering abscissa values x

IoTBDS 2020 - 5th International Conference on Internet of Things, Big Data and Security

344

and we can thus refer to element values in an abridged

way as V

directly rather than as V (x

). Considering a

dataset as a set of elements N

= {V

)±u

} would

add extra complexity due to the handling of two dis-

tinct variables; there is no need to do so explicitly ex-

cept for those datasets where sampling occurs at ir-

regular intervals.

4.1 Intrinsic Quality Q

Once a numerical element and its uncertainty are ex-

pressed in NDF as a set {hBaselinei±hUncertaintyi}, it is

possible to use the baseline-to-uncertainty ratio to

deﬁne the intrinsic quality Q of that element. In other

words, if N

= {V

±u

}, then its intrinsic quality can

be mathematically expressed as

Q (N

) = log

·|V

−

(3)

This formula captures the dependencies on uncer-

tainty and on the distance between an element value

and the corresponding value on the best ﬁt regression

curve for the dataset (whatever this curve may be).

Notably, the intrinsic quality of an element:

• Is inversely proportional to uncertainty. A nu-

merical data element with no uncertainty (i.e., an

inﬁnitely precise element) is characterised by an

inﬁnite intrinsic quality. However, intrinsic data

quality in our context is only deﬁned for numer-

ical dataset elements with a non-null uncertainty.

• Is inversely proportional to the distance between

the actual element value and the value of the cor-

responding element laying on the best ﬁt regres-

sion curve.

• Can be approximated using the simpler formula

Q (N

) = log

| if |V

−

|  1

The intrinsic quality Q (S) associated with a set

S = {N

,. .. ,N

} of M numerical values, each

characterised by its own uncertainty and expressed in

NDF, is deﬁned as the sum of the individual quality

values of each element N

in the set, i.e.

Q (S) =

∑

k=1

Q (N

) (4)

The intrinsic quality of a numerical dataset ele-

ment with associated uncertainty is an indicator that

can be used as the basis of accuracy metrics. One

could be led to conclude that, as intrinsic quality is

not only just based on both dataset element value and

uncertainty but on their ratio, this is by far the best

possible option in the circumstances. However, such

conclusion is not completely true, as other indicators

are needed to perform a more comprehensive dataset

quality assessment from different perspectives.

4.2 Indicators based on Value Only:

Distance-based Quality Factor

So far we have deﬁned numerical data quality indic-

ators based on value-uncertainty pairs. However, it is

possible to introduce a value-only indicator, namely

the distance-based quality factor Q

, which is pro-

portional to the distance between a dataset element

and the corresponding point on the best ﬁt regression

curve for this dataset. For sake of simplicity, let us

consider a dataset S = {N

} of M elements, where

∀k : N

= {V

±u

} and l

= ϕ(S) is the corres-

ponding best ﬁt regression curve. In formulae, the

distance-based quality factor is thus deﬁned as

) = log

·(V

−

(5)

Note that Q

• Is very similar to the intrinsic quality Q . How-

ever, differently from Q , Q

does not include any

reference to uncertainty u.

• Is inversely proportional to the distance between

the actual element value, namely V

and the value

of the corresponding element laying on the best ﬁt

regression curve, namely

• Can be approximated using the simpler formula

) = log

| if |V

−

|  1

The distance-based quality factor associated with

a set S = {N

,. .. ,N

} of M numerical values is

deﬁned as the sum of the individual Q

) values

associated with each element in S, i.e.

(S) =

∑

k=1

) (6)

4.3 Indicators based on Uncertainty

Only: Information Entropy E

To identify a different kind of numerical data qual-

ity indicator, let us consider uncertainty only. We

noted in Section 3.2 that the uncertainty set associ-

ated to a numeric element is expressed using a single

signiﬁcant digit (e.g.., E = {0.9783 ±0.005} rather

than E = {0.9783 ±0.0027}). Once represented in

NDF (e.g., u = {±0.003} in this case), uncertainty u

speciﬁes both the accuracy magnitude of a numeric

element value and the amplitude of its inaccuracy

(see below). For instance, in the above example, the

NDF-normalised order of magnitude for accuracy is

−3

, which represents the uncertainty unit. Once u

is taken into account, the actual value of E can be any

of the numbers in the set deﬁned by its uncertainty

Intrinsic Indicators for Numerical Data Quality

345

amplitude, i.e., E ∈{0.9780,.. ., 0.9783,. .. ,0.9786}.

This is denoted as the uncertainty set associated to the

numerical element E.

For each numerical element with uncertainty ex-

pressed as E = {V ±u}, we can deﬁne the measure

µ of the corresponding uncertainty set as the num-

ber of elements in this set. This measure, which

explicitly depends on the uncertainty amplitude, can

be easily calculated from the single signiﬁcant (i.e.,

non-zero) digit of the corresponding inaccuracy amp-

litude, i.e. µ = hNonZeroDigiti×2 + 1 For example, if

{0.9783 ±0.003}, then µ

= 3 ×2 + 1 = 7.

We can now deﬁne the notion of information en-

tropy E associated with a number N as

E (N) = log

(µ

) (7)

For instance, let us consider a number N

= 0.34 ±

0.05. In this case, µ

= 5 ×2 + 1 = 11 and the asso-

ciated entropy is thus E (N

) = log

(11) = 1.0414.

Although information entropy is a property

deﬁned on individual data with associated uncer-

tainty, it can be extended to entire datasets. The in-

formation entropy associated with a set of M numer-

ical values S = {N

,. .. ,N

} – each characterised

by its own uncertainty and expressed in NDF – is

deﬁned as the sum of the individual information en-

tropy values associated with each Nk in the set, i.e.

E (S) =

∑

k=1

E (N

) (8)

If a number N is inﬁnitely accurate (i.e., if u = 0),

then its uncertainty set is a singleton set, which has

one element only. Hence, the associated information

entropy is E (N) = log

(1) = 0.

Information entropy was ﬁrst introduced in

(Shannon, 1948) in the context of communications

based on the transmission of analog electric signals.

In our context, however, information entropy is con-

sidered as a quantitative measure of numerical data

quality. More speciﬁcally, it is a measure of data

accuracy, consistency, completeness, and precision,

namely the four intrinsic numerical data quality di-

mensions described in Section 2 of this paper.

5 EVALUATION

This section focuses on the evaluation of the metrics

(the data quality indicators) introduced and discussed

in the previous section. To guarantee consistency and

to ensure that the assessment of the proposed metrics

is unbiased, we used a perfect sinusoidal wave as the

input in our computational workﬂow. This type of in-

put was chosen as it models many different physical

phenomena such as the propagation of elastic (seis-

mic) vibrations, sound and light.

The sinusoidal dataset used in the simulation ex-

periments described in this paper has a rate of 44100

and a frequency of 44100 Hz with volume at 100%.

Figure 3 shows a time representation of the generated

wave. Figures 1 and 2 show the experiment runs.

Each experiment is designed to run 100 times, with

each run consisting of 100 iterations, with a grand

total of 10000 runs for each experiment. To test the

results of the data quality indicators introduced in this

paper, we devised 3 different test cases:

• Introduce 100 | 1000 | 10000 gaps

• Introduce 100 | 1000 | 10000 outliers

• Introduce 100 | 1000 | 10000 gaps and outliers

The experiments with synthetic sinusoidal wave

data that are described in this paper are intentionally

designed for simplicity. However, more advanced ex-

periments with real datasets (e.g., using Distributed

Temperature Sensing data in Oil & gas production

wells) are currently being performed. These will be

introduced and discussed in future publications. Each

of our current experiments provides a variety of in-

dicators: Average (AVG), Standard Deviation (STD)

and the three indicators introduced in section 4. Both

AVG and STD are introduced as a tool to verify and

validate the effectiveness and the correctness of our

algorithms that artiﬁcially introduce defects (namely,

gaps and outliers) in the ‘clean’ dataset. Such al-

gorithms are not discussed in this paper. In our ﬁrst

experiment we only introduced gaps in the dataset. As

a result, both AVG and STD are gradually increasing

in parallel with the increase in the number of gaps.

The difference in results is due to the fact that gaps

are randomly generated, replacing any value in the

dataset with one characterised by an almost inﬁnite

uncertainty. The results show that two of the indicat-

ors introduced in Section 4, namely the intrinsic qual-

ity Q and the distance-based quality factor Q

both

decrease in value following the introduction of more

data gaps. This is to be expected, as the effect of gaps

can be either represented by inﬁnite uncertainty or by

an out-of-scale value if uncertainty is not used as a

parameter in the considered numerical data quality in-

dicator. The Information Entropy metric value in all

3 experiments increases if each gap is considered as a

number with an inﬁnite uncertainty, so that it cannot

be precisely localised any more. It should be noted

that the theoretical treatment of a gap as a dataset

point whose uncertainty becomes inﬁnite or, altern-

atively, as a dataset point whose value becomes out of

bound, is a theoretical artiﬁce to keep the number of

dataset elements unchanged during the experiments.

IoTBDS 2020 - 5th International Conference on Internet of Things, Big Data and Security

346

0 2000 4000 6000 8000

25000

30000

35000

40000

45000

Distance Quality Factor

Intrinsic Quality Factor

Information Entropy

Run No.

Indicator

Figure 1: 1000 gaps Metrics.

0 2000 4000 6000 8000

25000

30000

35000

40000

45000

Distance Quality Factor

Intrinsic Quality Factor

Information Entropy

Run No.

Indicator

Figure 2: 1000 Gaps & Outliers Metrics.

2000 4000 6000 8000

100

Time

Amplitude

Figure 3: Sinewave.

This is because any methodologically sound compar-

ison should be done in like-for-like conditions. In our

case, this means that once a single element gap is in-

troduced the corresponding dataset should not lose an

element and thus become smaller. Simply, the ele-

ment is retained for both count and indicator purposes

but its parameters (uncertainty or value) are changed

to reﬂect the practical effect of generating a gap.

In the second experiment we introduced only out-

liers in the dataset. An outlier is deﬁned as an ele-

ment whose value is at least three times bigger than

that of its immediate left and right neighbours in the

ordered series of dataset elements. The outlier is thus

represented as a number that is 3 ×σ where σ is the

base case standard deviation. This deﬁnition of out-

lier is purely arbitrary but it is rooted in the empirical

analysis of production data in a variety of industries,

where time series are assessed for temporal stability

and random sensorial malfunctionings lasting frac-

tions of a second happen frequently in extreme en-

vironmental conditions. For instance, thermal sensors

located deep in hydrocarbon production wells, where

temperature can easily reach 100C, sometimes ‘go

crazy’ and provide a single measured value that is at

least three or four times higher. Here too the value

and the position of outliers are generated on a random

basis. As in our previous experiment, the defect is ac-

curately indicated by all of the numerical data quality

indicators. In this experiment we observed that in all

3 test cases information entropy only changed a little.

On the other hand the intrinsic quality factor and the

distance quality factor were indicating much lower

values. This is consistent with the conceptual treat-

ment of an outlier as an element whose uncertainty

is widely increased and whose value differs substan-

tially from that of the corresponding element on a best

ﬁt regression curve. A big difference between ac-

tual outlier value and corresponding best ﬁt regression

point value lowers both Q and Q

. A substantial un-

certainty increase (which is conceptually necessary so

that the element still remains compatible with the ori-

ginal trend despite its increased distance from the re-

gression curve) increases information entropy on the

basis of an increase in the order of magnitude, which

is addressed by the logarithmic information entropy

formula.

The third experiment combines the ﬁrst two, cre-

ating 100, 1000 or 10000 combined gaps and outliers

respectively in the dataset. In terms of AVG and STD,

the introduction of these defects follows the same pat-

tern described in the previous two paragraphs above.

The results of this experiment are quite conclusive as

both the distance quality factor and the intrinsic qual-

ity factors show a signiﬁcant decrease in indicator val-

ues with the introduction of more defects. The value

of the information entropy indicator, on the other side,

increases with the increasing number of defects in-

troduced. The quantitative difference (and the spread

of values per run) depends on the structure of the in-

dicator. Information entropy is based on the size of

the element E

uncertainty set; this increases substan-

tially in case of gaps or outliers - by an large fol-

lowing the progression 10 – 100 – 1000 if µ(E

) in-

creases more than one order of magnitude. This ex-

plains the higher E values with respect to Q or Q

The distance-based quality factor only has one para-

meter (the distance between an element and the cor-

responding one on a best ﬁt regression curve), so it

is characterised by less noise than Q , which has two

paramenters (distance and uncertainty) and thus two

degrees of freedom that allow a higher noise level in

the various experiment rounds.

6 CONCLUSIONS

Our results indicate that numerical data quality can

successfully be measured with the indicators intro-

Intrinsic Indicators for Numerical Data Quality

347

duced and described in this paper. All indicators

were successfully showing the effect of gaps and out-

liers in reducing data quality. Information entropy

E shows how data quality worsens with each experi-

ment that results in more noise. Intrinsic data quality

Q and distance-based data quality Q

show how dis-

tance from the best ﬁt curve impacts on overall qual-

ity, whether uncertainty is explicitly considered or

not. The proposed data quality indicators can be con-

sidered as the initial data quality assessment step for

a numerical dataset, serving as a precondition to any

subsequent data manipulation. By knowing what is

wrong with the evaluated dataset, appropriate clean-

ing and improvement techniques can be applied, or

in case of low and non-improvable quality indicators,

the dataset can be deemed as unreliable.

ACKNOWLEDGEMENTS

This research is funded by EPSRC Doctoral Training

Partnership 2016-2017 University of Aberdeen with

award number: EP/N509814/1

REFERENCES

Batini, C. et al. (2009). Methodologies for data quality as-

sessment and improvement. ACM Computing Surveys,

41(3):1–52.

Batini, C. and Scannapieco, M. (2006). Data Quality: Con-

cepts, Methodologies and Techniques (Data-Centric

Systems and Applications). Springer.

Cai, L. and Zhu, Y. (2015). The challenges of data quality

and data quality assessment in the big data era. Data

Science Journal, 14(0):2.

De Amicis, F., Barone, D., and Batini, C. (2006). An analyt-

ical framework to analyze dependencies among data

quality dimensions.

Deming, W. E. (1991). Out of the crisis, 1986. Massachu-

setts Institute of Technology Center for Advanced En-

gineering Study iii.

Dobyns, L. and Crawford-Mason, C. (1991). Quality or

else: The revolution in world business. Regional Busi-

ness, 1:157–162.

Group, D. U. W. (2013). The six primary dimensions for

data quality assessment. DAMA UK.

Hufford, D. (1996). Data warehouse quality. DM Review,

January.

Juran, J. (1989). Juran on leadership for quality: An exec-

utive handbook.

Li, L. (2012). Data quality and data cleaning in database

applications. PhD thesis, Napier University, Edin-

burgh.

Loshin, D. (2001). Dimensions of data quality. Enterprise

Knowledge Management, page 101–124.

Marev, M., Compatangelo, E., and Vasconcelos, W. W.

(2018). Towards a context-dependent numerical data

quality evaluation framework. Technical report, Com-

puting Sci. Dept., University of Aberdeen.

Olson, M. H. and Lucas Jr, H. C. (1982). The impact of

ofﬁce automation on the organization: some implic-

ations for research and practice. Communications of

the ACM, 25(11):838–847.

Pipino, L. L., Lee, Y. W., and Wang, R. Y. (2002). Data

quality assessment. Communications of the ACM,

45(4):211.

Redman, T. C. (2008). Data Driven: Proﬁting from Your

Most Important Business Asset. Harvard B. R. P.

Shannon, C. E. (1948). A mathematical theory of commu-

nication. The Bell System Technical Journal, 27(3).

Swanson, E. B. (1987). Information channel disposition and

use. Decision Sciences, 18(1):131–145.

Todoran, I.-G., Lecornu, L., Khenchaf, A., and Caillec, J.-

M. L. (2015). A methodology to evaluate important

dimensions of information quality in systems. JDIQ.

IoTBDS 2020 - 5th International Conference on Internet of Things, Big Data and Security

348