MODELING AND MONITORING THE QUALITY OF DATA
BY INTEGRITY CONSTRAINTS AND INTEGRITY CHECKING
Hendrik Decker
Instituto Tecnol´ogico de Inform´atica, Valencia, Spain
Keywords:
Quality, Integrity constraints, Integrity checking, Inconsistency tolerance.
Abstract:
Characteristic attributes for describing the quality of data, such as trustworthyness, soundness, riskiness, uncer-
tainty, dependability, reliability and other semantic properties can be modeled and monitored by conventional
database integrity technology. As opposed to traditional consistency constraints, occasional violations of some
of the integrity conditions that describe quality aspects may be tolerable, even for extended periods of time.
Traditional integrity checking methods are intolerant wrt. any constraint violation. They insist that all con-
straints are totally satisfied before updates can be checked for integrity preservation. Inconsistency-tolerant
methods can waive that insistence. Thus, if data quality is modeled by constraints, it can be monitored by any
integrity checking method that is inconsistency-tolerant. We illustrate that by an extended example, by which
inconsistency-tolerant integrity checking is also compared to some alternative aproaches.
1 INTRODUCTION
In relational databases, first-order predicate logic sen-
tences called assertions, integrity constraints or sim-
ply constraints are used to express conditions that
are required to be invariantly satisfied across state
changes caused by updates.
In knowlege bases and decision support systems,
the expressive power of logic also can be used to cap-
ture any other semantic information that goes beyond
the simple structures of common database content. In
particular, conditions for characterising the quality of
data can be expressed as assertions.
The basic motivation and idea behind this paper is
that the database should not suffer from semantically
imperfect data, and that stored data can be considered
to have sufficient quality if the database satisfies suit-
ably many quality constraints. Otherwise, if a critical
amount of the required quality assertions is violated,
then the quality of the database is impaired or dam-
aged beyond tolerable proportions.
The evaluation of assertions tends to be pro-
hibitively expensive. Thus, for the interpretation of
assertions as integrity constraints, specific integrity
checking methods for simplifying their evaluation are
used. However, since the semantics of quality asser-
tions is different from that of integrity constraints, the
use of integrity checking methods for simplifying the
evaluation of quality assertions is questionable.
Traditional integrity checking insists on total con-
straint satisfaction. That is not suitable in general
for monitoring quality assertions, since some of them
may be occasionally violated, even for extended pe-
riods of time, without impairing ongoing routine op-
erations. As opposed to that, we are going to see that
integrity checking methods that are able to tolerate ex-
tant violations of constraints also are able to monitor
the dynamics of the quality of data.
More precisely, we show how to gain a better
control over the quality and possible imperfections
of stored data, by expressing quality properties as
constraints, and monitoring them with inconsistency-
tolerant integrity checking methods. Conditions that
model quality properties may qualify data positively,
e.g., as trustworthy, secure or healthy, or negatively,
as imperfect or risky, uncertain or vague, etc. If each
such property is satisfied, there is no quality erosion
that would violate the constraints. Conversely, viola-
tion of quality properties means that the data that are
responsible for violation qualify as corrupt.
Capturing quality properties of data by describing
them in the form of integrity constraints yields a dou-
ble benefit: Firstly, to use the expressive power of the
syntax of semantic integrity constraints also for de-
scribing arbitrarily general quality properties of data.
Secondly, to make use of established integrity check-
207
Decker H. (2009).
MODELING AND MONITORING THE QUALITY OF DATA BY INTEGRITY CONSTRAINTS AND INTEGRITY CHECKING.
In Proceedings of the 4th International Conference on Software and Data Technologies, pages 207-214
DOI: 10.5220/0002264602070214
Copyright
c
SciTePress
ing methods in order to efficiently check data also for
quality. Thus, monitoring and controlling stored and
incoming new data and updates with regard to their
quality can be achieved.
In section 2, we first strive to gain a better under-
standing of the similarities and differences between
integrity and quality. Then, we claim that, in spite
of seemingly severe differences, it is possible to cap-
ture conditions for quality by integrity constraints,
and to monitor them by using methods for integrity
checking. This claim is substantiated in the remain-
der. In section 3, we recapitulate the concept of
inconsistency-tolerant integrity checking (Decker and
Martinenghi, 2006; Decker, 2008). (Inconsistency
here is synonymous to integrity violation.) We show
that it is precisely the inconsistencytolerance of meth-
ods that makes them apt to be used for monitoring
quality. In section 4, we elaborate an extended ex-
ample that illustrates how inconsistency-tolerant in-
tegrity checking can be used for risk management.
The latter is a special case of managing the quality
of data. In section 5, we address related work. In
section 6, we conclude.
2 QUALITY AND INTEGRITY
In 2.1, we analyse the similarities, and in 2.2 the dif-
ferences between quality and integrity. Essentially,
integrity and quality are very similar since both can
be described by assertions. They differ since integrity
constraints and their evaluation traditionally are much
more exigent than quality assertions. However, as ex-
plained in 2.3, this difference is reconciled and can be
overcome by inconsistency-tolerant methods for in-
tegrity checking, with which also quality assertions
can be evaluated.
2.1 Similarities
Traditionally, integrity constraints are used to express
correctness conditions with which all stored data must
comply. Upon each issued update, the constraints im-
posed on the database are checked. Updates are com-
mitted only if they do not cause integrity violation.
For example, in a civil registry database containing
information about citizens and their marital status, in-
serting married(john,mary) violates
xyz(married(x, y) married(x, z) y6=z),
i.e., a constraint forbidding bigamy, if the tuple
married(john,susan) is already stored. Then, also the
integrity constraint
xy married(x,y) person(x,m) person(y,m)
which requires that each spouse of each married cou-
ple is registered in the person table of the database
and has the marital status attribute set to m(arried),
will signal violation upon an attempt to delete the tu-
ple person(susan,m).
Also conditions for characterising database entries
that lack quality (i.e., data that are risky, uncertain,
precarious, dubious, suspicious, etc), as well data that
are devoid of such deficiencies (i.e., data that are trust-
worthy, dependable, reliable, credible, etc) can be ex-
pressed in the syntax of integrity constraints. Obvi-
ously, datalog negation can be used to convert posi-
tive into negative qualities and vice-versa.
The constraint uncertain person(x,null), for
instance, qualifies each entry in the person table as
uncertain by a namesake 0-ary predicate if the marital
status of that entry is unknown, as represented by a
null value. Another example is the constraint
x p(x, null) p(x, s) p(x,m) p(x,d) p(x,w)
where p abbreviates person. It says that each person
with unknownmarital status is either single or married
or divorced or widowed. Similarly, entries of persons
with birth date before the 20th century can be char-
acterized as dubious. By amalgamating higher-order
predicates into first-order terms (Bowen and Kowal-
ski, 1982), also sentences such as
confidence(row(x,y),z) z < th trustworthy(x)
may serve as constraints for disqualifying the trust-
worthyness of rows x in database tables y such that
the confidence value z of x is below a certain thresh-
old value th (which can be thought of as a constant
parameter or the output of some evaluable function).
Thus, it is possible to characterize uncertain data
in the syntax of integrity constraints. Analogously,
this can also be done for any semantic properties that
capture some other quality aspects. Hence, it should
also be possible to use integrity checking methods in
order to check incoming data for violations of quality
constraints, and, symmetrically, check deletions for
having the effect of impairing the quality of the re-
maining data. In the following subsection, we are go-
ing to see that this is not as straightforward as it may
seem at first glace.
2.2 Differences
In 2.1, we have seen that the representation of prop-
erties describing logical consistency or quality is very
similar. Both can be modeled by integrity assertions.
However, there is a significant difference between
quality and integrity. Data that lack integrity are not
just uncertain or dubious, but definitely bad, while
data the quality of which is compromised may or
ICSOFT 2009 - 4th International Conference on Software and Data Technologies
208
may not have integrity. Essentially, the difference is
that integrity is two-valued (i.e., satisfied or violated),
while quality is not binary, or at least not in the sense
that imperfect data with impaired quality would al-
ways be invalid. In fact, data that violate integrity are
usually considered useless, harmful and unwanted,
while data that lack quality can still be useful.
For example, the integrity constraint
violated emp(x), age(x,y), y < 14
expresses that integrity is violated by underage em-
ployment (because there is a law by which this con-
straint is enforced), while the assertion
dubious emp(x), age(x,y), y > retirement age
expresses that overage employment qualifies as dubi-
ous (since, though not forbidden, it may contradict an
employer’s general policy that a person beyond retire-
ment age would remain employed). Another example
is the integrity constraint
violated email(x), sent(x,y), received(x,z), y > z
which declares that integrity is violated if the sent-
date of an email item is after its received-date (assum-
ing that both x and y are normalised wrt the same time
zone). By contrast, the formula
suspect(x) email(x,from(y)), authenticated(y)
rates an email item x received from y as suspect if the
latter cannot be authenticated, although the message
content of x may well be valid and unproblematic.
Although the same syntax can be used to represent
conditions for integrity and quality, the use of known
integrity checking methods for checking the quality of
data must be deemed problematic, if not unfeasible,
for the following reason.
All methods for efficient integrity checking insist
that integrity must be satisfied before a given update
is checked for integrity. That way, the evaluation of
constraints can focus on the relevant part of the data
that are actually affected by the update, while the
rest can be ignored, since it is known to satisfy in-
tegrity. It would be unrealistic, however, to generalise
that insistence on total integrity satisfaction by requir-
ing that, before each update, all stored data should
comply perfectly with all quality requirements. After
all, certain defects of quality can never be excluded
with complete certainty. Thus, not all data can be as-
sumed to be of pristine quality whenever an update
needs to be checked for quality preservation. Ex-
amples of databases where quality is not perfect but
most information is useful are given by each large the-
saurus or encyclopedia (think, e.g., of Wiktionary or
Wikipedia).
In principle, a way out of this dilemma could be
to use a method that does not insists on total integrity
before each update. The only method in the litera-
ture that does not require the total satisfaction of all
constraints is the so-called brute-force method. It ex-
haustively evaluates all constraints upon each update,
without any simplification. But brute-force evalua-
tion may be prohibilively expensive, due to the high
complexity of constraints. Another way out could be
to repair all violated constraints before or after each
update.
A more elegant and less expensive solution of us-
ing integrity checking methods for monitoring quality
assertions is presented in the following section.
2.3 Reconciliation
Inconsistency-tolerant integrity checking has been in-
troduced in (Decker and Martinenghi, 2006; Decker
and Martinenghi, 2008). In particular, it has been
shown that, contrary to common belief, many well-
known integrity checking methods, although not all
of them, can waive the requirement that each con-
sistency assertion be totally satisfied before updates
can be checked efficiently. An important feature of
inconsistency-tolerant methods is that none of their
functionality and efficiency is compromised by arbi-
trarily high amounts of extant constraint violations.
Since quality properties can be expressed by the
same syntax as constraints, it follows that integrity
checking methods can be used to check quality prop-
erties. In particluar, the use of inconsistency-tolerant
methods enables an efficient way of evaluating such
properties even if there are data that do not fully com-
ply with all quality constraints.
Nevertheless, inconsistency-tolerant methods are
capable of detecting and rejecting each impairment of
quality assertions upon each update, no matter if the
extant imperfections are minor shortcomings or ma-
jor corruptions of data. Thus, the task of improving
the quality of damaged data can be delegated to sepa-
rate, possibly off-line processes. Such processes may
be run at any convenient point of time. In particular,
they need not be run at update time, as required by
traditional integrity checking approaches.
3 INCONSISTENCY TOLERANCE
In this section, we recap the main definitions of incon-
sistency-tolerant integrity checking (Decker and Mar-
tinenghi, 2006; Decker, 2008). Unless specified oth-
erwise, we use terminology and notations that are
conventional in the databases community (see, e.g.,
(Ramakrishnan and Gehrke, 2003)).
MODELING AND MONITORING THE QUALITY OF DATA BY INTEGRITY CONSTRAINTS AND INTEGRITY
CHECKING
209
Throughout, let ‘method’ always signify an in-
tegrity checking method. We assume that each con-
straint is represented in prenex form, i.e., an implicit
or explicit quantifier precedes a quantifier-freematrix.
This includes the two most common forms of repre-
senting a constraint, either as a denial (i.e., a clause
without head whose body is a conjunction of liter-
als) or in prenex normal form (i.e., quantifiers out-
ermost, negations innermost). An integrity theory is a
set of constraints. An update is a bipartite finite set of
database clauses to be inserted or deleted.
From now on, let the symbols D, IC, U, I and M
always denote a database, an integrity theory, an up-
date, a constraint and, resp., a method. We write D
U
to denote the updated database, and also refer to D
and D
U
as the old and the new state, respectively.
We assume that the semantics of D and IC is
given by a distinguished unique Herbrand model of
D. Thus, I is satisfied (violated) in D if I is true
(resp., false) in that model. As usual, IC is called
satisfied (violated) in D if each I IC (resp., at least
one I IC) is satisfied (resp., violated) in D. For
convenience, we write D(IC) = true and D(I) = true
for denoting that IC or, resp., I is satisfied in D, and
D(IC) = false (D(I) = false) that it is violated. ‘Con-
sistency’ and ‘inconsistency’ are synonymous with
‘satisfied’ and, resp., ‘violated’ integrity.
Each correct method M can be formalized as a
mapping that takes as input a triple (D,IC,U) such that
D(IC) = true, and outputs upon termination either ok
or ko. Here, ok means that M accepts U because U
does not violate any constraint, and ko means that M
does not accept U. For inconsistency-tolerant meth-
ods, the premise D(IC) = true can be waived without
penalty. For simplicity, we only consider input triples
(D,IC,U) such that the computation of M (D,IC,U)
terminates. In practice, that can always be achieved
by a timeout mechanism with output ko.
Each constraint I can be conceived as a set of
particular instances, called cases’, of I, such that I
is satisfied iff all of its cases are satisfied. Thus, in-
tegrity maintenance can focus on satisfied cases, and
check if their satisfaction is preserved across updates.
Violated cases can thus be tolerated and possibly
dealt with at any moment that is more convenient.
That is captured by the following definition.
Definition. Inconsistency-tolerant Integrity.
a) A variable x is called a global variable in I if x
is -quantified in I and does not occur left of the
quantifier of x.
b) For a constraint I and a substitution ζ of its global
variables, let Iζ be obtained by replacing each global
variable in I by the term assigned to it in ζ. Each such
Iζ is called a case of I.
c) Let SC(D,IC) denote the set of all cases C of all
I IC such that D(C) = true, i.e., C is satisfied in D.
d) M is called inconsistency-tolerant if, for each
triple (D,IC,U), the output M (D, IC,U) = ok en-
tails that D
U
(C) = true, for each C SC(D,IC).
In words, the definition above means: If an
inconsistency-tolerant M accepts an update without
insisting that each constraint be satisfied before the
update, then the output ok guarantees that each case
of IC that was satisfied in D remains satisfied in D
U
.
Example. For relations p, q, let the second
column of q be subject to the foreign key con-
straint I = x,y z(q(x,y) p(y,z)), which refer-
ences the primary key column of p, constrained by
I
= p(x,y), p(x,z), y 6= z. The global variables of
I are x and y; all variables of I
are global. For
U = insert q(a,b), a typical method M only eval-
uates the simplified basic case z p(b,z) of I. If,
for instance, (b,b) and (b,c) are rows in p, M out-
puts ok, ignoring all irrelevant violated cases such
as, e.g., p(b,b), p(b,c), b 6= c and I
, i.e., all ex-
tant violations of the primary key constraint. M is
inconsistency-tolerant if it always ignores irrelevant
violations. M outputs ko if there is no tuple match-
ing (b,z) in p.
It is easy to see that inconsistency-tolerant in-
tegrity checking significantly generalizes the tradi-
tional approach, which does not legitimize the use
of methods in the presence of extant constraint vio-
lations.
As shown in (Decker and Martinenghi, 2006,
2008), many known methods for integrity check-
ing are inconsistency-tolerant. The reasoning of
inconsistency-tolerant methods, and also of meth-
ods that are non inconsistency-tolerant, is featured at
length in subsection 4.2.
4 RISK MANAGEMENT
In this section, we illustrate how to use the evaluation
of assertions by integrity checking methods for mon-
itoring and managing risks.
4.1 Risks and Quality
A risk is a negative quality. Its positive counterpart
may be characterized by properties such as security,
dependability, reliability, safety and the like. Risks
can often not be totally excluded, while it is always
ICSOFT 2009 - 4th International Conference on Software and Data Technologies
210
requisite to minimize and to control them for lowering
their probability to increase.
Of course, the amount of tolerable risk depends
on the application and often also on its users (think,
e.g., of stock market transactions). The example elab-
orated in 4.2 is open to interpretation. By assign-
ing convenient meanings to predicates, it could be in-
terpreted as a risk model of, e.g., financial services
(think, e.g., of Basel II), or a nuclear power plant.
4.2 An Extended Example
Of course, a single example can always be criticized
to be statistically irrelevant. However, for each of the
mentioned alternatives, several typical features that
are independent of the particular example are illus-
trated. In particular, we are going to see that, for
safety-critical applications, the use of a method that
is inconsistency-tolerant is more dependable than to
use one which is not. Our example will show that
using a non-inconsistency-tolerant method for moni-
toring risks may have fatal consequences.
We are going to compare inconsistency-tolerant
integrity checking with the following alternative ap-
proaches to monitor risk: brute-force evaluation, non-
inconsistency-tolerant integrity checking, repairing,
and consistent query answering (Arenas et al., 1999).
In detail, we address the following points 1) - 6).
1) The cost of the brute-force method.
2) The cost of inconsistency-tolerant methods.
3) The dependability of methods.
4) The cost of repairing the old state.
5) The cost of repairing the new state.
6) The risk of consistency query answering.
Let us consider a database D whith the following
definitions of view predicates rl, rm, rh that model
risks of low, medium and, respectively, high degree.
rl(x) p(x, x)
rm(y) q(x,y), p(y,x)
rm(y) p(x, y), q(y,z), p(y,z), q(z,x)
rh(z) p(0, y), q(y,z), z > th
In the clause defining rh, let th be a evaluable
threshold value that we assume to be always greater
or equal 0. Now, let the risks be denied as in the fol-
lowing integrity theory:
IC = {← rl(x), rm(x), rh(x)}.
Before populating D with facts about p and q, let
us verify that IC is satisfiable at all by any etension of
D. Indeed, it is, e.g., by each extension of p such that
no fact of the form p(0,y) is in p and any of the fol-
lowing alternatives holds: either p = q, or D contains
{q(2,1), p(1,2), p(2,1)} and arbitrarily many facts of
the form p(n,n+ m), for n > 1, m > 0.
Now, let the extensions of p and q be as follows.
p(0,0), p(0,1), p(0,2), p(0,3), .. ., p(0, 10000),
p(1,2), p(2, 4), p(3,6), p(4,8),. .., p(5000, 10000)
q(0,0), q(1,0), q(3,0), q(5,0), q(7,0),..., q(9999,0)
Clearly, there is a single violated low-risk case in
D, which is caused by p(0,0). Let us make sure that
there is no other violated risk case in D, but trying to
refute each denial about rl, rm and rh.
First of all, there obviously is no other low-risk
cause of form p(x,x) that would violate rl(x).
Next, let us try to find an instance of the body of
the first clause of rm that would be true in D. Since
the second column of q is always 0, q(x,0),p(0,x),
would have to be true. That, however, cannot be,
since p(0,x) / D for each x such that q(x, 0) D.
For trying to find a satsified instance of the body of
the second clause of rm, let e stand for an even num-
ber greater or equal 0, o for an odd number greater or
equal 1, and n for any natural number greater or equal
0. Further note that each p-fact in D is either of the
form p(0,e) or p(0,o) or p(n,2n), for n > 1. So, since
the second column of p joins with the first column of
q only if their value is an even number, the only possi-
ble instances of that clause which could make its body
true are of one of the following three forms:
p(0, e), q(e,z), p(e, z), q(z,0)
or
p(0,o), q(o,0), p(o,0), q(0, 0)
or
p(n,2n), q(2n,0), p(2n,0), q(0,n)
Obviously, none of these instances can become
true, because q(e,z) does not hold for any z, q(0, 0)
is true in D, and q(2n,0) is false for each n > 0.
Last, the clause of rh: to make its body true would
require that 0 > th, but we have excluded that. Hence,
we have verified that rl(0) is the only violated risk
case of IC in D, and that p(0,0) is its only cause.
Now, consider U = insert q(0,9999), for illustrat-
ing 1) - 6) above.
1) The cost of brute-force checking for any up-
date is high. That is a commonplace, but let us see in
some more detail to what brute-force evaluation of IC
amounts, for later comparison.
Evaluation of rl(x) involves a scan of all of
p. Evaluation of rm(x) involves joins of p and q,
a join of local p with remote q, plus possibly many
lookups in p and q. Evaluation of rh(x) involves
a join of local p with remote q, plus the evaluation of
MODELING AND MONITORING THE QUALITY OF DATA BY INTEGRITY CONSTRAINTS AND INTEGRITY
CHECKING
211
possibly many ground expressions of the form z > th.
With large extensions of p and q, the evaluation
steps outlined above may last too long, particularly if
safety-critical risks are monitored in real time. In the
following point, we shall see that it is far less expen-
sive to use an inconsitency-tolerant method that sim-
plifies the evaluation of integrity constraints by taking
the update into account and by limiting its focus on
the data that are affected by the update.
2) We are going to see that the cost of
inconsistency-tolerant integrity checking of U is
much lower than to use brute-force evaluation. But,
before we go into details, recall that the use of any
traditional method that insists on the satisfaction of
IC in the old state D is prohibited for the database in
our example, since D(IC) = false.
Typical simplification methods compile pre-
simplifications for update patterns at constraint
specification time. Thus, the cost of such pre-
simplifications at update time is nil. U matches the
update pattern q(a,b), which in turn matches pre-
cisely the following unfoldings of rm by the two
clauses defining rm, and of rh, respectively.
q(x, y), p(y,x)
p(x,y), q(y,z), p(y,z), q(z,x)
p(0, y), q(y,z), z > th
Thus, the pre-simplifications complied for the pat-
ter for insertions of facts of the form q(a,b), are as
follows.
p(b,a)
p(x,a), p(a,b), q(b,x)
p(0, a), b > th
Substituting (a,b) by the inserted values (0,9999)
at update time yields the following simplifications.
p(9999,0)
p(x,0), p(0,9999), q(9999,x)
p(0, 0), 9999 > th
By a simple lookup of p(9999,0) for evaluating
the first of the three denials, it is inferred that rm is
violated.
Since a medium risk has been detected, there is in
principle no need to continue checking the remaining
two simplified denials. However, we are going to do
that, in order to build a bridge to point 3).
Evaluating the second denial from left to right
amounts to the cost of answering the query p(x, 0).
The single answer is x = 0. Then, a lookup of
q(9999,0) succeeds. Hence, the second denial is true,
which means that there is no further medium risk.
Since p(0, 0) is true, the third denial turns out to
be violated if 9999 > th holds, thus indicating a high
security risk.
To summarize this point: Inconsistency-tolerant
integrity checking of U essentially costs a simple ac-
cess to the p relation. Moreover, if all constraints are
evaluated evenafter some violation has been detected,
only an additional simple lookup is needed. And,
perhaps more importantly, inconsistency-tolerant in-
tegrity checking prevents medium- and high-risk vio-
lations that would be caused by the update if it were
not rejected.
3) Inconsistency-tolerant checking is dependable,
non-inconsistency-tolerant checking is not. This
claim is confirmed by considering the following kind
of reasoning, as performed by methods that are not
inconsistency-tolerant. Such methods are specified,
e.g., in (Gupta et al., 1994; Lee and Ling, 1996).
Since the p relation is not affected by U, the truth
value of the unfolding p(x,x) of the constraint
rl(x) is the same in D. Since each method that is
not inconsistency-tolerant insists on the premise that
all constraints be satisfied in the old state, such meth-
ods, when applied to our example, conclude that the
unfolded denial p(x,x) is true in D and D
U
, even
though p(0,0) D. That conclusion is then applied to
the third of the simplified unfoldings from 2), which
is reprinted below, for convenience.
p(0,0), 9999 > th
The subsumption-based reasoning of methods that
are not inconsistency-tolerant can be summarized as
follows: Applying the premise that p(x,x) is satis-
fied to p(0,0),9999 > th infers that the latter also
remains satisfied in D
U
, because it is subsumed by
p(x,x). Thus, non-inconsistency-tolerant integrity
checking wrongly concludes that the high risk con-
straint rh(z) is not violated in D
U
.
4) Now, we are going to see that repairing the old
state is costly. Recall that the traditional integrity
checking approach insists on total constraint satis-
faction in the old state. This means that all extant
violations need to be repaired before each update.
In general, the identification of all extant constraint
violations may already be very expensive in large
databases, and indeed unaffordable at update time.
Fortunately, however, there is only a single low-
risk constraint violation in our example, as we have
already seen before: p(0,0) is the only cause of the
only constraint violation rl(0) in D. Thus, to repair
D means to first delete p(0, 0), and then check if that
preserves all assertions.
To check delete p(0,0) for integrity means to
ICSOFT 2009 - 4th International Conference on Software and Data Technologies
212
check the simplified denials
q(0,0)
and
p(x,0), q(0,0), q(0,x)
obtained from resolving p(0,0) with the bodies of
the two clauses defining rm, since precisely those two
clauses are affected by the deletion of p(0,0). Hence,
no constraint other than rm(y) is potentially vio-
lated by the intended repair.
Of the two simplified denials above, the second
one clearly is satisfied in D
U
, since no fact of the form
p(x,0) remains in the database after p(0,0) is deleted.
However, the first one is violated, since q(0,0) is true
in D
U
. Hence, another repair action is needed. The
obvious candidate is delete q(0,0).
To delete q(0, 0) affects
rm(y) p(x, y), q(y,z), p(y,z), q(z,x)
and yields the simplified check of
p(0, y), q(y,0), p(y,0).
Obviously, this denial is violated by all facts in D
that are of the form p(0,o) and q(o, 0), where o is an
odd number in the interval [1,9999]. Thus, to delete
q(0,0) for repairing the violation caused by deleting
p(0,0) causes the violation of each case of the form
rm(o), for each odd number o in [1,9999].
Clearly, many facts about p or q would have to be
deleted in order to repair each of these violated cases.
For simplicity, we won’t follow them through, since
the point that repairing D is very complexand tends to
be much more expensive than inconsistency-tolerant
integrity checking has become obvious already. We
only recall the big advantage of inconsistency-tolerant
integrity checking that repair actions do not have to
take place at update time. Instead, they can be taken
off-line, at any convenient moment.
5) Also repairing the new state is more costly than
to simply tolerate extant constraint violations until
they can be repaired at some better moment. In our
example, this becomes obvious by recalling from 1)
that, in D
U
, there are three violated cases: thelow-risk
case that is already violated in D and the medium- and
high risk cases as detected by inconsistency-tolerant
integrity checking. To repair them is indeed even
more complicated than to only repair the violated low-
risk case, as attempted in 4).
Moreover, it should be noted for risk management
that it is no good idea in general to simply accept an
update without checking for potential violations of
constraints, and to attempt repairs only after the up-
date is committed, because repairing takes time, dur-
ing which an updated but unchecked state may con-
tain malicious risks of any order.
6) Consistent query answering in inconsistent
databases (CQA) is a popular approach to cope with
extant constraint violations for query answering (Are-
nas et al., 1999). Although query answering is
not the topic of this paper, a connection between
inconsistency-tolerant integrity checking and CQA
can easily be drawn, because the monitoring of qual-
ity constraints involves the evaluation of such con-
straints or their simplifications. Thus, the idea may
arise to use CQA for evaluating constraints as queries,
in order to avoid wrong answers hat could be due to
extant constraint violations.
However,to evaluate constraints or simplifications
thereof by CQA is not recommendable, because con-
sistent answers are defined to be those that are true in
each minimally repaired state of the datbase. Thus,
for each queried constraint, CQA will by definition
return the empty answer, which indicates the satis-
faction of the constraint. Thus, answers to queried
constraints that are computed by CQA have in fact no
meaningful interpretation.
For instance, CAQ computes the empty answer to
the query rl(x) as well as to the query rh(z),
for any extension of the relations p and q. However,
the only sensibly correct answer to the first query in
D is x = 0. Similarly, the only reasonable answer
to the second query in D
U
is x = 9999, assuming
that 9999> th. These answers are reasonable because
they correctly indicate risks contained in D and D
U
,
respectively.
This shows that, despite of many unquestionable
merits of CQA, it should not be used for monitoring
quality, if quality is modeled by integrity constraints.
5 RELATED WORK
Although database quality and database integrity are
intuitively related, they never have been approached
in a uniform manner, to the best of our knowledge,
neither in theory nor in practice.
Kinships and semantic differences between data
that have or lack quality, and data that have or violate
integrity are observed, in a collection of work on mod-
eling and managing uncertain data (Motro and Smets,
1996). In that book, largely diverse approaches to
handle data that lack quality are proposed. In par-
ticular, approaches such as probabilistic and fuzzy set
modeling, exception handling, repairing and paracon-
sistent reasoning are discussed. However, no particu-
lar approach to integrity checking is considered.
Integity checking also has never been addressed in
detail by related work on consistent query answering
in inconsistent databases (Arenas et al., 1999).
MODELING AND MONITORING THE QUALITY OF DATA BY INTEGRITY CONSTRAINTS AND INTEGRITY
CHECKING
213
Moreover, several paraconsistent logics that toler-
ate inconsistency and quality impairment of data have
been proposed, e.g., in (Decker et al., 2002; Bertossi
et al., 2005). Each of them, however, departs from
classical first-order logic, by adopting some anno-
tated, probabilistic, modal or multivalued logic, or by
eliminating and replacing standard axioms and infer-
ence rules with non-standard axiomatizations. As op-
posed to that, inconsistency-tolerant integrity check-
ing fully conforms with standard two-valued datalog
and does not need any extension of classical logic.
Further work on the management of inconsisten-
cies in databases is going on in the field of measur-
ing inconsistency (Grant and Hunter, 2006; Decker,
2009). Inconsistency measures can be used for a
form of inconsistency-tolerant integrity checking that
is different from the approach outlined in section 3.
It accepts an update only if the measured amount of
inconsistency in the old state does not increase in the
new state. There are several possible ways to measure
inconsistency. Two that are directly related to sec-
tion 3 are to count the number of violated cases, or to
compare the set of violated cases before and after the
update (Decker and Martinenghi, 2008). In (Decker,
2009), some more measures related to inconsistency-
tolerant integrity checking are discussed.
6 CONCLUSIONS
We have shown that the quality and the integrity
of stored data can be modeled and monitored in a
straightforward, uniform manner. Conditions that
capture properties of data quality and integrity can
both be modeled by database assertions.
Traditionally, no method for simplified integrity
checking has tolerated as input a database that is in-
consistent with its constraints. However, as shown in
(Decker and Martinenghi, 2006, 2008), it is possible
to waive that restriction. Hence, quality and integrity
assertions, the occasional violation of which is toler-
able, can be monitored efficiently by inconsistency-
tolerant constraint checking methods.
It is important to note that, for achieving incon-
sistency tolerance, no re-implementation or extension
of any existing method that can be shown to have that
property is needed. As illustrated in the extended
example of this paper, inconsistency tolerance is es-
sential, since wrong, possibly fatal conclusions can
be inferred from deficient data by using a method
that is not inconsistency-tolerant. Many methods
that are inconsistency-tolerant, and also some that are
not, have been identified in (Decker and Martinenghi,
2006, 2008).
Ongoing work is concerned with establishing a
closer relationship of inconsistency-tolerant integrity
checking with the fields of repairing, consistent query
answering and inconsistency measuring.
REFERENCES
Arenas, M., Bertossi, L. E., and Chomicki, J. (1999). Con-
sistent query answers in inconsistent databases. In
Proceedings of PODS, pages 68–79. ACM Press.
Bertossi, L., Hunter, A., and Schaub, T. (2005). Inconsis-
tency Tolerance, volume 3300 of LNCS. Springer.
Bowen, K. and Kowalski, R. A. (1982). Amalgamating lan-
guage and metalanguage. In Clark, K. and T¨arnlund,
S.-A., editors, Logic Programming, pages 153–172.
Academic Press.
Decker, H. (2008). Inconsistency-tolerant integrity check-
ing for knowledge assimilation. In Filipe, J., Shishkov,
B., Helfert, M., and Maciaszek, L., editors, Software
and Data Technologies, volume 22 of CCIS, pages
320–331. Springer.
Decker, H. (2009). Quantifying the quality of stored data
by measuring their integrity. Submitted.
Decker, H. and Martinenghi, D. (2006). A relaxed approach
to integrity and inconsistency in databases. In Her-
mann, M. and Voronkov, A., editors, Proc. 13th LPAR,
volume 4246 of LNCS, pages 287–301. Springer.
Decker, H. and Martinenghi, D. (2008). Classifying in-
tegrity checking methods with regard to inconsistency
tolerance. In Proceedings of the 10th ACM SIGPLAN
conference on Principles and Practice of Declarative
Programming, pages 195–204. ACM Press.
Decker, H., Villadsen, J., and Waragai, T., editors (2002).
Proceedings of the ICLP 2002 workshop on Paracon-
sistent Computational Logic, volume 95 of Datalo-
giske Skrifter. Roskilde University, Denmark.
Grant, J. and Hunter, A. (2006). Measuring inconsistency
in knowledgebases. Journal of Intelligent Information
Systems, 27(2):159–184.
Gupta, A., Sagiv, Y., Ullman, J. D., and Widom, J. (1994).
Constraint checking with partial information. In Pro-
ceedings of PODS 1994, pages 45–55. ACM Press.
Lee, S. Y. and Ling, T. W. (1996). Further improvements on
integrity constraint checking for stratifiable deductive
databases. In VLDB’96, pages 495–505. Kaufmann.
Motro, A. and Smets, P. (1996). Uncertainty Manage-
ment in Information Systems: From Needs to Solu-
tions. Kluwer.
Ramakrishnan, R. and Gehrke, J. (2003). Database Man-
agement Systems. McGraw-Hill.
ICSOFT 2009 - 4th International Conference on Software and Data Technologies
214