hippocratic databases [1] restricts the subclass to select-project join queries. Although
a partial solution for multiple XML queries is discussed in [4], this shows non-
inferability of XPath query result only for a special case.
Our problem of inferring database secrets by combining the results of multiple queries
is different from k-anonymity violation checking for relational views, as discussed
e.g. in [9], or from l-diversity [7], for the following reason. k-anonymity and l-
diversity both regard relationships between all values given for certain attributes,
whereas our secret information is an association of individual combinations of values.
The difference is: even if 2-anonimity and 2-diversity are provably violated for a
given pair (A,B) of attributes, our secret can still be uncovered for the following
reason. We can not be sure that a secret as defined in our paper, i.e., an association
between two concrete values (a1,b1) of the attributes A and B, can be derived because
the 2-anonymity could be violated for other pairs of values (a2,b2) only and not for
the pair (a1,b1) of the secret, i.e., 2-anonymity between attributes can be violated
without the concrete values of the secret being leaked.
Theorem proving for first order predicate calculus has been investigated a long time,
e.g. [8], but it is not directly applicable for the following reason. A database relation
contains only the tuples that are interpreted to be true, but the closed-world
assumption and operations like negation, set-difference, and bag-difference may
require to consider also the tuples of a relation schema that are not in the relation.
When it becomes necessary to model all these facts as being false, the number of
formulas will be in the order of the number of database schema tuples which is too
high for today’s theorem provers.
Nesting of views has been used in query optimization. However, the approaches
investigated focus on fast execution plans and avoid looking into all possible
combinations of views which is required here.
Finally, in contrast to all other approaches to inference on database queries that regard
only a subset of the database queries, e.g. [3], [6], we regard all relational algebra
expressions, including bag-valued relational expressions allowing for duplicates.
6 Summary
Whenever secret company information that could be accessed by multiple user has
been illegally leaked to a third party, it is crucial for the company to find all the
possible information leaks. We have provided a formalization of secret information as
being the answer Rs to a secret query Qs. Second, we have shown how secret
information can be inferred from a set of user queries Q1,…,Qn and known answers
to these queries. Third, we reduced the problem of finding information leaks to an
inference problem among database queries. Fourth, we have proven that this problem
is NP-hard. Fifth, we have reduced this problem to searching a composition function f
that when applied to the user queries Q1,…,Qn generates a relational expression that
can be transformed into the secret query Qs by query simplification and by
substitution of query expressions with results. Whenever such a composition function
f can be found, the secret Rs is inferable, i.e. we have found a potential information
leak. Finally, as integrity constraints are only a special case of queries, each solution
to our general problem is also a solution to database inference in the presence of
integrity constraints or so called “global knowledge” which can be expressed as a
186