A DISTORTION FREE WATERMARK FRAMEWORK

FOR RELATIONAL DATABASES

Sukriti Bhattacharya and Agostino Cortesi

Dipartimento di Informatica, Universita Ca’ Foscari di Venezia, Via Torino 155, 30170 Venezia, Italy

Keywords:

Database watermarking, HMAC, Abstract interpretation.

Abstract:

In this paper we introduce a distortion free invisible watermarking technique for relational databases. The

main idea is to build the watermark after partitioning tuples with actual attribute values. Then, we build hash

functions on top of this grouping and get a watermark as a permutation of tuples in the original table. As the

ordering of tuples does not affect the original database, this technique is distortion free. Our contribution can

be seen as an application to relational databases of software watermarking ideas developed within the Abstract

Interpretation framework.

1 INTRODUCTION

A watermark can be considered to be some kind of

information that is embedded into underlying data

for tamper detection, localization, ownership proof,

and/or traitor tracing purposes (Agrawal et al., 2003).

Watermarking techniques apply to various types of

host content. Here, we concentrate on relational

databases. Rights protection for such data is crucial in

scenarios where data are sensitive, valuable and nev-

ertheless they need to be outsourced. A good exam-

ple is data mining application, where data are sold in

pieces to parties specialized in mining it, e.g. sales

patterns database, oil drilling data, ﬁnancial data.

Other scenarios involvefor example online B2B inter-

actions, e.g., airline reservation and scheduling por-

tals, in which data are made available for direct, in-

teractive use. Given the nature of most of the data,

it is hard to associate rights of the originator over it

(Lafaye, 2007). Watermarking can be used to solve

these issues. Unlike encryption and hash description,

typical watermarking techniques modify ordinal data

as a modulation of the watermark information and

inevitably cause permanent distortion to the original

data and therefore cant not meet the integrity require-

ment of the data as required in some applications.

The ﬁrst well-known database watermarking

scheme for relational databases was proposed by

Agrawal and Kiernan (Agrawalet al., 2003) for water-

marking numerical values. The fundamental assump-

tion is that the watermarked database can tolerate a

small amount of errors. Since any bit change to a

categorical value may render the value meaningless,

Agrawal and Kiernan’s scheme cannot be directly ap-

plied to watermarking categorical data. To solve this

problem, Sion (Sion, 2004) proposed to watermark a

categorical attribute by changing some of its values to

other values of the attribute (e.g., ’red’ is changed to

’green’) if such change is tolerable in certain applica-

tions.

All of the work cited so far, assume that minor dis-

tortions caused to some attribute data can be tolerated

to some speciﬁed precision grade. However some ap-

plications in which relational data are involved can-

not tolerate any permanent distortions and data’s in-

tegrity needs to be authenticated. To meet this re-

quirement, we further strengthen this approach and

propose a distortion free watermarking algorithm for

relational databases based on reordering tuples. The

robustness of the proposed watermarking obviously

depends on the size of the individual groups so it is

speciﬁcally designed for large databases. The result-

ing watermark is robust against various forms of ma-

licious attacks and updates to the data in the table.

Database watermarking consists of two basic pro-

cesses: watermark insertion and watermark detection

(Agrawal et al., 2003), as illustrated in Figure 1. For

watermark insertion, a key is used to embed water-

mark information into an original database so as to

produce the watermarked database for publication or

distribution. Given appropriate key and watermark

information, a watermark detection process can be

applied to any suspicious database so as to deter-

mine whether or not a legitimate watermark can be

229

Bhattacharya S. and Cortesi A. (2009).

A DISTORTION FREE WATERMARK FRAMEWORK FOR RELATIONAL DATABASES.

In Proceedings of the 4th International Conference on Software and Data Technologies, pages 229-234

DOI: 10.5220/0002256402290234

Copyright

c

SciTePress

detected. A suspicious database can be any water-

marked database or innocent database, or a mixture

of them under various database attacks.

Figure 1: Basic watermarking process.

While the basic processes in database watermark-

ing are quite similar to those in watermarking multi-

media data, the approaches developed for multimedia

watermarking cannot be directly applied to databases

because of the difference in data properties. These

differences include (Agrawal et al., 2003):

• A multimedia object consists of a large number

of bits with considerable redundancy. Thus, the

watermark has a large cover in which to hide. A

database relation consists of tuples, each of which

represents a separate object. The watermark needs

to be spread over these separate objects.;

• The relative spatial/temporal positioning of vari-

ous pieces of a multimedia object typically does

not change. Tuples of a relation, on the other

hand, constitute a set, and there is no implied or-

dering between them.

• Multimedia objects typically remain intact; por-

tions of an object cannot be dropped or replaced

arbitrarily without causing perceptual changes

in the object. On the other hand, tuple inser-

tions, deletions, and updates are the norm in the

database setting.

Once outsourced, i.e., out of the control of the wa-

termarker, data might be subjected to a set of attacks

or transformations; these may be malicious e.g., with

the explicit intent of removing the watermark - or sim-

ply the result of normal use of the data. An effective

watermarking technique must be able to survive such

use. The main idea of this paper is to build the wa-

termark after partitioning tuples which depend on the

actual attribute values and then to build hash func-

tions on top of these groupings. Our contribution can

be seen as an application to relational databases of

software watermarking ideas developedwithin the ab-

stract interpretation frame work (Cousot and Cousot,

2004) and (Cousot and Cousot, 2007).

The paper is organized as follows. In section

2 we formalize the deﬁnition of tables in relational

database and hence a formal deﬁnition of watermak-

ing process of a table in relational database is given.

Section 3 illustrates how a table in relational database

can be abstracted. Then a distortion free algorithm

for watermark embedding and veriﬁcation, on top of

this abstraction, is provided in section 4 and 5, re-

spectively. Section 6 illustrates the algorithms by a

suitable example. The correctness proof of this wa-

termarking process is given in section 7. A robust-

ness analysis of this watermarking scheme is given in

section 8. Finally we draw our conclusions in section

9.

2 PRELIMINARIES

This section contains formal deﬁnitions (Collberg

and Thomborson, 2002) and (Haan and Koppelaars,

2007) of tables in relational database and database

watermarking.

Deﬁnition 2.1: Function.

Let Π

i

be the projection function which selects the i-

th coordinate of a pair. F is a function over the set

A into set B ⇔ F ∈ ℘(A× B)

V

(∀p

1

, p

2

∈ F : p

1

6=

p

2

⇒ Π

1

(p

1

) 6= Π

1

(p

2

))

V

{Π

1

(p)|p ∈ F} = A.

Deﬁnition 2.2: Set Function.

A set function is a function in which every range ele-

ment is a set. Formally, let F is a set function ⇔ F is

a function and (∀c ∈ dom(F) : F(c) is a set).

For instance we can express information about com-

panies and their locations by means of a set function

over the domain {Company, Location}, namely.

(Company;{’Natural Join’, ’Central Boekhuis’, ’Or-

acle’, ’Remmen & De Brock’}) (Location, {’New

York’, ’Venice’, ’Paris’})

Deﬁnition 2.3: Table.

Given two sets H and K, a table over H and K is a set

of functions T over the same set H and into the same

set K. i.e. ∀ t ∈ T: t is a function from H to K.

For instance consider a table containing data on em-

ployees:

The table is represented by the set of functionst

1

,t

2

,t

3

where dom(t

i

) = emp no, emp name, emp rank and

ICSOFT 2009 - 4th International Conference on Software and Data Technologies

230

Table 1: EMPLOYEE.

emp no emp name emp rank

100 John Manager

101 David programmer

103 Albert HR

for instance t

1

(emp name) = John.

There is a correspondence between tuples and

functions. For instance, t

1

corresponds to the fol-

lowing tuple: (emp no, 100), (emp name, John),

(emp rank, manager). The ﬁrst coordinates of the

ordered pairs in a tuple are referred to as the attributes

of that tuple.

Deﬁnition 2.4: Watermarking.

A watermark W for a table T over H into K, is a pred-

icate such that W(T) is true and the probability of

W(T

′

) being true with T

′

∈ ℘(H × K)\T is negligi-

ble.

3 DATABASE ABSTRACTION

Our watermarking technique can be seen as an exten-

sion to relational databases of general ideas as soft-

ware watermarking as introduced by P. Cousot and

R.Cousot (Cousot and Cousot, 2004).

Given a relational database (i.e. a set of tables),

we assume that the semantics of a query is precisely

deﬁned, and its answer is unique. Let A be the set of

attributes in T. The domain D of a categorical attribute

x ∈ A in table T is the set of all possible values of x,

and the (ﬁnite) value set V (⊆ D) is the set of values

of x that are actually present in T. We can partition the

tuples in T by grouping the values of attribute x by P

= [[v

i

] : 1 ≤ i ≤ N, where ∀t ∈ T : t.x = v

i

⇔ t ∈ [v

i

].

The frequency q

i

of v

i

is the number of tuples in [v

i

].

The data distribution of x (in T) is the set of pairs

τ = {(v

i

, f

i

)|1 ≤ i ≤ N}. So the entire database can be

partitioned into N ﬁxed mutual exclusive areas based

on each categorical value v

i

. The above concept leads

to an abstraction as depicted in Figure 2.

Given a table T over A into D and a categorical

attribute x ∈ A and P = {[v

i

] : 1 ≤ i ≤ N}, for each

set S ⊆ T, We can deﬁne a concretization map γ

x

as

follows:

(

γ

x

(v

i

,h) = S ⊆ T | ∀t ∈ S :t ∈ [v

i

] ∧size of S is h

γ

x

(⊤) = T

γ

x

(⊥) =

/

0

The best representation of a set of tuples with at-

tribute x is captured by the corresponding abstraction

function α

x

:

Figure 2: Table Abstraction (Galois Connection).

α

x

(S) =

(

(v

i

,h) if ∀t ∈ S : t.x = v

i

∧ size of S is h

⊤ if ∃t

1

,t

2

∈ S : t

1

.x 6= t

2

.x

⊥ if S =

/

0

We may prove that (α

x

,γ

x

) form a Galois insertion

(Myrvold and Ruskey, 2001) with α

x

monotoneand γ

x

weakly monotone, i.e. (v, u) ≤ (v, m) ⇒ (∪γ(v,u)) ⊆

(∪γ(v,u)).

4 WATERMARK EMBEDDING

The watermark embedding process is partition based

on a partition P = {[v

i

] : 1 ≤ i ≤ N} corresponding

to a given categorical value x. This partitioning can

be seen as a virtual grouping which does not change

the physical position of the tuples. After grouping,

all tuples in each group are sorted according to their

primary key . Like grouping, the sorting operation

does not change the physical position of tuples ei-

ther. Each group is then considered (hence, water

marked)independently.

Recall that we denote by q

k

the frequency of

[v

k

], i.e the number of tuples in [v

k

]. A (keyed)

group hash value (H

qk

) is computed using a HMAC

(HMAC, 2002) function based on the tuple hash val-

ues in a sorted order. Then, a watermark W =

extractBits(H

qk

,ln(q

k

)) of length ln(q

k

) is derived

from H

qk

. The watermark W is embedded into this

group by permuting the order of the tuples. The new

order π can be easily calculated from W using Myr-

vold and Ruskey’s linear permutation unranking algo-

rithm (Myrvold and Ruskey, 2001) based on W:

A DISTORTION FREE WATERMARK FRAMEWORK FOR RELATIONAL DATABASES

231

Algorithm 1: Watermark Embedding in [v

k

].

Begin

for i = 1 to q

k

do // number of tuples in group

G

k

h

i

= HMAC(ℜ, r

i

) // tuple hash of tuple r

i

end for

H

1

= HMAC(ℜ, h

1

)

for i = 2 to q

k

do

H

i

= HMAC(ℜ, H

(i−1)

* p

h

i

i

)

//where p

i

is the i-th prime number. e.g. p

2

=2,

p

5

=7 and so on

end for

W = ExtractBits(H

qk

,ln(q

k

))

Unrank(q

k

,W,π

k

)

End

Procedure ExtractBits(H

qk

, l)

l = Concatenation of ﬁrst l selected bits from

H

qk

// assuming H

qk

is longer than l

return l

End

Procedure Unrank(q

k

, W, π

k

)

π

k

= (0, ... , q

k−1

)

if (q

k

> 0) then

swap(π[q

k

-1], π

k

[W mod q

k

])

Unrank(q

k

− 1, ⌊W/q

k

⌋, π

k

)

end if

End

5 WATERMARK VERIFICATION

A very important problem in a watermarking scheme

is synchronization, that is, we must ensure that the

watermark extracted is in the same order as that em-

bedded. If synchronization is lost, even if no mod-

iﬁcations have been made, the embedded watermark

cannot be correctly veriﬁed. In watermark detection,

a group of q

k

tuples are selected and their ordering π

k

is identiﬁed, where π

k

is a permutation of (0, . . . ,

q

k

− 1). A watermark W’ can be derived from π

k

us-

ing Myrvold and Ruskey’s linear permutation ranking

algorithm (Myrvold and Ruskey, 2001). Based on the

tuple hash of the sorted tuples, a group hash value is

computed (H

qk

) using a HMACfuction with same key

value ℜ used during embedding. Then, a watermark

W is extracted from the group hash. If W matches

W’, the tuples in this group are authentic; otherwise,

the data in this group have been modiﬁed or tampered

with.

Algorithm 2: Watermark veriﬁcation in [v

k

].

Begin

for i = 1 to q

k

do // number of tuples in group

G

k

h

i

= HMAC(ℜ, r

i

) // tuple hash of tuple r

i

end for

H

1

= HMAC(ℜ, h

1

)

for i = 2 to q

k

do

H

i

= HMAC(ℜ, H

(i−1)

* p

h

i

i

)

// where p

i

is the i-th prime number. e.g. p

2

=2,

p

5

=7 and so on

end for

W = ExtractBits(H

qk

,ln(q

k

))

W’=rank(q

k

, π

k

, π

−1

k

)

if (W = W’) then

return (Authentication Preserved)

else

return (Authentication Violated)

End

Procedure rank(q

k

,π

k

,π

−1

k

)

if(q

k

= 1) then return 0

s = π

k

[q

k

− 1]

swap(π

k

[q

k

− 1],π

k

[π

−1

k

[q

k

− 1]])

swap(π

−1

k

[s],π

−1

k

[q

k

− 1])

return(s+q

k

* rank(q

k

− 1, π

k

, π

−1

k

))

End

6 EXAMPLE

This section illustates an example of applying this

watermarking technique on the table PARTS =

(prts no,color) 2 which only has 14 tuples with pri-

mary key prts no. TABLE 2 is the original table be-

fore partitioning operation.

Table 3 is the partitioned table based on the color

attribute with hash values h

i

, associated with each tu-

ple. The table contains three partitions [v

red

],[v

blue

]

and [v

black

]. Frequencies of each group are q = 8, q =

4 and q = 2, respectively.

For simplicity, we only show hypothetical hash

values, not the real ones, for each tuple. Group

hash values H

blue

,H

red

and H

black

are calculated us-

ing tuple hash values h

i

, 1 ≤ i ≤ 14 associated with

corresponding tuples and a secured key ℜ. Let us

suppose that W

blue

= ExtractBits (H

blue

, ln (q

blue

)) =

(10)

2

,W

red

= ExtractBits(H

red

, ln (q

red

)) = (011)

2

and

W

black

= ExtractBits(H

black

,ln (q

black

)) = (0)

2

, respec-

tively. Table 4 shows the watermark embedding oper-

ation on Table 2 using unrank function as illustrated in

Algorithm 1 for each partition, respectively. The ﬁnal

watermark value associated with the initial PARTS ta-

ble (Table 2) is equal to W = W

blue

◦ W

red

◦ W

black

=

ICSOFT 2009 - 4th International Conference on Software and Data Technologies

232

Table 2: PARTS (before partitioning).

prts no color

1007P BLUE

1017P RED

1027P RED

1030P BLACK

1032P RED

1040P RED

1044P BLUE

1049P RED

1051P BLUE

1055P RED

1057P BLACK

1067P RED

1077P BLUE

1087P RED

Table 3: PARTS (after partitioning).

prts no color h

i

1007P BLUE 1900

1044P BLUE 1790

1051P BLUE 1990

1077P BLUE 1654

1017P RED 2000

1027P RED 1799

1032P RED 1929

1040P RED 1897

1049P RED 1700

1055P RED 1690

1067P RED 1770

1087P RED 2790

1030P BLACK 1999

1057P BLACK 1970

(10)

2

◦ (011)

2

◦ (0)

2

= (100110)

2

.

Observe that any change to the initial table can

clash against the watermark W, which is embedded

into the ﬁnal table. Applying Algorithm 2 to Table

4 we may easily check that the authentication is pre-

served.

7 CORRECTNESS

In this section we show the correctness proof of the

algorithm discussed so far.

Theorem. Let T be a table over H into K, and let [v

i

] :

1 ≤ i ≤ N be a partition of T. Let π

k

be the permutation

computed by Algorithm 1 and let π = π

1

◦ π

2

◦ ... ◦ π

N

be the compositions of those permutations. Then the

Table 4: PARTS (Watermarked).

prts no color

1044P BLUE

1077P BLUE

1007P BLUE

1051P BLUE

1027P RED

1032P RED

1087P RED

1049P RED

1055P RED

1067P RED

1017P RED

1040P RED

1057P BLACK

1030P BLACK

ordered set hW

1

,W

2

,...,W

N

i is a watermark for T, that

can be embedded in the table itself as π(T).

Proof. Let T be a table containing η tuples and let

P = [v

i

] : 1 ≤ i ≤ N be a partition with frequencies

q

i

|1 ≤ i ≤ N, where each q

i

≤ η. The group hash

value H

qk

is computed using tuple hash values h

i

, for

1 ≤ i ≤ q

k

in the set [v

k

]. All the hash values are

computed using a HMAC (HMAC, 2002) function by

a secret key ℜ known by the database owner. Using

H

qk

, the watermark W

k

of length ln(q

k

) is calculated

and embedded in [v

k

] by Algorithm 1. So, the value

of W

k

depends only on the ℜ and on tuples in [v

k

].

Suppose the tuple t

m

∈ [v

k

] is tampered some-

how. So the hash value h

m

associated to t

m

will

correspond to a different group hash value H

′

qk

by ideal hash function properties. Since an ideal

HMAC is injective, when using the secure key ℜ

and two different messages M

1

and M

2

, we would

get HMAC(ℜ,M

1

)6=HMAC(ℜ,M

1

). Therefore, cor-

responding to the tampered t

m

we would get a dif-

ferent watermark W

′

k

. When the frequencies q

k

are

big enough to guarantee that the probability of guess-

ing the permutation π

k

is negligible, we get that

hW

1

,W

2

,...,W

N

i is a watermark according to the Def-

inition 2.4.

8 ROBUSTNESS

We analyze now the probability that all alterations

are correctly localized in corresponding groups over

a watermarked table. We consider three kind of alter-

ations:

• Modify an attribute value;

A DISTORTION FREE WATERMARK FRAMEWORK FOR RELATIONAL DATABASES

233

• Insert a tuple;

• Delete a tuple.

We assume that partition [v

k

] consists of q

k

tuples;

thus, the length of the embedded watermark W

k

for

[v

k

] is ln(q

k

).

Suppose a single attribute value is modiﬁed in the

watermarked relation. Without loss of generality, as-

sume r

i

.A

j

is modiﬁed (i.e., the j-th attribute of the

i-th tuple) in [v

k

]. The modiﬁcation will affect the tu-

ple hash h

i

, the group hash H

q

k

and thus the embed-

ded watermark W

k

. After the modiﬁcation, each bit of

W

k

has equal probabilities to match the corresponding

bit of W

′

k

, which is the watermark extracted from the

group. Therefore, the probability that the modiﬁca-

tion can be correctly detected (i.e., W

k

= W

′

k

) is

Prob

alt

= 1 −

1

2

ln(q

k

)

(1)

Let a single tuple be inserted into the watermarked

relation. The inserted tuple will be allocated to one of

the N partitions [v

k

] in the watermark detection. Since

the added tuple will affect the group hash as well as

the embedded watermark in a random way, the proba-

bility that the insertion can be detected (i.e., the water-

mark veriﬁcation result is false for the affected group)

is

Prob

ins

= 1 −

1

2

ln(q

k

+1)

(2)

Let a single tuple be deleted from the water-

marked relation. Exactly one partition [v

k

] will lose

the deleted tuple in watermark detection. The absence

of the deleted tuple in the group will affect the group

hash as well as the embedded watermark in a random

way. The probability that the deletion can be detected

(i.e., the watermark veriﬁcation result is false for the

affected group) is

Prob

del

= 1 −

1

2

ln(q

k

−1)

(3)

So in general we conclude that the probability that

any alteration on the table is eventually located by the

algorithm is O(1 −

1

2

ln(N)

), where N is the size (num-

ber of tuples) of the table. This ensures that the water-

marking techinique satisﬁes the requirement of Deﬁ-

nition 2.4.

9 CONCLUSIONS

The proposed watermarking scheme is group based

and a group watermark is associated to a speciﬁc or-

der of tuples in that group. The length of the water-

markW

k

for k-th group is ln(the number of tuples in k-

th group).The longer a watermark, the more secure is

the scheme. And the ﬁnal watermark associated with

the table will be the concatenated watermark strings

associated with each group. The fact that this algo-

rithm is group based, yields to the following strength

points:

• We are able to detect and locate modiﬁcations as

we can trace the group which is possibly affected

when a tuple t

m

is tampered;

• It is distortion free and invisible as it consists into

a permutation of the original table tuples. There-

fore no space overhead is required.

Observe that, this watermarking technique can be

tuned according to different security levels, by re-

turning the partitioning through the use of multiple

attributes instead of a single one.

ACKNOWLEDGEMENTS

Work partially supported by Italian MIUR CONIF ’07

project ”SOFT”.

REFERENCES

Agrawal, Kiernan, and Haas (2003). Watermarking rela-

tional data: framework, algorithms amd analysis. The

VLDB Journal.

Collberg and Thomborson (2002). Watermarking, tamper-

prooﬁng, and obfuscation - tools for software protec-

tion. IEEE Trans. Software Engrg.

Cousot, P. and Cousot, R. (2004). An abstract

interpretation-based framework for software water-

marking. 31st ACM SIGPLANSIGACT symposium

on Principles of programming languages.

Cousot, P. and Cousot, R. (2007). Abstract interpretation

and application to static analysis . 1st IEEE and IFIP

International Symposium on Theoretical Aspects of

Software Engineering.

Haan and Koppelaars (2007). Applied Mathematics for

database Professionals. Apress.

HMAC (2002). The keyed-hash message authentication

code. FEDERAL INFORMATION PROCESS STAN-

DARDS PUBLICATION.

Lafaye, J. (2007). An analysis of database watermarking

security. 3rd International Symposium on Information

Assurance and Security.

Myrvold and Ruskey (2001). Ranking and unranking per-

mutations in linear time. Inf. Process. Lett.

Sion (2004). Proving ownership over categorical data. IEEE

International Conference on Data Engineering.

ICSOFT 2009 - 4th International Conference on Software and Data Technologies

234