Distortion-free Watermarking Scheme for Compressed Data in

Columnar Database

Waheeb Yaqub

, Ibrahim Kamel

and Zeyar Aung

Center for Cyber Security, New York University, Abu Dhabi, U.A.E.

Department of Electrical and Computer Engineering, University of Sharjah, Sharjah, U.A.E.

Department of Computer Science, Khalifa University of Science and Technology, Masdar Institute, Abu Dhabi, U.A.E.

Keywords:

Data Integrity, Digital Watermarking, Columnar Databases, Information Hiding, Database Security.

Abstract:

Digital watermarking is an effective technique for protecting various databases against data copyright in-

fringement, piracy, and data tampering in the columnar database. Increasing deployments of various database

systems and their versatile applications have raised the need for better watermarking schemes that are tailored

to the target database systems’ speciﬁc architecture. Most of the existing digital watermarking schemes do

not take into consideration the side effects that watermarking might have on the database’s important char-

acteristics such as data compression and overall performance. In this research, we propose a distortion-free

fragile watermarking scheme for columnar database architecture without interfering its underlying data com-

pression scheme and its overall performance. The proposed scheme is ﬂexible and can be adapted to various

distributions of data. We tested our proposed scheme on both synthetic and real-world data, and proved its

effectiveness.

1 INTRODUCTION

Databases have served the business intelligence in-

dustry well over the last few decades. However, with

the rise of Big Data Analytics and the vast quantities

of data that reside in organizations, fast analysis of

data is becoming more difﬁcult. Columnar databases

(in contrast to row-wise relational databases) boost

performance of data analytic transactions by reduc-

ing the amount of data that needs to be read from

disk. They succeeded to achieve that by storing

data column-wise (all values of an attribute are clus-

tered together). This allowed retrieving only those at-

tributes required by the query and leaving the rest of

the tuple on the disk. The similar values in colum-

nar data allowed achieving better compression rate

for data stored in columnar databases. One of which

is decreasing the bottleneck between RAM and Hard

drive. For instance, in traditional relational database

answering a range query would require loading the

set of records that falls in the range query and then

executing projection query to discard the unwanted

attributes. In columnar DB, in contrast, answer-

ing range queries would require loading only the

columns that contain the attributes that are associ-

ated with query. Therefore, the query performance

is often increased, especially when the database is

very large. Another important advantage of storing

data in columns, it lends itself naturally to compres-

sion. Columnar DB achieves much higher compres-

sion rate and allows many queries to be answered on

compressed data.

The database research community came to real-

ize the importance of protecting the integrity of the

database content, particularly those that are kept on

the cloud or published on the web. Databases usu-

ally contain sensitive and critical information such as

salaries, expenses, loans, inventory value, etc. Thus,

unauthorized changes to databases might result in sig-

niﬁcant losses to organizations and individuals. The

use of secure hash like MD5 or SHA and Digital

Signatures for protecting the integrity of the whole

databases would detect unauthorized alteration but

cannot localize the attack. Using Digital Signature

or SHA at a ﬁner level, e.g., attribute level would be

very costly and requires signiﬁcant space overhead.

Digital watermarking is one of the techniques that

can be used to protect data integrity. Digital water-

marking was an attractive technology because of its

lightweight and efﬁciency. Originally digital water-

marking research focused on multimedia objects such

as image, audios and video (Lee and Jung, 2001; Pot-

Yaqub, W., Kamel, I. and Aung, Z.

Distortion-free Watermarking Scheme for Compressed Data in Columnar Database.

DOI: 10.5220/0006857203430353

In Proceedings of the 15th International Joint Conference on e-Business and Telecommunications (ICETE 2018) - Volume 2: SECRYPT, pages 177-187

ISBN: 978-989-758-319-3

177

dar et al., 2005; Bajpai and Kaur, 2016; Asikuzzaman

and Pickering, 2017). In the recent years, researchers

started focusing on methods to watermark relational

databases (Agrawal and Kiernan, 2002; Khanna and

Zane, 2000; Kamel, 2009; Li et al., 2004; Li and

Deng, 2006; Guo et al., 2006; Kamel et al., 2013;

Camara et al., 2014; Kamel et al., 2016; Rani et al.,

2017). Watermarking schemes exploit redundancy in

the data objects to hide the secret message (digital

watermark). Alphanumeric data stored in databases

usually contain less redundancy than image and video

data, and thus more difﬁcult to watermark. Moreover,

the dynamic nature of databases makes the process

of designing a watermarking scheme even more chal-

lenging. A digital watermark is a secret pattern of

bits inserted in the data objects to verify the integrity

of the content or to identify ownership. Digital water-

marking can be classiﬁed into two categories: Robust

and Fragile. Robust watermarking is used for copy-

right protection and designed to withstand attacks like

compression, cropping, scaling, and any other ge-

ometrical transformations. Fragile watermarks, on

the other hand, get corrupted with slight change to

the content, while allowing authorized operations. In

other words, fragile watermarking is used to allow le-

gitimate operations to the data and detects any unau-

thorized changes. One of the main requirements of a

successful watermark to be inconspicuous. In its sim-

plest form, watermark bits can be stored in the least

signiﬁcant bits of an image, for example.

Recently, there have been several studies for pro-

tecting data integrity in relational databases. The de-

tection rate depends on the type and the value of the

modiﬁcation. Most of the existing techniques for

protecting relational database integrity store the wa-

termark by distorting bits of the protected attributes

(Agrawal and Kiernan, 2002; P

erez Gort et al., 2017).

The most serious limitation of the existing integrity

protection techniques is that they introduce distor-

tions to the watermarked attributes. Therefore, these

techniques affect the data values and change them.

While these types of watermarking can be used in

applications that can tolerate distortion, e.g., mete-

orological readings; they are not suitable for many

other applications that deal with data like, employees

salaries, account balance, inventory values, medical

data, etc.

This paper proposes a new watermarking scheme

for detecting unauthorized modiﬁcation in colum-

nar databases. The proposed fragile watermarking

scheme works on both uncompressed and compressed

columnar databases without the need to decompress

the data. The proposed algorithm can detect, with

high probability, unauthorized changes to altered at-

tributes. To the best of authors knowledge, there is no

published study on watermarking columnar databases

for protecting data integrity. The basic idea behind

the proposed scheme is to organize columns of data

into groups of ﬁxed size. Initially, groups are re-

arranged according to a predetermined order. The

watermarking algorithm sorts the group in an order

that corresponds to value of secret digital watermark.

Unauthorized users, who do not know the secret wa-

termark, can still change any of the attribute values

in the group. However, this blind changes will, with

high probability, disturb the secret order. On the other

hand, authorized users who know the secret water-

mark, need to re-adjust the group after making any

change. This paper made the following contributions:

• Propose a watermarking scheme to protect the

data integrity of columnar databases.

• The proposed scheme introduce zero distortion to

the protected attributes.

• Watermark insertion and detections algorithms

work directly on the compressed data without the

need to decompress.

• Integrity detection algorithm can identify, with

high probability, the attacked attribute value.

• Provides comprehensive simulation experiments

on performance of the proposed scheme using

synthetic and real data.

2 RELATED WORK

One of the ﬁrst works in distortion-free database wa-

termarking for tamper detection was by Li et al. (Li

et al., 2004). The scheme relied on creating a permu-

tation of tuples in the database. The authors proposed

virtual grouping of the tuples in the table with the help

of a cryptographic hash function. Each tuple’s hash

value is compared with the group’s hash value to de-

cide the membership of the tuple to a virtual group.

Each group is watermarked separately. The water-

marking is based on changing the order of a tuple with

its neighbouring tuple depending on the tuples’ hash

value. For instance, if the watermark bit value is zero

and the ﬁrst tuple’s hash value is greater than the sec-

ond tuple’s hash value, then the pairs are switched in

location. This scheme can only localise the attack up

to the group size, and the attacked tuple’s exact loca-

tion cannot be determined. Furthermore, the update

process is costly due to the involvement of crypto-

graphic hash functions. Moreover, every group has a

watermark to be stored separately. The relationship

between consecutive pairs are compared, and hence

SECRYPT 2018 - International Conference on Security and Cryptography

178

the attack on indirect neighbour values may not be

detected.

Khan and Husain (Khan and Husain, 2013) pro-

posed a fragile scheme based on characteristics of

data such as the frequency of distribution of digit

counts, length and range of numerical data values.

They relied on third party such as certiﬁcation author-

ity to store the watermark value in order to achieve

zero watermarking property i.e. distortion free. The

scheme proposed by Khan and Husain is similar to

distortion-free based on hash value of the database

that is stored in third party’s vicinity. Such techniques

will not be able to localize the attack, thereby making

the whole relation as malicious whenever attacks are

detected.

Bhattacharya and Cortesi (Bhattacharya and

Cortesi, 2009) also proposed a watermarking scheme

which detects malicious changes. The main differ-

ence between Li et al. and Bhattacharya and Cortesi

is that the former virtually groups the tuples based on

the hash values of the primary keys and the secret key,

while the latter carries out grouping based on categor-

ical attribute values.

The public watermarking scheme proposed in Li

and Deng (Li and Deng, 2006) embeds the watermark

without limiting the data type of attributes to numeri-

cal or categorical. The scheme ﬁrst creates a relation

W out of the original relation T . The relation W is of

size n ×η, where η is less than or equal to the number

of attributes. The attribute values in W are the Most

Signiﬁcant Bits (MSB) of those in T . The relation

W is made public such that anyone can verify the au-

thenticity of MSBs of the relation T . The drawbacks

are that: (1) If an attacker changes the LSB of T , it

will not be detected. (2) Publishing W can be costly

in terms of space in cases of large databases with mil-

lions of tuples.

All the fragile watermarking methods of relational

database for tamper detection mentioned above (Li

et al., 2004; Khan and Husain, 2013; Bhattacharya

and Cortesi, 2009; Li and Deng, 2006) either perform

detection only or detection as well as localisation of

an attack up to a virtual group. (Kamel et al., 2016)

proposed a scheme that can localise the attack to the

granularity of one or two candidate tuples.

Unfortunately, all of the above-mentioned tech-

niques do not cater for the underlying architecture and

data compression schemes in DBMSs. While most of

the DBMSs rely on their own storage and query pro-

cessing strategies for performance optimization, the

watermarking techniques proposed in last few years

(Li et al., 2004; Li and Deng, 2006; Zhang et al.,

2006; Wang et al., 2008; Bhattacharya and Cortesi,

2009; Hu et al., 2009; Rao and Prasad, 2012; Kamel

et al., 2013; Franco-Contreras et al., 2014; Kamel

et al., 2016; Rani et al., 2017) do not take into consid-

eration the performance degradation from watermark

embedding. Although Rani et al. (Rani et al., 2017)

considered only the distributed database architecture

at which their watermarking scheme to be deployed.

Whereas, none of the previously mentioned authors

considered the underlying compression scheme used

by database.

3 PROPOSED WATERMARKING

SCHEME-COLUMNARWM

This section introduces the proposed scheme for wa-

termarking. The aim of this watermarking scheme is

to detect and identify tampering of the data stored in

the columnar database. The proposed watermarking

scheme has the following desirable properties:

• Distortion free: the proposed the scheme will not

introduce any distortion or modiﬁcation to the val-

ues of the underlying data.

• Compression independent: the proposed scheme

can be applied directly to compressed data in

columnar DB.

• Allowing incremental updates: watermarked data

can be updated by simply updating small set of

data.

• Blind: veriﬁcation of the watermark existence

does not require knowledge of the original

database.

• Modular: each column in columnar database can

watermarked separately and not relying on the

content of the other columns.

• Detect and localize attacks: the proposed scheme

will be able to detect attacks with high tamper de-

tection rate and localize the victim data element in

most cases.

• Incur minimal performance degradation

Attributes in a columnar database is stored in a

separate ﬁle (Abadi et al., 2008; Abadi et al., 2009;

Abadi et al., 2012). The proposed scheme beneﬁts

from the redundant order of data items of a column

by hiding the watermark in the relative order of the

data items. Each ﬁle (attribute) is processed sepa-

rately; meaning the algorithm watermarks each col-

umn separately. Each column (attribute) is organized

into groups of g data elements each. Each group is

watermarked separately. Notice here that a group con-

tains a set of values belong to one attribute. The group

is watermarked by reordering its data elements in such

Distortion-free Watermarking Scheme for Compressed Data in Columnar Database

179

a way that the group’s new order corresponds to some

unique watermark value. The re-arrangement of data

items in each group is done relative to a secret order

called “reference order”, can be any order, e.g., as-

cending order of data item values.

The proposed scheme (“ColumnarWM”) re-

orders each database column with respect to reference

order I

, according to watermark value W . The ref-

erence order I

and watermark value W are kept se-

cret, therefore any unauthorized change will distort

the watermark. ColumnarWM neither introduces any

distortion nor affects the usability of the data, it sim-

ply stores the watermark value W in the relative order

of the data items in a column of the database. More-

over, the scheme adds the watermark to each column

separately. Therefore, guaranteeing modular access

of each column, which is the main feature of a colum-

nar database over row database. ColumnarWM in-

serts the watermark without affecting the compression

ratio of the compressing technique used in a columnar

database. The desirable properties of the proposed

scheme such as distortion-free, modular access and

invariable compression assure that the scheme will

not interfere with columnar database main operations.

The ﬁrst step in watermarking phase is to divide

the entire column of a database into a number of

groups, each group has g data elements of a column.

This step is called grouping. Then each group is wa-

termarked and veriﬁed independently. By organiz-

ing column data elements into groups and watermark-

ing them independently, the proposed scheme ensures

that process of watermarking is incremental. New

data elements form a new group that need to be water-

marked separately and independent of the rest of the

column. The data elements in each group of the col-

umn are sorted with respect to selected reference or-

der I

. Subsequently, each group is watermarked ac-

cording to watermark value W using the proposed wa-

termark embedding algorithm WATERMARKEMBED.

The new re-ordered data elements of the group are the

watermarked order I

Since the order of the data elements in the group

represents the watermark value W . There is a need

for a one to one mapping between watermark value

W and order of the data elements. We use a bijection

mapping proposed in (Kamel et al., 2016; Kamel and

Kamel, 2011) which is simply a one to one mapping

that uniquely maps the order of the list of entries to a

numerical number.

Hence, given an ordered group of data elements

the watermark value W can be recovered. Authorized

users can verify the integrity of attribute by simply

knowing the reference order I

and watermark value

W . An integrity veriﬁcation algorithm can extract the

watermark from a group of data elements; if the wa-

termark is correct, it can be concluded that data at rest

is integral otherwise attacked.

Recall that the proposed scheme ColumnarWM

hides the watermark in the relative order of data el-

ements of the group. Hence theoretically, there are g!

unique watermark values that can be inserted as the

order of data elements. On one side we have a group

that contains g data elements c

, ..., c

. On the other

side we have all possible values of watermark W in in-

teger decimal number. Given that, there might not be

any clear pattern relating the integer numbers to the

permutations of data elements. We used the method

proposed in (Kamel et al., 2016; Kamel and Kamel,

2011) for mapping unique W integer values of size

r digits. The method relies on converting the inte-

ger valued W to factorial number base system. Main

reason for converting integer number W to factoradic

base number is that there’s a close connection be-

tween factoradic numbers and permutations of data

values. Therefore, before embedding the watermark,

we ﬁrst convert watermark value W from decimal for-

mat to factoradic format W

. Then the group of data

elements are ordered according to reference order I

Finally the watermark value W

is embedded using

algorithm WATERMARKEMBED. The authenticity of

database is assured by performing integrity veriﬁca-

tion using the reference order I

after applying de-

watermarking algorithm. The de-watermarking algo-

rithm which is inverse of watermark embedding algo-

rithm. The de-watermarking algorithm tries to recon-

struct the data elements i.e. column’s reference order

from the watermarked order I

. If the resulting or-

der of each group conﬁrms with reference order I

then the database is authentic.

Some notations that will be used throughout this

paper are summarized in Table 1. To describe the pro-

posed scheme for protecting the integrity of columnar

database, we will present the following in the next

section 3.1.

• An algorithm for inserting watermark in column

• Integrity algorithm for that will point out if unau-

thorized modiﬁcation carried out on protected col-

umn

• Victim identiﬁcation algorithm for limited cases

(sub operation in integrity check algorithm).

The following two sections present the description

of the two algorithms: WATERMARKEMBED and IN-

TEGRITYVERIFICATION. The attributes to be water-

marked are to be decided by the database owner. To

process queries, a columnar database reads attribute

from disk pages of secondary storage.

Algorithm 1 (WATERMARKEMBED) is used for

SECRYPT 2018 - International Conference on Security and Cryptography

180

grouping, enforcing reference order, and inserting the

watermark in groups of the columnar database.

Table 1: Frequently-used notations in this study.

Symbol Description

G group of the relation R

g The number of data elements in a group

W Secret watermark value in decimal

Reference order by which data elements’

integrity is veriﬁed

Probability and estimated probability of

detecting an attack on the watermarked

database

Probability and estimated probability of

localisation up to 1 victim (i.e., exactly

pinpointing the victim)

Probability and estimated probability of

localisation up to 2 victims (i.e., either one

of them is the victim)

3.1 Watermark Embedding

The watermarking procedure WATERMARKEMBED

includes sub-operations: (a) the column data elements

are organized to groups of size g, (b) each group con-

tent is ordered according to reference order I

, and (c)

the group content is watermarked using the value W

which is only known to database owner. For simplic-

ity, let us consider that the attributes are stored in pure

column-store model, i.e. each attribute is stored in

separate ﬁle. Although our proposed scheme can be

easily extended to multi-column columnar database.

Multi-column columnar database is a simple exten-

sion of pure column columnar database, which stores

more than one attribute in one ﬁle.

The core operation of the ColumnarWM scheme’s

embedding algorithm is left-circular shift on subsets

of data elements of column from the reference or-

der. First the embedding algorithm sorts the group

data elements according to reference order. This

step is crucial so that when watermark extraction(de-

watermarking) is carried out the reference order can

be used as a validation check for degradation of

watermark. Then the watermark value W

is used

to shufﬂe the data elements of the group to re-

ﬂect the watermark. To further clarify the water-

mark embedding process, consider the group con-

tent be {c

, c

} of size g = 5 . In general

the whole column content is divided into groups of

size g [line2]. Let us consider the group content

after enforcing reference order be {c

, c

}

[line5]. Let us further assume that watermark value

in factoradic number system W

is 421 [line6]. Both

watermark value and reference order are secret and

only known to the database owner. Since the wa-

termark value is of three digits, therefore the Left-

CircularShift method will be executed three times

in a loop [line7-8]. In the ﬁrst iteration of loop

all data elements of {c

, c

} are shifted left

by four positions (by ﬁrst most signiﬁcant digit of

) resulting into {c

, c

}. The data ele-

ment c

is in its ﬁnal position and will not be con-

sidered in circular-shift of remaining two iterations

of the loop. Hence, only four elements will con-

sidered {c

, c

}. Since the next digit of W

is 2, only {c

, c

} elements are shifted left by

two positions, resulting in {c

, c

}(the whole

group looks like {c

, c

}). Similarly, the

third iteration will freeze c

and remaining data el-

ements are shifted left by 1. Resulting in output of

, c

}. Final group’s output after completing the

for loop iterations will be {c

, c

}. In gen-

eral, each group can have different watermark value

W that can be generated using single secret provided

by database owner; however the discussion assumes

only assumes watermark value is same for all groups

, G

, ......., G

3.1.1 Factorial Number System

Unlike traditional numbering systems, the factorial

number system has mixed base. The base of the i

digit is different from the base of j

digit, ∀ i 6= j (on

the contrary, the base of all digits in the binary sys-

tem is always 2). The value of i

digit is strictly less

than or equal to its base value. The weight value of

digit equals i!. A factorial number with n digits is

represented as:

∑

b=1

∗b!] (1)

where b indicates the digit index; a

is a constant,

which can take the values from 0 to b only. For ex-

ample, the constant a in the least signiﬁcant digit can

take the values 0 or 1, the 2

digit can take the values

0, 1, or 2; while the constant a in the 3

digit can take

the values 0, 1, 2, or 3. This is different from tradi-

tional numbering systems where each digit can take

values from 0 to (base −1). The integer (349)

the decimal system can be represented in the factorial

system as follow:

(349)

= (1 ∗1!) + (0 ∗2!) + (2 ∗3!) +(4 ∗4!) +

(2 ∗5!) = (24201)

f actorial

Factorial numbering system is a number system

where the base of the b

place is b!, and the allowed

coefﬁcients are between 0 and b.

Distortion-free Watermarking Scheme for Compressed Data in Columnar Database

181

Algorithm 1: WATERMARKEMBED algorithm.

1: procedure WATERMARKEMBED(W, I

)

2: Divide a column into m groups (G

, G

, ......., G

)

3: C is a set of g data elements that belongs to the same group in the

original physical order

4: for i ←0 to m;i ←i + 1 do

5: Sort(C

, I

)

6: W

← f actoradic(W )

7: for i ←0 to |W

|−1 do

8: Le f tCircularShi f t(W

d−i−1

, C[i : d])

9: return(C)

3.2 Integrity Veriﬁcation

This section shows how integrity of the data resid-

ing in secondary storage is checked prior to its use

by DBMS. The data integrity can be conducted occa-

sionally or on every read of column by DBMS. The

idea is that given a group of data elements of column

and the watermark value W the Integrity veriﬁcation

algorithm can recover the reference order I

. The

algorithm INTEGRITYVERIFICATION includes sub-

operations: (a) ﬁnding group, (b) extracting water-

mark by (reverse process of WATERMARKEMBED)

on group, and (c) ATTACKDETECT procedure de-

tects and localises of attack on the column. To de-

tect and localise an attack, we need to know the

and W . First, all the data elements associated

with a single group is retrieved [line2]. Then RE-

VERSEWATERMARKEMBED algorithm is performed

on the group using secret W [line4]. It is the re-

verse procedure of Algorithm WATERMARKEMBED.

For example, if Algorithm WATERMARKEMBED was

used to insert the watermark by left circular shift-

ing of the values by factoradic elements of W =

{W [0], W [1], W [2], . . . , W [g −1]}, then the reverse al-

gorithm is simply right circular shifting of the fac-

toradic elements of W = {W [g −1], W [g −2], W [g −

3], . . . , W [1], W [0]}.

Once the watermark is extracted from the group,

the ATTACKDETECT procedure is used to detect

attack and localise the victim. The ATTACKDETECT

is a simple procedure that detect and localise an

attack by checking each data element of group. The

group content is considered not attacked if all data

elements in the group follow the reference order

. If any of the data elements don’t follow the

reference order its location index is extracted to

pinpoint the victim of an attack. Detailed example

on how localisation can be carried out on a group is

presented in Scenario 1 and 2. Hence, an attack will

be detected if the de-watermarked group C

does not

follow the reference order I

that was originally used

in WATERMARKEMBED.

Algorithm 2: INTEGRITYVERIFICATION algorithm.

1: procedure INTEGRITYVERIFICATION(W, I

)

2: Divide a column into m groups (G

, G

, ......., G

)

3: C is a set of g data elements that belongs to the same group in the

original physical order

4: REVERSEWATERMARKEMBED(C

, W )

5: if ATTACKDETECT(C

, I

)==FALSE then

6: return(C

)

7: else

8: group C

has been attacked at location index, index−1

The following simple example will better clarify

the detection and localisation algorithms. The

following are the two possible scenarios that can

occur in detecting and localising the attack. After

grouping let the G = {700, 900, 300, 550, 500,

620, 1000}. Assume that the reference order is

ascending order. Therefore, the group content after

enforcing reference order will be G = {300, 500,

550, 620, 700, 900, 1000}. The group content should

follow the reference order after the REVERSEWA-

TERMARKEMBED algorithm performed on the group.

Scenario 1 (Attribute Attacked, Detected nd

Identiﬁed): The attacker increased the 2

element

(index starting from 0) 550 by approximately 29.1

percent and the value became 710. Consequently, the

group content after REVERSEWATERMARKEMBED

becomes G = {300, 500, 710, 620, 700, 900, 1000}.

By inspecting the content, we can see that the 2

element 710 and the 3

element 620 is out of the

reference order I

. Therefore, the attack has been

detected. To localise the attack, the group content is

checked whether the 2

element has been increased

or the 3

element has been decreased. Therefore,

we compare the two possible victim candidates with

their respective immediate neighbours and distant

neighbours to identify the victim. The immediate

neighbours of the 2

element (710) are the 1

ele-

ment (500) and the 3

element (620) and the distant

neighbours are the 0

element (300) and the 4

element (700). Similarly, the immediate neighbours

of the 3

element are 710 and 700, while the distant

neighbours are 500 and 900. By inspecting the distant

neighbours of the 3

element (500 and 900), it is

observed that they follow the reference order, i.e.,

500 ≤ 620 ≤ 900. While inspecting the 2

element’s

distant neighbours, it is clear that they do not follow

the reference order, i.e., 300 ≤ 710  700. Thus, the

element 710 is the victim. (See Figure 1.)

Scenario 2 (Attribute Attacked, Detected and

Identiﬁed Up To Two Victims): Once again, the 2

element (index starting from 0) is attacked, but this

time decreased by approximately 11 percent. The

SECRYPT 2018 - International Conference on Security and Cryptography

182

value has been decreased from 550 to 490. The

group content after REVERSEWATERMARKEMBED

is G = {300, 500, 490, 620, 700, 900, 1000}, it is

clear that the elements in group do not follow the ref-

erence order I

, and hence there has been an attack

on the group. The relationship between the 1

and

the 2

elements are not according to reference order.

Therefore, these two elements are victim candidates.

By inspecting the distant (next to immediate) neigh-

bours of the candidate victims, we can see that the 1

element follows the reference order, i.e., 500 ≤ 620,

and so does the 2

element, i.e., 300 ≤ 490 ≤ 700.

Thus, in this scenario, we can only localise the attack

up to two possible victims instead of pinpointing the

exact one.

Figure 1: Scenario 1 explained on the number line. In the

ﬁrst line, the group content follows the reference order. In

the second line, the 2

element has been attacked, i.e., its

value has been increased from 550 to 710. The attack can

be detected because the change in value of 550 leads it to

fall outside its immediate neighbour’s range of 500 to 620.

Furthermore, the victim is identiﬁed if the change makes the

value fall outside the distant neighbour range, i.e., outside

300 to 700.

In INTEGRITYVERIFICATION ﬁrst, the group is

made associating the ﬁrst g number of data ele-

ments to G

, subsequently, G

is the second batch

of data elements in column and so on. Once the

group is formed, in the third step, the physical or-

der group content C

is de-watermarked using RE-

VERSEWATERMARKEMBED. After REVERSEWA-

TERMARKEMBED, the group C

is checked for at-

tack detection and localisation using ATTACKDE-

TECT. If there is no attack detected, the group content

is forwarded for further query processing. Otherwise

dropped based on the policies set in DBMS.

3.3 Watermarking Compressed Data

One of the most important properties of the proposed

watermarking scheme is it can be applied on com-

pressed database. Also attack detection and victim

identiﬁcation can be performed on the compressed

version without need to decompress the database. The

proposed algorithms WATERMARKEMBED and IN-

TEGRITYVERIFICATION can be used to watermark

compressed data as well as uncompressed data. With

a simple pre-processing step, the algorithms can be

extended to watermark the compressed data due to the

fact that our proposed scheme depends only on the or-

dering of data elements of column.

The watermark embedding scheme presented in

Algorithm 1 can easily be extended to MonetDB

patched dictionary compression. Each dictionary

word can be considered as an actual value in a page.

Therefore, a 1- to 4-byte dictionary word representa-

tion of column values can be sorted and modiﬁed in

terms of location without affecting the compression

scheme itself. Our experiments in Section 4 show

that the simulation results achieved on the uncom-

pressed data using the original WATERMARKEMBED

algorithm (Algorithm 1) are similar to those on Mon-

etDB’s compressed data using a slightly modiﬁed al-

gorithm with a preprocessing step.

4 EXPERIMENTAL SETUP

To formalise the experimental setup, let P

be the

probability of successful detection of an attack on a

group, P

be the probability of successful localisa-

tion of the exact victim, and P

be the probability of

successful localisation of the two possible victims in

case the exact victim cannot be pinpointed. Let

, and

be the estimated probabilities.

Let a successful event be the detection and local-

isation of an attack, while an unsuccessful event be

an undetected (and hence not localised) attack. For

each group size g, there will be two Bernoulli distri-

butions with P

and P

, while (1 −P

) and (1 −P

)

will be the probabilities of unsuccessful detection and

unsuccessful localisation.

All experiments in this study generally involve

three steps. Initially, the data are watermarked ac-

cording to watermarking scheme proposed in Section

3.1. Secondly, the watermarked data are attacked us-

ing one of the attack models presented in Section 4.1.

Finally, the attacked data are de-watermarked, and

the attack is attempted to be detected and localised.

Each experiment is repeated 1000 times to estimate

the probabilities

and

. The experiments are car-

ried out with varying parameters of the watermark-

ing scheme. Those parameters are the group size, at-

tack percentage, and standard deviation of the syn-

thetic data. Furthermore, multiple attack models are

performed on the synthetic and the real data to es-

timate the probabilities of detection and localisation.

For instance, when the relationship between the group

size g vs. the estimated probabilities is examined,

for each group size g the experiment is repeated 1000

times (with a different victim randomly selected each

time) and the resultant averaged probabilities are cal-

culated.

Distortion-free Watermarking Scheme for Compressed Data in Columnar Database

183

4.1 Attack Models

For our proposed fragile watermarking scheme, the

following attack models are taken into consideration:

Modiﬁcation Attack: Mallory (attacker) increases

or decreases a speciﬁc percentage of value of a

protected numerical attribute. Only a single victim

data element is randomly picked from a set of data

elements, and the protected numerical attribute value

is attacked (changed).

Superset Attack: Mallory inserts new data elements

into the database such that they might affect the

database watermark. This attack model is simulated

by inserting new data elements at randomly selected

locations. The probabilities of detection and local-

isation are observed for the following numbers of

insertions: 1, 2, 4, 8, 16, 25, and 50.

Deletion Attack: Mallory deletes a subset of data el-

ements from the existing group. The data elements to

be deleted are selected randomly. The probabilities of

detection and localisation are observed for the follow-

ing numbers of deleted data elements: 1, 2, 4, 8, 16,

25, and 50.

4.2 Results on Synthetic Data

To compare how watermarking of compressed data

affects the probabilities of detection and localisation,

we carry out experiments on both uncompressed and

compressed synthetic data. The synthetic data are

tested in order to ﬁnd the relationship between the pa-

rameters (group size, distribution of data, and stan-

dard deviation) vs. the probabilities of detection and

localisation.

In the following experiments, the relationship be-

tween the probabilities of detection

and exact lo-

calisation

vs. one of the parameters is analyzed

while the rest of the parameters are kept constant to

some value. For each experiment, the parameters used

are kept at their default values according to Table 2

unless mentioned otherwise in the experiment itself.

4.2.1 Uncompressed Data

We ﬁrst examine the proposed scheme’s performance

on the raw uncompressed data. First, we test various

group sizes g ranging from 1 to 1000. The other pa-

rameters are kept at their default values according to

Table 2. Only a single element is attacked in the col-

umn. From Figure 2(a) it can be observed that the

estimated probabilities of detection

and exact lo-

calisation

of the attack increase as the number of

data elements in the group increases. The main reason

behind this increasing trend in the probabilities is be-

cause in a larger group the attacked victim will disturb

the reference order easily since there are more possi-

ble values from the same distribution. Small group

sizes are tested only to examine the relationship be-

tween

and

vs. g. It can be noticed from Figure

2(a) that

and

can reach 1.0 if g is chosen to

be more than 200. The attack on the watermarked at-

0.2

0.4

0.6

0.8

200 400 600 800 1000

Estimated Probability

Group Size g

(a)

0.2

0.4

0.6

0.8

0 10 20 30 40 50 60 70 80 90 100

Estimated Probability

Attack Percentage

(b)

0.2

0.4

0.6

0.8

200 400 600 800 1000

Estimated Probability

Standard Deviation

(c)

0.2

0.4

0.6

0.8

200 400 600 800 1000

Estimated Probability

Standard Deviation

(d)

Figure 2: Experimental results for synthetic uncompressed

data. Estimated probabilities of detection

and localisa-

tion up to 1 victim

and up to 2 victims

for (a) dif-

ferent group sizes, (b) different attack percentages, (c) dif-

ferent standard deviations, and (d) increasing group sizes

simultaneously with increasing standard deviations.

tribute values is detected if the reference order I

violated. Therefore, it is obvious that as the attack

percentage increases, so do

and

as depicted in

Figure 2(b). In this experiment, the attack percentage

is changed as 0.5, 1, 3, 5, 10, 15, 20, 30, 50, and 100,

while the other parameters are kept constant at their

default values.

Moreover, detection and localisation of attacks

rely on the differences between the victim and its im-

mediate neighbour values. Thus, as the standard de-

viation of the data values stored in the database in-

creases, the consecutive elements’ differences also in-

crease, and this negatively affects

and

. This ef-

fect can be resolved if the reference order is based on

some cryptographic hash. Such a trend is depicted in

Figure 2(c). In reality, different attributes in databases

are of varying standard deviations. Therefore, it is

unfeasible to make a ﬁxed assumption on the data’s

standard deviation. The experiment in Figure 2(d) is

carried out to show that if we simultaneously increase

SECRYPT 2018 - International Conference on Security and Cryptography

184

Table 2: Default parameter values for synthetic data.

Data

Distribution

Mean

Standard

Deviation

Attack

Percentage

Group

Size

Uncompressed Normal 1000 10% of mean 5% 100

MonetDB Compressed Normal 32768 10% of mean 5% 100

the group size as the standard deviation increases, the

desired

and

can be achieved.

4.2.2 MonetDB Compression

In this section, we will examine the proposed

scheme’s performance on the compressed data us-

ing MonetDB/X100 compression scheme. MonetDB

stores the data as a patched dictionary where the most

frequent integers or values are represented by 1 to 4

bytes. The vectors that MonetDB/X100 pass around

to the CPU are of sizes 256 elements to 16K elements

per vector, which provide the optimal performance

(Boncz et al., 2005). Therefore, to ensure a good per-

formance, the group size g can be kept to the similar

size of the vector.

The original synthetic data are not in compressed

form. As a preprocessing step, the synthetic data is

virtually converted into a 2-byte representation sim-

ilar to MonetDB’s patched dictionary compression

scheme. The rest of the experiment is performed in

the manner similar to the one for the uncompressed

data.

It can be observed in Figure 3(a) that estimated

probabilities of detection

and localisation

in-

crease with the increase of group size, as expected.

Furthermore, these probabilities decrease with the in-

crease of standard deviation as depicted in Figure

3(b). We have observed that the behaviour of our pro-

posed scheme on the MonetDB compressed data is

similar that on the uncompressed data. Therefore, it

is conﬁrmed that by introducing some minor changes,

the proposed Algorithm 1 is easily adaptable to Mon-

etDB’s patched dictionary compression with a mini-

mal effect on its operations.

Figures 3(a) and 3(b) are for single element attack

where the attacker randomly picks a single attribute

value and either increases or decreases the original

value by an attack percentage.

Similarly, Mallory can insert new attribute values

(data elements) to the database. Such an attack can al-

ways be detected because it changes the group’s size

and can be localised if the inserted values disturb the

reference order I

. However, as the number of in-

serts increases, the estimated probabilities decrease

as shown in Figure 3(c). This is because of inserting

new values alongside with their new neighbours, thus

maintaining the reference order. It should be noted

0.2

0.4

0.6

0.8

200 400 600 800 1000

Estimated Probability

Group Size g

(a)

0.2

0.4

0.6

0.8

5 k 10 k 15 k 20 k 25 k 30 k

Estimated Probability

Standard Deviation

(b)

0.2

0.4

0.6

0.8

0 5 10 15 20 25 30 35 40 45 50

Estimated Probability

Superset Attack

(c)

Figure 3: Experimental results for synthetic compressed

data with MonetDB/X100 patched dictionary compression.

Estimated probabilities of detection

and localisation up

to 1 victim

and up to 2 victims

for (a) different

group sizes, (b) different standard deviations, and (c) dif-

ferent sizes of insertion attack (superset attack).

that if the group size is chosen to be large enough, all

these attacks can be handled and the desired detection

and localisation can be achieved. The group size set

for experiment in Figures 3(c) is 100 and this is con-

sidered minimal when compared to a single page of

MonetDB that can store up to 1K attribute values.

The deletion attack model experiment is not

shown here because any deleted data elements can be

detected with estimated probability of 1.0 but cannot

be localised. Such detection is due to the ﬁxed group

size, which can clearly indicate that there is/are miss-

ing data element(s) based on the group size. Since

our watermarking scheme relies on immediate neigh-

bour values, when a data element (value) is deleted,

the neighbouring attribute values still follow the ref-

erence order. Consequently, the deleted data elements

cannot be localised.

4.3 Results on Real-world Data

We use the publicly available “Forest Cover Type”

(UCI Knowledge Discovery in Databases Archive,

Distortion-free Watermarking Scheme for Compressed Data in Columnar Database

185

1999). This dataset was used and tested for water-

marking in (Agrawal and Kiernan, 2002) and many

other previous works. The dataset contains 581012

tuples. From 61 attributes in the dataset, we fo-

cus on two numerical attributes: “Elevation” and

“H Dist To Rd”. Both attributes exhibit high stan-

dard deviations, which made them attractive choices

for testing detection and localisation performances of

the proposed scheme.

In the experiments, the relationship between the

probabilities of detection

and exact localisation

vs. one of the parameters is analyzed. For each exper-

iment the parameters used are kept constant according

to group size g to 100, attack percentage to ﬁve and

number of subsets attacked to one unless mentioned

otherwise in the experiment itself.

4.3.1 MonetDB Compression

For the MonetDB patched dictionary compressed

data, the relationship between the estimated proba-

bilities of detection and localisation vs. the group

size g is ﬁrst analyzed. It can be observed from Fig-

ure 4(a) that

and

increase with the group size.

Furthermore, we tested the attribute H Dist To Rd.

It exhibits a very high standard deviation, which is

√

2431272 = 1559.25. From Figure 4(b), it can be

seen that the estimated probabilities

and

do not

reach 1.0 until the group size reaches 2000. This phe-

nomenon is due to the high standard deviation of the

attribute. However, as we increase the group size, the

probability of unsuccessful detection decreases expo-

nentially as depicted in Figure 4(c).

Most of the previously introduced watermarking

schemes did not have any considerations on how

the underlying data will be affected by compression.

In this study, through the experiments on both the

synthetic and the real-world data, we have asserted

that the proposed watermarking scheme is compat-

ible with the underlying architecture of compressed

columnar database such as MonetDB.

5 CONCLUSIONS

In this study, we ﬁrst examined the architecture of

columnar database MonetDB and its associated com-

pression schemes. Then, we introduced a fragile

distortion-free watermarking scheme for columnar

databases that takes the compression schemes as well

as the other watermarking requirements into account.

All algorithms in the proposed watermarking scheme

are designed to achieve high accuracy (detection rate),

high localisability (in pinpointing the tampered data).

0.2

0.4

0.6

0.8

200 400 600 800 1000

Estimated Probability

Group Size g

(a)

0.2

0.4

0.6

0.8

400 800 1200 1600 2000

Estimated Probability

Group Size g

(b)

0.05

0.1

0.15

0.2

0.25

200 400 600 800 1000

Estimated Probability

Group Size g

(c)

1-P

Figure 4: Experimental results for real-world data with

MonetDB/X100 patched dictionary compression. Esti-

mated probabilities of detection

and localisation up to 1

victim

and up to 2 victims

for (a) different group

sizes in Elevation attribute, (b) different group sizes in

H Dist To Rd attribute. Sub-ﬁgure (c) shows the estimated

probability of unsuccessful detection.

Experimental results on both the synthetic and the real

data demonstrated that we have achieved those objec-

tives.

REFERENCES

Abadi, D., Boncz, P., Harizopoulos, S., Idreos, S., and Mad-

den, S. (2012). The design and implementation of

modern column-oriented database systems. Found.

Trends Databases, 5:197–280.

Abadi, D. J., Boncz, P. A., and Harizopoulos, S. (2009).

Column-oriented database systems. PVLDB, 2:1664–

1665.

Abadi, D. J., Madden, S. R., and Hachem, N. (2008).

Column-stores vs. row-stores: How different are they

really? In SIGMOD, pages 967–980.

Agrawal, R. and Kiernan, J. (2002). Watermarking rela-

tional databases. In VLDB, pages 155–166.

Asikuzzaman, M. and Pickering, M. R. (2017). An

overview of digital video watermarking. IEEE Trans-

actions on Circuits and Systems for Video Technology,

PP(99):1–1.

Bajpai, J. and Kaur, A. (2016). A literature survey - various

audio watermarking techniques and their challenges.

In 6th International Conference - Cloud System and

Big Data Engineering, pages 451–457.

Bhattacharya, S. and Cortesi, A. (2009). A distortion free

SECRYPT 2018 - International Conference on Security and Cryptography

186

watermark framework for relational databases. In

CISOFT-EA, pages 229–234.

Boncz, P. A., Zukowski, M., and Nes, N. (2005). Mon-

etDB/X100: Hyper-pipelining query execution. In

CIDR, pages 225–237.

Camara, L., Li, J., Li, R., and Xie, W. (2014). Distortion-

free watermarking approach for relational database in-

tegrity checking. Mathematical problems in engineer-

ing.

Franco-Contreras, J., Coatrieux, G., Cuppens, F., Cuppens-

Boulahia, N., and Roux, C. (2014). Robust lossless

watermarking of relational databases based on circu-

lar histogram modulation. IEEE Transactions on In-

formation Forensics and Security, 9(3):397–410.

Guo, H., Li, Y., Liu, A., and Jajodia, S. (2006). A fragile

watermarking scheme for detecting malicious modi-

ﬁcations of database relations. Information Sciences,

176(10):1350 – 1378.

Hu, Z., Cao, Z., and Sun, J. (2009). An image based algo-

rithm for watermarking relational databases. In 2009

International Conference on Measuring Technology

and Mechatronics Automation, pages 425–428.

Kamel, I. (2009). A schema for protecting the integrity of

databases. Computers & Security, 28(7):698 – 709.

Kamel, I., AlaaEddin, M., Yaqub, W., and Kamel, K.

(2016). Distortion-free fragile watermark for rela-

tional databases. International Journal of Big Data

Intelligence, 3(3):190–201.

Kamel, I. and Kamel, K. (2011). Toward protecting the

integrity of relational databases. In Internet Security

World Congress, pages 258–261. IEEE.

Kamel, I., Yaqub, W., and Kamel, K. (2013). An empiri-

cal study on the robustness of a fragile watermark for

relational databases. In 9th International Conference

on Innovations in Information Technology (IIT), pages

227–232.

Khan, A. and Husain, S. A. (2013). A fragile zero wa-

termarking scheme to detect and characterize mali-

cious modiﬁcations in database relations. The Scien-

tiﬁc World Journal. Article ID 796726.

Khanna, S. and Zane, F. (2000). Watermarking maps:

Hiding information in structured data. In 11th An-

nual ACM-SIAM Symposium on Discrete Algorithms,

SODA ’00, pages 596–605. Society for Industrial and

Applied Mathematics.

Lee, S.-J. and Jung, S.-H. (2001). A survey of watermarking

techniques applied to multimedia. In IEEE Interna-

tional Symposium on Industrial Electronics Proceed-

ings (ISIE), pages 272–277.

Li, Y. and Deng, R. H. (2006). Publicly veriﬁable ownership

protection for relational databases. In ACM Sympo-

sium on Information, Computer and Communications

Security, ASIACCS ’06, pages 78–89. ACM.

Li, Y., Guo, H., and Jajodia, S. (2004). Tamper detection

and localization for categorical data using fragile wa-

termarks. In 4th ACM Workshop on Digital Rights

Management, DRM ’04, pages 73–82. ACM.

erez Gort, M. L., Feregrino Uribe, C., and Nummenmaa,

J. (2017). A minimum distortion: High capacity wa-

termarking technique for relational data. In 5th ACM

Workshop on Information Hiding and Multimedia Se-

curity, pages 111–121. ACM.

Potdar, V. M., Han, S., and Chang, E. (2005). A sur-

vey of digital image watermarking techniques. In 3rd

IEEE International Conference on Industrial Infor-

matics INDIN, pages 709–716.

Rani, S., Koshley, D. K., and Halder, R. (2017).

Partitioning-insensitive watermarking approach for

distributed relational databases. In Transactions on

Large-Scale Data-and Knowledge-Centered Systems

XXXVI, pages 172–192. Springer.

Rao, B. V. S. and Prasad, M. V. N. K. (2012). Subset selec-

tion approach for watermarking relational databases.

In Data Engineering and Management, pages 181–

188. Springer Berlin Heidelberg.

UCI Knowledge Discovery in Databases Archive (1999).

Forest CoverType.

Wang, C., Wang, J., Zhou, M., Chen, G., and Li, D. (2008).

Atbam: An arnold transform based method on water-

marking relational data. In 2008 International Con-

ference on Multimedia and Ubiquitous Engineering

(MUE), pages 263–270.

Zhang, Y., Niu, X., Zhao, D., Li, J., and Liu, S. (2006). Re-

lational databases watermark technique based on con-

tent characteristic. In First International Conference

on Innovative Computing, Information and Control -

Volume I (ICICIC’06), pages 677–680.

Distortion-free Watermarking Scheme for Compressed Data in Columnar Database

187