Hiding Traversal of Tree Structured Data from

Untrusted Data Stores

Ping Lin and K. Selc¸uk Candan

Department of Computer Sciences and Engineering

Arizona State University

Tempe, AZ. 85287

Abstract. With the increasing use of web services, many new challenges con-

cerning data security are becoming critical. Especially in mobile services, where

clients are generally thin in terms of computation power and storage space, a re-

mote server can be outsourced for the computation or can act as a data store.

Unfortunately, such a data store may not always be trustworthy and clients with

sensitive data and queries may want to be protected from malicious attacks. In

this paper, we present a technique to hide tree structured data from potentially

malicious data stores, while allowing clients to traverse the data to locate an ob-

ject of interest without leaking information to the data store. The two motivat-

ing applications for this approach are hiding (1) tree-like XML data as well as

XML queries that are in the form of tree-paths, and (2) tree-structured indexes

and queries executed on such data structures. We show that this task is achiev-

able through a one-server protocol which introduces only a limited and adjustable

communication overhead. This is especially essential in low bandwidth (such as

wireless) distributed environments. The proposed protocol has desirable commu-

nication and concurrency performance as demonstrated by the experiments we

have conducted.

Keywords: XML, content privacy, access privacy.

1 Introduction

In web and mobile computing, clients usually do not have sufﬁcient computation power

or memory and they need remote servers to do the computation or store data for them.

Publishing data on remote servers helps improve data availability and system scalability,

reducing clients’ burden of managing data. With their computation power and large

memory, such remote servers are called data stores or oracles. Typically, these data

stores can not be fully trusted, for they may be malicious and can make illegal use

of information stored on them to gain proﬁts. Clients with sensitive data (e.g., personal

identiﬁable data) may require that their data be protected from such data storage oracles.

This leads to encrypted database research [1, 2], in which sensitive data is encrypted, so

the content is hidden from the database. It is deﬁned as content privacy [3].

This work is supported by the AFOSR grant #F49620-00-1-0063 P0003.

Lin P. and Selçuk Candan K. (2004).

Hiding Traversal of Tree Structured Data from Untrusted Data Stores.

In Proceedings of the 2nd International Workshop on Security in Information Systems, pages 314-323

DOI: 10.5220/0002685203140323

 SciTePress

Sometimes not only the data outsourced to a data store, but also queries are of value

and a malicious data store can make use of such information for its own beneﬁts. This

privacy is deﬁned as access privacy [3]. Typical scenarios demanding access privacy

include:

– A mineral company wants to hide the locations to be explored when retrieving

relevant maps from the IT’department map database.

– In a stock database, the kind of stock a user is retrieving is sensitive and needs to

be kept private [4].

This leads to private information retrieval [4] research, which studies how to let users

retrieve information from database without leaking (even to the server) the location of

the retrieved data item.

Tree structure is a very important data structure and tree-structured data shows it-

self in many application domains. In this paper, we address outsourcing and hiding of

tree-structured data and queries on this data. For this work, we have two motivating

applications: (1) hiding XML data that is stored in the form of trees and XML queries

in the form of tree paths; (2) hiding tree indexed data and queries for the data.

In this paper, we concentrate on hiding tree structured data and traversal of trees

from oracles. Noticing that existing private information retrieval techniques require ei-

ther heavy replication of the database onto multiple non-communicating servers or large

communication costs [4], we give an one-server tree-traversal protocol that provides

a balance between the communication cost and security requirements. To protect the

client from the malicious data store, some tasks (such as traversing the tree-structures)

are delegated to client.

This paper: In Section 2 we present a general overview of the framework and the

outline of the hidden data access. In Section 3, we discuss how redundancy enables

oblivious traversal of a tree structure. In Section 4, we address the underlying technical

challenges and provide traversal algorithm. In Section 5 we give a quantitative analysis

of the protocol and discuss how to tune the various system and security parameters to

optimize the performance. We implement the protocol and analyze experiment results

in Section 6. Section 7 discusses the amount of security the protocol can achieve and

suggests ways to improve the security of the protocol in the future. Finally, we conclude

in Section 8.

2 Overview of the Hiding Framework

In this section, we ﬁrst give a general overview of the hiding framework. We, then,

provide an outline of the proposed hidden data retrieval protocol.

There are three types of entities with different roles in the system: data owners,

licensed users, and a data store (oracle). The data owners and licensed users are thin

clients (as explained before). A data owner has the right to publish its data on the or-

acle, and a licensed user has the permission granted by some data owner to retrieve

information from the data owner’s data storage space in the oracle. The oracle manages

data storage spaces, where data and tree structures are stored in a hidden way.

315

Clients run data encryption algorithms, have initial secret keys for decryption. En-

cryption algorithms are used to encrypt data and tree structures before sending them

to the oracle to ensure that the content of data and the data structure are hidden from

the oracle. If clients are accessing an outsourced index tree, they have point- or range-

queries. If they are accessing outsourced XML trees, they have query patterns. Query

patterns are used to traverse a tree structure along paths described by some regular-like

expressions. These tasks are accomplished efﬁciently by ”thin” clients with the help

of specialized embedded hardware, such as smartcards, distributed to licensed user by

data owners. Smartcards have been used a lot in mobile computing. They are relatively

cheap, costing no more than several dollars. Such embedded hardware also helps in

solving secret key distribution problem, i.e. by distributing smartcards that contain se-

cret keys, a data owner distributes keys to licensed users[5].

Every time the data owner wants to insert new data into the tree structure or delete

a data item from it, the owner

1. encrypts the data with a secret key,

2. walks the index structure in an oblivious manner so that the traversal path is hidden to the

data store

3. locates the node of interest (either for insertion or deletion),

4. updates the tree structure by inserting or deleting encrypted index or data nodes in proper

positions in the tree, in an oblivious way with respect to the data store.

By walking or updating the tree structure in an oblivious way with respect to the

data store, we mean minimizing the leakage of information about the data and the tree

structure as much as possible; the details of how to walk and update tree-structures in

an oblivious way is described in Section 4.

Client traversal of the tree for retrieving information is similar to update as in order

to prevent the database server from differentiating between read and write operations,

a read operation is always implemented as a read followed by a writing of the contents

back.

3 Oblivious Traversal of the Tree Structure

It is obvious to hide the content of the nodes of a tree structure by encrypting them

before they are passed to the data store. Consequently their content is already hidden

from a malicious store. However, if a client traverses the tree structure in a plain way,

the relationships between nodes in the tree, therefore the tree-structure as well as the

user’s query, are revealed. We propose two adjustable techniques to achieve oblivious

traversal of tree structures: access redundancy and node swapping.

Access Redundancy: Access redundancy requires that each time a client accesses a

node, instead of simply retrieving that particular node, it asks from the server a set

of randomly selected m − 1 nodes in addition to the target node. Consequently, the

probability with which the data store will guess the intended node is

. m is a security

parameter that is adjustable. We discuss how to choose the value of m in Section 5. We

deﬁne this set the redundancy set of the target node.

The problem with redundancy sets, on the other hand, is that their repeated use can

leak information about the target node. For example, if the root node’s address is ﬁxed,

316

query1

query2

query1

query2

query3

(a) After ﬁrst access (b) After the second access (c) After the third access

Fig. 1. Leakage of the position of root node of index as a result of repeated accesses

then multiple access requests for the root node reveal its position (despite the use of

redundancy) since the root is always in the ﬁrst redundancy set any client asks.

By intersecting all the redundancy sets, the data store can learn the root node. The

situation is depicted in Figure 1. If the root is revealed, the risk that its children may be

exposed is high, and so is the case with the whole tree structure.

query1

query2

query3

query1

query2

(a) After the ﬁrst access (b)After the second access (c) After the third access

Fig. 2. The movement of a node

Node Swapping: Consequently, in order to prevent the server from using an attack

based on intersecting repeated or related requests, we have to move nodes each time

they are accessed. Preferably, the move should have minimal impact on the tree structure

and should not leak information about where a given node is moved to. To achieve this,

each time a client needs to access a node from the server, it asks from the server a

redundancy set consisting of m nodes that includes at least one empty node along

with the target node. The client then

1. decodes the target, 2. swaps it with the empty,

3. re-encrypt the redundancy set and writes them back

Figure 2 shows how this approach prevents information leakage: Figure 2(a) shows

that after the ﬁrst access, the position of the target node is moved (the arrow shows

the node’s movement). Figure 2(b) and 2(c) show that after the second and the third

accesses, the position of the target node is moved again. As shown in Figure 2, during

the course of an access, the oracle has the chance to know the position of the node

only if the redundancy set for the access has little intersection with the set of the

previous access so that the position where the node moved to after the previous access

is revealed. But since the node moves again once the nodes are written back after the

access, such leakage is of no use to the server. In this way, the possible position of

the target node is randomly distributed in the data storage space and thus the repeated-

access-attack is avoided.

Node swapping requires re-encryption of nodes before they are re-written to the

server. Re-encryption should employ a new encryption scheme/key, the reason is as

follows: if the same encryption scheme is used, by comparing the content of nodes in

317

the redundancy set after rewriting with their original content, the server can easily

identify the new position of the node. This means that a client has to identify how each

node is encrypted. We achieve this by adding a new ﬁeld which contains the secret

key for that particular node. This ﬁeld is always encrypted using a single/ﬁxed secret

key.This way, the client can decrypt this ﬁeld to learn how to decrypt the rest of the

node.

4 Hidden Tree Traversal Algorithm

To implement oblivious traversal of tree structure, some critical issues have to be solved:

– After moving one node, in order to maintain the integrity of the tree structure,

the parent’s pointer to this node has to be updated accordingly. How can this be

performed without revealing parent-child relationships on the tree structure?

– How to keep consistency of a tree structure when there are many clients access it

concurrently?

– How can we choose the values of various system parameters, such as the amount

of redundancy m?

In this section, we provide techniques to address the ﬁrst two of these challenges, and

provide hidden retrieval algorithms based on them and the underlying protocol. In Sec-

tion 5, we will discuss the choice of system parameters in greater detail.

Maintaining Parent/Child Relationships: As to the challenge of maintaining node/parent-

node relationships after node swapping, we propose the following solution: ﬁnd the

empty node to be swapped with the child node and update the parent node correspond-

ingly before actually moving the child node. This way, parents are always updated con-

sidering the future locations of their children.

Concurrency Control without Deadlocks: The proposed protocol will be applied to

web-based mobile computing environments with large number of clients. In order to

keep consistency of the tree structure with many clients accessing tree structures si-

multaneously, proper concurrency control must be used at server’s side. There has been

intensive study about index locking so that maximum concurrency is achieved with the

integrity of tree structure preserved [6–8]. Since there is no pure read operation in the

scheme (each node, after being read, should be written back), only exclusive locks are

needed. To prevent deadlocks, we organize nodes in a data owner’s data storage space

into d levels.Each level of a data owner’s data storage space requires an empty node

list to maintain empty nodes at this level. Client always asks for locks of parent level

nodes before asking for locks of child level nodes, and it always asks for locks of nodes

belonging to the same level in some predetermined order (e.g. in the order of ascending

node ids). In this way, all nodes in a data owner’s data storage area are accessed by all

clients in a ﬁxed predetermined order. This ensures that circular waits can not occur,

hence deadlocks are prevented.

In Figure 3, we provide the pseudo code of the oblivious traversal algorithm. The

time complexity for this algorithm is O(d×m), with d denoting the depth of tree storage

space and m denoting the redundancy set size, and the space complexity for it is

O(m).

318

[Oblivious traversal algorithm]

Input: feature values of target data and the identiﬁer of the data owner.

Output: pointer to the node that contains the data if there exists one; or null pointer.

1. lock and fetch the ﬁxed public entry node to the data store, let it be PARENT, ﬁnd the

root, let it be CURRENT.

2. select a redundancy set for the CURRENT, lock nodes in the set, let the empty

node in the set be EMPTY.

3. update the PARENT’s pointer to refer to the EMPTY, release locks on the PARENT

level.

4. swap the CURRENT with the EMPTY.

5. if CURRENT contains the data, return CURRENT

else

let CURRENT be PARENT, ﬁnd the child node to be traversed next, let it be CUR-

RENT, repeat 2,3,4,5.

Fig. 3. Oblivious traversal algorithm

5 Identifying Appropriate Values for the System Parameters:

Hiding a Single Query

Choosing the appropriate design parameter values for a hiding system depends on var-

ious system constraints, including the acceptable communication cost and the required

degree of hiding. Let us model a data owner’s data storage space as d levels. Suppose

the tree structure is an l-level tree. Then, the following parameters and constraints have

to be considered:

– the maximum probability, δ, for the server to be able to ﬁnd the actual node that the

client is asking from a redundancy set. We have:

≤ δ.

– the maximum probability, λ, for the server to ﬁnd the path along which a client

walks the tree structure. We have:

≤ λ.

We emphasize here that although it is easy for the data store to guess the target

node from the redundancy set if m is small, it becomes much harder to guess

the parent-child relations between sequential node accesses. And the probability to

discover a path is reduced exponentially with the increase of length of the path,

hence should be slim even with a small value of m.

– the total communication cost ε clients are allowed to make for each data retrieval.

We have: ((read(m) + write(m)) × l ≤ ε, here read(m)/write(m) denotes com-

munication cost to read/write m nodes from the server.

– a node may contain multiple data points. We denote the node size, i.e. the number

of data points a node is able to contain, as s. Value of s can be determined by

considering the following:

Let c denote the function of one round-trip communication cost for data points to be

received from and sent to the server, e and d denote the encryption and decryption

319

cost function , w and r denote the write and read cost function. Theoretically, they

are linear functions. Then :

total_cost_for_data_retrieval

= tree_depth * m *(communication + decryption + encryption +

read + write cost_per_node)

= l * m *( c(s) + d(s) + e(s) + r(s) + w(s) );

As node size s increases, tree depth l decreases while costs per node increases.

If all other parameters are known, we can calculate optimal node size to minimize

the total cost. However, as s increases, the probability for the data store to ﬁnd a

path, which is

, increases. Therefore, the value of s should be carefully chosen

to ensure that security requirement is satisﬁed and the total cost is minimized as

much as possible.

Note that most of the above constraints are linear, and an appropriate parameter

setting can be easily identiﬁed using efﬁcient algorithms.

6 Experiment Results

To validate the protocol, we simulated the protocol and conducted some experiments to

test the protocol. The computing environment consisted of a Linux server acting as a

data store and a 1.0Ghz/256M laptop generating client requests. They were connected

via a Wireless LAN system. We implemented a 2 dimensional k-d tree as the index

structure due to its simplicity. This simple structure enables us to observe experiment

results more effectively.

In the paper, we do not experiment with range queries as we focus on path traversal.

We point out that using this protocol, range queries can be implemented as multiple

path traversals without deadlocks. We generated 40000 data points that were uniformly

distributed in the region (0,0) to (1000000, 1000000), and stored them into a data stor-

age space with capacity 30000 nodes. The size of redundancy set, m, is set to

Response time and node size We executed a set of experiments to show the relationship

between node size and response time, i.e., the time between a client sending a data

retrieval request and getting the response.

Figure 4(a) shows the experiment result. In this ﬁgure, there are two sets of results.

The dark points denote the results of experiments with encryption/decryption imple-

mented by software. This set of results shows that when node size is set to around 50

data points, the minimum response time (about 38s), is achieved. This phenomenon

veriﬁes the theoretic observation that there must exist an optimal node size (Section 5).

Considering the probability for the malicious server to ﬁnd the path (we denote it as

path probability, which is a function of page size,

log(

num

)

, here m is the redundancy

parameter, num is the total number of data points stored, s denotes node size. ), suitable

node size can be chosen to satisfy security requirements and minimize response time.

The set of white points depicts experiments with efﬁcient hardware encryption/decryption.

From the result, we found that encryption and decryption constitute heavy cost and with

assistant hardware, response time can be greatly reduced to about 8s.

320

(a) response time and (b) ratio of sequential process

node size and concurrency control

Fig. 4. Experiment result

To compare our protocol with one-server Private Information Retrieval (PIR) tech-

nique [4], we also simulated PIR by transferring the whole database to a client. The

simulation was conducted in the same computing environment (same linux sever, same

laptop, same Wireless LAN connection). It takes about 3643s to ﬁnish transferring. We

can claim that our protocol is much more efﬁcient.

Another interesting phenomenon we observe from Figure 4(a) is that although the

two sets of points have big difference in their values, they have similar zigzag pattern.

This shows that the discontinues and sharp varieties in response time values are mainly

determined by other costs (communication cost c(s), write cost w(s), read cost r(s))

than encyption/decryption (Section 5).

We also notice that response time for the set of black points has a strong tendency

to increase with the node size, while it does very slightly for white points. This can be

explained by the signiﬁcant parts encryption/decryption play in the total cost and their

linear increase with the node size (Section 5).

Furthermore, we conducted a set of experiments to show the effect of concurrency

control. In this set of experiments, 50 retrieval requests for independently selected ran-

dom data points were launched out one by one at varying frequency from every 10ms to

every 300ms. In the experiment results, we found no deadlocks. We also found that the

total time to ﬁnish all the requests was much less than letting the server process those

retrievals sequentially. To give a sample result, when requests were launched out every

20ms, the total time required to ﬁnish them was 734.8s, and the time to process them

sequentially was 1442.9s. Figure 4(b) gives the ratio of the time required to process se-

quentially and the time required by our protocol with concurrency control. We can see

that the ratio is about 2. This means we gain 100% saving with the concurrency control.

Figure 4(b) also shows that this ratio increases with the time interval. This is consistent

with the common knowledge that the efﬁciency of the data store reduces with more

clients accessing trees at the same time.

7 Future Work on Hiding Correlated Queries

The protocol should be able to protect queries and tree data structure from a polyno-

mial time server. To study the security guarantee the protocol provides, suppose that

the server keeps a history of all redundancy sets users retrieved, and the server

321

tries to infer about queries and data by statistic analysis of the history. We deﬁne each

redundancy set a call, and the history a view of the server. The amount of security

is deﬁned as:

1. For any two different queries Q

and Q

posed in the view, the distribution of their

sequences of calls are indistinguishable in polynomial-time.

2. For any two queries Q

and Q

posed in the view, it is hard to tell if they are

identical or not by observing their sequences of calls.

If the data storage space is randomly initialized, queries are uniformly posed, tree nodes

will always be uniformly distributed in each layer of the data storage space. So for two

different queries, if their query path lengths are equal, the distribution of their sequences

of calls are identical, hence indistinguishable in polynomial-time; if their query path

lengths are not equal, clients can execute dummy calls at deeper levels to always make

the same number of calls. We are currently studying how to improve the protocol when

queries are not uniformly distributed.

As to the second security requirement, if two identical queries are posed consecu-

tively without any interfering calls, their calls at the same level will always intersect,

hence intersections will give some hint about identical queries. We are also currently

studying how to improve the protocol by methodically introducing intersections be-

tween non-identical queries to make intersections independent from identical queries.

8 Conclusion

In this paper, we propose a simple, adaptive and deadlock free protocol to hide tree

structured data and traversal of it from a data store. Since a lot of data such as XML

has a tree structure and queries can be expressed as traversal paths, this protocol can

be utilized to hide such data and queries. Compared with existing private information

retrieval techniques [4, 9], our protocol does not need replication of databases and it re-

quires less communication, and is thus practical. We provide an example how to apply

it to hide XML documents and tree path based queries. Finally, we conduct experiments

and observe that the proposed techniques achieve hiding without generating unaccept-

able concurrency problems.

Acknowledgement

We thank Dr Rida A. Bazzi for his helpful comments on this paper.

References

1. Hacig

us, H., Iyer, B.R., Li, C., & Mehrotra, S.(2002) Executing SQL over Encrypted

Data in the Database-Service-Provider Model, Proceedings of 2002 ACM SIGMOD Inter-

national Conference on Management of Data, Madison, Wisconsin, USA, June 3-6, 2002.

pp. 216-227.

322

2. Oracle Corp.,Database Security in Oracle8i, 1999. Retrieved Febuary 26, 2004, from

http://otn.oracle.com/depoly/security/oracle8i/index.html.

3. Smith, S. W., & Safford, D.(2001). Practical Server Privacy with Secure Coprocessors.

IBM Systems Journal, Vol. 40, No. 3. pp.683-695.

4. Chor, B., Goldreich, O., Kushilevitz, E., & Sudan, M.(1995). Private Information Re-

trieval, Proceeding of 36th IEEE Conference on the Foundations of Computer Sciences,

Milwaukee, Wisconsin, USA, October 23-25, 1995. pp. 41-50.

5. Bouganim, L., & Pucheral, P.(2002). Chip-secured Data Access: Conﬁdencial Data on

Untrusted Servers, Proceedings of 28th Very Large Data Bases Conference, Hongkong,

China, 2002. pp.131-142.

6. Bayer, R., & Schkolnich, M.(1977). Concurrency of Operations on B-Trees, Acta Infor-

matica, Vol. 9, pp. 1-21.

7. Mohan, C.(1996). Concurrency Control and Recovery Methods for B+-Tree Indexes:

ARIES/KVL and ARIES/IM, In Kumar, V.(Ed.) Performance of Concurrency Control

Mechanisms in Centralized Database Systems, Prentice-Hall 1996, pp. 248-306.

8. Mohan, C.(2002). An Efﬁcient Method for Performing Record Deletions and Updates

Using Index Scans, Proceedings of 28th Very Large Data Bases Conference, Hongkong,

China, 2002.pp.940-949.

9. Chor, B., Gilboa, N., & Naor, M.(1997). Private Information Retrieval by Keywords, Tech-

nical Report TR CS0917. Technion Israel, 1997.

323