Towards Efﬁcient Reorganisation Algorithms of Hybrid Index Structures

Supporting Multimedia Search Conditions

Carsten Kropf

Institute of Information Systems, Hof University, Hof, Germany

Keywords:

Database Architecture and Performance, Indexing, Spatio-Textual Indexing, Hybrid Index, Data Structures

and Data Management Algorithms

Abstract:

This paper presents the optimization of the reorganisation algorithms of hybrid index structures supporting

multimedia search conditions. Multimedia in this case refers to, on the one hand, the support of high dimen-

sional feature spaces and, on the other, the mix of data of multiple types. We will use an approach which

may typically be found in geographic information retrieval (GIR) systems combined of two-dimensional ge-

ographical points in combination with textual data. Yet, the dimensions of the points may be arbitrarily set.

Currently, most of these access methods implemented for the use in database centric application domains are

validated regarding their retrieval efﬁciency in simulation based environments. Most of the structures and ex-

periments only use synthetic validation in an artiﬁcial setup. Additionally, the focus of these tests is to validate

the retrieval efﬁciency. We implemented such an indexing method in a realistic database management system

and noticed an unacceptable runtime behaviour of reorganisation algorithms. Hence, a structured and iterative

optimization procedure is set up to make hybrid index structures suitable for the use in real world application

scenarios. The ﬁnal outcome is a set of algorithms providing efﬁcient approaches for reorganisations of access

methods for hybrid data spaces.

1 INTRODUCTION

Hybrid index structures, or in general, access meth-

ods for disk oriented systems with the ability to efﬁ-

ciently explore hybrid data spaces have been investi-

gated much during the last years. Basically, these ac-

cess methods provide fast access to the data stored in

the underlying data spaces. Many applications may

use them to search in a geographic information re-

trieval (GIR) context utilizing a combined scheme of

textual and geographical data, like vector data such

as points, lines or polygons. A GIR system may

use relational database systems to persist the data.

Yet, specialized indexing systems are necessary sup-

porting queries for multidimensional ranges in con-

junction with keywords. Typical search conditions in

this application domain are of mixed type of textual

and geographical predicates. Other types of combi-

nations, for example normalized in combination with

non-normalized data sets, may also be efﬁciently sup-

ported by these hybrid access structures. Most of

them use some kind of index structure for normal-

ized values, e.g. a B-Tree or an R-Tree, in combi-

nation with one for non-normalized values, e.g. an

inverted index or a Radix tree. Several variants for

this kind of access have been created and empirically

evaluated. Examples of query classes may be boolean

range queries supporting checks for existence of key-

words inside document texts in combinations with

containment of (geographical) points inside ranges or

polygons or the overlap of ranges or polygons.

Unfortunately, the validation of these access struc-

tures is mostly performed in synthetic simulation

based environments. These experiments show an im-

proved retrieval behaviour and fast access to the un-

derlying datasets. Thus, search processes executed

with these access methods perform very efﬁciently. In

realistic environments, a GIR system might be used to

store data generated by a web crawler. It could also

serve for persisting data from social networks. These

allow the integration of geographical data during pub-

lishing time. Either the use of standard web crawlers

applying an analysis and ﬁlter chain or the integra-

tion of social media produces several millions of data

portions every day. Each of these postings must be

included in the hybrid index structure to allow the

search and retrieval of the data based on the analy-

sis results, again. Thus, for the integration of such a

231

Kropf C..

Towards Efﬁcient Reorganisation Algorithms of Hybrid Index Structures Supporting Multimedia Search Conditions.

DOI: 10.5220/0004996302310242

In Proceedings of 3rd International Conference on Data Management Technologies and Applications (DATA-2014), pages 231-242

ISBN: 978-989-758-035-2

 2014 SCITEPRESS (Science and Technology Publications, Lda.)

hybrid access method, it is inevitable to provide, be-

sides the existence of efﬁcient retrieval mechanisms,

also fast insertion and reorganisation algorithms.

Simulation based environments which validate the

enhanced retrieval capabilities are most often con-

structed in main memory and executed on server sys-

tems with large amounts of RAM. Sometimes, also

disk oriented test suites are used to simulate the be-

haviour of the access methods. However, the most

interesting feature validated in the experiments is the

quantity of disk I/Os which may, e.g., be monitored

through access counting. Generally, disk I/Os may be

used as an approximation of time as still nowadays,

the positioning and read/write operation of disk heads

towards a rotating hard disk takes milliseconds to be

completed. Thus, it is sufﬁcient to validate the block

access count using simulation based synthetic envi-

ronments.

One of the main issues which lead to this work

was a realistic GIR system backed by a relational

database. We tried to integrate this new kind of ac-

cess structures combining an R-Tree augmented with

bitlists with an inverted index lookup approach. Dur-

ing the implementation of the structure in a realis-

tic environment, the reorganisation time of the hybrid

index structure inside the realistic database environ-

ment was one of the conspicuousnesses during per-

formance monitoring of the web crawler which gen-

erated several thousands of postings per day. The per-

formance of the retrieval algorithms is superior to the

one of independent searches in multiple structures (in-

verted index and R-Tree separated from each other),

but the reorganisation performance is just unaccept-

able for this task. Based on the blocking behaviour

of table modiﬁcation operations, a bad reorganisa-

tion performance also leads to high retrieval times

if searches are issued during insertions, updates or

deletions. Therefore, the reorganisation algorithms

had to be reinvestigated in order to reduce the effort

to a reasonable amount. Basically, it must be noted

that not only the construction of a data structure must

be investigated but also the algorithms leading to the

methodology of construction and retrieval in a new

kind of access structure. The reorganisation algo-

rithms are such a gap which are not described in de-

tail, yet. This paper describes the procedure and ﬁnal

outcome of an optimization based on an initial imple-

mentation of a hybrid access method which evolves

to acceptable performance bounds using an iterative

approach. Hence, the term “iteration” is used here to

describe the single steps inside the entire optimization

procedure.

The main contributions of these work are:

1. A discussion of the initial implementation as well

as the deﬁnition of a test suite for the optimiza-

tion of reorganisation algorithms of hybrid access

structures,

2. an overview of the optimization iterations per-

formed to limit the reorganisation effort to a rea-

sonable amount and

3. the ﬁnal state of efﬁcient reorganisation algo-

rithms with focus on insertions.

This work is structured as follows: Section 2 dis-

cusses in short the related work, especially hybrid in-

dex structure variants and experiments. An outline

of the initial implementation as well as the deﬁnition

of a test suite and the outer circumstances, like input

data and parameters is given in section 3. The proce-

dure of the actual optimization is presented in section

4 which also explains details of selected optimizations

and phases. The ﬁnal state of the reorganisation algo-

rithms is given in section 5 where the ﬁnal and efﬁ-

cient algorithms are explained.

2 RELATED WORK

Basic approaches for creating GIR systems are pre-

sented in (G

obel and de la Cruz, 2007; Kropf et al.,

2011; Vaid et al., 2005). These papers introduce the

requirements and also analysis techniques for GIR

systems like toponym extraction and already provide

ﬁrst insights on possible indexing schemes. Such a

system may utilize the enhanced capabilities of hy-

brid index structures to present the data to the user,

fast.

Hybrid indexing techniques, either with focus on

top-k or boolean queries are presented in (Felipe et al.,

2008; G

obel et al., 2009; G

obel and Kropf, 2010;

Rocha-Junior et al., 2011; Rocha-Junior and Nørv

ag,

2012; Wu et al., 2012; Zhang et al., 2009; Zhou et al.,

2005). Several variants of hybrid access methods are

presented in these papers. The targets either include

boolean queries which just ask for existence of certain

features like keywords or top-k queries which already

employ ranking possibilities. The main construction

of these access methods is similar for most of them.

One base access method is taken into account man-

aging data from one part of the data space, e.g. the

geographical/normalized data sets, and is then aug-

mented with a representation for another part, e.g. the

texts/non-normalized data sets. We use one of these

methods as the base of our investigation and try to en-

hance the reorganisation performance.

An experimental evaluation of available structures

and methods for spatial keyword processing is given

as an overview in (Chen et al., 2013). Also the classes

DATA2014-3rdInternationalConferenceonDataManagementTechnologiesandApplications

232

of boolean range queries as well as boolean k-NN or

top-k kNN queries are investigated and an evaluation

for each of the problem classes is given based on pre-

deﬁned datasets. This analysis evaluates the query ef-

ﬁciency regarding the respective query classes. Yet,

the construction and update efﬁciency which may be

described as reorganization effort is left out of scope,

there.

3 INITIAL IMPLEMENTATION

AND TEST SUITE DEFINITION

This section basically outlines the initial implementa-

tion of the hybrid index structure under examination.

It must be noted that a more general overview is given

here and only the respective phases are discussed as

well as the setup of the test suite in conjunction with

test and input parameters.

The access structure under test is described from

a conceptual view in (G

obel et al., 2009). The ba-

sic concept of the hybrid index implemented for this

work can be seen in ﬁgure 1. It is comprised by

an initial inverted index separating high and low fre-

quently used terms based on Zipf’s Law (Zipf, 1949).

This empirical law is used to, on the one hand, keep

the bitlists small and, on the other, allow sequential

searches for elements to a pre-deﬁned extent. In some

cases, it might be more efﬁcient to scan the results re-

turned from an inverted index linearly instead of pro-

ceeding with the retrieval in the hybrid R-Tree part.

The tuples to be stored in the index consist of a set of

terms or words from textual documents in conjunction

with a set of (geographical) points. The structure con-

sists of an initial inverted index which is the starting

point for all operations. All terms are indexed in this

structure. For frequently occurring terms, only refer-

ences to bit indices which are used in further opera-

tions are stored whereas for seldom occurring terms,

all document entries, they are present in, are stored

directly, there. Thus, low frequently occurring terms

are directly searched for inside this inverted index.

The document heap structure serves for persisting all

point values from the documents currently not present

inside the hybrid R-Tree to be able to support sequen-

tial searches on point data, as well. The next storage

structure, which is only used for commonly occurring

terms, is the hybrid R-Tree which is a basic R-Tree

augmented with a bitlist. The bitlist is used to indi-

cate the existence of a term, represented by its bit in-

dex, inside a certain subtree. As the bitlist is of ﬁxed

size, multiple hybrid R-Tree instances for discrete bit

index ranges might exist. The root nodes of these R-

Trees are stored inside the R-Tree Root Storage com-

ponent. The ﬁnal destination of each term is inside

the secondary inverted index structures. Each of the

leaf elements of the hybrid R-Tree which then repre-

sents a point value has one of these secondary struc-

tures assigned. The terms are ﬁnally persisted there

to build the intersection of point and term assignment

to document in an efﬁcient manner. Each of the in-

verted index structures, initial as well as secondary, is

accessed via a B-Tree which serves as the directory.

However, based on the implementation inside a re-

alistic database environment, some changes are intro-

duced to the original proposal. The database selected

for the implementation of the access structures is a

modiﬁed version of the H2 Database Engine

. The

main changes introduced to this database system re-

sult from the requirement of adding user deﬁned ac-

cess methods to be able to load them from external

places, e.g. a jar ﬁle residing in a speciﬁc folder.

The modiﬁcation of the initial inverted index will

be referred to as “rootBTree” phase. The phases

which describe the modiﬁcation of the hybrid R-

Tree in the following work are called “rtreeExpand”

for general R-Tree insertion operations, “distribute”

for keyword distribution and “generateList” for sub-

set generation of speciﬁc terms valid for a given

point/region. The modiﬁcation of the secondary in-

dexes is referred to as “putEntries” in the following.

As mainly the insertion is inspected, the entire reor-

ganisation process is called “add”.

We refer to “iteration” as one phase inside the en-

tire optimization procedure. It must be noted that the

implementation before the ﬁrst optimization iteration

is really a na

ıve one. The inverted index structures,

initial and secondary, store the directory inside a B+-

Tree. Each occurrence of one term and its relation

to a document is marked once directly inside the leaf

nodes. Hence, lots of entries inside the B+-Tree exist

mapping from exactly the same term to one document

they occurs in. This is obviously not the best option,

but it is worked on later during the optimization iter-

ations. After the insertion of the entries at the initial

inverted index, the terms with a occurrence frequency

greater than a pre-deﬁned arbitrary user deﬁned limit

are distributed to the hybrid R-Tree. Hence, ﬁrst the

node elements and regions are created by inserting the

points to the existing R-Tree. Then, the distribution

of the terms to the inner and ﬁnally the leaf nodes of

the R-Tree is executed. This is backed by the genera-

tion of a list of elements valid for the respective sub-

tree to distribute them to subtrees where they spatially

ﬁt. The ﬁnal step is the insertion at the secondary in-

verted index, initially also executed using the na

ıve

http://www.h2database.com/html/main.html, accessed

2014-03-24

TowardsEfficientReorganisationAlgorithmsofHybridIndexStructuresSupportingMultimediaSearchConditions

233

Figure 1: Conceptual Overview of the Hybrid Index Structure.

approach.

The test suite used here measures the results of

speciﬁc insertion operations repeatedly. Generally, it

monitors lots of features of the hybrid index structure

like the number of disk I/Os (no distinctions are made

between read and write) differentiated based on the

page type, the highest assigned bit index and other in-

ternal states of the hybrid access structure. The infor-

mation about, especially, the disk I/Os also had major

inﬂuences on the decisions regarding the restructur-

ing of the algorithms. However, primary subject of

investigation for this study is the runtime of the al-

gorithm. Hence, the test suite also contains proﬁling

facilities used repeatedly. Based on the fact that the

proﬁling produces large XML output ﬁles, an itera-

tive measurement scheme is introduced which is only

applied at pre-deﬁned points in time (50 document in-

sertions without, then one with, continuing with 50

without proﬁling activated, . . . ). By using an import

to a graph database and into adopted analysis tools,

an output for the respective phases may be generated

displaying the total and the average runtime as well as

the number of calls to each speciﬁc sub-routine.

The input data result from an analysis of a

Wikipedia dump

which is unfortunately already

deleted from the servers. Newer dumps of the

Wikipedia are generated repeatedly and freely avail-

able

. The given analysis may be applied to newer

versions of the dumps, as well. Besides, also a pre-

processed version of the Reuters TRC2

corpus is

used for the veriﬁcation which is not presented in this

paper. Thus, we also ensure that the given optimiza-

tions are independent of the corpus used and represent

generic approaches valid for any kind of corpus. An

http://dumps.wikimedia.org/enwiki/20111007/enwiki-

20111007-pages-articles.xml.bz2, accessed 2011-11-07

http://dumps.wikimedia.org/enwiki/latest/, accessed

2014-05-14

http://trec.nist.gov/data/reuters/reuters.html, accessed

2014-05-14

Table 1: Table of Settings for the Pre-tests Testsuite Run.

Parameter Value

HLimit 200

R-Tree Elements 5

Document Count 969

Splitting Method R-Tree (1–2)/R*-Tree (3–7)

Choose Subtree R-Tree (1–2)/ R*-Tree (3–7)

analysis scheme is executed which uses stop word re-

moval and porter stemming (see (Porter, 1997)). After

that, coordinates are extracted from the articles, on the

one hand, by using the already assigned ones which

may be set up for points of interest inside Wikipedia.

On the other hand, also a toponym extraction func-

tionality is applied which detects place names and as-

signs discrete geographical coordinates to the place

names found in the articles. The articles consisting

of sets of terms and points serve as an input for the

database table backed by the hybrid access method.

Unfortunately, the ﬁrst iteration took very long for

completing the measurement runs. Hence, only a very

limited number of measurement runs could be exe-

cuted. 38 repetitions of the test suite runs are per-

formed. Basically, due to the varying approach of

not monitoring 50 documents (including queries after

each 25th) and monitoring one afterwards, the total

amount of documents inserted to the database table

is only





· 50 +





= 969. During the actual re-

search work on the optimization, the quantity of doc-

uments processed by the hybrid access method was,

obviously, much bigger than the given 969 used for

this analysis.

Some parameters may be set on the test suite for

control. They are summarized in table 1. The param-

eter HLimit refers to the arbitrary user deﬁned limit

separating low and high frequently used terms. It is

set up to 200 based on prior experiments. As a ﬁxed

block size is set for the database disk pages, either

the R-Tree element count or the number of entries

present to be set inside the bitlist may be speciﬁed.

DATA2014-3rdInternationalConferenceonDataManagementTechnologiesandApplications

234

Iteration 2 Iteration 3 Iteration 4 Iteration 5 Iteration 6 Iteration 7

average required time

time (ms)

0 200000 600000 1000000 1400000

add

Figure 2: Overview of Average Required times for Adding

an Item (Iterations 2 – 7).

In this case, ﬁve R-Tree elements are conﬁgured and

the size of the bitlist is calculated. The R-Tree may

be adopted to the user’s needs by applying different

split or choose subtree methods (e.g. (Ang and Tan,

1997; Beckmann et al., 1990; Guttman, 1984)). These

change between iteration 2 and 3 as result of an opti-

mization.

As update operations in databases are often per-

formed by marking one tuple as deleted and adding

a new one, we do not especially discuss these, here.

This holds (at least) for the inspected databases H2

and PostgreSQL. The query efﬁciency of the changed

structures and algorithms is also not discussed be-

cause one main focus of the optimizations is to im-

prove reorganisation efﬁciency whilst keeping the re-

trieval performance constant. The measured times are

only compared relatively to each other. Hence, we

omit a detailed description of the hardware used for

the tests.

4 EVOLUTION OF HYBRID

INDEX STRUCTURE

REORGANISATION

ALGORITHMS

An overview of the average duration of the reorgan-

isation runs is presented in ﬁgure 2. As a side note,

iteration 1 is excluded from this plot and many others

following because the optimizations performed with

this iteration including the cache mechanism are just

too large to be reasonably displayed (≈ 0.531% rel-

ative runtime of phase “add” for iteration 2 in com-

parison to iteration 1). It must be noted that, although

all following plots show average runtimes in millisec-

onds, the values cannot be taken as a reference. This

results from the chosen proﬁling approach. For the

rootBTree rtreeExpand generateList putEntries

0.0 0.1 0.2 0.3 0.4 0.5 0.6

Figure 3: Average Percentual Runtime per Phase.

time measurements, additional code for monitoring as

well as exporting is executed which distorts the abso-

lute values because the proﬁling approach adds a cer-

tain time overhead. Hence, they may only be used to

compare the times measured during the test suite exe-

cutions and may not be taken as absolute ones. Figure

2 describes the trend of iterations 2 – 7. It shows a

clear decrease of the runtime for each iteration except

for iteration 4. Inside the remaining phases, the dif-

ferences are comparably high. Yet, a clear decreasing

tendency, at least from iteration 5 to 7, can be depicted

from this plot. The runtime difference of iteration 2 to

3 is only marginal whereas it slightly worsens in iter-

ation 4. This is is a result of the size of the document

set. On principle, the quantity of documents is con-

tinuously raised between the particular iterations dur-

ing the real optimization iterations executed in exter-

nal studies. The changes applied in iteration 4 mainly

proﬁt from a larger document set. Thus, the relatively

small cardinality of the document set used in this test

run does not produce the same effects as for larger

sets.

The particular phases of the reorganisation algo-

rithm are now analyzed in detail with respect to the

changes performed during the optimization iterations.

The values of the ﬁrst iteration are omitted because, as

already seen for the entire process, the differences be-

tween the ﬁrst and the subsequent iterations are sim-

ply too big to be displayed reasonably.

Figure 3 shows the average percentual portion of

the respective phases in relation to the entire runtime.

Hence, the initial inverted index (“rootBTree”) con-

tributes only a small portion of the runtime whereas

the remaining phases have more inﬂuence, in total.

Additionally, no iteration directly refers to this struc-

ture. Therefore, this phase is left out of consideration

in the following subsections.

TowardsEfficientReorganisationAlgorithmsofHybridIndexStructuresSupportingMultimediaSearchConditions

235

4.1 Iterations

The iterations described in this subsection are exe-

cuted in this order. That means that the sequence, pre-

sented in the following subsections, also corresponds

to the one executed to reach the ﬁnal goal of obtain-

ing efﬁcient reorganisation algorithms. The outcome

of each optimization is validated based on pre- and

post-tests. The facts, presented here, describe the ﬁ-

nal outcome of each of the sequentially executed op-

timization iterations.

As already seen in section 4, each of the opti-

mization iterations introduces differences in the run-

time. However, it is probably interesting which kinds

of adoptions lead to the respective improvements. A

short list of changes in the algorithms or data struc-

tures is given here. Each of the changes introduced for

the respective iterations is motivated based on obser-

vations leading to solution alternatives. One of them

is selected on grounds of additional evaluations (e.g.

further investigations or experiments).

The iterations seen in the ﬁgures correspond with

the following list of adoptions:

1. Initial State using the na

ıve implementation as de-

scribed in section 3. The values for this iteration

are omitted in the plots because of the tremendous

differences.

2. Page and Value Caches introducing least recently

used caching (e.g. (O’Neil et al., 1993)) for seri-

alization and de-serialization of objects as well as

an adopted caching mechanism. The used caches

are introduced in addition to already present ones

from the H2 database.

3. R-Tree Distribution which considers different

methods of selecting a proper subtree or splitting

a node in two. Three methods of splitting (R-

Tree linear, R*-Tree and Ang and Tan) as well

as two methods of subtree selection (R-Tree lin-

ear and R*-Tree) were cross-evaluated. The R*-

Tree splitting and choose subtree work best for the

datasets under test (Wikipedia and also an addi-

tional Reuters excerpt).

4. Inverted Index and Storage Changes adopting the

internal storage of objects from a serialized ver-

sion of the entire object to the decomposition of

individual objects. This makes a direct access to

stored data more efﬁciently. The storage mech-

anism of the postings inside the inverted index

which is used intensively in initial and secondary

index storage is changed based on experimental

and theoretical considerations. Depending on the

size of the posting lists compared with the avail-

able block size, they are either stored inside the

directory B+-Tree or at external pages.

5. Further Inverted Index manipulations. This it-

eration includes an optimization for searches in-

side the directory B+-Tree. Thus, a binary search

in inner elements is included reducing this inter-

nal search effort from linear O (n) to logarithmic

O(logn). Besides, empirical data are taken to

check the occurrence frequency of all points in the

datasets (Wikipedia and Reuters) and the outcome

is that the frequency is also distributed based on

Zipf’s Law. A specialized storage mechanism is

included for points occurring in exactly one doc-

ument.

6. Pre-Calculation of Item Insertions introduces a

modiﬁed version of the distribution of term to

document mappings to the points where they ﬁt in

inside the hybrid R-Tree. Initially, the data struc-

ture used is set up on the terms pointing to the

documents which refer to the points contained in

them. This is changed to a pre-calculated hash

table version of points referring to terms which

themselves reference the documents. Thus, the

pre-calculation is very similar to the structures re-

quired by the hybrid R-Tree and secondary in-

verted index, respectively. A more direct al-

gorithm of distribution is chosen which loads

leaf nodes where the points are in and the pre-

calculated inverted posting sets may be inserted

to the secondary inverted index structures, with-

out further calculations. In addition, bulk-loading

oriented operations for secondary inverted index

structures are also introduced.

7. Spatial Structures for Spatial Distributions are in-

troduced in the last of the inspected iterations.

The hash table is replaced by a KD-Tree (Bent-

ley, 1975) which is better suited for handling spa-

tial objects. This enhances the access to spatial

objects on the one hand for subset generation and

enables a cache hit query in inner node elements.

Thus, for inner elements inside the hybrid R-Tree,

it is sufﬁcient that at least one point exists in the

set of objects to the distributed elements which is

contained inside the region described by the node

element to be further distributed. Therefore, no

subsets must be created and spatial range queries

are efﬁciently supported which is not given for

hash tables.

4.2 R-Tree

The R-Tree operations are not investigated exclu-

sively during the optimization. Only the distribution

is changed (split and choose subtree) during iteration

DATA2014-3rdInternationalConferenceonDataManagementTechnologiesandApplications

236

Iteration 2 Iteration 3 Iteration 4 Iteration 5 Iteration 6 Iteration 7

average required time

time (ms)

0 2000 4000 6000 8000

rtree

Figure 4: Overview of R-Tree Optimization Iterations.

3. The alterations of the runtime are mainly intro-

duced as side effects from other optimizations. The

results can be seen in ﬁgure 4. Iteration 1 is omit-

ted here because the caching mechanism introduced

in iteration 1 has too big effects on the runtime that a

proper display of the results is impossible. The cache

mainly works for the spatial objects stored inside the

R-Tree (and some others, as well).

Although iteration 3 introduces an adopted ver-

sion of the distribution, the runtime slightly rises,

here. This results from the fact that the actual cal-

culation of the improved distribution takes more time.

The basic R-Tree algorithms (at least the linear ver-

sion implemented here) is not very computationally

complex. The more complex algorithms of the R*-

Tree contribute negatively to the runtime of the in-

sertion operations. Additionally, the change in the

R-Tree element structure should support the distribu-

tion of terms to points in the “distribute” part of the

hybrid index which follows the “rtreeExpand” phase,

shown here. Hence, an optimization of the actual R-

Tree modiﬁcation is also not intended for iteration 3.

The greatest effects for the actual R-Tree manipu-

lations are given in iteration 4 and 7. The value for

iteration 6 is missing because in this iteration, an-

other insertion and distribution algorithm is applied.

It is performed together with the distribution and no

phase can be isolated for analysis here. The difference

between iteration 3 and 4 results from the change in

the storage mechanisms, because the spatial objects

may be modiﬁed more directly, then. In advance, a

composed object of spatial component, bitlist and sec-

ondary inverted index reference is stored inside the el-

ements of an R-Tree node. The new storage approach

separates the storage to individual column represen-

tations. The gap present between iterations 5 and

7 is explainable because of the improvements intro-

duced by the KD-Tree. The algorithms are substan-

tially changed in this iteration. Up to iteration 5, the

Iteration 2 Iteration 3 Iteration 4 Iteration 5 Iteration 6 Iteration 7

average required time

time (ms)

0 200000 600000 1000000

put

Figure 5: Overview of Secondary Inverted Index Manipula-

tion Iterations.

insertion and distribution of the new terms and points

to be placed to the hybrid R-Tree is executed by ﬁrst

iterating through all points present in the set of over-

ﬂowing elements, looking up the points and, if not

present, inserting them. Subsequently, the distribu-

tion of terms to the points is executed. Hence, all

affected nodes are accessed at least twice. The new

version ﬁrst distributes the terms to already present

R-Tree point elements. Afterwards, all points (and af-

fected terms) which were not handled, yet, are placed.

Hence, the possibility of loading paths multiple times

exists but is lower based on the more direct way of

placing only the remaining elements and not having

to load the entire tree (at least) twice.

4.3 Secondary Inverted Index

The effects of the optimization iterations for the sec-

ondary inverted index are presented in prior to the

ones of the distribution phase because it directly in-

ﬂuences the runtime of the distribution of elements.

However, the secondary inverted index reorganisation

does not depend on other algorithms and may thus be

inspected exclusively.

Figure 5 shows an overview of the iterations of the

secondary inverted index manipulations. Iteration 1

is omitted, again, due to huge differences of the run-

time. The secondary inverted index manipulation is

executed very frequently because all terms that are

distributed to the hybrid index must be placed at a ﬁ-

nal destination, which is the secondary inverted index.

Hence, for each point affected by the distribution of

items, the manipulation of the secondary inverted in-

dex is carried out. The general tendency of this phase

is continuously decreasing, too. Iteration 4, where the

runtime of increases very much, is an outlier. Basi-

cally, it can be shown that for larger numbers of doc-

uments handled by the hybrid index structure, the ef-

TowardsEfficientReorganisationAlgorithmsofHybridIndexStructuresSupportingMultimediaSearchConditions

237

fects introduced with iteration 4 are also beneﬁcial for

the runtime. The changes introduced in iteration 4 re-

fer to the storage of a high quantity of documents. If

there is a large number of references from one par-

ticular item stored inside the directory to a lot of en-

tries in the document heap, a strategy is applied to the

elements inside the B-Tree which, depending on the

occurrence frequency of one term, decides whether to

place the term inside the directory B+-Tree or in ex-

ternalized posting lists. Unfortunately, the manage-

ment operations introduced using this approach for

the inverted index become by far more complex than

the initial ones. Originally, for each element a tuple

is built which is simply inserted to the secondary in-

verted index using the standard B+-Tree algorithms

and placed directly inside the directory. After the

introduction of the new strategy, decisions must be

taken whether to place the element to the leaf node

itself or to move it to an external page. References

to these external pages must be updated if the ele-

ment is moved, which includes additional computa-

tional overhead. This indicates that the algorithms for

element placement in the new version of the inverted

index are by far more complex than the original ones

which simply took each term to document reference

and placed it instantly into the B+-Tree.

The remaining tendency demonstrated there is that

the average time required for one manipulation of

the secondary inverted index continuously decreases

from iteration 2 –7. An optimized way for the dis-

tribution of point values inside the hybrid index also

seems to have a side effect for the secondary inverted

index performance. Yet, the most signiﬁcant changes

for this manipulation phase turn out between itera-

tions 5 – 7. The combination of a pre-calculation, an

advanced inverted index and the introduction of spa-

tial structures for the distribution algorithms seems to

increase the performance of the secondary inverted in-

dex manipulation to a large extent. The effects of the

reorganisation of the storage mechanism for points

referencing exactly one tuple eminently affects the

average runtime. The same obviously holds for the

bulk-oriented insertions introduced in iteration 6. It-

eration 7 actually does not change much in the ba-

sic behaviour of the algorithms and thus has nearly

no inﬂuence, there. Basically, iteration 7 only intro-

duces an optimized way for the distribution of ele-

ments throughout the hybrid index which means that,

besides a pre-calculation of elements, also the distri-

bution may be executed in a more optimal way. This

change leads to the fact that the performance of the

generation of valid items for a given subtree is opti-

mized evoking, on the one hand, a lower amount of

calculations of valid elements and, on the other, to

Iteration 2 Iteration 3 Iteration 4 Iteration 5 Iteration 6 Iteration 7

average required time

time (ms)

0 200000 600000 1000000 1400000

distribute

Figure 6: Overview of Distribution Optimization Iterations.

an optimized way of treating pages inside the cache

of the H2 database. If a lower amount of pages is

visited during the reorganisation, less pages need to

be managed inside the cache for the remaining page

types. This produces more space available in this

cache. Hence, it is possible to perform the manage-

ment of the potentially high quantity of secondary in-

verted index pages more efﬁciently because a large

amount of the available cache can be used here. This

leads subsequently to a more efﬁcient utilization of

the cache resource from the database. Nevertheless,

the most considerable difference can be detected be-

tween iteration 5 and 6 which means that the bulk-

loading as well as the appropriate storage of points oc-

curring in only one document lead to a tremendously

improved performance for the secondary inverted in-

dex structures.

4.4 Distribution of Entries

Besides the inverted index manipulation, the distribu-

tion of the entries is the phase which is worked on

most considering the optimizations. It is related very

closely to the secondary inverted index manipulation

because the entries are placed into the secondary in-

verted index at the end of the distribution. Hence, the

items valid for the particular subtree represented by

an element are calculated and afterwards distributed

to this subtree. In the case of a leaf node, the entries

are subsequently inserted to a secondary inverted in-

dex. Thus, for most optimization iterations, except for

iteration 6 where the distribution is omitted by includ-

ing direct insertions to leaf nodes, these two phases

must be inspected in connection with each other.

The results of the measurements for the distri-

bution phase are displayed in ﬁgure 6. This ﬁgure

shows that the performance for this part steadily in-

creases as well (except for iteration 4). Iteration 1

is again omitted based on the already mentioned rea-

DATA2014-3rdInternationalConferenceonDataManagementTechnologiesandApplications

238

sons. The distribution phase contains the recursive

traversal through the tree as well as the subset gen-

eration and placement of the items to the secondary

inverted index structures. Thus, the efﬁciency of this

structure directly affects the “distribute” phase. This

inﬂuence explains the results from iteration 4. It can

be seen that the alteration of split and choose subtree

methods has minor effects, at least compared to the

iterations carried out later on. Iteration 5, however,

introduces a great performance increase. In this itera-

tion, the storage structures are adopted, which means

that points occurring only in one single document are

stored without the necessity of secondary inverted in-

dex manipulations. As a side note, this is also ben-

eﬁcial for the retrieval effort. Hence, the process of

skipping these manipulations leads to a major perfor-

mance improvement here. But, the generation of the

item lists is still executed there. Thus, for each R-

Tree element a set of valid term ids and documents

is generated to be distributed to the respective sub-

tree, there. This is omitted in iteration 6 where the

pre-calculations in the shape of a hash table are intro-

duced. From iteration 5 to 6 a relative performance

enhancement of ≈ 50% can be depicted. This is a

direct effect of the introduction of the hash table pre-

calculation and the direct insertions to secondary in-

verted index structures. No subsets must be calcu-

lated during the distribution phase. The most inﬂu-

ential part at this stage is the introduction of the hash

table. Although it has proved successful for the iter-

ation, it is its main time consumer, as well. This is

primarily based on the “random” access of the map

entries which has been overcome by the use of a spa-

tial structure (KD-Tree) at the end. The effects of the

KD-Tree are tremendous like the difference between

iteration 6 and 7 shows. In comparison to the map, the

KD-Tree shows two substantial distinctions. On the

one hand, the list generation is skipped and replaced

by an approach performing cache-like. Thus, the gen-

eration of potentially large lists for subtrees can be

left out. On the other, the spatial structure of the KD-

Tree arranges the elements better suited for queries

towards speciﬁc points or ranges. This leads to an

enormous performance gain between iteration 6 and

7. Besides these improved spatial distributions, the

approach also beneﬁts from the chance of not having

to visit many subtrees and visiting them only once.

Figure 7 shows the results for the generation of the

subsets valid for the particular subtrees. The general

tendency of this ﬁgure is comparable to the one of the

distribution function if the put entries phase is disre-

garded. Iteration 6 skips the generation of the sublists

by introducing the hash table. This is changed in it-

eration 7 where the KD-Tree mainly functions as a

Iteration 2 Iteration 3 Iteration 4 Iteration 5 Iteration 6 Iteration 7

average required time

time (ms)

0 50000 100000 150000 200000

generate

Figure 7: Overview of Sublist Generation.

comparison cache. The queries for the respective en-

tries valid for one spatial element inside the R-Tree

are performed via the KD-Tree in the last iteration. A

query is only executed for presence of at least one el-

ement from the KD-Tree inside or equalling the cur-

rently inspected element. Only in the leaf level, the

sets are generated which are already pre-calculated

before the distribution algorithm. Thus, actually in no

stage from iteration 7, a subset is generated because

the already prepared ones are chosen for insertion. It

is obvious (see ﬁgure 7) that this change is the most

beneﬁcial regarding the performance.

With exception of iteration 4, the performance of

the subset generation lowers very much during the op-

timization iterations. A great effect can also be de-

picted between iteration 2 and 3. That indicates that it

proﬁts much from the improved version of of the R-

Tree distributions. Based on the adopted versions of

the inverted index algorithms, the generation of sub-

sets also beneﬁts from the caching. The algorithm,

basically, ﬁrst loads all documents contained in the

sets and checks the point values. Thus, if less pages

are loaded by other parts, the subset generation proﬁts

from this by not requiring all the entries to be loaded

from the document heap repeatedly.

5 FINAL RESULT OF THE

REORGANISATION

ALGORITHMS

This section provides the ﬁnal outcome of the reor-

ganisation algorithms. Basically, only the insertion

is inspected in this case. The algorithms are listed

in pseudo-code which should be applicable to a wide

variety of programming languages. These algorithms

are the ﬁnal outcome of the seven implementation

states as discussed in the previous sections.

TowardsEfficientReorganisationAlgorithmsofHybridIndexStructuresSupportingMultimediaSearchConditions

239

Algorithm 1: Insert Document.

1 Function HybridIndex::Add(doc)

Data: the document to be inserted to the

hybrid index

2 documentHeap.Place(doc.getPoints());

3 toDistribute =

initial.Insert(doc.getWords());

4 if toDistribute !=

0 then

5 hybrid.Insert(toDistribute);

Algorithm 1 shows the general hybrid index in-

sertion algorithm. First, the document heap is ma-

nipulated. This storage structure has not been dis-

cussed, before. It is a persistence unit saving the

points or normalized values for the support of sequen-

tial searches when the a term occurring inside one of

the document has an occurrence frequency of lower

than HLimit. Thus, the sequential ﬁltering done on

the documents after inverted index retrieval may be

executed, there. After this insertion step, the initial in-

verted index modiﬁcation is executed which produces

a set of values to be distributed to the hybrid R-Tree

which means that the term frequency is higher than

HLimit. Thereafter, this set is further processed by

the hybrid R-Tree algorithms.

Algorithm 2: Insertion at Initial Inverted Index.

1 Function Initial::Insert(terms)

Data: the terms representing the

non-normalized values

Result: set of items to be distributed to the

hybrid R-Tree

2 overﬂow =

3 for term ∈ terms do

4 inverted.FindLeaf(term);

5 if f ound ∧ over f low then

6 overflow.Add(element);

7 else

8 refcount =

inverted.Insert(term);

9 if re f count > HLimit then

10 item.SetWordId(max(termid)+

1);

11 inverted.Replace(inverted

index item);

12 overflow.Add(element);

13 return overﬂow

A description of the initial inverted index manip-

ulation is given in algorithm 2. For each term in the

set of terms assigned to a document, this algorithm

checks whether the respective term is already in the

set of terms with a high frequency. If so, they are

directly added to the set to be processed, further. If

not, the reference count is determined by placing the

new document reference to the postings list. If after

that, the term frequency is higher than the given limit,

a new bit index is created to indicate the presence or

absence of this term inside the bitlist of an R-Tree ele-

ment, all references to this term are removed from the

inverted index and it is also added to the set of terms

to be further processed.

Algorithm 3: Insertion on Inverted Index.

1 Function Inverted::Insert(node, elem)

Data: the node to insert the element at and

the element itself

Result: the frequency of the element

2 if internal storage then

3 if ! inverted.CheckSize(node, elem)

then

4 inverted.MoveExternal(items);

5 inverted.PlaceExternal(elem);

6 else

7 inverted.PlaceToNode(elem);

8 else

9 inverted.PlaceExternal(elem);

10 return frequency

The general inverted index manipulation used by

the initial as well as the secondary inverted index is

shown in algorithm 3. Two strategies may be applied

here based on the size required by the set of docu-

ments pointed to by one term. Either internal stor-

age, which means that the references are placed di-

rectly inside the directory, or external storage indicat-

ing that an external postings page is used may be se-

lected as strategies. In case the external storage indi-

cator is set, the new element is directly stored, there.

Otherwise, the size of the currently existing elements

with the new one in addition is determined and de-

cided whether a limit is reached. Currently, this limit

is set to the size of a page, because if a second page

must be created to store all references to one term in,

then it is more efﬁcient to keep the directory B+-Tree

lean and place all entries to an external postings list.

The insertion of the overﬂowing set to the hybrid

R-Tree is described in algorithm 4. First, a KD-Tree

which serves for cache lookups and subset generation

is constructed which is then distributed to the R-Tree

root node. This KD-Tree uses the point or normalized

values as keys which reference a set of terms which

again refer to a set of documents each. That means

that the new entries for the secondary inverted index

DATA2014-3rdInternationalConferenceonDataManagementTechnologiesandApplications

240

Algorithm 4: Hybrid R-Tree Insertion.

1 Function Hybrid::Insert(overﬂow)

Data: the overﬂowing list to be distributed

2 kdtree =

hybrid.GenerateAssignment(overﬂow);

3 hybrid.Distribute(kdtree, root);

4 hybrid.PlaceRemaining(kdtree);

Algorithm 5: Distribution of Elements.

1 Function Hybrid::Distribute(kdtree, node)

Data: the KD-Tree and the node to be

modiﬁed

2 for entry ∈ node do

3 if ! isLeaf then

4 if kDTree.CacheHit(entry) then

5 subtree =

hybrid.LoadSubtree(entry);

6 hybrid.Distribute(kdtree,

subtree);

7 else

8 assignment =

kDTree.Candidates(entry,

kdtree);

9 secondary.PutEntries(entry,

assignment);

10 hybrid.UpdateBitlist();

structures are already pre-calculated, there. Finally,

new items which means that terms cooccurring with

points that have not been placed, yet, are inserted to

the R-Tree using the appropriate splitting and subtree

choosing methods.

Algorithm 5 shows how the distribution of the

elements to the respective secondary inverted index

structures is executed. In the ﬁnal version, for inner

nodes just a check is performed whether at least one

point from the KD-Tree is inside the region described

by one element pointing to a particular subtree. If

this is true, the algorithm recurs to this subtree until a

leaf node is hit. In case of a leaf node, an assignment

of term id to document is built which is valid for the

respective point. This assignment consisting of po-

tentially multiple terms which may point to multiple

documents is then inserted to the secondary inverted

index structures.

The last algorithm to be executed when inserting

a new document instance to the hybrid index is the in-

sertion operation on secondary inverted index struc-

tures (see algorithm 6). It checks whether currently

there is only one document pointed to by the entire

secondary inverted index. If this is true and the passed

Algorithm 6: Insertion Operation on Secondary

Inverted Index.

1 Function Secondary::PutEntries(entry,

assign)

Data: the hybrid R-Tree entry and the

assignment to be placed

2 if secondary.CheckSingleDocument()

then

3 secondary.SetRowKey(document);

4 else

5 if row key not empty then

6 secondary.MoveSec(entries);

7 while assign !=

0 do

8 toAdd =

secondary.GetEntries(assign);

9 inverted.Insert(toAdd);

assignment also points to solely this document, only

the bitlist of the R-Tree element which references this

secondary inverted index has to be updated. Other-

wise, the respective assignment must be placed to the

secondary inverted index by pre-calculating all ele-

ments which may be placed in a bulk-loading oriented

procedure. This is executed by ﬁrst searching a proper

node in the directory B+-Tree and then inspecting the

respective node and probably its successors if more

than one element from the set of entries contained in

the assignment may be placed, there.

6 CONCLUSIONS

We presented an iterative approach for the optimiza-

tion of hybrid index structures supporting spatial-

keyword queries. Finally, a set of algorithms is ob-

tained which leads to efﬁcient support for hybrid in-

dex structures in realistic database environments. An

analysis of queries during the optimizations is omitted

because the entire optimization work is focussed on

leaving the retrieval at least constant or to improve it.

Yet, the main focus was on optimizing the reorganisa-

tion behaviour. The reorganisation in this case refers

to both, insertions and updates, because in many re-

lational databases updates are performed by marking

a row as invalid and inserting a new one. Hence, be-

sides optimizing insertion operation, also the updates

are improved.

The main optimization effort is done in relational

database management systems. However, the ﬁnal

algorithms may be used in arbitrary environments

where disk oriented persistence and buffer manage-

ment and allocation is an issue. Hence, they may also

TowardsEfficientReorganisationAlgorithmsofHybridIndexStructuresSupportingMultimediaSearchConditions

241

be ported to other kinds of persistence systems like

graph databases.

A data structure always consists of the basic stor-

age structure deﬁnition in connection with a set of al-

gorithms to handle the data ﬂow inside this structure.

The data structure in connection with retrieval algo-

rithms was already present in advance of this work.

Yet, the reorganisation algorithms were missing. This

gap is now closed as we presented efﬁcient algorithms

for this task.

REFERENCES

Ang, C.-H. and Tan, T. C. (1997). New linear node splitting

algorithm for r-trees. In SSD ’97: Proceedings of the

5th International Symposium on Advances in Spatial

Databases, pages 339–349, London, UK. Springer-

Verlag.

Beckmann, N., Kriegel, H.-P., Schneider, R., and Seeger,

B. (1990). The r*-tree: An efﬁcient and robust ac-

cess method for points and rectangles. SIGMOD Rec.,

19(2):322–331.

Bentley, J. L. (1975). Multidimensional binary search

trees used for associative searching. Commun. ACM,

18(9):509–517.

Chen, L., Cong, G., Jensen, C. S., and Wu, D. (2013). Spa-

tial keyword query processing: an experimental eval-

uation. In Proceedings of the 39th international con-

ference on Very Large Data Bases, PVLDB’13, pages

217–228. VLDB Endowment.

Felipe, I. D., Hristidis, V., and Rishe, N. (2008). Keyword

search on spatial databases. International Conference

on Data Engineering, 0:656–665.

Guttman, A. (1984). R-trees. a dynamic index structure for

spatial searching. In SIGMOD ’84: Proceedings of

the 1984 ACM SIGMOD international conference on

Management of data, pages 47–57, New York, NY,

USA. ACM.

obel, R. and de la Cruz, A. (2007). Computer science

challenges for retrieving security related information

from the internet. Global Monitoring for Security and

Stability (GMOSS), -:90 – 101.

obel, R., Henrich, A., Niemann, R., and Blank, D. (2009).

A hybrid index structure for geo-textual searches. In

Proceeding of the 18th ACM conference on Informa-

tion and knowledge management, CIKM ’09, pages

1625–1628, New York, NY, USA. ACM.

obel, R. and Kropf, C. (2010). Towards hybrid index

structures for multi-media search criteria. In DMS,

pages 143–148. Knowledge Systems Institute.

Kropf, C., Ahmmed, S., G

obel, R., and Niemann, R. (2011).

A geo-textual search engine approach assisting dis-

aster recovery, crisis management and early warning

systems. In Geo-information for Disaster manage-

ment (Gi4DM).

O’Neil, E. J., O’Neil, P. E., and Weikum, G. (1993). The

lru-k page replacement algorithm for database disk

buffering. In Proceedings of the 1993 ACM SIGMOD

International Conference on Management of Data,

SIGMOD ’93, pages 297–306, New York, NY, USA.

ACM.

Porter, M. F. (1997). Readings in information retrieval.

chapter An algorithm for sufﬁx stripping, pages 313–

316. Morgan Kaufmann Publishers Inc., San Fran-

cisco, CA, USA.

Rocha-Junior, J. a. B. and Nørv

ag, K. (2012). Top-k spa-

tial keyword queries on road networks. In Proceed-

ings of the 15th International Conference on Extend-

ing Database Technology, EDBT ’12, pages 168–179,

New York, NY, USA. ACM.

Rocha-Junior, J. B., Gkorgkas, O., Jonassen, S., and

Nørv

ag, K. (2011). Efﬁcient processing of top-

k spatial keyword queries. In Proceedings of the

International Symposium on Spatial and Temporal

Databases (SSTD), volume 6849 of LNCS, pages 205–

222. Springer.

Vaid, S., Jones, C. B., Joho, H., and Sanderson, M. (2005).

Spatio-textual indexing for geographical search on the

web. In 9th International Symposium on Spatial and

Temporal Databases SSTD 2005, volume 3633 of Lec-

ture Notes in Computer Science, pages 218–235.

Wu, D., Yiu, M. L., Cong, G., and Jensen, C. S. (2012).

Joint top-k spatial keyword query processing. Knowl-

edge and Data Engineering, IEEE Transactions on,

24(10):1889 –1903.

Zhang, D., Chee, Y. M., Mondal, A., Tung, A. K. H., and

Kitsuregawa, M. (2009). Keyword search in spatial

databases: Towards searching by document. Data En-

gineering, International Conference on, 0:688–699.

Zhou, Y., Xie, X., Wang, C., Gong, Y., and Ma, W.-Y.

(2005). Hybrid index structures for location-based

web search. In CIKM ’05: Proceedings of the 14th

ACM international conference on Information and

knowledge management, pages 155–162, New York,

NY, USA. ACM.

Zipf, G. K. (1949). Human Behaviour and the Principle

of Least Effort: an Introduction to Human Ecology.

Addison-Wesley.

DATA2014-3rdInternationalConferenceonDataManagementTechnologiesandApplications

242