DISTRIBUTED APPROACH OF CONTINUOUS QUERIES WITH

KNN JOIN PROCESSING IN SPATIAL DATA WAREHOUSE

Marcin Gorawski and Wojciech Gębczyk

Silesian Technical University, Institute of Computer Science, Akademicka 16, 44-100 Gliwice, Poland

Keywords: kNN join processing, distributed spatial data warehouse, continuous query, mobile query, mobile object.

Abstract: The paper describes realization of distributed approach to continuous queries with kNN join processing in a

spatial telemetric data warehouse. Due to dispersion of the developed system, new structural members were

distinguished - the mobile object simulator, the kNN join processing service and the query manager.

Distributed tasks communicate using JAVA RMI. The kNN queries (k Nearest Neighbour) joins every point

from one dataset with its k nearest neighbours in the other dataset. In our approach we use the Gorder

method, which is a block nested loop join algorithm that exploits sorting, join scheduling and distance

computation filtering to reduce CPU and I/O usage.

1 INTRODUCTION

With expansion of location-aware technologies such

as the GPS (Global Positioning System) and

growing popularity and accessibility of the mobile

communication, location-aware data management

becomes a significant problem in the mobile

computing systems. Mobile devices become much

more available with concurrent growth of their

computational capabilities. It is expected that future

mobile applications will require scalable architecture

that will be able to process a very large and quickly

growing number of mobile objects and to evaluate

compound queries over their locations (Yiu,

Papdias, Mamoulis, Tao, 2006).

The paper describes realization of distributed

approach to the Spatial Location and Telemetric

Data Warehouse (SDW(l/t)), which bases on the

Spatial Telemetric Data Warehouse (STDW))

STDW consist of telemetric data containing

information about water, gas, heat or electricity

consumption (Gorawski, Wróbel, 2005). DSDW(l/t)

(Distributed Spatial Location and Telemetric Data

Warehouse) is supplied by datasets from Integrated

Meter Reading (IMR) data system and by mobile

objects locations.

Integrated Meter Reading data system enables

communication between medium meters and

telemetric database system. Using GPRS or SMS

technology, measurements from meters located on a

wide geographical area are transferred into database,

where they are processed and stored for further

analysis.

The SDW(l/t) supports tactical decisions making

process concerning medium productivity on the base

of short-termed consumption predictions.

Predictions are supported in the analysis of data

assembled in a data warehouse during the ETL

process, evaluated for datasets from the telemetric

database server.

2 DESIGNED APPROACH

First figure illustrates designed approach’s

architecture. We can observe multiple, concurrently

operating mobile objects (query points), the Gorder

(Chenyi, Hongjun, Beng Chin, Jing 2004) service is

responsible for processing simultaneous continuous

queries over k nearest neighbors, RMI’s SDWServer

and the central part of designed data system -

SDW(l/t), which is also described as a query

manager. Communication between SDW(l/t) and

query processing service is maintained with Java’s

Remote Method Invocation (RMI) solutions.

Principal notion of the described approach is to

distribute previously designed system over many

independent nodes. Asa result we expect faster and

more efficient processing of similarity join method

Gorder. In the previous approach all components of

the system shown in figure 1 were linked together on

a single computer station. All active processes used

131

Gorawski M. and GÄ

Zbczyk W. (2007).

DISTRIBUTED APPROACH OF CONTINUOUS QUERIES WITH KNN JOIN PROCESSING IN SPATIAL DATA WAREHOUSE.

In Proceedings of the Ninth International Conference on Enterprise Information Systems - DISI, pages 131-136

DOI: 10.5220/0002368501310136

 SciTePress

the same CPU. Because of high CPU usage and long

evaluation time we decided to distribute the SDW

(l/t) into independent services, linked together with

Java RMI technology. The most efficient solution

assumes running the Gorder service on a separate

computer because it is the most CPU consuming

element. Other components may be executed on

other computers or on the same computer due to

their insignificant CPU consumption.

Designed system works as follows. First we have

to upload a road map and defined meters into

database on Oracle Server, using SDW(l/t). Then we

start the SDWServer, the Gorder service and as

many mobile objects as we want to evaluate. Every

new mobile object is registered in the database. In

SDW(l/t) we define new queries for active mobile

objects. Queries are also registered in database. The

Gorder service periodically verifies if any queries

are defined. Every query is processed during each

cycle of the Gorder process. The results are sent to

SDW(l/t), where they are stored for further analysis.

SDWServer secures steady RMI connection between

running processes.

Figure 1: DSDW(l/t).

3 MOBILE OBJECT’S

SIMULATOR

For designed approach’s evaluation we developed a

mobile object simulator that corresponds to any

moving object like car, man or airplane. Being in

constant movement, mobile objects are perfect to act

as a query points. Continuous changes in their

locations forces data system to continuously process

queries to maintain up-to-date information about

object’s k nearest neighbours. While designing the

mobile object mechanism we made a few

assumptions. On the one hand, mobile objects are

not allowed to interfere in system’s behaviour, but

on the other hand, they provide everything that is

necessary to conduct system’s overall experiments.

They also prepare system for realistic, natural

conditions.

Mobile object’s simulator is a single process that

represents a moving object. It consantly changes its

actual location and destination. We assume that a

moving object has the ability to send updates on its

location to the Oracle server, which is the core of

DSDW (l/t). It is justifiable assumption because the

GPS devices are getting cheaper every day.

In real terms, the location-aware monitoring

systems are not aware of mobile object’s problem of

choosing the right direction, because it is not the

system that decides where specific object is aiming

to. System only receives information about current

object’s location and makes proper decisions on the

way of processing it. Since our project is not a real

life system, but only a simulation, which goal is to

evaluate new solutions, we do not have access to the

central system containing information about the

mobile objects positions. Therefore, we had to

develop an algorithm that will decide on mobile

object’s movement in order to make SDW(l/t) more

realistic.

4 GORDER QUERY SERVICE

k Nearest Neighbor (kNN) join combines each point

of one dataset R with its k nearest neighbors in the

other dataset S. Gorder is a block nested loop join

algorithm which achieves its efficiency due to data

sorting, join scheduling and distance computation

reduction. Firstly it sorts input datasets into order

called G-order (an order based on a grid). As a

result, the datasets are ready to be partitioned into

blocks that are proper for efficient scheduling for

join processing. Secondly, scheduled block nested

loop join algorithm is applied to find k nearest

neighbors for each block of R data points within data

blocks of S dataset.

Gorder achieves its efficiency due to inheritance

of strength of the block nested loop join in being

able to reduce random reads and due to a pruning

strategy, which reduces unpromising data blocks

using properties of G-ordered data. Furthermore,

Gorder utilizes two-tier partitioning strategy to

optimize CPU and I/O time and reduces distance

computation cost by pruning away redundant

computations.

jdbc

SDWServer

RMI

ORACLE

Server

Mobile

object

Mobile

object

Mobile

object

Mobile

object

SDW(l/t)

Gorder

jdbc RMI

ICEIS 2007 - International Conference on Enterprise Information Systems

132

4.1 G-ordering

The Gorder’s authors designed an ordering based on

a grid called G-ordering to group neighboring data

points together, so that in the scheduled block nested

loop join phase they can identify the partition of a

block of G-ordered data and schedule it for join.

Firstly, Gorder conducts the PCA transformation

(Principal Component Analysis) on input datasets.

Secondly, it applies a grid on a data space and

partitions it into l

square cells, where l is the

number of segments per dimension.

Definition 1. (kNN join) (Chenyi, Hongjun, Beng

Chin, Jing, 2004) Given two data sets R and S, an

integer k and the similarity metric dist(), the kNN-

join of R and S, denoted as R

kNN S, returns pairs of

points (p

; q

) such that p

is from the outer dataset R

and q

from the inner dataset S, and q

is one of the

K-nearest neighbours of p

∞≤≤

⎟

⎠

⎞

⎜

⎝

⎛

−=

∑

1,..),(

xqxpqpdist

(1)

For further notice we have to define the

identification vector, as a d-dimensional vector

v=<s

,...,s

>, where s

is the segment number to

which the cell belongs to in i

dimension. In our

approach we deal with a two-dimensional

identification vectors.

Bounding box of a data block B is described by

the lower left E = <e

, ..., e

> and upper right T =

, ..., t

> point of data block B (Böhm,

Braunmüller, Krebs, Kriegel 2001).

⎪

⎩

⎪

⎨

⎧

≤≤⋅−

kdla

)1.(

(2)

⎪

⎩

⎪

⎨

⎧

≤≤⋅

kdla

(3)

where α is an active dimension of the data block. In

designed approach points will be represented only

by two dimensions: E = <e

, e

>, T = <t

, t

4.2 Scheduled G-ordered Data Join

In the second phase of Gorder, G-ordered data from

R and S datasets is examined for joining. Let assume

that we allocate n

and n

buffer pages for data of R

and S. Next we partition R and S into blocks of the

allocated buffer sizes. Blocks for R are allocated

sequentially and iteratively into memory. Blocks for

S are loaded into memory in order based on their

similarity to blocks for R, which are already loaded.

It optimizes kNN processing by scheduling blocks

for S so that the blocks which are most likely to

contain nearest neighbors can be loaded into

memory and processed first.

Similarity of two G-ordered data blocks is

measured by the distance between their bounding

boxes. As shown in a previous section, bounding

box of a block of G-ordered data may be computed

by examining the first and the last point of data

block. The minimum distance between two data

blocks B

and B

is denoted as MinDist(B

, B

), and is

defined as the minimum distance between their

bounding boxes (Chenyi, Hongjun, Beng Chin, Jing,

2004). MinDist is a lower bound to the distance of

any two points from blocks of R and S.

,(),(,

srsrssrr

ppdistBBMinDistBpBp ≤

∈

∀

(4)

According to the corollary shown above we can

deduce two pruning strategies (Chenyi, Hongjun,

Beng Chin, Jing, 2004):

1. If MinDist(B

) > pruning distance of p, B

does

not contain any points belonging to the k-nearest

neighbors of the point p, and therefore the distance

computation between p and points in B

can be

filtered. Pruning distance of a point p is the distance

between p and its kth nearest neighbor candidate.

Initially, it is ∞.

2. If MinDist(B

) > pruning distance of B

, B

does

not contain any points belonging to the k-nearest

neighbors of any points in B

, and hence the join of

and B

can be pruned away. The pruning distance

of an R block is the maximum pruning distance of

the R points inside.

Join algorithm firstly sequentially loads blocks

of R into the main memory. For the block B

of R

loaded into memory, blocks of S are sorted in order

according to their distance to B

. At the same time

blocks with MinDist(B

) > pruning distance of B

are pruned (pruning strategy (2)). That is why only

remaining blocks are loaded into memory one by

one. For each pair of blocks of R and S the

MemoryJoin method is processed. After processing

all not pruned blocks of S with block of R, list of

kNN candidates for each point of B

, is returned as a

result.

DISTRIBUTED APPROACH OF CONTINUOUS QUERIES WITH KNN JOIN PROCESSING IN SPATIAL DATA

WAREHOUSE

133

4.3 Memory Join

To join blocks B

and B

each point p

in B

compared with B

. For each point p

in B

we find

out if MinDist(B

, B

) > pruning distance of p

. If the

condition holds, according to the first pruning

strategy, B

cannot contain any points that could be

candidates for k nearest neighbours of p

, so B

can

be skipped. In other way function CountDistance is

called for p

and each point p

in B

. Function

CountDistance inserts into a list of kNN candidates

of p

this p

, which dist(p

, p

) > pruning distance of

is a distance between the bounding boxes of

and B

on the α-th dimension, where

).,.min(

ααα

BB=

5 SDW(L/T)

The SDW(l/t) acts as a coordinator of all actions.

Configuration changes are initiated by this service. It

affects efficiency of the whole DSDW (l/t). The

SDW(l/t) is responsible for loading a virtual road

map in the database. All objects included in the

input dataset for the Gorder join processing service

are displayed on the map. In this application we can

define all query execution parameters that may

affect computation time. We correspond to this part

of system as a „query manager” because all queries

are defined and maintained here.

The SDW(l/t) enables generation of test datasets

for experimental runs. It is also an information

centre about all defined mobile objects and about

their current locations. One of the most important

feature of the SDW(l/t) enables tracing current

results for continuously processed queries.

Query manager provides information about

newly defined or removed queries to the

SDWServer. Afterwards, this information is fetched

by Gorder service, which recalculates input datasets

for kNN join and returns them for the query

processing.

6 EVALUATION OF

DISTRIBUTED SDW(L/T)

All experiments were performed on a road map of

size 15x15 km. Map was generated for 50 nodes per

100 km

and for 50 meters per 100 km

for each type

of medium (gas, electricity, water). Only evaluation

on effect of number of meters was conducted for a

few different maps. The number of segments per

dimension was set to 10. Block size was 50 data

points. This values has been considered as optimal

after performing additional tests that are not

described in this paper. In the study we performed

experiments for a non-distributed SDW(1/t) and

distributed versions of SDW(l/t) – DSDW(l/t). The

results illustrate influence of distribution on system

efficiency and query computation time.

6.1 Testing architecture DSDW(l/t)

Figure 2 illustrates hardware architecture used

during evaluation of DSDW(l/t). On first computer

we place Oracle 10g with the designed database,

RMI server SDWServer and the SDW(1/t) for

managing queries. On the separate computer we

place mobile objects because they do not consume

much of the computation power and many processes

can be run simultaneously. On the last computer we

run only the Gorder service for better evaluation

time.

SDW(l/t)

SDWServer

Oracle 10g

100 Mb/s

Mobile

Object

Simulators

Gorde

Service

Figure 2: Testing architecture DSDW(l/t).

6.2 Single Query Experiments

2 4 6 8 10 15 20 50

Value of k

Time [ms]

Energy meters All meters

Figure 3a: Effect of value k SDW(l/t).

ICEIS 2007 - International Conference on Enterprise Information Systems

134

For a single query experiments we define one

mobile object. Figure 3a. illustrates that average

evaluation time of a query concerning one type of

meters (1) is more or less on constant level for non-

distributed version SDW(l/t). We can notice

distractions for k equal 6 or 8 but this aberrations are

very small, measured in milliseconds. For query

encompassing all meters (2), for higher number of

meters, an average query evaluation time increases

with the growth of value k starting from value 8,

where minimum is achieved. However, this increase

is also measured in milliseconds. For DSDW(l/t) we

can observe a little higher average measure time, but

it is constant and it does not change with the

increase of value k.

1 5 10 25 50

Value of k

Time [ms]

Energy meters All meters

Figure 3b: Effect of value k DSDW(l/t).

While testing the influence of of the number of

meters on query evaluation time we set parameter k

to value 20 (Figure 4). Performed experiments show,

that with the growth of number of meters the query

evaluation time increases. However, time does not

grow up very quickly. After increasing the number

of meters six times, query evaluation time increased

for about 77 % for non-distributed SDW(l/t). For

DSDW(l/t) we can notice little higher average

evaluation time. That is caused by the need of

downloading all meters data to another computer.

6.3 Simultaneous Queries Experiments

The time of full Gorder process was measured

during experiments for simultaneous queries. It

means that we measured the average summary

evaluation time for all defined queries that are

processed during single run of Gorder process.

Figure 5 summarizes the effect of number of

simultaneous queries on average Gorder process

evaluation time. All queries were defined for the

same type of meters. That is why the evaluation time

of one cycle of the Gorder process was evaluated

during one single call of the Gorder algorithm.

Concurrently with previous results, the influence of

k value on the process evaluation time is

insignificant. However, with the growth of number

of simultaneous queries the time of conducted

computations increases. For SDW(l/t) experiments

were performed only for 5 mobile objects due to

high CPU usage caused by running entire system on

one computer. It was needless to run experiments

for a greater number of mobile objects. An average

evaluation time increases with the growth of number

of queries. Each additional query causes time growth

for about 10 ms. For distributed version of the

system we could process 12 objects and more. An

average evaluation time is little higher but it is more

constant and increases slowly.

50 100 150 200 250 300

Number of meters per 100 km^2

Time [ms]

100

120

140

160

50 100 150 200 250 300

Nu mb er o f meters p er 100 km^ 2

Time [ms]

Figure 4: Effect of number of meters per 100 km

–

SDW(l/t) (first figure) and DSDW(l/t) (second figure).

Differentiation of queries (Figure 6) caused that

in every single cycle of Gorder process, Gorder

algorithm was called separately for every type of

query. Therefore, for four queries about four

different types of meters Gorder process called

Gorder algorithm four times. Given results for non-

distributed SDW(l/t) proved that with the growth of

the number of differential queries, process

evaluation time increases significantly. We

processed only three queries with input datasets with

the same size.

In the DSDW(l/t) we performed experiments for

12 queries. 3 queries about water meters, 3 about gas

meters and 3 about electricity meters. Each of them

with the same input dataset size. We also added 3

queries encompassing all meters. By adding queries

successively, one by one, from each type of query,

we measured average evaluation time of the entire

DISTRIBUTED APPROACH OF CONTINUOUS QUERIES WITH KNN JOIN PROCESSING IN SPATIAL DATA

WAREHOUSE

135

process. Given results show that with the growth of

the number of different queries the average

evaluation time increases slowly. The growth is

much less significant than in non-distributed version

and we are able to process much more queries.

7 SUMMARY

Pilot system SDW(l/t) is being improved in terms of

searching for new simultaneously continuous queries

processing techniques. Distributed approach of

designed system DSDW(l/t) shows that this

development direction should be considered for

further analysis. Furthermore, using incremental

execution paradigm as the way to achieve high

scalability during simultaneous execution of

continuous spatio-temporal queries should be

considered. Queries should be grouped in the unique

list of continuous, spatio-temporal queries, so that

spatial join could be processed between moving

objects and moving queries.

We also consider implementing solution for

balanced, simultaneous and distributed query

processing to split execution of queries of the same

type on different computers, depending on their

CPU usage prediction.

REFERENCES

Yiu, M., Papadias, D., Mamoulis, N., Tao, Y.. Reverse

Nearest Neighbors in Large Graphs. IEEE

Transactions on Knowledge and Data Engineering

(TKDE),18(4), 540-553, 2006.

Mouratidis, K., Yiu, M., Papadias, D., Mamoulis, N.,

2006. Continuous Nearest Neighbor Monitoring in

Road Networks. To appear in the Proceedings of the

Very Large Data Bases Conference (VLDB), Seoul,

Korea, Sept. 12 - Sept. 15, 2006.

Gorawski M., Wróbel W., 2005. Realization of kNN Query

Type in Spatial Telemetric Data Warehouse. Studia

Informatica, vol.26, nr 2(63), pp.1-22, 2005.

Gorawski M., Malczok R., 2004. Distributed Spatial Data

Warehouse Indexed with Virtual Memory Aggregation

Tree. 5th Workshop on Spatial-Temporal DataBase

Management (STDBM_VLDB’04), Toronto, Canada

2004.

Chenyi Xia, Hongjun Lu, Beng Chin Ooi, Jing Hu, 2004.

GORDER: An Efficient Method for KNN Join

Processing, VLDB 2004, pp. 756-767.

Böhm Ch., Braunmüller B., Krebs F., Kriegel H., 2001.

Epsilon Grid Order: An Algorithm for the Similarity

Join on Massive High-Dimensional Data, Proc. ACM

SIGMOD INT. Conf. on Management of Data, Santa

Barbara, CA, 2001.

Hammad M. A., Franklin M. J., Aref W. G., Elmagarmid

A. K., 2003. Scheduling for shared window joins over

data streams. VLDB, 2003

12345

Number of simultaneous queries

Time [ms]

k = 5

k = 10

100

120

140

123456789101112

Number of simultaneous queries

Time [ms]

k=5 k=10

Figure 5: Effect of number of simultaneous queries –

SDW(l/t) (first figure) and DSDW(l/t) (second figure).

123

Number of different simultaneous queries

Time [ms]

k = 5 k=10

100

120

140

160

123456789101112

Number of simultaneous queries

Time [ms]

k=5 k=10

Figure 6: Effect of differentiation of simultaneous queries

– DSDW(l/t).

ICEIS 2007 - International Conference on Enterprise Information Systems

136