Determining Top-K Candidates by Reverse Constrained Skyline

Queries

Ruei Sian Jheng

, En Tzu Wang

and Arbee L. P. Chen

Department of Computer Science, National Tsing-Hua University, Hsinchu, Taiwan

Computational Intelligence Technology Center, Industrial Technology Research Institute, Hsinchu, Taiwan

Department of Computer Science National Chengchi University, Taipei, Taiwan

Keywords: Top-K Queries, Range Queries, Skyline Queries, Reverse Skyline Queries, Quad-tree Index.

Abstract: Given a set of criteria, an object o is defined to dominate another object o' if o is no worse than o' in each

criterion and has better outcomes in at least a specific criterion. A skyline query returns each object that is

not dominated by any other objects. Consider a scenario as follows. Given three types of datasets, including

residents in a city, existing restaurants in the city, and candidate places for opening new restaurants in the

city, where each restaurant and candidate place has its respective rank on a set of criteria, e.g., convenience

of parking, we want to find the top-k candidate places that have the most potential customers. The potential

customers of a candidate place is defined as the number of residents whose distance to this candidate is no

larger than a given distance r and also regard this candidate as their skyline restaurants. In this paper, we

propose an efficient method based on the quad-tree index and use four pruning strategies to solve this

problem. A series of experiments are performed to compare the proposed method with a straightforward

method using the R-tree index. The experiment results demonstrate that the proposed method is very

efficient, and the pruning strategies very powerful.

1 INTRODUCTION

In the past decade, various spatial queries on spatial

databases have attracted much attention, such as the

k-nearest-neighbor (kNN) queries, the reverse k-

nearest-neighbor queries (RkNN), the range queries,

and the skyline queries. There are also many studies

focusing on the skyline computation since it plays an

important role in the applications of multi-criteria

decision making. Given a d-dimensional dataset, a

data point p is said to dominate another data point q

if it is better than or equal to q in all dimensions and

with at least one dimension better than that of q. A

data point is defined to be a skyline point if no data

points can dominate it. Many different approaches

have been proposed for efficient skyline

computation, such as BNL (Borzsonyi, 2001), D&C

(Borzsonyi, 2001), Bitmap (Tan, 2001), SFS

(Chomicki, 2003), LESS (Godfrey, 2005), BBS

(Papadias, 2005), SaLSA (Bartolini, 2006), ZSearch

(Lee, 2007), and OSP (Zhang, 2009).

Sharifzadeh and Shahabi (Sharifzadeh, 2006)

addresses the spatial skyline queries which consider

the Euclidean distances between a set of query

points and a set of data points. Deng et al. (Deng,

2007) addresses the multi-source skyline query and

proposes efficient algorithms on its query processing

in road networks. Chen and Lian (Chen, 2009) and

Fuhry et al. (Fuhry, 2009) propose the metric skyline

query, whose dynamic attributes are defined in the

metric space. Papadias et al. (Papadias, 2005) is the

first paper that mentions the dynamic skyline in

which the preference on an attribute is defined to be

better close to the requirement of the user. Dellis and

Seeger (Dellis, 2007) introduces the reverse skyline

query, which is a special skyline query and has the

similar concept to reverse k-nearest-neighbor queries.

Given a query point q, the reverse skyline query

returns the data points whose dynamic skyline

results contain q. For example, two features

including manufacture year and engine displacement

are considered while buying a second-hand car. The

operator of a second-hand car shop may want to

retrieve the number of customers who consider a

specific car as their dynamic skyline results for

deciding whether to import this car. We can make

more profitable decisions through the reverse

skyline query. On the other hand, the reverse k-

101

Jheng R., Wang E. and Chen A..

Determining Top-K Candidates by Reverse Constrained Skyline Queries.

DOI: 10.5220/0005498601010110

In Proceedings of 4th International Conference on Data Management Technologies and Applications (DATA-2015), pages 101-110

ISBN: 978-989-758-103-8

 2015 SCITEPRESS (Science and Technology Publications, Lda.)

nearest-neighbor (RkNN) query (Kang, 2007), (Korn,

2000), (Lin, 2003), (Stanoi, 2000), (Stanoi, 2001),

(Tao, 2004), (Wu, 2008), (Yang, 2001) has also

received significant research attentions since it was

introduced in (Korn, 2000). An RkNN query

regarding a query point q ﬁnds all data points which

regard q as one of their corresponding k nearest

neighbors. Since q is close to such data points, q is

said to have high inﬂuence on these data points. The

RkNN answer set with respect to q is called the

inﬂuence set of q (Korn, 2000).

In some applications, skyline queries may be

issued with a range constraint. Consider a scenario

as follows. There are some office buildings and

restaurants located in a city. Each restaurant has its

own scores in different criteria such as service or

average price. A lot of workers from the office

buildings have to find a restaurant for lunch. They

may issue a range query with a distance r to indicate

that only the restaurants within this distance will be

considered. Moreover, they most likely will choose

the skyline restaurants within this distance to have

lunch. That is, a worker may issue a constrained

skyline query to find their target restaurants. For a

restaurant, we define its popularity by the number of

times it appears as an answer in the constrained

skyline queries issued from the workers. The

popularity of a restaurant can be computed by

reverse constrained skyline queries.

Now assume we want to open new restaurants in

the city at several candidate locations. We want to

determine top-k candidates based on their popularity

such that a good business can be expected. For

solving this novel top-k query, in this paper, we

propose a basic method and an advanced method.

Three pruning strategies are provided for reducing

the number of competitors while computing the

number of potential customers for each candidate.

Moreover, a pruning strategy focuses on reducing

the number of customers which cannot be the

potential customers of a target candidate. Rooted at

these four strategies, the advanced method

outperforms the basic method, substantially reducing

the computation time. The experiment results

demonstrate that the pruning strategies have a strong

pruning power.

The remainder of the paper is organized as

follows. The formal problem definition and a basic

solution to this problem are given in Section 2. An

advanced solution and its index structures are

described in Section 3. The performance evaluation

on the proposed algorithm is reported in Section 4.

Finally, Section 5 concludes this work.

2 PRELIMINARIES

In this section, we formally define the problem to be

solved and also propose a basic solution for it.

2.1 Problem Formulation

Referring to the scenario mentioned in Section 1, we

have two datasets including a set of office buildings

(customers) and a set of existing restaurants. In

addition, we have another dataset of candidates for

opening new restaurants. All of the datasets are on a

two dimensional space used to represent their

locations and moreover, the datasets of candidates

and the existing restaurants have the other n

attributes representing the features of the restaurants

such as service or average price.

Assume each customer finds a restaurant within

a distance r from his/her location. This search area

forms a circle with the center being the location of

the customer and a radius of r as shown in Figure 1,

where the triangle point represents the customer. If a

restaurant is located within this search area and is

the skyline point among all restaurants in this area

considering the other n attributes, this restaurant gets

one point from the corresponding customer. For

example, there are five restaurants located in the

search area as shown in Figure 1. The values of the

other 2 attributes of these restaurants representing

service ranking and food ranking are (6, 3), (5, 4),

(4, 5), (7, 5), and (6, 6), respectively. As a result, the

three restaurants with attributes (6, 3), (5, 4), and (4,

5) are skyline restaurants in this search area

(assuming smaller values of the attributes are better).

Each of them gets one point from the corresponding

customer.

Figure 1: An illustration of the search area of a customer.

The problem of determining the top-k candidates

by reverse constrained skyline queries is formally

defined as follows. There are three sets of data

points on a two dimensional space, representing

customers, competitors, and candidates. Moreover,

the competitors and candidates have the other n

DATA2015-4thInternationalConferenceonDataManagementTechnologiesandApplications

102

attributes. Given the above three datasets and a

distance r, we return k data points which have the

highest scores from the set of candidates. The

scoring function is mentioned as above and in

addition, when we compute the score of a candidate,

we only consider the dominating relationship

between the candidate and competitors but not the

candidate and other candidates.

2.2 The Basic Solution

The basic approach to this novel top-k query is

based on the R-tree index (Guttman, 1984). We

assume that the R-tree indices of the set of customers

R and the set of competitors C are constructed in

advance. Each data point in the set of candidates is

kept in a sequence. We sequentially process each of

them to compute their corresponding scores and then

return the top-k results.

For a candidate, we trace the index of customers

to find the customers whose distance to the

candidate is less than r. On the other words, a range

query with a center equal to the location of the

candidate and a distance r is issued. The returned

customers are kept in another sequence. Then, for

each returned customer, another range query is

issued to find the competitors whose distance to the

customer is less than r. The index of competitors can

help to efficiently answer this query. After that, we

compare the candidate with the corresponding

competitors on the other n attributes to check

whether the candidate is a skyline point. If yes, it

gets one score from the corresponding customer.

Following the above steps, the score of each

candidate can be computed.

3 TOP-K QUERY PROCESSING

In this section, an advance approach to solving the

top-k query considering dominating relationship is

proposed. The index structure used in this approach

is discussed in Subsection 3.1 and then we detail this

approach in Subsection 3.2.

3.1 The Index Structures

The advance approach is based on two types of

index structures. One is built for customers, while

the other is built for competitors. Each of them is an

instance of quadtree. A quadtree index is a tree data

structure in which each internal node has four

children. Quadtree is often used to partition a two-

dimensional space by recursively subdividing it into

four quadrants (regions). Initially, we subdivide the

whole space into four regions with equal sizes. If a

new region still contains objects, it will be further

subdivided into four smaller regions as well. A

region will be recursively subdivided into smaller

regions until no objects contained in a region or the

amount of objects in a region is less than a suitable

number.

We use the quadtree structure to index customers,

called amount-quadtree. In each internal node, we

additionally record the amount of the customers in

its child nodes. An example is shown in Figs. 2(a)

and 2(b). The triangle points are regarded as

customers. As mentioned, while computing the

quadtree index, the whole space is recursively

subdivided into smaller regions. The circles are

viewed as the internal nodes of amount-quadtree and

moreover, the square point is the root node of

amount-quadtree. As shown in Fig. 2(b), for the root

node and each internal node, we record the amount

of the customers in it corresponding child node.

(a) An example of the index structure of customers.

(b) An illustration of amount-quadtree.

Figure 2: Examples of the index structure of customers

and amount-quadtree.

We also use the quadtree structure to index

competitors, called superiority-quadtree. Again,

initially, we subdivide the whole space into four

regions with equal sizes. If a new region still

contains objects, it will be further subdivided into

four smaller regions as well. In each internal node,

DeterminingTop-KCandidatesbyReverseConstrainedSkylineQueries

103

we additionally record the best value of each n

dimension with respect to its child nodes. An

example is shown in Figs. 3 and 4. The star points

are regarded as competitors. The circles are used to

represent the internal nodes of superiority-quadtree.

The square is the root of superiority-quadtree. As

shown in Fig. 4, we use the second quadrant to

explain the concept of superiority-quadtree. We find

the best values in x-dimension and y-dimension from

all of the child nodes of a corresponding internal

node. Suppose that we prefer the smaller value in

both x-dimension and y-dimension. The data point

(1, 4) is the competitor that has the best value in x-

dimension. The data point (2, 2)

is the competitor

that has the best value in y-dimension. Then, we

record (1, 2) in the corresponding internal node.

Figure 3: An example of the index structure of competitor.

Figure 4: An illustration of superiority-quadtree.

3.2 The Advanced Solution

In the following, we first introduce four strategies

used in our advanced algorithm for efficiency

enhancement, three of which are used to reduce the

amount of competitors and the other one is used to

reduce the amount of customers when computing the

score values for candidates. Then, we detail the

advanced algorithm.

Property 1. Given a candidate, if the distance

between the candidate and a specific competitor is

larger than 2r, this competitor cannot affect the score

of the given candidate.■

This property is quite straightforward. The

influence region of the competitor, i.e., the circle

with a center equal to the location of the competitor

and a radius of r cannot overlap the influence region

of the candidate since the distance between the

candidate and competitor is larger than 2r.

Accordingly, this competitor cannot affect the score

of the candidate.

Property 2. Given a candidate, the competitors

which cannot dominate the given candidate cannot

affect the score of the candidate.■

For a candidate, the competitors who cannot

dominate the candidate cannot decide whether the

candidate is a skyline result or not, no matter where

the locations of the competitors are. Since only the

skyline results can get the score from a customer,

these competitors cannot affect the score of the

candidate. On the other hand, if a candidate is

dominated by a specific competitor, we can ensure

that the candidate cannot get any score from the

customers located at the overlap of the two

corresponding influence regions. This is because due

to the competitor, the candidate cannot be the

skyline result with respect to the customers in the

overlap.

Figure 5: An example of Property 3.

Property 3. Given a competitor within a distance r

to a specific candidate, which dominates the specific

candidate, the competitor forms a do-not-care area

in which the other competitors need not be

considered while computing the score of the specific

candidate.■

DATA2015-4thInternationalConferenceonDataManagementTechnologiesandApplications

104

We use an example shown in Fig. 5 to introduce

the do-not-care area with respect to a competitor and

detail the concept of Property 3. As mentioned in

Property 3, there are two intersection points

(denoted checkpoint1 and checkpoint2 in Fig.5)

generated by the influence region of the start-shape

competitor and that of the specific candidate. A line,

passing through checkpoint1 and checkpoint2,

partitions the circle with the center equal to the

location of the candidate and a radius of 2r into two

semicircles, one of which contains the candidate

while the other one contains the start-shape

competitor. The area formed by the semicircle

containing the start-shape competitor subtracting the

influence region of checkpoint1 and that of

checkpoint2 is the do-not-care area with respect to

the start-shape competitor. The dark black area

shown in Fig.5 is the do-not-care area with respect

to the start-shape competitor.

Any other competitors located at the do-not-care

area need not be considered to compare with the

specific candidate. This is because the overlap of the

influence region of the candidate and that of another

competitor located at the do-not-care area is always

fully contained in the overlap of the influence region

of the candidate and that of the start-shape

competitor. To the customers in the overlap of the

influence region of the candidate and that of another

competitor located at the do-not-care area, the

candidate need not be compared with the competitor

since the candidate is already dominated by the start-

shape candidate, not able to get scores from the

customers.

Figure 6: An example of Property 4.

Property 4. Given a competitor which dominates a

specific candidate, any customers located in the

overlap of the two corresponding influence regions

formed by the candidate and competitor cannot

contribute scores to the candidate.■

As shown in Fig. 6, suppose that we prefer small

values in both attributes, the candidate with (3, 5) is

dominated by the competitor with (2, 4). Any

customers located in the overlap of the two

corresponding influence regions formed by the

candidate and competitor will not contribute scores

to the candidate since to these customers, the

candidate always cannot be the skyline results due to

the competitor who dominates it.

Algorithm 1: (The ECE algorithm).

Input: amount-quadtree of customers R, superiority-

quadtree of competitor S, candidates, r, and k

Output: Top-k results from candidates

Main function ( ):

1 For each candidate c

2 Trace amount-quadtree to compute the number of

customers located within a distance r to c

3 Sort

the candidates into a decreasing order

according to the number of counted

customers.

4 Compute the scores for the first k candidates in

the sorted list. The smallest score is used to be

the threshold T and these k candidates are

regarded as potential results

7 For the unchecked candidate c in the sorted list

8 If the number of customers located within a

distance of r to c < T

9 Prune c

10 else

11 Compute the score of c

12 If the score of c is larger than T

13 T = the score of c

14 Replace the candidate with the smallest score

in the potential result by c

15 Return the top-k results

The advanced algorithm named ECE (Efficient

Candidate Elimination) detailed in the following is

based on the amount-quadtree index of customers

and the superiority-quadtree index of competitors.

We assume that the amount-quadtree index and the

superiority-quadtree index are constructed in

advance. The pseudo codes of the ECE method are

shown in Algorithm 1.

First, we sequentially process each candidate by

tracing amount-quadtree from the root to get the

number of customers located in the influence region

of a candidate. By using the amount-quadtree index

structure, we can efficiently get the number of

customers located in the influence region of a

corresponding candidate because in some cases, we

can obtain the number of customers from the records

of the internal nodes, without reaching for leaves.

After that, we sort the candidates into a decreasing

order according to the number of customers located

in their corresponding influence regions. For each

DeterminingTop-KCandidatesbyReverseConstrainedSkylineQueries

105

candidate in the sorted list, we calculate its exact

score. Once we get k exact scores, the smallest score

is used to be the lower bound. The candidates in the

sorted list with the number of customers located in

the corresponding influence regions can be pruned

as they have no chances of being the top-k results. In

the following, we focus on how to compute the exact

score of a candidate.

By Properties 1-3, we can prune most of the

competitors that need not be considered while

computing the exact score for a specific candidate.

For a specific candidate, we first prune the

competitors with a distance to it more than 2r by

Property 1. Then, by using the superiority-quadtree

index structure, we can efficiently get the

competitors that dominate the candidate because of

the record kept in the internal node. When traversing

superiority-quadtree, if the records kept in the

internal node are dominated by the target candidate,

we can prune the competitors in that branch by

Property 2 since the competitors who cannot

dominate the target candidate cannot affect the score

of the candidate. Then, by sequentially checking the

competitors that dominate the target and are with a

distance to the candidate smaller than r, we can

prune the competitors located in the corresponding

do-not-care area. After the above checking, the

remainder competitors are really taken into accounts

for computing the exact score of the target candidate.

At the very beginning, issuing a range query from

the target candidate, we find the customers that we

need to check. Before processing each

corresponding customer, we can reduce the number

of customers to be checked by Property 4. After all

of the above pruning check, we use the concept of

Basic to compute the exact score for the target

candidate.

4 EXPERIMENTS

In this section, a series of experiments are performed

to evaluate our approaches and the experiment

results are also presented and analyzed.

4.1 Experiment Setup

We use the data generator RandD to generate three

synthetic datasets with the independent, correlated,

and anti-correlated distributions as shown in Table

1. All objects in the datasets have coordinates within

the range of ([0, 2000], [0, 2000]). The competitors

and candidates have two attributes within a range of

([0, 2000], [0, 2000]). We also conduct the

experiments on a real dataset, obtained from the

website (http://www.census.gov/geo/www/tiger). Its

distribution is shown in Fig. 7. The real dataset

represents the resident locations in Los Angles. The

data size of this real dataset is approximate 360K.

We regard this data points as the customers. We also

generate 100K of the competitors and 1K of the

candidates with two attributes by the independent

generator [RandD]. The coordinates of each object

are within the range of ([0, 3100], [0, 1800]) and the

two attributes are within the range of ([0, 3100], [0,

3100]).

Four variables including r, number of customers,

number of competitors, and number of candidates

are used to be the factors in the experiments as

shown in Table 2. Moreover, k is set to 5 in the

experiments. All of the algorithms are implemented

in C++ and performed on a PC with the Intel Core

i5-2500 3.30GHz CPU, 8GB main memory, and

under the windows7 64bits operating system.

Table 1: The distributions of the test datasets.

Distribution Description

Independent

The attributes of each data point are

generated uniformly and randomly.

Correlated

If a data point has an attribute with low

value, the other attributes of this data

point may likely have low values as well.

Anti-

Correlated

If a data point has an attribute with a

low value, the other attributes of this

data point may likely have high values.

Table 2: Experimental factors.

Factors Default Range

# of customers 200K 150K - 300K

# of competitors 5K 3K - 6K

# of candidates 500 250 - 1000

R 200 160 - 240

Figure 7: The data distribution of the real dataset.

DATA2015-4thInternationalConferenceonDataManagementTechnologiesandApplications

106