AN IMPROVED METHOD TO SELECT CANDIDATES ON METRIC

INDEX VP-TREE

Masami Shishibori

, Samuel Sangkon Lee

and Kenji Kita

Institute of Technology and Science, The University of Tokushima, Tokushima, Japan

Dept. of Computer Science and Engineering, Jeonju University, Jeonju, Republic of Korea

Keywords:

Multimedia retrieval systems, Indexing technique, Vantage point tree, Triangule inequality.

Abstract:

On multimedia databases, it is one of important techniques to use the efﬁcient indexing method for the fast

access. Metric indexing methods can apply for various distance measures other than the Euclidean distance.

Then, metric indexing methods have higher ﬂexibility than multi-dimensional indexing methods. We focus on

the Vantage Point tree (VP-tree) which is one of the metric indexing methods. VP-tree is an efﬁcient metric

space indexing method, however the number of distance calculations at leaf nodes tends to increase. In this

paper, we propose an efﬁcient algorithm to reduce the number of distance calculations at leaf nodes of the VP-

tree. The conventional VP-tree uses the triangle inequality at the leaf node in order to reduce the number of

distance calculations. At this point, the vantage point of the VP-tree is used as a reference point of the triangle

inequality. The proposed algorithm uses the nearest neighbor (NN) point for the query instead of the vantage

point as the reference point. By using this method, the selection range by the triangle inequality becomes

small, and the number of distance calculations at leaf nodes can be cut down. Moreover, it is impossible to

specify the NN point in advance. Then, this method regards the nearest point to the query in the result buffer

as the temporary NN point. If the nearer point is found on the retrieval process, the temporary NN point is

replaced with new one. From evaluation experiments using 10,000 image data, it was found that our proposed

method could cut 5%∼12% of search time of the conventional VP-tree.

1 INTRODUCTION

In recent years, as a result of the lower price and larger

capacity of main memory and secondary storage de-

vices, it has become possible for personal comput-

ers to store large amounts of multimedia data such

as text, images, music, and videos. Consequently,

there is a growing need for technologies that permit

fast and accurate retrieval of the desired data from a

large amount of stored multimedia data. To improve

the search efﬁciency, it is necessary to ﬁrst extract the

desired features of the target data and to create an in-

dex based on these features. During the search pro-

cess, the appropriate data can be obtained by access-

ing only the index. The indexing technology has a

large inﬂuence on the retrieval efﬁciency.

The features extracted from multimedia data are in

general expressed as vectors, and the retrieval process

is executed based on the distances among feature vec-

tors. The methods used to transform feature vectors

into indices, i.e., the methods for indexing multidi-

mensional data, include the R-tree (Guttman, 1984),

R*-tree (Beckmann et al., 1990), SS-tree (White and

Jain, 1996), SR-tree (Katayama and Satoh, 1997), X-

tree (Berchtold et al., 1996), VA-FILE (Weber et al.,

1998), and others. However, these methods take for

granted the use of the Euclidean distance as a dis-

tance measure and they cannot cope with other mea-

sures. Examples of non-Euclidean distance measures

include the quadratic-form distance (Ioka, 1989) that

considers correlations among the dimensions of mul-

tidimensional data, the Edit distance that expresses

the similarity among strings of characters, and the

earth mover’s distance (Rubner et al., 1999).

In order to solve this issue, research on metric-

space indexing has been conducted. The construc-

tion of multidimensional indices is based on the carte-

sian values of the features in multidimensional space.

In contrast, the only requirement of the metric-space

index is the metric space postulates (Zezula et al.,

2006), and therefore it is possible to create such an

index using only information on the distances among

feature vectors. As a consequence, distance measures

other than the Euclidean distance can be used. The

314

Shishibori M., Sangkon Lee S. and Kita K..

AN IMPROVED METHOD TO SELECT CANDIDATES ON METRIC INDEX VP-TREE.

DOI: 10.5220/0003668803060311

In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval (KDIR-2011), pages 306-311

ISBN: 978-989-8425-79-9

 2011 SCITEPRESS (Science and Technology Publications, Lda.)

metric-space index is in general represented as a hi-

erarchical tree structure. The search space is got nar-

row during the search process by a recursive splitting

of the space (data set) based on the distance infor-

mation. Several schemes such as the M-tree (Ciac-

cia et al., 1995), VP-tree (Yianilos, 1993)(Fu et al.,

2000), MVP-tree (Bozkaya and Ozsoyoglu, 1997),

and MI-tree (Ishikawa et al., 1999) have been pro-

posed based on different space-splitting methods. In

the M-tree, the index tree is formed by a bottom-

up process during the space splitting. The disadvan-

tage of this method is that it introduces many com-

mon regions in the spaces resulting from the split,

which implies a lower search efﬁciency. The VP-

tree uses a pivot point called the vantage point and

splits the space in a top-down manner based on a hy-

persphere. The splitting does not introduce common

regions. During the search process, the nodes in the

search range are traversed starting from the root node.

Eventually, leaf objects linked to leaf nodes are ac-

cessed and their distances are calculated. These dis-

tances are used to decide whether or not the associ-

ated objects are located in the search range. However,

the calculation of the distances of the leaf nodes tra-

versed by the search process tends to increase the total

number of distance calculations, resulting in a lower

search speed. To solve this problem, in the VP-tree

a triangle inequality is applied to the leaf objects to

reduce the number of distance calculations.

In this paper, we propose an algorithm to reduce

the number of distance calculations at leaf nodes of

the VP-tree. In the conventional VP-tree, the van-

tage point is used as a reference point for the triangle

inequality. In the proposed method we note that the

selection range tends to get narrower as the distance

between the reference point of the triangle inequality

and the query object gets smaller, and thus we reduce

the number of distance calculations using the nearest

neighbor (NN) object (with respect to the query ob-

ject) as the reference point for the triangle inequality.

How to specify the NN object as the reference

point? It is impossible to identify in advance the NN

object. Thus, The nearest object to the query in the

result buffer is regarded as the temporary NN object.

If the nearer object is found on the retrieval process,

the temporary NN object is replaced with new one.

Furthermore, in order to use the triangle inequality

with the temporary NN object as the reference point,

we must know all the distances between the tempo-

rary NN object and all the objects related to the leaf

nodes of these objects. Since the temporary NN ob-

ject cannot be determined in advance, in practice we

need all the distances among the objects. During the

construction of the indexing we construct a distance-

list ﬁle by computing the distances among objects. It

is worth noting, however, that the ﬁle size is reduced

by splitting this large ﬁle into a ﬁle for each object.

After explaining the VP-tree construction and

search algorithm in Section 2, we will describe the

selection method for leaf nodes. Section 3 introduces

an improvement to the leaf-node selection algorithm.

Section 4 describes experiments with and an evalua-

tion of the improvement method. Finally, Section 5

provides a summary and points to future issues.

2 VP-TREE

2.1 Construction Algorithm

We now explain the VP-tree construction algorithm.

Suppose that we want to perform indexing of a data

set S containing N data points. At each node of the

tree, the vantage point (hereafter referred to as vp) is

selected using a random algorithm as described be-

low.

1. Select a temporary random vp from the data set.

2. Calculate the distance from the provisional vp to

the other N − 1 objects.

3. Calculate the mean and variance of these dis-

tances.

4. Repeat steps 1 through 3 several times and select

as the vp the point that leads to the largest vari-

ance.

Let µ be the mean of the distances from the root

node (selected as the vp) to all data points in S. If

d(p, q) is the distance between points p and q, then

the data set s is split into S1 and S2 as shown below:

= {s ∈ S | d(s, vp) < µ}

= {s ∈ S | d(s, vp) ≥ µ}

Continue by recursively applying this splitting oper-

ation to S1 and S2. All the subsets such as S1 and

S2 correspond to nodes in the VP-tree. Moreover, the

leaf nodes contain a number of objects.

2.2 Search Algorithm

We use the range search and k-nearest neighbor search

algorithms in the VP-tree. Range search involves

specifying the query object and the search radius (the

range) and retrieving the set of objects located at a

distance up to the speciﬁed radius from the center of

the circle. K-nearest neighbor search involves des-

ignating the query object and the number of search

hits k, so that the k objects with the smallest distances

AN IMPROVED METHOD TO SELECT CANDIDATES ON METRIC INDEX VP-TREE

315

are obtained in order, starting from the closest one.

The experiments of the present paper are based on k-

nearest neighbor search. However, since the k-nearest

neighbor search algorithm is based on range search,

both methods are explained here.

In the range search method, the nodes within the

search range are traversed starting from the root node.

The distances between the leaf objects linked to leaf

nodes and the query object are calculated, and the ob-

jects located inside the search range are found. On the

other hand, in the k-nearest neighbor search method,

the initial value of the search radius is set to inﬁn-

ity, and the objects that are traversed are appended

to the search-result list, starting from the root. If the

number of retrievals in the search-result list exceeds

a speciﬁed limit, the retrieved object with the maxi-

mum distance is removed, so that the number of items

in the list does not exceed the speciﬁed value. Here,

the maximum distance in the search-result list is taken

to be the search radius. The search radius is reduced

by repeating this process, resulting eventually in the

speciﬁed number of search results.

2.3 Method to Select Candidates on

Leaf-nodes

In the conventional VP-tree, result candidates in the

search range are selected by applying the triangle in-

equality to the objects corresponding to leaf nodes tra-

versed during the search (Yianilos, 1993). We now

describe this process. For each leaf node, the dis-

tances between the node’s vp object and each leaf ob-

ject are stored as a distance list when the VP-tree is

constructed (at the registration phase). By applying

the triangle inequality to the distances between the vp

object and each leaf object, we can reduce the num-

ber of distance calculations. The following theorem

holds, where q is a query object, r is the search range,

v is a vp object of a leaf node, and o is an object linked

to a leaf node. Moreover, v is called by the reference

point of the triangle inequality.

Theorem 1.

If |d(v, o) − d(v, q)| > r, then leaf object o is not

in the search range.

Proof.

From the triangle inequality d(v, q) + d(q, o) ≥

d(v, o) ,

d(v, o) − d(v, q) > r implies that d(q, o) > r,

and therefore object o is not in the search range.

Similarly, −d(v, o) + d(v, q) > r

and thus d(q, o) > r.

Therefore, Theorem 1 is proved.

Figure 1: Selection of candidates using a vantage point as

the reference point of the triangle inequality.

Input : q , r , L

Output : L

SearchLeaf (q , r , L)

{

foreach o (all objects in the leaf node) {

if ( |d(v , o) ¡ d(v , q)| ¡ r ) {

if ( d(o , q) ¡ r ) {

add o to L, set r to the max distance;

}

q : query object

r : radius of search range

o : object in the leaf node

v : vp object

L : search result

Figure 2: Search algorithm on the leaf node for k-nearest

neighbor search.

As for Theorem 1, r is given by the user and d(v, o)

can be obtained from the distance list which is con-

structed at the registration phase. Moreover, d(v, q)

can be computed only once for each leaf node dur-

ing the search process. Therefore, the objects outside

the search range can be speciﬁed without calculating

d(q, o) by using the triangle inequality of Theorem

1, and the number of distance calculations can be re-

duced. The result candidate selection for leaf nodes

is shown in Fig.1. The k-nearest neighbor search al-

gorithm is illustrated in Fig.2. The non-shaded part

of Fig.1 corresponds to the portion where the inequal-

ity of Theorem 1 holds. The distance calculation can

be skipped for objects located in this area. The shaded

area correspondsto the portion where Theorem 1 does

not hold. The distance calculation is necessary for ob-

jects located in this area.

KDIR 2011 - International Conference on Knowledge Discovery and Information Retrieval

316

3 SELECTION METHOD USING

NEAREST NEIGHBOR

OBJECTS

If vp is used as the reference point in the triangle in-

equality explained in the previous section, the effect

of the selection tends to increase as the portion where

Theorem 1 does not hold (shaded area of Fig.1) gets

smaller. The external radius of this portion corre-

sponds to the distance from vp to the query object q

added to the search range radius r. Since r is ﬁxed

for a given search query, the selection becomes more

effective as the distance between vp and q decreases.

On the other hand, the object closest to q is the near-

est neighbor object. By using the object nearest to

q instead of vp as the reference point in the triangle

inequality, it is possible to reduce the area where The-

orem 1 does not hold. Therefore, we propose a candi-

date selection method that uses the triangle inequality

with the nearest neighbor object as a reference point.

The following theorem holds, where q is the query

object, r is the search range radius, o

is the neigh-

bor object in the search list closest to the query object,

and o is an object linked to a leaf node.

Theorem 2.

If d(o

, o) − d(o

, q) > r, then leaf object o is

not in the search range.

Proof.

From the triangle inequality d(o

, q) + d(q, o) ≥

d(o

, o) ,

d(o

, o) − d(o

, q) > r implies that d(q, o) > r,

and therefore object o is not in the search range.

Therefore, Theorem 2 is proved.

Thus, if d(o

, o) and d(o

, q) are known, it is

possible to know whether or not an object is in the

search range without calculating the distance to each

leaf object. This situation is illustrated in Fig.3. In

other words, if no leaf object exists in the shaded part

of Fig.3, it is possible to skip the computation of the

distance to the query object. The actual leaf-node

search algorithm is illustrated in Fig.4. It is usually

not possible to identify in advance the nearest neigh-

bor objects. Thus, the nearest neighbor object o

provisionally assumed to be the object in the search

result list L that has the shortest distance from the

query object q, as shown in Fig.4. The provisional

nearest neighbor object o

is renewed every time a

new object is found in the search range during the

search process. Theorem 2 also applies to the dis-

tance between the provisional nearest neighbor object

and the query object.

Moreover, the value of d(o

, o) in Theorem 2 is

known if a list exists containing distances between

Figure 3: Selection of candidates using a nearest neighbor

point as the reference point of the triangle inequality.

Input : q , r , L

Output : L

SearchLeaf (q , r , L)

{

foreach o (all objects in the leaf node) {

if ( d(o

, q) + r ¡ª d(o

, o) ) {

if ( d(o , q) ¡ r ) {

add o to L, set r to the max distance;

}

q : query object

r : radius of the search range

o : object in the leaf node

: nearest neighbor object in the search result

L : search result

Figure 4: Search algorithm on the leaf node by the proposed

method.

the nearest neighbor object and the objects in the leaf

nodes. However, since it is not possible to know in

advance which object will become the nearest neigh-

bor of q, in practice all the objects must be considered

as candidates for o

. Therefore, for the indexing it

is necessary to compute the distances from each leaf

object to all the other objects and so create a distance-

list ﬁle. Since it is difﬁcult to hold such a large ﬁle

in memory, we adopt the ﬁle structure of the distance-

list is shown in Fig.5. In the proposed method, as

shown in Fig. 5, the distance-list ﬁle is split into a ﬁle

for each object and each ﬁle is managed separately in

the secondary memory. Thus, only the distance list

related to the provisional nearest neighbor object is

read into the memory. For example, suppose the pro-

visional nearest neighbor object is o

. In this case,

only the distance list related to o

in the second line

of Fig. 5 is read. Moreover, the distance list is read

only when the provisional nearest neighbor object is

renewed. By forming the distance-list ﬁle according

AN IMPROVED METHOD TO SELECT CANDIDATES ON METRIC INDEX VP-TREE

317

d(O

1, O1), d(O1, O2), L , d(O1, On )

Secondary Memory

d(O

2, O1), d(O2, O2), L , d(O2, On)

O 2

d(O

n, O1), d(On, O2), L , d(On, On)

Distance List of O1

Distance List of O2

Distance List of On

Onn

Figure 5: Content of the distance list ﬁle.

to this scheme, we can reduce the size of the ﬁle and

the number of read accesses.

In the implementation, an ID was assigned to each

object, a distance list was created by considering each

object ID as a ﬁle name, and a distance-list ﬁle con-

sisting of a group of ﬁles was constructed. How-

ever, to avoid OS-related restrictions on the maximum

number of ﬁles in a directory, the lowest digits of the

IDs were divided into triples. The three lowest dig-

its formed the ﬁle name and the next three formed the

directory. In other words, we restricted the maximum

number of ﬁles per directory to 1,000.

4 EVALUATION

4.1 Experimental Method

The proposed method was implemented in a VP-tree

and an experiment with a similar-image search task

was carried out. The computer used ran the Linux OS

and was equipped with a 3.2 GHz PentiumD CPU and

2G of memory. 10,000 photographic images (Corel,

2011) were used as registered images and features

were extracted from HSI histograms of the image fea-

ture vectors. The k-nearest neighbor search was per-

formed on 1,000 input images that had not been used

for the indexing in order to calculate the time, num-

ber of distance calculations, and average CPU time

required for one image search. The quadratic-form

distance was used as the distance measure.

An similar algorithm using nearest neighbor

objects is AESA(Approximating and Eliminating

Search Algorithm) (Vidal, 1986). The AESA uses a

distance-list ﬁle that is previously created and con-

tains the computed distances among objects, without

constructing an index tree such as the VP-tree. Since

the proposed method is based on an algorithm similar

to AESA, we use AESA as a benchmark to evaluate

the VP-tree candidate reduction improvement using

nearest neighbor objects.

200

400

600

800

1000

1200

1400

1600

1800

2000

2200

10 20 30 40 50 60 70 80 90 100

calc_num

vp_all

vp_nn

vp_all_nn

AESA

Figure 6: Number of distance calculations on 96 dimen-

sional data.

0.01

0.015

0.02

0.025

0.03

0.035

0.04

10 20 30 40 50 60 70 80 90 100

cpu-time[sec]

vp_all

vp_nn

vp_all_nn

Figure 7: CPU-time on 96 dimensional data.

4.2 Experimental Results

Using the improved method, we conducted an exper-

iment on k-nearest neighbor search. The legends in

the graph correspond to the methods below.

• vp all: reduction method using multiple vp.

• vp nn: reduction method using nearest neighbor

objects.

• vp all nn: reduction method using a combination

of vp all and vp nn.

• AESA: reduction method based on AESA.

Experimental results for the number of distance

calculations for 96 dimensions are shown in Fig.6.

Fig.7 shows experimental results for the cpu time

for 96 dimensions. Fig.8 shows experimental results

for the cpu time using AESA for dimensions rang-

ing from 12 to 96. The horizontal axis k represents

the number of retrievals while the vertical axis “calc

num” shows the number of distance calculations.

The ﬁgures show that the number of distance

calculations decreases in the order vp all, vp nn,

vp all nn. Note that there is no signiﬁcant difference

between vp all and vp nn, but a 10% improvement

is obtained in the case of vp all nn. A possible ex-

planation is that candidate reduction based on nearest

neighbor objects takes place in a different range than

that for the usual vp, and the range reduces more ef-

fectively if the methods are used in parallel.

In terms of execution time, the improvement for

100 retrievals and 48 dimensions is about 5%. For

KDIR 2011 - International Conference on Knowledge Discovery and Information Retrieval

318

0.02

0.04

0.06

0.08

0.1

0.12

0.14

10 20 30 40 50 60 70 80 90 100

cpu-time[sec]

dim12

dim24

dim48

dim96

Figure 8: CPU-time of AESA on each dimensional data.

100 retrievals and 96 dimensions, the improvement

increases to about 12%. Thus, an effective gain is ob-

tained with the present method even when the num-

ber of dimensions increases. The maximum number

of splits per leaf node was set to 10. The size of the

ﬁle that stores the distance list that is needed in the

candidate reduction method based on nearest neigh-

bor objects was 313 MB for all dimensions.

Fig.8 indicates that although AESA outperforms

the VP-tree in terms of the number of distance cal-

culations, the retrieval time is slower. A possible

reason for this is the difference in the number of

read accesses to the distance-list ﬁle. For AESA, the

distance-list ﬁle must be read at every iteration of the

process. In other words, this ﬁle is read as many

times as the number of distance calculations, and this

is thought to have a large inﬂuence on the retrieval

time. For the VP-tree, the distance-list ﬁle needs to be

read only for the reduction of leaf objects, and there-

fore the number of read accesses can be reduced to a

minimal level. Thus, the VP-tree resulted in a more

signiﬁcant improvement in the retrieval effectiveness

than did AESA.

5 CONCLUSIONS

We have proposed an improvement to the search al-

gorithm for the leaf nodes of a VP-tree. The results

show that the retrieval times were reduced by 5% to

12% for the task involving retrieval of similar images.

A topic for future work is the creation of a search al-

gorithm that permits further reductions in the distance

calculations with a smaller index size.

ACKNOWLEDGEMENTS

This work was supported in part by a grant from

the Grant-in-Aid for Scientiﬁc Research numbered

#21500940, #21300036 and #20650143 from the

Ministry of Education, Science and Culture, Japan.

REFERENCES

Beckmann, N., Kriegel, H. P., Schneider, R., and Seeger,

B. (1990). The r*-tree: An efﬁcient and robust access

method for points and rectangles. In Proc. of the ACM

SIGMOD ’90, pages 322–331.

Berchtold, S., Keim, D. A., and Kriegel, H. P. (1996). The

x-tree an index structure for high-dimensional data. In

Proc. of the 22nd VLDB, pages 28–39.

Bozkaya, T. and Ozsoyoglu, M. (1997). Distance-based in-

dexing for high-dimensional metric spaces. In Proc.

of the ACM SIGMOD, pages 357–368.

Ciaccia, P., Patella, M., and Zezula, P. (1995). M-tree: An

efﬁcient access method for similarity search in metric

spaces. In Proc. of the ACM SIGMOD Int. Conf. on

the Management of Data, pages 71–79.

Corel (2011). Corel image garally. http://www.corel.co.jp/.

Fu, A. W., Chan, P. M., Cheung, Y. L., and Moon, Y. S.

(2000). Dynamic vp-tree indexing for n-nearest neigh-

bor search given pair-wise distances. VLDB Journal,

pages 2–8.

Guttman, A. (1984). A dynamic index structure for spatial

searching. In Proc. of the ACM SIGMOD ’84, pages

47–57.

Ioka, M. (1989). A method of deﬁning the similarity of

images on the basis of color information. Technical

Report RT-0030.

Ishikawa, M., Notoya, J., Chen, H., and Ohbo, N. (1999).

A metric index mi-tree. Transactions of Information

Processing Society of Japan, 40(SIG6(TOD3)):104–

114.

Katayama, N. and Satoh, S. (1997). Sr-tree : An in-

dex structure for nearest neighbor searching of high-

dimensional point data. IEICE Transaction on Infor-

mation and Systems, J80-D-I(8):703–717.

Rubner, Y., Tomasi, C., and Guibas, L. J. (1999). The

earch mover’s distance, multi-dimensional scaling,

and color-based image retrieval. In Proc. of the ARPA

Image Understanding Workshop, pages 661–668.

Vidal, R. (1986). An algorithm for ﬁnding nearest neigh-

bours in approximately constant average time. Pattern

Recognition Letters, pages 145–157.

Weber, R., Schek, H. J., and Blott, S. (1998). A quantitative

analysis and performance study for similarity-search

methods in high-dimensional spaces. In Proc. of the

24th VLDB, pages 194–205.

White, D. A. and Jain, R. (1996). Similarity indexing with

ss-tree. In Proc. of the 12th Int. Conf. on Data Engi-

neering, pages 516–523.

Yianilos, P. N. (1993). Data structures and algorithms for

nearest neighbor search in general metric spaces. In

Proc. of the ACM-SIAM SODA’93, pages 311–321.

Zezula, P., Amato, G., Dohnal, V., and Batko, M. (2006).

Similarity Search –The Metric Space Approach –.

Springer press.

AN IMPROVED METHOD TO SELECT CANDIDATES ON METRIC INDEX VP-TREE

319