Fast Nearest Neighbor Search with Narrow 16-bit Sketch

Naoya Higuchi

, Yasunobu Imamura

, Tetsuji Kuboyama

, Kouichi Hirata

and Takeshi Shinohara

Kyushu Institute of Technology, Kawazu 680-4, Iizuka 820-8502, Japan

Gakushuin University, Mejiro 1-5-1, Toshima, Tokyo 171-8588, Japan

Keywords:

Similarity Search, Sketch, Ball Partitioning, Hamming Distance, Dimension Reduction, Distance Lower

Bound.

Abstract:

We discuss the nearest neighbor search using sketch which is a kind of locality sensitive hash (LSH). Nearest

neighbor search using sketch is done in two stages. In the ﬁrst stage, the top K candidates, which have close

sketches to a query, are selected, where K ≥ 1. In the second stage, the nearest object to the query from

K candidates is selected by performing actual distance calculations. Conventionally, higher accurate search

requires wider sketches than 32-bit. In this paper, we propose search methods using narrow 16-bit sketch,

which enables efﬁcient data management by buckets and implement a faster ﬁrst stage. To keep accuracy,

search using 16-bit sketch requires larger K than using 32-bit sketch. By sorting the data objects according to

sketch’s values, cost inﬂuence due to the increase in the number of candidates K can be reduced by improving

memory locality in the second stage search. The proposed method achieves about 10 times faster search speed

while maintaining accuracy.

1 INTRODUCTION

To implement efﬁcient similarity search in multidi-

mensional spaces, sketches (M

uller and Shinohara,

2009; Mic et al., 2016; Dong et al., 2008; Mic et al.,

2015; Wang et al., 2007) have been developed. Sketch

is a compact bit sequence representing multidimen-

sional data, which can be considered as a kind of lo-

cality sensitive hash(LSH). Ball partitioning (BP, for

short) is a method to make sketches by assigning a bit

0 or 1 to a data, such that 0 if it is in a ball and 1 other-

wise. BP is also used in vantage point tree (Yianilos,

1993).

The similarity search using sketches consists of

two stages. The ﬁrst stage selects candidates depend-

ing on their Hamming distances between sketches.

The second stage selects the nearest neighbor by com-

paring the candidates with the query using distances

in the original space.

In the search using the sketch, the distance be-

tween the sketches cannot completely reﬂect the dis-

tance between the objects. Therefore, unlike retrieval

using a hierarchical spatial index R-tree (Guttman,

1984) or M-tree (Ciaccia et al., 1997), nearest neigh-

bor solutions cannot be accurately obtained by sketch-

based retrieval. In order to guarantee a certain level

of precision at a speed comparable to that of the hi-

erarchical spatial indexing method, the width of the

sketch has been considered to be 32 bits or 64 bits.

In this paper, we propose a method using a nar-

rower 16-bit sketch. We assume that the size of the

database is several millions. The number of 32-bit

patterns is 2

= about 4 billion. As long as the

database size is not huge beyond that, there are too

many empty buckets when 32-bit sketches are used as

keys. Thus, bucket method is not suitable for 32-bit

case. When sketches wider than 32 bits are used, in

the ﬁrst stage search, all the sketches of the data are

prepared as they are, the distances from the sketch of

a query are calculated, and solution candidates are se-

lected by the full search.

On the other hand, the number of 16-bit pat-

terns is only 2

, so we can efﬁciently manage data

with bucket method. Then, in the ﬁrst stage search,

it is enough to perform matching with sketches of

= 65,536 pieces, and it is possible to execute

at high speed at a constant cost independent of the

database size. In the ﬁrst stage search, the number of

sketches close to the sketch of a query is only a few of

65,536. Therefore, if we use an algorithm to enumer-

ate sketches in the closest order without performing

matching between sketches, it is practically possible

to increase the speed so that the cost can be neglected.

Here, we will brieﬂy explain the speeding up of

540

Higuchi, N., Imamura, Y., Kuboyama, T., Hirata, K. and Shinohara, T.

Fast Nearest Neighbor Search with Narrow 16-bit Sketch.

DOI: 10.5220/0007377705400547

In Proceedings of the 8th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2019), pages 540-547

ISBN: 978-989-758-351-3

the ﬁrst stage search by enumerating sketches. Before

search, we prepare all 16-bit patterns sorted in ascend-

ing order of the number of ON bits. Sketches can be

enumerated in order of Hamming distance from the

sketch of a query by calculating bitwise exclusive OR

(XOR) between the sketch of the query and these bit

patterns. The second stage search is executed using

only initial part of this sequence. By this method, in

the ﬁrst stage search, calculation of the Hamming dis-

tance between the sketches becomes unnecessary and

almost no search cost is required.

Here we explain sketch’s enumeration method for

the ﬁrst stage speeding up using an example. Let the

sketch width be 3 bits. The distance between sketches

is assumed to be Hamming distance. Hamming dis-

tances between sketches are four types of 0, 1, 2, and

3. If sketch of a query s = 011, the sketches of respec-

tive distances are as follows.

The sketch of distance 0

One sketch that matches s, s itself, that is, 011.

The sketches of distance 1

There are three sketches that differ from s by 1 bit.

They are obtained by XOR of 3-bit strings (001,

010, 100), which have one ON bit, and s. They

are s ⊕ 001 = 010, s ⊕ 010 = 001, s ⊕ 100 = 111.

The sketches of distance 2

There are three sketches that differ from s by 2

bits. They are obtained by XOR of 3-bit strings

(011, 101, 110), which have two ON bits, and s.

They are s ⊕ 011 = 000, s ⊕ 101 = 110, s ⊕ 110

=101.

The sketch of distance 3

One sketch differs s by 3 bits, that is, 100. This is

obtained by s ⊕ 111.

In this way, in the case of 3-bit sketches, they can be

enumerated in ascending order of Hamming distance

to a query sketch, by making XOR of query sketch

with bit patterns 000, 001, 010, 100, 011, 101, 110

and 111 in ascending order of the number of ON bits.

Since the bit pattern sequence used for XOR does

not depend on the query, it is possible to prepare in

advance as arranged in ascending order of the number

of ON bits. Objects in the database can be managed

by the bucket method with the sketches themselves as

indexes, so in the ﬁrst stage search, there is no need

to calculate the distance between the object and the

sketch of the query. As a result, in effect, the search

cost of the ﬁrst stage can be ignored. Also, by sort-

ing objects in the database in the order of sketching,

memory locality in the second stage search can be im-

proved.

2 PRELIMINARIES

Here, we brieﬂy introduce some necessary concepts

for our discussion.

2.1 Nearest Neighbor Search using

Sketches

We assume that data in the given database are in-

dexed by natural numbers 0 to n − 1. Thus, let db =

, ··· , x

n−1

} ⊆ U be the given database of n data,

where U is a data space. The dissimilarity between

two data x

and x

is deﬁned as distance D(x

, x

The nearest neighbor search for a query q ∈ U is to

ﬁnd x ∈ db such that D(q, x) ≤ D(q, y) for all y ∈ db.

Let We can realize the nearest neighbor search using

sketches as follows, where s be a function which maps

data to its sketch, and K ≥ 1.

1. Preparation stage:

Calculate all the sketches s(x

), . . . ,s(x

n−1

2. First stage (Filtering using the Hamming dis-

tances of sketches):

Select K candidates x

, . . . , x

K−1

which have clos-

est sketches s(x

), . . . , s(x

K−1

) to the sketch s(q)

of a query q.

3. Second stage (Nearest neighbor search using ac-

tual distances):

Select the nearest neighbor data from the candi-

dates x

, . . . , x

K−1

Sketches are relatively small structures with re-

spect to their original feature data. For example, we

use 32-bit or 16-bit sketches for image feature data

of 64 bytes in our experiments. In the ﬁrst stage of

searching process, we use the Hamming distances,

which can be more easily calculated using bit op-

erations than the actual distances between features.

However, sketches cannot preserve all the distance re-

lation. Therefore, we use them as a ﬁlter. The accu-

racy of search is the probability that a correct near-

est neighbor is obtained. The larger K of the number

of candidates in the ﬁrst stage achieves a more accu-

rate but slower search. Thus, one of the most impor-

tant subjects on sketch is to achieve higher accuracy

with smaller K, or equivalently, to speeding up search

within acceptable error.

2.2 Sketches based on Ball Partitioning

In this paper, we use sketches based on ball partition-

ing (BP). A pair (p, r) of a point and a radius is called

a pivot. A ball partitioning BP

(p,r)

is deﬁned as fol-

lows:

Fast Nearest Neighbor Search with Narrow 16-bit Sketch

541

























Figure 1: 2-bit sketches by two balls.

(p,r)

(x) =



0 if D(p, x) ≤ r

1 otherwise

A BP based sketch function s

width w is deﬁned by a set of w pivots

P = {(p

, r

), ..., (p

w−1

, r

w−1

)} as follows:

(x) = BP

w−1

)

(x)...BP

)

(x)

Consider 4 points A, B,C , D on a Euclidean plane

in Figure 1. Using a set of two pivots P =

{(p

, r

), (p

, r

)}, their sketches are s

(A) = 01,

(B) = 00, s

(D) = 11. Let q be any

query outside of both balls. Since s

(q) = 11, Ham-

ming distances between sketches of q and A, B,C, D

are 1, 2, 1, 0, respectively. The order of conventional

priority in the ﬁrst stage is D < A = C < B. Note that A

and C cannot be distinguished by Hamming distances

from q.

2.3 Distance Lower Bounds between

Queries and Sketches

In this paper, we use the priority based on dis-

tance lower bounds (Higuchi et al., 2018) in the ﬁrst

stage, which gives accurate search than the Ham-

ming distance. This is a technique based on the fact

that sketches by ball partitioning can be regarded as

quantized images of dimensional reduction mapping

Simple-Map (Shinohara and Ishizaka, 2002). Let

P = {(p

, r

), ..., (p

w−1

, r

w−1

)} be the set of pivots.

We can get the lower bound e

(q, s

(x)) of D(q, x) as

follows.

(q, s

(x)) =



0 if BP

)

(q) = BP

)

(x)

|D(p

, q) − r

| otherwise

We propose priorities using the distance lower bounds

(q, s

(x)) as the criteria to select candidates in the

ﬁrst stage. When we use as the priority

score

∞

(q, s

(x)) =

w−1

max

i=0

(q, s

(x))

which is the maximum lower bounds, we can safely

prune some of candidates because it is really a dis-

tance lower bound. We can also use their sum score

as the priority

score

(q, s

(x)) =

w−1

∑

i=0

(q, s

(x))

which is not longer a distance lower bound, but de-

rives higer accuracy than score

∞

2.4 Optimization of Sketches

In order to improve the accuracy of the search using

the sketch, we select the pivot set P so that the col-

lision probability becomes small. When the sketches

of different data x and

(x) = s

(y)

a collision is said to occur. In this paper, sketch is

optimized using QBP (Higuchi et al., 2018). If opti-

mized 16-bit sketches are used for databases of mil-

lions data, it is expected that the number of data pro-

jected on each sketch will be somewhat even.

3 FAST SEARCH USING 16-bit

SKETCHES

Since the total number of 16-bit sketches is 2

65,536, it is enough to collate with all 16-bit sketches

rather than checking with sketches of millions of in-

dividual data. The cost is independent on the database

size. In actual searching, only a small part of sketches

that approximate the sketch of the query is needed be-

cause the number of data sharing each sketch is ex-

pected to be somewhat even. So by using an algorithm

that enumerates the sketches in order of the sketch of

query, it is possible to increase the speed because the

ﬁrst step search is almost negligible in cost.

3.1 Speeding Up Search using

Hamming Distances of 16-bit

Sketches

The search using 16-bit sketches, in the case of using

a Hamming distance, can be relatively easily speed

up, as explained in Introduction.

ICPRAM 2019 - 8th International Conference on Pattern Recognition Applications and Methods

542

A nearest neighbor search method using 16-bit

sketches is shown in Algorithm 1. It is assumed

here that the database is managed using buckets with

sketches as keys as follows.

x[0], x[1], · · · , x[n − 1]: Array of feature data,

sorted in the sketch order on the memory. (It should

not be an indirect sort via pointers)

id[i]: Data ID of feature data x[i].

f [s]: First position in the array x of the data which

sketch is s.

n[s]: Number of data which sketch is s.

In this way, data whose sketches are s become as

follows

x[ f [s]], x[ f [s] + 1], ··· , x[ f [s] + n[s] − 1]

As preparations for search by the Hamming distance,

we prepare all 16-bit patterns arranged in ascending

order of the number of ON bits.

m[0], m[1], ··· , m[2

− 1]

3.2 Speeding Up Search using Distance

Lower Bounds of 16-bit Sketches

In the case of using score

∞

, the distance lower bounds

are different for each query. So unlike the case using

Hamming distances, sketches cannot be enumerated

by XOR with the prepared bit pattern sequence. How-

ever, as explained below, you can enumerate sketches

in ascending order of score

∞

with sketch for queries.

First, we explain the algorithm using a concrete ex-

ample for 3-bit sketch.

Let i be the position of 2

when bit strings are con-

sidered as binary numbers(i = 0, 1, 2).

, r

): Pivot of the ball partitioning correspond-

ing to position i.

P = {(p

, r

), (p

, r

), (p

, r

)}: Set of Pivots.

q: Query.

= |D(q, p

)−r

|: Distance lower bound between

data and query when their sketch bits are different at

the position i.

Note that, for arbitrary data x, the following in-

equality holds:

)

(q) 6= BP

)

(x) → D(q, x) ≥ e

For simplicity, it is assumed that these distance lower

limits satisfy the following.

≥ e

In ascending order, there are only 4 types of score

∞

between sketches, 0, e

, e

, or e

. Assuming that the

sketch of the query is s

(q) = 011, the sketch with

each score

∞

is as follows.

The sketch whose score

∞

is 0

011 itself.

The sketch whose score

∞

is e

It differs only the bit of position 0 of 011. That is,

010. Then, the sketch is 011. Note that s

(q) ⊕

001 = 010.

The sketches whose score

∞

is e

The bits of position 2 of 011 are the same, the bits

of position 1 are different, and the bit of position

0 is arbitrary. That is, 000 and 001. These are

XORs of s

(q) and 010, 011 respectively.

The sketches whose score

∞

is e

Bits in position 2 are different from 011, the rest

are arbitrary. That is, 100, 101, 110, 111. These

are XORs of s

(q) and 100, 101, 110, 111 respec-

tively.

In this way, sketches in order of score

∞

with s

(q)

are enumerated by XOR with s

(q) and the follow-

ing bit pattern sequence in ascending order of binary

numbers.

000, 001, 010, 011, 100, 101, 110, 111

Notice that score

∞

is determined by the leftmost po-

sition of the ON bit of the pattern. We can rearrange

them in order of Gray code.

000, 001, 011, 010, 110, 111, 101, 100

This is because the order of the binary value and the

order of the Gray code are both in ascending order of

the leftmost position of ON bit. The sequence of Gray

code can be generated by one bit inverting operations

as shown in Table 1.

Table 1: Gray code generation and sketch enumeration.

Gray code Sketch enumeration score

∞

000 011 0

000 ⊕ 001 = 001 011 ⊕ 001 = 010 e

001 ⊕ 010 = 011 010 ⊕ 010 = 000 e

011 ⊕ 001 = 010 000 ⊕ 001 = 001 e

010 ⊕ 100 = 110 001 ⊕ 100 = 101 e

110 ⊕ 001 = 111 101 ⊕ 001 = 100 e

111 ⊕ 010 = 101 100 ⊕ 010 = 110 e

101 ⊕ 001 = 100 110 ⊕ 001 = 111 e

The sequence of which bit positions are to be inverted

is like 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0. By

using this nature of Gray code, efﬁcient enumeration

becomes possible. If the ranking of the distance lower

bounds of each bit position to the query is decided, bit

inverting operation of the relative position using it is

performed. The bit position to be inverted can be ob-

tained with bitcount(i⊕(i+1))−1. Starting from the

Fast Nearest Neighbor Search with Narrow 16-bit Sketch

543

Algorithm 1: NNSEARCHBYHAMMING.

/*x[0], x[1], · · · , x[n − 1]: Array of feature data sorted by sketches */

/*id[i]: Data ID of feature data x[i] */

/* f [s]: First position in the array x of the data which sketch is s*/

/*n[s]: Number of data which sketch is s*/

/* K : Number of candidates obtained in the ﬁrst stage = number of actual distance calculations */

/*m[0], m[1], ··· , m[2

− 1]: All w-bit patterns in ascending order of the number of ON bits*/

/* w : the width of sketches */

function NNSEARCHBYHAMMING(query)1

(NN, nearest, checked) ← (“none”, ∞, 0);2

for i = 0 to 2

− 1 do3

s ← sketch(quer y) ⊕ m[i];4

(NN, nearest, checked) ←SEARCH(query, s, NN, nearest, checked);5

if checked ≥ K then6

return NN;7

end8

end9

return NN;10

end11

function SEARCH(query, s, NN, nearest, checked)12

for i = f [s] to f [s] + min(n[s], K − checked) − 1 do13

if D(query, x[i]) ≤ nearest then14

(NN, nearest) ← (id[i], D(query, x[i]));15

end16

end17

return(NN, nearest, checked + min(n[s], K − checked));18

end19

sketch s

(q) of the query, instead of the bit pattern of

all 0s, it is possible to enumerate sketches in ascend-

ing order of the score from the query, as shown in

Table 1. Using the nature of Gray code the algorithm

for sketch enumeration becomes very simple, since

the operation to get the next sketch in the sequence

can be achieved by just one bit inverting operation.

Here, we explain why the above method can cor-

rectly enumerate sketches. Let g(i) be the Gray code

corresponding to i (i = 0, 1, . . . ). That is, g(0) = 000,

g(1) = 001, . . . , g(7) = 100. Then, s

(q) ⊕ g(i) is the

i-th sketch in the enumeration to be generated. By the

nature of ⊕, the following holds.

(q) ⊕ g(i)) ⊕ (s

(q) ⊕ g(i + 1)) = g(i) ⊕ g(i + 1)

Note that g(i) ⊕ g(i + 1) is a bit pattern with just one

ON bit. Thus, we have the following, where hh is the

left logical shift operation.

g(i) ⊕ g(i + 1) = (1 hh (bitcount(i ⊕ (i + 1)) − 1))

Therefore, we can enumerate desired sequence of

sketches by starting from the sketch s

(q) of a query

and applying the same bit inverting operations as for

the sequence of Gray codes.

Since the distance lower bounds does not neces-

sarily satisfy e

≥ e

, we use the ranking of

these lower bounds. In Algorithm 2, we assume

that bidx[i], ··· , bidx[w − 1] are the rearrangement of

0, 1, · ·· , w − 1 satisfying the following.

bidx[w−1]

≥ ·· · ≥ e

bidx[1]

≥ e

bidx[0]

The function Search is indicated in Algorithm 1.

As a prioritization, we know that score

is more

accurate than score

∞

. The speeding up method for

score

∞

uses the fact that score

∞

has only w + 1 = 17

patterns. On the other hand, score

may have 2

= 2

patterns. Therefore, we cannot achieve the same

speeding up method by enumeration. In the experi-

ment in this paper, in search by 16-bit sketch using

score

, score

for all sketches of 2

= 2

are naively

calculated.

4 EXPERIMENTS

In this section, we report experiments using data,

which are images and music, as follows:

• Images: about 70 million 2D frequency spectrums

of 64 dimension data extracted from 2,900 videos.

• Music: about 70 million mel-frequency spectrums

of 96 dimension data extracted from 1,400 music

CD.

ICPRAM 2019 - 8th International Conference on Pattern Recognition Applications and Methods

544

Algorithm 2: SEARCHBYSCOREINF.

/* w : the width of sketches */

/* K : Number of candidates obtained in the ﬁrst stage = number of actual distance calculations */

function SEARCHBYSCOREINF(query)1

Prepare the distance lower bounds rank order table bidx[0], . . . , bidx[w − 1] for query;2

(NN, nearest, checked) ← (“none”, ∞, 0);3

s ← sketch(quer y);4

(NN, nearest, checked) ←SEARCH(query, s, NN, nearest, checked);5

if checked ≥ K then6

return NN;7

end8

for i = 0 to 2

− 1 do9

s ← s ⊕ (1 hh bidx[bitcount(i ⊕ (i + 1)) − 1]);10

(NN, nearest, checked) ←SEARCH(query, s, NN, nearest, checked);11

if checked ≥ K then12

return NN;13

end14

end15

return NN;16

end17

In ICPRAM 2018 (Higuchi et al., 2018) we also used

the ICPRAM color dataset. However, there are only

about 100,000 data and it is not suitable for this re-

search, so it is not used. We adopted 32 bits and 16

bits as the width of sketches. The sets of pivots are

selected by QBP (Higuchi et al., 2018) which have

smaller collision probability. Table 2 shows the states

of buckets for 16-bit sketches, where “average” is the

average number of elements, “empty” is the number

of empty buckets, “≥ 10” is the ratio of the number

of buckets with 10 or more elements. Since many

buckets have more than 10 elements as observed in

Table 2, speeding up of the second stage by sorting

data in the order of sketch is expected. In comparative

experiments with or without sorting, about 3 times

speedup was conﬁrmed.

Table 2: Buckets for 16-bit Sketches.

Data Image Music

average 105 108

empty 908 (1.5%) 2104 (3.1%)

≥ 10 87% 74%

Randomly generated data are not appropriate for ex-

periments of nearest neighbor search, because in

higher dimensional spaces it is rare to ﬁnd near data.

Therefore, from randomly selected two data x and y,

we prepare queries by mixing x and y as noise of level

5%, 10%, . . . , 50%. For example, a query at noise

level 5% is a sum of x and y with weight 95% and 5%,

respectively. For each noise level, we prepare 1,000

queries. The average of nearest neighbor distances for

queries are shown in Figure 2.

















          









Figure 2: The average of nearest neighbor distances.

The PC used for the experiments was CPU Intel (R)

Xeon (R) CPU E 5 - 2640 2.5 GHz, memory 64

GBytes.

The retrieval accuracy for image data and music

data are shown in Table 3 and Table 4, respectively,

where, “score” is the search priority order, “width”

is the number of bits of sketch, “K” is the ratio of

the candidates in the ﬁrst stage with respect to the

database size, “sketches” is the average number of

sketches enumerated (only for 16-bit, 100% for naive

method without enumeration), “time” is the search

time in millisecond per query (“1st st.” is the search

time for the ﬁrst stage). When enumeration method is

used, the cost for the ﬁrst stage cannot be separated

and omitted. For K, which is the number of actual

distance calculations in the second stage search, we

select 0.1% for 32-bit and 1.0% for 16-bit so that the

search accuracy is about the same (about 70% for im-

ages and about 65% for music). Conventional search

using 32-bit sketches with K = 0.1% achieves faster

Fast Nearest Neighbor Search with Narrow 16-bit Sketch

545

Table 3: Precisions for image search.

score Hamming score

∞

score

width 32 16 32 16 32 16

K 0.1% 1.0% 0.1% 1.0% 0.1% 1.0%

sketches − 100% 0.76% − 100% 0.73% − 100%

time 1st st. 28.7 4.36 − 35.6 3.23 − 32.0 4.90

(ms) total 29.8 7.16 2.85 36.9 6.06 2.68 33.2 7.76

query noise

All 70.2 73.4 73.0 74.3 79.7 79.7 80.2 85.1

5 − 10% 99.8 99.7 99.6 100 100 100 100 100

15 − 20% 96.9 94.4 94.8 99.1 99.3 99.3 100 100

25 − 30% 80.4 80.3 79.3 85.7 88.8 88.8 92.7 94.4

35 − 40% 46.0 53.9 53.3 52.4 63.5 63.5 63.7 73.2

45 − 50% 28.0 38.5 37.9 34.4 47.2 47.2 44.7 58.0

Table 4: Precisions for music search.

score Hamming score

∞

score

width 32 16 32 16 32 16

K 0.1% 1.0% 0.1% 1.0% 0.1% 1.0%

sketches − 100% 0.55% − 100% 0.48% − 100%

time 1st st. 30.3 4.33 − 36.3 3.23 − 34.4 4.63

(ms) total 31.7 7.89 3.58 38.0 6.82 3.38 36.0 8.30

query noise

All 65.5 66.8 66.0 63.6 75.1 73.4 72.3 77.8

5 − 10% 99.7 98.7 98.2 99.5 99.8 99.7 100 100

15 − 20% 93.3 89.1 89.3 92.2 95.3 94.8 97.2 97.2

25 − 30% 70.6 68.2 67.6 67.4 80.3 78.5 79.8 83.0

35 − 40% 40.2 44.2 43.7 36.9 55.6 53.1 51.2 61.5

45 − 50% 23.9 33.7 31.4 21.9 44.5 41.2 33.2 47.4

Table 5: Precisions over 90% for image search.

score Hamming score

∞

score

width 32 16 32 16 32 16

K 2.0% 6.5% 1.5% 5.0% 1.0% 2.5%

sketches − 100% 5.8% − 100% 4.1% − 100%

time(ms) 139 22.0 17.5 106 16.8 12.8 107 12.0

All 91.5 90.2 90.2 90.1 92.0 90.6 93.8 91.4

Table 6: Precisions over 90% for music search

score Hamming score

∞

score

width 32 16 32 16 32 16

K 3.5% 10% 3.5% 8.0% 1.5% 4.0%

sketches − 100% 7.4% − 100% 5.5% − 100%

time(ms) 227 37.4 32.7 188 30.1 25.1 160 18.7

All 90.8 90.4 90.1 90.3 92.2 90.6 92.0 90.9

retrieval than R-Tree based search, whose search time

is about 100s millisecond.

From these tables, it can be seen that as the noise

level increases, that is, as the nearest neighbor solu-

tion gets farther, the search precision becomes lower.

This may be considered to be the inﬂuence of “curse

of dimensionality” in high dimensional space. When

Hamming distance is used for prioritization, for each

database, 16-bit sketches (K = 1.0%) can achieve the

same search precision as 32-bit sketches (K = 0.1%).

Using score

∞

or score

for prioritization improves

search accuracy. The effectiveness of speeding up by

enumerating 16-bit sketches can also be conﬁrmed.

There are slight differences in accuracy depending on

ICPRAM 2019 - 8th International Conference on Pattern Recognition Applications and Methods

546

whether enumeration is used or not, but this is be-

cause the ones selected by the method differ when

there is one with the same priority. In addition, it can

be seen that the number of 16-bit sketches enumerated

in search time is very small. Using the speeding up

by enumeration, the retrieval speed can be increased

about 10 times as compared with using the conven-

tional 32-bit sketches. Since score

cannot be speed-

ing up by enumeration method, it achieves only about

4 times faster search, but the highest accuracy.

In the conventional method, it was not possible

to get higher precision by increasing K in order to

achieve faster search speed than other methods such

as R-Tree. However, we can expect that the proposed

method keeps high-speed search, even if higher ac-

curacy is required. The results compared with larger

K with precision exceeding 90% are shown in Table 5

and Table 6. Search using 16-bit sketches needs larger

K than 32-bit ones, but achieves speeding up about 8

times. Also, with score

you can achieve high preci-

sion without increasing K too much, and it is fastest

though not using enumeration method.

5 CONCLUDING REMARKS

Changing from 32-bit sketches to narrower 16-bit

sketches, about 10 times faster search is achieved

by efﬁcient ﬁrst stage search and data management

by the bucket method. When Hamming distance or

score

∞

are used for prioritization for 16-bit sketches,

the ﬁrst stage search can be done in very short time by

enumerating sketches in order of priorities. As a fu-

ture work, we should consider enumeration algorithm

for score

in a similar way.

Using the 16 bit sketch, in order to maintain the

same degree of precision as in the conventional 32

bit case, the number K of ﬁrst stage candidates is

required to be approximately three times as large.

By sorting data with sketch as a key, second stage

retrieval can be improved about three times faster.

Therefore, it can be expected that the superiority of

the proposed method can be preserved for data of

higher dimension than those used in the experiments.

We need to further investigate the relationship be-

tween database size n and optimal sketch width w. In

this paper we assumed n to be millions, but for larger

databases it may be better to make w greater than 16.

In the experiments in this study, we used the

heuristic method QBP (Higuchi et al., 2018) which

minimizes the collision probability as the evaluation

index for sketch optimization. By using AIR, a kind

of simulated annealing method, a pivot set of sketches

with smaller collision probability than QBP can be

obtained, but search accuracy is not improved (Ima-

mura et al., 2017). However, since the data man-

agement by the bucket method is performed in our

proposed method, as a merit of using a sketch with

smaller collision probability, there is a possibility of

improving the speed by localizing the memory access.

In any case, it seems necessary to further investigate

sketch optimization.

ACKNOWLEDGMENTS

This work was partially supported by JSPS KAK-

ENHI Grant Numbers 16H02870, 17H00762,

16H01743, 17H01788, and 18K11443.

REFERENCES

Ciaccia, P., Patella, M., and Zezula, P. (1997). M-tree: An

efﬁcient access method for similarity search in metric

spaces. In Proc. VLBD’97, pages 426–435.

Dong, W., Charikar, M., and Li, K. (2008). Asymmetric

distance estimation with sketches for similarity search

in high-dimensional spaces. In Proc. ACM SIGIR’08,

pages 123–130.

Guttman, A. (1984). R-trees: A dynamic index structure

for spatial searching. In Yormark, B., editor, Proc.

SIGMOD’84, pages 47–57. ACM Press.

Higuchi, N., Imamura, Y., Kuboyama, T., Hirata, K., and

Shinohara, T. (2018). Nearest neighbor search using

sketches as quantized images of dimension reduction.

In Proc. ICPRAM 2018, pages 356–363.

Imamura, Y., Higuchi, N., Kuboyama, T., Hirata, K., and

Shinohara, T. (2017). Pivot selection for dimension

reduction using annealing by increasing resampling.

In Proc. LWDA 2017, pages 15–23.

Mic, V., Novak, D., and Zezula, P. (2015). Improving

sketches for similarity search. In Proc. MEMICS’15,

pages 45–57.

Mic, V., Novak, D., and Zezula, P. (2016). Speeding up

similarity search by sketches. In Proc. SISAP 2016,

pages 250–258.

uller, A. and Shinohara, T. (2009). Efﬁcient similarity

search by reducing i/o with compressed sketches. In

Proc. SISAP’09, pages 30–38.

Shinohara, T. and Ishizaka, H. (2002). On dimension re-

duction mappings for approximate retrieval of multi-

dimensional data. In Progress of Discovery Science,

LNCS 2281, pages 89–94.

Wang, Z., Dong, W., Josephson, W., Q. Lv, M. C., and Li,

K. (2007). Sizing sketches: A rank-based analysis

for similarity search. In Proc. ACM SIGMETRICS’07,

pages 157–168.

Yianilos, P. (1993). Data structures and algorithms for near-

est neighbor search in general metric spaces. In Proc.

SODA 1993, pages 311–321. ACM Press.

Fast Nearest Neighbor Search with Narrow 16-bit Sketch

547