Indexing High-Dimensional Vector Streams

ao Pedro V. Pinheiro

1 a

, Lucas Ribeiro Borges

, Bruno Francisco Martins da Silva

Luiz Andr

e P. Paes Leme

2 b

and Marco Antonio Casanova

1 c

Pontiﬁcal Catholic University of Rio de Janeiro, Rio de Janeiro RJ, Brazil

Universidade Federal Fluminense, Niter

oi RJ, Brazil

Keywords:

High-Dimensional Vector Streams, Approximate Nearest Neighbor Search, Product Quantization, Hierarchical

Navigable Small World Graphs, Classiﬁed Ad, Trading Platform.

Abstract:

This paper addresses the vector stream similarity search problem, deﬁned as: “Given a (high-dimensional)

vector

and a time interval

, ﬁnd a ranked list of vectors, retrieved from a vector stream, that are similar to

and that were received in the time interval

.” The paper ﬁrst introduces a family of methods, called staged

vector stream similarity search methods, or brieﬂy SVS methods, to help solve this problem. SVS methods are

continuous in the sense that they do not depend on having the full set of vectors available beforehand, but adapt

to the vector stream. The paper then presents experiments to assess the performance of two SVS methods, one

based on product quantization, called staged IVFADC, and another based on Hierarchical Navigable Small

World graphs, called staged HNSW. The experiments with staged IVFADC use well-known image datasets,

while those with staged HNSW use real data. The paper concludes with a brief description of a proof-of-concept

implementation of a classiﬁed ad retrieval tool that uses staged HNSW.

1 INTRODUCTION

A classiﬁed ad is a textual description of a product,

with a few images of the product, and the price, sale

conditions, and the required seller data - Figure 1. An

online trading platform, or simply a trading platform,

is used in this paper in the speciﬁc sense of a Web

application where sellers can post classiﬁed ads and

buyers can search for products and close transactions.

The transaction volume a trading platform must sup-

port can be signiﬁcant. Indeed, a popular online prod-

uct trading platform typically processes thousands of

classiﬁed ads per day.

The motivation for this paper lies in the challenge

of creating a classiﬁed ad retrieval tool that receives

a classiﬁed ad and returns a ranked list of similar ads.

Similarity, in this case, would be computed with re-

spect to the textual description, the set of images, price,

etc. of the classiﬁed ads. The paper considers the sce-

nario where new ads are continuously included in the

platform, creating a stream of classiﬁed ads.

The construction of a classiﬁed ad retrieval tool, in

https://orcid.org/0000-0002-0909-4432

https://orcid.org/0000-0001-6014-7256

https://orcid.org/0000-0003-0765-9636

this scenario, must face two major difﬁculties. First,

the tool would have to combine text and content-based

image retrieval, since product descriptions contain text

and images. Albeit there are several well-tested text

retrieval tools, retrieving images by content similar-

ity is still challenging, especially when the number

of images is high. State-of-the-art content-based im-

age retrieval strategies (Hameed et al., 2021; Li et al.,

2021) assume that images are represented by high-

dimensional vectors, created using some Deep Learn-

ing technique. Alternatively, the tool might transform

both the text and the images (or in fact any other media

as well) of an ad into a single high-dimensional vector,

as in cross-modal retrieval techniques (Costa Pereira

et al., 2014; Zeng et al., 2020). The challenge then

becomes how to efﬁciently search a (large set) of high-

dimensional vectors or, more precisely, how to im-

plement approximated nearest neighbor search over

high-dimensional vectors (J

egou et al., 2011; Johnson

et al., 2021; Yang et al., 2020).

The second difﬁculty lies in that the set of classiﬁed

ads is dynamic, in the sense that sellers continuously

create new ads, often at a high rate, and ads may be

short-lived, either because the product was actually

sold, or because the seller withdraw the ad, or simply

because the ad became obsolete for some reason. In

Pinheiro, J., Borges, L., Martins da Silva, B., Leme, L. and Casanova, M.

Indexing High-Dimensional Vector Streams.

DOI: 10.5220/0011758900003467

In Proceedings of the 25th International Conference on Enter prise Information Systems (ICEIS 2023) - Volume 1, pages 32-43

ISBN: 978-989-758-648-4; ISSN: 2184-4992

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

Figure 1: Example of a classiﬁed ad.

fact, one may model this scenario as a classiﬁed ad

stream, where the retrieval process occurs over the

stream, perhaps limited to some point in the past, as

a sliding time window. This characteristic of trading

platforms therefore requires that the retrieval strategies

support a dynamic scenario.

From a high-level point of view, this scenario

would require solving the vector stream similar-

ity search problem, deﬁned as: “Given a (high-

dimensional) vector

and a time interval

, ﬁnd a

ranked list of vectors, retrieved from a vector stream,

that are similar to

and that were received in the time

interval T”.

The main contribution of the paper is a family of

methods, called staged vector stream similarity search

methods, or brieﬂy SVS methods, to help solve this

problem. An SVS method uses a main memory cache

to temporarily store the vectors as they are received

from the vector stream. When

becomes full, or a

timeout occurs, the current stage terminates and the

vectors in

are indexed and stored in secondary stor-

age. The net result is a sequence of indexed sets of vec-

tors, each set covering a speciﬁc time interval. Hence,

an SVS method is incremental, in the sense that it does

not depend on having the full set of vectors available

beforehand, but it adapts to the vector stream, and it

can cope with an unlimited number of vectors.

The paper discusses experiments that assess the

performance of implementations of two SVS meth-

ods: one is based on IVFADC - “Inverted File with

Asymmetric Distance Computation” (J

egou et al.,

2011), called staged IVFADC; and another is based

on HNSW – “Hierarchical Navigable Small World”

graphs (Malkov and Yashunin, 2020), as implemented

in Redis

, and is called staged HNSW. IVFADC and

HNSW were chosen since they are well-known ap-

proximated vector similarity search methods.

The ﬁrst set of experiments adopts the database and

query descriptors from the INRIA Holidays images

(Jegou et al., 2008), and assesses the overhead of a

staged implementation against a non-staged, equiva-

lent implementation. The second set of experiments

uses a test dataset constructed from real data, and pro-

vides a more realistic comparison between a staged

and a non-staged implementation.

The paper concludes with a brief description of

a proof-of-concept implementation of a classiﬁed ad

retrieval tool based on Jina

, a framework to build

applications leveraging neural search engines, to con-

trol the retrieval process, and on the staged HNSW

implementation, to index the vector stream.

The rest of this paper is organized as follows. Sec-

tion 2 covers related work. Section 3 outlines SVS.

Section 4 introduces staged IVFADC and the associ-

ated experiments. Section 5 describes staged HNSW

and the set of experiments with real data. Section

6 brieﬂy describes the proof-of-concept implementa-

https://redis.io

https://jina.ai

Indexing High-Dimensional Vector Streams

tion of a classiﬁed ad retrieval tool. Finally, Section 7

contains the conclusions.

2 RELATED WORK

2.1 Batch Vector Similarity Search

Similarity search in large scale, high dimensional

datasets is an essential feature of several Deep Learn-

ing applications (Bengio et al., 2013). Such applica-

tions represent objects as high-dimensional vectors and

use vector similarity search to ﬁnd relevant objects.

However, an exhaustive search of a set of nearest

neighbors can be prohibitively expensive (Beyer et al.,

1999) and traditional indexing strategies do not fare

much better (J

egou et al., 2011). Several algorithms

(Muja and Lowe, 2009; Gionis et al., 1999; Datar et al.,

2004) tried to tackle the time complexity problem by

looking for the nearest neighbor, with high probabil-

ity instead of an exact search. However, storing the

indexed vectors in main memory still posed a serious

limitation for large volumes of data.

The approach proposed in (J

egou et al., 2011) cir-

cumvents these memory constraints by storing a short

code in memory, obtained through product quanti-

zation, instead of the original vectors. This results

in a time and memory-efﬁcient solution for indexing

vectors and performing approximate nearest neighbor

search. The basic idea is to cluster the vectors and use

each cluster centroid to index all vectors that belong to

that cluster. In particular, IVFADC (J

egou et al., 2011)

is an access method based on product quantization that

has been implemented and successfully tested over

billions of vectors

. An implementation of product

quantization that takes advantage of GPUs was also

reported in (Johnson et al., 2021).

In more detail, IVFADC uses two quantizers,

called a coarse quantizer and a product quantizer, to

index and query vectors, using a set of inverted lists.

The coarse quantizer is used to determine which in-

verted list

each vector

should be added to, and the

residual is passed through the product quantizer to gen-

erate the shortcode that is stored in

, together with the

identiﬁer of

. IVFADC is asymmetric because a query

vector

is not quantized by the product quantizer. The

coarse quantizer of

is used to determine which set

of at most

inverted lists should be searched, and

the distances between residuals and shortcodes are di-

rectly computed. The k nearest neighbor vectors are

then returned. Note that w and k are parameters of the

https://engineering.fb.com/2017/03/29/data-

infrastructure/faiss-a-library-for-efﬁcient-similarity-search/

query, and the search is not exhaustive, since only the

entries in the selected inverted lists are searched.

IVFFlat

is a simpliﬁed version of IVFADC, which

only uses the coarse quantizer and thereby has a faster

index construction and requires less storage space. Fur-

thermore, if the query vector comes from the vector

dataset, IVFFlat can achieve a 100% recall.

Ge et al. (2013) introduced an optimized product

quantization that minimizes quantization distortions

w.r.t. the space decomposition and the quantization

codebooks. Cai et al. (2016) described another opti-

mized product quantization scheme which ensures a

better subspace partition.

In another direction, Malkov and Yashunin (2020)

proposed the Hierarchical Navigable Small World –

HNSW index for the approximate

-nearest neighbor

search based on navigable small-world graphs with

controllable hierarchy. HNSW incrementally builds

a multi-layer structure consisting of a hierarchical set

of proximity graphs (layers) for nested subsets of the

stored elements. HNSW starts a search by randomly

selecting an entry node from the top layer, and goes

through all neighbor nodes of the entry node. It repeat-

edly explores the neighbors of new candidate nodes,

and maintains a

item order based on the distance

to the target. HNSW performs very well even on a

large dataset, and can obtain a higher speedup than

a quantization-based algorithm. However, HNSW

spends a relatively long time building neighbor graphs.

Graph storage is another bottleneck when the dataset

is too large (Fu et al., 2019).

Several libraries offer vector indexing methods.

They differ in the methods and the similarity metrics

supported, as well as whether they are open source or

not, offer a Python interface, and are stand-alone or run

on a cluster. FAISS

is an open-source Python library

developed at Meta which offers product quantization

and similarity metrics, including IVFADC and IVFlat.

FAISS also has a multi-GPU implementation. ScaNN

is a similar library developed at Google (Guo et al.,

2020). NGT

- Neighborhood Graph and Tree for

Indexing High-dimensional Data was developed at

Yahoo and implements a speciﬁc indexing method,

with (NGTQ) or without quantization (NGT), with

different similarity metrics.

Yang et al. (2020) described PASE, a scheme for

extending the index type of PostgreSQL that supports

similarity vector search. PASE is used in an industrial

https://github.com/facebookresearch/faiss/wiki/Faiss-

indexes

https://github.com/facebookresearch/faiss/wiki/

https://github.com/google-research/google-

research/tree/master/scann

https://morioh.com/p/8c38367453ae

ICEIS 2023 - 25th International Conference on Enterprise Information Systems

environment and offers, among other options, IVFFlat

and HNSW. The authors argued that IVFFlat is better

for high-precision applications, such as face recogni-

tion, whereas HNSW performs better in general sce-

narios including recommendations and personalized

advertisements, which is the scenario of this paper.

Milvus

is another example of a vector database

offering similarity search. It supports, among others,

IVFFlat and HNSW.

Table 1 summarizes the main features of some well-

known vector indexing libraries and search engines. A

detailed comparison of these methods and tools can be

found at ANN-Benchmarks

Table 1: A comparison of vector indexing libraries and

search engines.

Tool Open Source

Multiple

Similarity Metrics

Quantization

FAISS Y Y* Y

ScaNN Y Y* Y

NGT Y Y* Y

PASE Y Y Y

Milvus Y Y Y

Weaviate Y Y

Qdrant Y Y

Elastic Y Y

(*) No support for cosine similarity.

As mentioned in the introduction, Section 4.1 dis-

cusses a main memory product quantization implemen-

tation of SVS, and assesses the overhead of a staged

implementation against a non-staged, equivalent imple-

mentation. Section 5 describes the proof-of-concept

implementation of staged HNSW that uses Redis with

HNSW, and a set of experiments using real data. There-

fore, these two sections cover implementations of SVS

using product quantization and HNSW, two frequently

used vector index methods.

2.2 Online Vector Similarity Search

Methods for batch similarity search of vectors were

designed to cover the scenario where the complete

set of vectors is known a priori. By contrast, online

similarity search of vectors methods were introduced

to overcome this limitation.

Xu et al. (2018) addressed the problem of creat-

ing quantization methods for databases that evolve.

They described an online product quantization (online

PQ) model that incrementally updates the quantization

codebook to accommodate the incoming streaming

data. Furthermore, the online PQ model supports both

data insertions and deletions over a sliding window.

https://milvus.io/docs/index.md

http://ann-benchmarks.com/index.html

Liu et al. (2020) also proposed an online, optimized

product quantization model to dynamically update the

codebooks and the rotation matrix.

Yukawa and Amagasa (2021) proposed a method

for updating the rotation matrix using SVD-Updating,

which can update the singular matrix using low-rank

approximations. Using SVD-Updating, instead of per-

forming multiple singular value decompositions on a

high-rank matrix, the authors showed how to update

the rotation matrix by performing only one singular

value decomposition on a low-rank matrix.

SVS, the proposed family of vector stream simi-

larity search algorithms, follows a much simpler strat-

egy. It generates a sequence of sets of indexed vec-

tors, stores the indexes generated at each stage in sec-

ondary memory, and uses the stored indexes to process

approximated nearest neighbor search over the high-

dimensional vectors.

3 THE FAMILY OF STAGED

VECTOR STREAM

SIMILARITY SEARCH

METHODS

3.1 Non-Staged Vector Stream

Similarity Search

As a baseline, one may consider any vector similarity

search method adapted to vector streams. Algorithm

1 summarizes, in pseudocode, the essence of a non-

staged ingestion of a stream of vectors

. CREATEIN-

DEX initializes the index and ADJUSTINDEX hides the

details of how the index is adjusted when a new vector

is read from the stream.

Algorithm 1: Non-staged ingestion of a stream of vectors

with incremental indexing.

1: procedure NI

2: CREATEINDEX(I)

3: repeat

4: READ(V ; v)

5: t ← CLOCK

6: ADJUSTINDEX(v, t, I)

7: STORE((v,t))

8: until shutdown

9: end procedure

The exact details of an implementation of Algo-

rithm 1 naturally depend on the index method chosen.

However, independently of the method adopted, the

index will grow unbounded since there is no limit on

Indexing High-Dimensional Vector Streams

the number of vectors to be processed (which come

from a stream). This is one of the problems that the

staged methods try to avoid.

3.2 Staged Vector Stream Similarity

The family of staged vector stream similarity search

methods, or brieﬂy SVS methods, refers to similarity

search methods for vector streams with the following

characteristics. An SVS method uses a main memory

cache

to store the vectors as they are received from

the vector stream. When

becomes full, or a timeout

occurs, the current stage terminates and the vectors in

are indexed and stored in secondary storage. The

net result is a sequence of indexed sets of vectors,

each set covering a speciﬁc time interval. Hence, an

SVS method does not depend on having the full set of

vectors available beforehand, and it can cope with an

unlimited number of vectors.

Members of the SVS family basically differ on

the exact vector indexing scheme they use. However,

there are two broad alternatives: incremental, when the

index

is incrementally constructed, in main memory,

as the vectors are added to the cache; deferred, when

the index

is constructed, in main memory, at the end

of each stage using all vectors in the cache. In either

case,

is persisted in secondary storage when the stage

ends, and reinitialized for the next stage.

SVS has four major operations:

•

INGESTION of a stream of vectors, and indexing

and storing the vectors in secondary storage

•

RETRIEVAL of vectors by similarity and times-

tamp, and ranking the retrieved vectors

• DELETION of vectors

• MERGING of indexes

Algorithms 2 and 3 are highly simpliﬁed descrip-

tions of the INGESTION operation in pseudocode, for

the incremental and deferred alternatives, respectively.

They use seven auxiliary procedures: CLOCK, READ,

ADDCACHE, CREATEINDEX, ADJUSTINDEX, REINI-

TIALIZEINDEX, and STORE.

CLOCK is a function that returns the current wall

clock value.

ADDCACHE adds the newly read vector

to the

cache

, along with the current timestamp (which the

RETRIEVAL operation uses to help ﬁlter the desired

vectors). In the incremental alternative, ADJUSTIN-

DEX immediately indexes

, that is, it updates the

index

to register

. In the deferred alternative,

not indexed at this point.

When the cache becomes full, or a timeout occurs,

in the incremental alternative, STORE just stores

Algorithm 2: Staged ingestion of a stream of vectors

with

incremental indexing.

1: procedure SI(timeout)

2: C ←

0 . cache

3: t

← CLOCK

4: CREATEINDEX(I)

5: repeat

6: READ(V ; v) . read v from stream V

7: t ← CLOCK

8: ADDCACHE((v,t),C)

9: ADJUSTINDEX(v, I)

10: e ← (CLOCK − t

) . cache elapsed time

11: if C is full or e > timeout then

12: for each (v, t) ∈ C do

13: STORE((v, t))

14: end for

15: T ← (t

, CLOCK) . time interval

16: STORE((I, T ))

17: C ←

18: t

← CLOCK

19: REINITIALIZEINDEX(I)

20: end if

21: until shutdown

22: end procedure

with the time interval

it covers, and REINITIAL-

IZEINDEX reinitializes

. In the deferred alternative,

CREATEINDEX is executed to create an index,

, re-

quired to index the vectors in

; STORE stores, in sec-

ondary storage,

with the time interval

it cover. Fi-

nally, in both alternatives, STORE moves to secondary

storage each vector

in the cache

with the timestamp

t when v was read.

Algorithm 4 is again a highly simpliﬁed descrip-

tion of the RETRIEVAL operation in pseudocode. The

RETRIEVAL operation receives as input a query vector

q and a time interval T and performs an approximated

nearest-neighbor search over the stored vectors.

For each index

whose interval intersects T, RE-

TRIEVEVECTORS uses

to perform an approximated

nearest neighbor search to retrieve from secondary

storage all vectors indexed by

that are similar to

and whose timestamp falls in

, returning a list

all such vectors. It combines the partial results in a

single list

. Finally, it ranks the vectors in

by the

distance to q and by timestamp.

The RETRIEVAL operation may also search the

cache, if its time interval intersects T (not represented

in Algorithm 4 for simplicity).

The DELETION operation deletes a speciﬁc vector,

given its identiﬁer. Once removed, the vector will no

longer be retrieved in a search.

Finally, as indexes can get sparse because of dele-

tions, the MERGE operation combines time-adjacent

ICEIS 2023 - 25th International Conference on Enterprise Information Systems

Algorithm 3: Staged ingestion of a stream of vectors

with

deferred indexing.

1: procedure SD(timeout)

2: C ←

0 . cache

3: t

← CLOCK

4: repeat

5: READ(V ; v) . read v from stream V

6: t ← CLOCK

7: ADDCACHE((v,t),C)

8: e ← (CLOCK − t

) . cache elapsed time

9: if C is full or e > timeout then

10: CREATEINDEX(C; I)

11: for each (v, t) ∈ C do

12: ADJUSTINDEX(v, I)

13: STORE((v, t))

14: end for

15: T ← (t

, CLOCK) . time interval

16: STORE((I, T ))

17: C ←

18: t

← CLOCK

19: end if

20: until shutdown

21: end procedure

Algorithm 4: Retrieval of a ranked list of vectors

, given a

query vector q and a time interval T .

1: procedure RET(q, T )

2: L ←

3: for each index I that covers T do

4: RETRIEVEVECTORS(q, I; L

)

5: L ← L ∪ L

6: end for

7: Rank L by similarity to q and by timestamp

8: Return the ranked list

9: end procedure

indexes, according to a conﬁgurable size heuristic.

3.3 Summary

In summary, the main characteristics of the alternatives

for the INGESTION operation are (see Table 2):

•

NI: the overall cost is dominated by the cost of

adjusting the index, since the number of vectors in

the stream is not bounded.

•

SI, SD: at each stage, the cost of adjusting the index

is bounded, since the number of vectors is bounded

by the cache size.

•

SD: at each stage, the overhead is not negligible,

since an index must be created using the vectors

in the cache; however, the index is speciﬁc to the

vectors in the cache.

•

SI: at each stage, the overhead is minimized, since

it reduces to reinitializing the index.

Table 2: Abbreviations for the ingestion strategies.

Abbr. Alg. Description

NI 1 Non-staged ingestion of a stream of

vectors with incremental indexing

SI 2 Staged ingestion of a stream of

vectors with incremental indexing

SD 3 Staged ingestion of a stream of

vectors with deferred indexing

4 STAGED PRODUCT

QUANTIZATION

4.1 Implementations of the INGESTION

Operation based on Product

Quantization

IVFADC, outlined in Section 2.1, with some adjust-

ments, would provide an implementation of the non-

staged INGESTION operation (see Algorithm 1). CRE-

ATEINDEX would construct a codebook

upfront from

a learning set

of vectors (J

egou et al. (2011) used

for the experiments, a learning set with 100,000 im-

ages extracted from Flickr). ADJUSTINDEX would

then index each vector

in the stream against

, which

reduces in IVFADC to ﬁnding the nearest centroid

in the coarse quantizer to

, using Euclidean distance,

codifying the residual

r = v

− v

with the product quan-

tizer into a code

q(r)

, and adding the ID of

and code

q(r) to the inverted list associated with v

However, since the number of vectors in the stream

is unknown, there is no limit on the size of the inverted

lists that IVFADC uses to keep the indexed vector IDs

and codes. Therefore, the inverted lists could be re-

placed by keeping the indexed vectors in a database. In

fact, this is how PASE (Yang et al., 2020) implements

IVFFlat in PostgreSQL. The disadvantage of this alter-

native is exactly that it uses an immutable codebook,

computed from a learning set of vectors, which may

decrease the performance of the retrieval operation

over the vectors received from the vector stream.

To circumvent this problem, an online product

quantization algorithm (see Section 2.2) that adjusts

the codebook could be adopted instead of IVFADC.

The disadvantage of this alternative is that adjusting

a codebook is an expensive operation, a problem that

is exacerbated by the fact that the algorithm has to

handle an unbounded vector stream.

Indexing High-Dimensional Vector Streams

IVFADC would also be an alternative to implement

the staged INGESTION operation.

Consider ﬁrst the incremental indexing alternative.

One possibility would be to assume that CREATEIN-

DEX constructs a codebook

upfront from a training

set and that

is never changed. Then, ADJUSTIN-

DEX would ﬁnd the nearest centroid

in the coarse

quantizer to

, using Euclidean distance, codifying the

residual

r = v

− v

with the product quantizer into a

code

q(r)

, and adding the ID of

and code

q(r)

to the

inverted list associated with

. Since the codebook

is assumed to be ﬁxed, ADJUSTINDEX would not

change

. However, contrasting with the discussion of

the non-staged IVFADC, the size of the inverted lists

is bounded, since it would depend on the size of the

cache. At the end of each stage, STORE would move

the inverted lists to secondary storage, and REINITIAL-

IZEINDEX would simply reinitialize the lists, keep-

ing the original codebook. Hence, the staged method

would differ from both the original IVFADC imple-

mentation and from PASE.

This implementation would have the disadvantage

that it uses a ﬁxed codebook, which might reduce re-

call, if the codebook is created from a training set of

vectors which turns out to be unrelated to the vectors

observed in the stream. A second possibility would

then be to adopt an online product quantization algo-

rithm to circumvent this problem.

Consider now the deferred alternative. At the end

of each stage, CREATEINDEX would construct a differ-

ent codebook

for the vectors in the cache, rather than

a training set. ADJUSTINDEX would then index each

vector

in the cache as before. Finally, STORE would

move the lists and the codebook to secondary storage.

This implementation would use different codebooks

in each stage, and would avoid the overhead of on-

line product quantization methods. The disadvantage

would be the overhead of constructing a new codebook

at each stage, which might be reduced by sampling the

vectors used in the clustering algorithm.

In summary, the main characteristics of the product

quantization alternatives for the INGESTION operation

are (see Table 3):

•

IVFADC-NI: the overall cost is dominated by the

cost of updating the inverted lists, since the code-

book is ﬁxed and created upfront from a train-

ing set of vectors, but the inverted lists grow un-

bounded.

•

IVFADC-SI, IVFADC-SD: at each stage, the cost

of updating the inverted lists is bounded, since the

number of entries in the lists is bounded by the

cache size.

•

IVFADC-SD: at each stage, the overhead is not

negligible, since a codebook is created using the

vectors in the cache; however, the codebook is

speciﬁc to the vectors in the cache, which might

increase recall.

•

IVFADC-SI: at each stage, the overhead is min-

imum when the codebook is ﬁxed, since only a

reinitialization of the inverted lists is required;

when the codebook is updated by an online prod-

uct quantization algorithm, the overhead can be

non-negligible, however.

Table 3: Abbreviations for the ingestion strategies imple-

mented with product quantization.

IVFADC-NI IVFADC implementation of NI,

the non-staged ingestion of a stream

of vectors with incremental indexing

IVFADC-SI IVFADC implementation of SI,

the staged ingestion of a stream

of vectors with incremental indexing

IVFADC-SD IVFADC implementation of SD,

the staged ingestion of a stream

of vectors with deferred indexing

4.2 Experiments with Staged Product

Quantization

The experiments reported in this section assess the

performance of the IVFADC implementation of the

staged ingestion of a stream of vectors with deferred

indexing, referred to as IVFADC-SD in Table 3.

In more detail, the goals of the experiments are:

•

Build cost: evaluate the cost of IVFADC-SD, for

various cache sizes.

•

Query cost: compare the cost of IVFADC-SD with

a baseline when processing query sets.

•

Search quality: compare the mean average

recall@R

of IVFADC-SD with that of a baseline

when processing a set of queries.

The experiments use:

•

Base dataset: a random partition of the 1 million

INRIA Holidays images into 10 batches

,...,

•

Query descriptors: from the INRIA Holidays im-

ages

The base dataset and the query descriptors are as

in (J

egou et al., 2011), except that the base dataset is

partitioned into 10 sets.

The partition of the base dataset simulates the use

of a cache that holds 100,000 images. That is, the

implementation of IVFADC-SD simply processes each

batch

and trains a codebook with a sample of

Note that this simple strategy greatly facilitates the

ICEIS 2023 - 25th International Conference on Enterprise Information Systems

Table 4: Recall values of the staged product quantization experiments.

recall@1 recall@5 recall@10 recall@20 recall@50 recall@100

Baseline 0.269 0.534 0.646 0.743 0.826 0.859

IVFADC-SD 0.251 0.504 0.618 0.725 0.823 0.865

experiments since it just simulates IVFADC-SD using

a standard implementation of IVFADC.

The baseline is taken as IVFADC applied to the

non-partitioned base dataset, trained with a sample of

the base dataset (rather than a separate training set as

in (J

egou et al., 2011)).

Since the original experiments with this baseline

adopted Euclidean distance, it was also adopted here.

The search quality metric is the mean average

recall@R

, as in (J

egou et al., 2011). In general, given

a query

with a set

of relevant vectors for

recall@R

is the proportion of vectors in

which are

ranked in the ﬁrst

positions. In this particular case,

the set of relevant vectors for

is a group of vectors

closest to Q, as speciﬁed in (Jegou et al., 2008)

In all experiments, the codebooks and the vectors

were stored in main memory. The experiments used a

PC with an Intel core i5-9600k CPU @ 3.7GHz pro-

cessor and 16 GB RAM (12GB for the VM), running

Ubuntu 20.04 on WSL2 VM and python 3.10.

The results were:

IVFADC-SD and the baseline IVFADC had equiv-

alent total build cost, both in terms of training the

codebooks and indexing the data.

IVFADC-SD had a signiﬁcant increase of around

40% in query cost when querying across all 10

batches (the query cost were not detailed in a sepa-

rated table since they were uniformly 40% higher).

Search quality, measured by mean average

recall@R

, saw a slight reduction for lower val-

ues of

, but was otherwise similar to the baseline,

as observed in Table 4.

These results deserve a few comments. First, note

that, if a new batch

is considered, the baseline

IDFADC would have to recompute the codebook. On

the other hand, IVFADC-SD would only have the cost

associated with processing B

Second, the increase in query cost comes from the

need to query across the different codebooks, calculat-

ing the residual distance to each different product quan-

tizer centroid, and merging the individual result lists.

Evidently, the query cost depends on how many dif-

ferent batches span the relevant interval. Thus, narrow

intervals could signiﬁcantly lower the query cost by

limiting the number of batches that must be searched.

See also http://lear.inrialpes.fr/people/jegou/data.php

Third, a possible reason for the decrease in search

quality would be that the partitioning and sampling

make the data too sparse, which gets accentuated at

lower values of R.

To conclude, these early experiments suggest that

the staged method does not incur signiﬁcant overhead,

can achieve equivalent search quality, and has a query

cost that depends on the search interval. But the staged

method scales to vector streams of unbounded length,

whereas a non-staged method does not.

5 STAGED HNSW

This section describes experiments to compare the be-

havior of staged Hierarchical Navigable Small World

(staged HNSW) with a baseline, when the data volume

increases. A detailed discussion of implementation al-

ternatives for the INGESTION operation using HNSW

would follow along the lines of Section 4.1.

The experiments used data collected from a Brazil-

ian online classiﬁed ads company, as follows. There

are three main verticals:

Real Estate

Vehicle

, and

Goods

. The experiments target

Goods

ads, focusing on

Electronics > Telephony & Cellphones

(5.89%

of approved ads). Daily, there is an average of 444k

approved ads (about 5/sec) entering the platform. The

datasets used in these experiments were constructed

from approximately 50k, 100k, 250k, 500k, 750k, and

1MM ads collected from 2022/06/01 to 2022/07/10.

For brevity, this section reports experiments that

used only the vectors obtained from encoding the ad

texts (see Figure 1), all with the same dimension, equal

to 768.

Staged HNSW was simulated in Redis

with

HNSW by dividing each dataset into batches. For

each batch, Redis was used to create an HNSW in-

dex for the vector embeddings of the ad texts. The

experiments divided each dataset into 5 batches.

The baseline used Redis to store the set of vector

embeddings resulting from processing the ad texts,

without dividing the dataset into batches. Furthermore,

the baseline used the “ﬂat” option, that is, Redis pro-

cesses queries by executing a full sequential scan of

the embeddings.

In all experiments, the vectors and the indices were

maintained in main memory, which required a much

https://redis.io

Indexing High-Dimensional Vector Streams

Table 5: Indexing times (in ms) for various dataset and batch sizes (text-only).

Dataset size Batch size Batch 0 Batch 1 Batch 2 Batch 3 Batch 4 All batches HNSW

50,000 10,000 7,374 7,883 7,388 7,333 7,226 37,204 97,376

100,000 20,000 16,231 15,348 13,251 14,859 13,912 73,601 201,870

250,000 50,000 68,378 63,417 67,307 63,645 103,408 366,155 536,274

500,000 100,000 188,794 252,109 155,201 159,594 105,520 861,218 1,242,132

1,000,000 200,000 244,318 234,918 233,393 232,553 234,731 1,179,913 2,128,172

Table 6: Query processing times (in ms; dataset size = 1,000,000).

Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Avg

FLAT (k=10) 3.4132 0.2694 0.2859 0.2629 0.2718 0.2855 0.2800 0.2687 0.3117 0.2714 0.2806

HNSW (k=10) 0.1241 0.0523 0.0935 0.0463 0.0697 0.0871 0.0793 0.0623 0.1401 0.0684 0.0796

HNSW batch 0 (k=4) 0.0831 0.0444 0.0676 0.0394 0.0534 0.0664 0.0600 0.0487 0.0925 0.0526 0.0595

HNSW batch 1 (k=4) 0.0689 0.0407 0.0577 0.0358 0.0462 0.0571 0.0519 0.0421 0.0776 0.0457 0.0513

HNSW batch 2 (k=4) 0.0660 0.0385 0.0533 0.0335 0.0412 0.0522 0.0470 0.0382 0.0764 0.0415 0.0472

HNSW batch 3 (k=4) 0.0660 0.0368 0.0521 0.0316 0.0382 0.0506 0.0457 0.0356 0.0765 0.0385 0.0454

HNSW batch 4 (k=4) 0.0660 0.0358 0.0519 0.0311 0.0380 0.0528 0.0454 0.0341 0.0765 0.0373 0.0452

larger HW conﬁguration. Redis was run on a PC server

with OS GNU/Linux Ubuntu 16.04.6 LTS, a quad-

core processor Intel(R) Core(TM) i7-5820K CPU @

3.30GHz, with 64 GB of RAM and 1TB of SSD.

Table 5 shows the time spent on ingesting the vec-

tors and building the indices in Redis. The lines cor-

respond to the various dataset sizes. The column la-

beled “

Batch size

” indicates the batch size, which sim-

ulates the cache size. The columns labeled “

Batch 0

”

through “

Batch 4

” show the time Redis took to ingest

and build the HNSW index for the vectors in a given

batch. The column labeled “

All batches

” shows the

sum of the batch times. The column labeled “

HNSW

”

shows the time Redis took to ingest and build the

HNSW index for all vectors in a given dataset. Note

that the sum of the times to ingest and index the vector

for all batches is roughly half of the time to ingest and

index the full dataset, which is expected, given the way

an HNSW index is built.

Table 6 shows the query processing times. The

last column, labeled “

Avg

”, discards the outlier values

min and max. Each line corresponds to a search alter-

native: “

FLAT (k=10)

” indicates that Redis used no

index to retrieve the ﬁrst k=10 vectors closest to

using Euclidean distance, from the full dataset with

1,000,000 vectors; “

HNSW (k=10)

” indicates that Re-

dis used the HNSW index to retrieve the ﬁrst k=10

vectors closest to

, using Euclidean distance, from

the full dataset with 1,000,000 vectors; and “

HNSW

batch i (k=4)

” indicates that Redis used the HNSW

index to retrieve the ﬁrst k=4 vectors closest to

using Euclidean distance, from the 200,000 vectors in

Batch i.

Note that the query processing times vary slightly

from batch to batch. Also, note that the query pro-

cessing times using HNSW are about

3.5x

faster than

using the “ﬂat” option (no index). The query pro-

cessing times using the HNSW batches in parallel are

between

4.7x

and

6.2x

faster than using the “ﬂat” op-

tion. It is important to stress that, since k=4 for the

batches, processing the HNSW batches in parallel re-

sulted in

5 × 4 = 20

vectors that were sorted by score

and ﬁltered to obtain the top 10 vectors.

To assess precision and recall, we selected, from

the dataset with 1,000,000 vectors, 10 vectors to play

the role of queries. For each query

, we considered

as the relevant vectors the top-10 vectors retrieved

by Redis with the “ﬂat” option from the dataset with

1,000,000 vectors, which amounts to the 10 vectors

closest to

, in Euclidian distance, since Redis with

the “ﬂat” option performs a full dataset scan.

Euclidean distance was adopted for consistency

with the experiments reported in Section 4.2.

Table 7 shows the mean average precision@k, for

k = 1, 5, 10

. Lines labeled “

HNSW

” correspond to

processing each query

using Redis with HNSW

over the full dataset with 1,000,000 vectors. Likewise,

lines labeled “

HNSW batches

” correspond to process

each query

using Redis with HNSW over each batch

with 200,000 vectors, keeping the k=4 ﬁrst vectors and

merging the results.

Table 8 shows the mean average recall@k, for

k = 1, 5, 10, and is similarly organized.

Finally, a closer look at some query examples re-

vealed that the ad used to create the ﬁrst query

was duplicated multiple times in the dataset. Then,

after a quick validation, it was clear that the seller

subscribed to different ads which were copies of each

other. Thus, the ﬁrst ten vectors retrieved were from

these copies with a Euclidean distance equal to 0. This

suggests deduplicating the ads before constructing the

test datasets.

ICEIS 2023 - 25th International Conference on Enterprise Information Systems

Table 7: Precision values of the query experiments.

Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Avg

HNSW

precision@1 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

precision@5 0.20 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.92

precision@10 0.50 0.90 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.94

HNSW batches

precision@1 1.00 0.00 1.00 0.00 0.00 1.00 0.00 1.00 0.00 1.00 0.50

precision@5 0.40 0.40 0.60 0.20 0.40 0.40 0.60 0.40 0.60 1.00 0.50

precision@10 0.50 0.50 0.60 0.40 0.30 0.40 0.60 0.30 0.50 0.80 0.49

Table 8: Recall values of the query experiments.

Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Avg

HNSW

recall@1 0.10 0.10 0.10 0.10 0.10 0.10 0.10 0.10 0.10 0.10 0.10

recall@5 0.10 0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.46

recall@10 0.50 0.90 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.94

HNSW batches

recall@1 0.10 0.00 0.10 0.00 0.00 0.10 0.00 0.10 0.00 0.10 0.05

recall@5 0.20 0.20 0.30 0.10 0.20 0.20 0.30 0.20 0.30 0.50 0.25

recall@10 0.50 0.50 0.60 0.40 0.30 0.40 0.60 0.30 0.50 0.80 0.49

6 A CLASSIFIED AD RETRIEVAL

TOOL

This section brieﬂy outlines a proof-of-concept clas-

siﬁed ad retrieval tool

, based on Jina,

, a frame-

work to build applications leveraging neural search

engines, to control the retrieval process, and on the

staged HNSW implementation introduced in Section

5 to index the vector stream.

The tool is structured into a main module and three

auxiliary modules. The main module is responsible for

controlling the task ﬂow between the auxiliary mod-

ules and offers a user interface that permits indicating

the dataset to be used, among other features. The aux-

iliary modules are a text encoder, an image encoder,

and a database.

Each ad has a key, a name, a brief textual descrip-

tion, and an image. The text and image encoders, as the

name implies, transform the text and image of an ad

into their vector representations. The vectors resulting

from the encodings are stored in Redis, as indicated in

Section 5. Each index depends on several parameters,

such as the embedding dimension, the metric used to

compute the distance between two vectors, and the

type of the index.

Following the staged strategy with deferred index-

ing, the tool buffers 50.000 ads before encoding and

storing their text and images.

As an example of the retrieval output, suppose that

the user wants to ﬁnd the top 3 ads most similar to

the ad “Vendo

Motorola E7

. Vendo Motorola E7, Tr

meses de uso

com

nota ﬁscal

, acompanha capinha e

Available at https://github.com/BrunoFMSilva/projeto-

ﬁnal-multimodal-clustering

https://jina.ai

pel

ıcula de vidro” (“Sell

Motorola E7

. Sell Motorola

E7, three

months of use

, with

invoice

, and cover and

glass protection cover”). The tool will return:

“

Motorola e7

muito conservado. Vendo Motorola

e7 com 4 meses de uso nota ﬁscal e tudo”

(“

Motorola e7

in good conditions. Sell Motorola

e7 with 4

months of use invoice

and everything”).

“

Motorola E7

semi novo na caixa. Vendo celular

Motorola E7 na caixa no valor de 700.00 4

meses

de uso”

(“

Motorola E7

almost new in the box. Sell Mo-

torola E7 in the box for 700.00 4

months of use

”).

“

Motorola E7

o hoje. Vendo esse Motorola E7

valor 500 reais .ele acompanha. Carregado orig-

inal Fone de ouvido original

Nota ﬁscal

e caixa.

Ele vai fazer 8

meses de uso

. Motivo da venda

*****. ZAP.******** ”

(“

Motorola E7

only today. Sell Motorola E7 for

500 reais together with. Original charger Original

earphone

invoice

and box. It will be 8

months old

Reason *****. ZAP.********”).

Finally, to facilitate the experiments, the tool al-

lows the user to test different conﬁgurations by varying

the embedding dimensions, the type of the indices –

“ﬂat” or “HSNW”, the distance metrics adopted, and

some optimization parameters, such as the construc-

tion of the indices in parallel and the amount of mem-

ory used.

7 CONCLUSIONS

The main contribution of this paper was a family

of algorithms, called staged vector stream similarity

Indexing High-Dimensional Vector Streams

search – SVS, to dynamically index a stream of high-

dimensional vectors and facilitate similarity search.

SVS is continuous in the sense that it does not depend

on having the full set of vectors available beforehand,

but adapts to the vector stream.

This family of algorithms provides an elegant solu-

tion to the vector stream similarity search problem that

does not depend on updating the underlying vector in-

dexing method, which is usually expensive, as pointed

out in the related work section. Indeed, the original

contribution of the paper stems from the observation

that a stream of vectors that become obsolete over

time requires an approach different from static vector

indexing methods or updating such data structures.

The paper discussed two sets of experiments to

assess the performance of SVS. The ﬁrst set of ex-

periments used an IVFADC implementation and the

same setup as in (J

egou et al., 2011), and the second

set adopted an HNSW implementation over real data.

These experiments suggested that the SVS implemen-

tations do not incur signiﬁcant overhead and achieve

reasonable search quality. However, SVS can support

unbounded vector streams.

The paper concluded with a brief description of a

proof-of-concept implementation of a classiﬁed ad re-

trieval tool, based on Jina and Redis with HNSW. The

tool allows testing different conﬁgurations by varying

the embedding dimensions, the type of the indices,

the distance metrics adopted, and some optimization

parameters.

As future work, we plan to conduct further experi-

ments with the proof-of-concept retrieval tool, using

much larger datasets collected from the classiﬁed ad

platform and larger sets of realistic queries.

REFERENCES

Bengio, Y., Courville, A., and Vincent, P. (2013). Repre-

sentation learning: A review and new perspectives.

IEEE transactions on pattern analysis and machine

intelligence, 35(8):1798–1828.

Beyer, K., Goldstein, J., Ramakrishnan, R., and Shaft, U.

(1999). When is “nearest neighbor” meaningful? In

International conference on database theory, pages

217–235. Springer.

Cai, Y., Ji, R., and Li, S. (2016). Dynamic programming

based optimized product quantization for approximate

nearest neighbor search. Neurocomputing, 217:110–

118.

Costa Pereira, J., Coviello, E., Doyle, G., Rasiwasia, N.,

Lanckriet, G., Levy, R., and Vasconcelos, N. (2014).

On the role of correlation and abstraction in cross-

modal multimedia retrieval. Transactions of Pattern

Analysis and Machine Intelligence, 36(3):521–535.

Datar, M., Immorlica, N., Indyk, P., and Mirrokni, V. S.

(2004). Locality-sensitive hashing scheme based on

p-stable distributions. In Proceedings of the twentieth

annual symposium on Computational geometry, pages

253–262.

Fu, C., Xiang, C., Wang, C., and Cai, D. (2019). Fast

approximate nearest neighbor search with the nav-

igating spreading-out graph. Proc. VLDB Endow.,

12(5):461–474.

Ge, T., He, K., Ke, Q., and Sun, J. (2013). Optimized product

quantization for approximate nearest neighbor search.

In 2013 IEEE Conference on Computer Vision and

Pattern Recognition, pages 2946–2953.

Gionis, A., Indyk, P., Motwani, R., et al. (1999). Similarity

search in high dimensions via hashing. In Proc. 25th

International Conference on Very Large Data Bases,

page 518–529, San Francisco, CA, USA. Morgan Kauf-

mann Publishers Inc.

Guo, R., Sun, P., Lindgren, E., Geng, Q., Simcha, D., Chern,

F., and Kumar, S. (2020). Accelerating large-scale

inference with anisotropic vector quantization. In Proc.

37th International Conference on Machine Learning,

ICML’20, page 10. JMLR.org.

Hameed, I. M., Abdulhussain, S. H., and Mahmmod, B. M.

(2021). Content-based image retrieval: A review of

recent trends. Cogent Engineering, 8(1):1927469.

Jegou, H., Douze, M., and Schmid, C. (2008). Hamming

embedding and weak geometric consistency for large

scale image search. In Computer Vision – ECCV 2008,

pages 304–317.

egou, H., Douze, M., and Schmid, C. (2011). Product

quantization for nearest neighbor search. IEEE Trans-

actions on Pattern Analysis and Machine Intelligence,

33(1):117–128.

Johnson, J., Douze, M., and Jegou, H. (2021). Billion-scale

similarity search with gpus. IEEE Transactions on Big

Data, 7(03):535–547.

Li, X., Yang, J., and Ma, J. (2021). Recent developments of

content-based image retrieval (cbir). Neurocomputing,

452:675–689.

Liu, C., Lian, D., Nie, M., and Hu, X. (2020). Online

optimized product quantization. In 2020 IEEE Inter-

national Conference on Data Mining (ICDM), pages

362–371.

Malkov, Y. A. and Yashunin, D. A. (2020). Efﬁcient and

robust approximate nearest neighbor search using hi-

erarchical navigable small world graphs. IEEE Trans.

Pattern Anal. Mach. Intell., 42(4):824–836.

Muja, M. and Lowe, D. G. (2009). Fast approximate near-

est neighbors with automatic algorithm conﬁguration.

VISAPP (1), 2(331-340):2.

Xu, D., Tsang, I. W., and Zhang, Y. (2018). Online product

quantization. IEEE Transactions on Knowledge and

Data Engineering, 30(11):2185–2198.

Yang, W., Li, T., Fang, G., and Wei, H. (2020). Pase:

Postgresql ultra-high-dimensional approximate near-

est neighbor search extension. In Proc. 2020 ACM

SIGMOD International Conference on Management of

Data, page 2241–2253.

ICEIS 2023 - 25th International Conference on Enterprise Information Systems

Yukawa, K. and Amagasa, T. (2021). Online optimized

product quantization for dynamic database using svd-

updating. In Database and Expert Systems Applica-

tions, pages 273–284.

Zeng, D., Yu, Y., and Oyama, K. (2020). Deep triplet neural

networks with cluster-cca for audio-visual cross-modal

retrieval. ACM Trans. Multimedia Comput. Commun.

Appl., 16(3).

Indexing High-Dimensional Vector Streams