SSTR: Set Similarity Join over Stream Data

Lucas Pac

ıﬁco and Leonardo Andrade Ribeiro

Instituto de Inform

atica, Universidade Federal de Goi

as, Goi

ania – Goi

as Brazil

Keywords:

Advanced Query Processing, Data Streams, Databases, Similarity Join

Abstract:

In modern application scenarios, large volumes of data are continuously generated over time at high speeds.

Delivering timely analysis results from such massive stream of data imposes challenging requirements for

current systems. Even worse, similarity matching can be needed owing to data inconsistencies, which is

computationally much more expensive than simple equality comparisons. In this context, this paper presents

SSTR, a novel similarity join algorithm for streams of sets. We adopt the concept of temporal similarity and

exploit its properties to improve efﬁciency and reduce memory usage. We provide an extensive experimental

study on several synthetic as well as real-world datasets. Our results show that the techniques we proposed

signiﬁcantly improve scalability and lead to substantial performance gains in most settings.

1 INTRODUCTION

In the current Big Data era, large volumes of data

are continuously generated over time at high speeds.

Very often, there is a need for immediate processing

of such stream of data to deliver analysis results in

a timely fashion. Examples of such application sce-

narios abound, including social networks, Internet of

Things, sensor networks, and a wide variety of log

processing systems. Over the years, several stream

processing systems have emerged seeking to meet this

demand (Abadi et al., 2005; Carbone et al., 2015).

However, the requirements for stream processing

systems are often conﬂicting. Many applications de-

mand comparisons between historical and live data,

together with the requirements for instantaneously

processing and fast response times (see Rules 5 and

8 in (Stonebraker et al., 2005)). To deliver results in

real-time, it is imperative to avoid extreme latencies

caused by disk accesses. However, maintaining all

data in the main memory is impractical for unbounded

data streams.

The problem becomes even more challenging in

the presence of stream imperfections, which has to be

handled without causing delays in operations (Rule 3

in (Stonebraker et al., 2005)). In the case of streams

coming from different sources, such imperfections

may include the so-called fuzzy duplicates, i.e., mul-

tiple and non-identical representations of the same

information. The identiﬁcation of this type of re-

dundancy requires similarity comparisons, which are

Table 1: Messages from distinct sources about a football

match.

Source Time Message

X 270

Great chance missed within the penalty

area.

Y 275

Shooting chance missed within the

penalty area.

Z 420

Great chance missed within the penalty

area.

computationally much more expensive than simple

equality comparisons.

Further, data stream has an intrinsic temporal na-

ture. A timestamp is typically associated with each

data object recording, for example, the time of its ar-

rival. This temporal attribute represents important se-

mantic information and, thus, can affect a given no-

tion of similarity. Therefore, it is intuitive to consider

that the similarity between two data objects decreases

with their temporal distance.

As a concrete example, consider a web site provid-

ing live scores and commentary about sporting events,

which aggregates streams from different sources. Be-

cause an event can be covered by more than one

source, multiple arriving messages can be actually

describing a same moment. Posting such redun-

dant messages are likely to annoy users and degrade

their experience. This issue can be addressed by

performing a similarity (self-)join over the incom-

ing streams — a similarity join returns all data objects

whose similarity is not less than a speciﬁed threshold.

Pacíﬁco, L. and Ribeiro, L.

SSTR: Set Similarity Join over Stream Data.

DOI: 10.5220/0009420400520060

In Proceedings of the 22nd International Conference on Enterprise Information Systems (ICEIS 2020) - Volume 1, pages 52-60

ISBN: 978-989-758-423-7

Thus, a new message is only posted if there are no

previous ones that are similar to it. In this context,

temporal information is crucial for similarity assess-

ment because two textually similar messages might be

considered as distinct if the difference in their arrival

time is large. For example, Table 1 shows three mes-

sages about a soccer match from different sources. All

messages are very similar to one another. However,

considering the time of arrival of each message, one

can conclude that, while the ﬁrst two messages refer

to the same moment of the match, the third message,

despite being identical to the ﬁrst one, actually is re-

lated to a different moment.

Morales and Gionis introduced the concept of

temporal similarity for streams (Morales and Gio-

nis, 2016). Besides expressing the notion of time-

dependent similarity, this concept is directly used to

design efﬁcient similarity join algorithms for streams

of vectors. The best-performing algorithm exploits

temporal similarity to reduce the number of compar-

isons. Moreover, such time-dependent similarity al-

lows to establish an ”aging factor”: after some time, a

given data object cannot be similar to any new data ar-

riving in the stream and, thus, can be safely discarded

to reduce memory consumption.

This paper presents an algorithm for similarity

joins over set streams. There is a vast literature on set

similarity joins for static data (Sarawagi and Kirpal,

2004; Chaudhuri et al., 2006; Xiao et al., 2011; Ver-

nica et al., 2010; Ribeiro and H

arder, 2011; Quirino

et al., 2017; Ribeiro-J

unior et al., 2017; Mann et al.,

2016; Wang et al., 2017); however, to the best of our

knowledge, there is no prior work on this type of sim-

ilarity join for stream data. Here, we adapt the notion

of temporal similarity to sets and exploit its proper-

ties to reduce both comparison and memory spaces.

We provide an extensive experimental study on sev-

eral synthetic as well as real-world datasets. Our re-

sults show that the techniques we proposed signiﬁ-

cantly improve scalability and lead to substantial per-

formance gains in most settings.

The remainder of the paper is organized as fol-

lows. Section 2 provides background material. Sec-

tion 3 presents our proposed algorithm and tech-

niques. Section 4 describes the experimental study

and analyzes its results. Section 5 reviews relevant

related work. Section 6 summarizes the paper and

discusses future research.

2 BACKGROUND

In this section, we deﬁne the notions of set and tem-

poral similarity, together with essential optimizations

derived from these deﬁnitions. Then, we formally

state the problem considered in this paper.

2.1 Set Similarity

This work focus on streams of data objects repre-

sented as sets. Intuitively, the similarity between two

sets is determined by their intersection. Represent-

ing data objects as sets for similarity assessment is

a widely used approach for string data (Chaudhuri

et al., 2006).

Strings can be mapped to sets in several ways.

For example, the string “ Great chance missed within

the penalty area” can be mapped to the set of

words {’Great’, ’chance’, ’missed’, ’within’, ’the’,

’penalty’, ’area’}. Another well-known method is

based on the concept of q-grams, i.e., substrings of

length q obtained by “sliding” a window over the

characters of the input string. For example, the string

“similar” can be mapped to the set of 3-grams {’sim’,

’imi’, ’mil’, ’ila’, ’lar’}. Henceforth, we generically

refer to a set element as a token.

A similarity function returns a value in the interval

[0, 1] quantifying the underlying notion of similarity

between two sets; greater values indicate higher simi-

larity. In this paper, we focus on the well-known Jac-

card similarity; nevertheless, all techniques described

here apply to other similarity functions such as Dice

e Cosine (Xiao et al., 2011).

Deﬁnition 1 (Jaccard Similarity). Given two sets x

and y, the Jaccard similarity between them is deﬁned

as J (x, y) =

x∩y

x∪y

Example 1. Consider the sets x and y below, derived

from the two ﬁrst messages in Table 1 (sources X and

Y):

x ={’Great’, ’chance’, ’missed’ ’within’, ’the’,

’penalty’, ’area’},

y ={’Shooting’, ’chance’, ’missed’, ’within’, ’the’,

’penalty’, ’area’}.

Then, we have J (x, y) =

7+7−6

= 0.75.

A fundamental property of the Jaccard similarity is

that any predicate of the form J (x, y) ≥ γ, where γ is

a threshold, can be equivalently rewritten in terms of

an overlap bound.

Lemma 1 (Overlap Bound (Chaudhuri et al., 2006)).

Given two sets, r and s, and a similarity threshold γ,

let O (x, y, γ) denote the corresponding overlap bound,

for which the following holds:

J (x, y) ≥ γ ⇐⇒

x ∩ y

≥ O (x, y, γ) =

1+γ

×(|x|+|y|).

Overlap bound provides the basis for several ﬁl-

tering techniques. Arguably, the most popular and ef-

fective techniques are size-based ﬁlter (Sarawagi and

SSTR: Set Similarity Join over Stream Data

Kirpal, 2004), preﬁx ﬁlter (Chaudhuri et al., 2006),

and positional ﬁlter (Xiao et al., 2011), which we re-

view in the following.

2.2 Optimization Techniques

Intuitively, the difference in size between two simi-

lar sets cannot be too large. Thus, one can quickly

discard set pairs whose sizes differ enough.

Lemma 2 (Size-based Filter (Sarawagi and Kirpal,

2004)). For any two sets x and y, and a similarity

threshold γ, the following holds:

J (x, y) ≥ γ =⇒ γ ≤

|x|

|y|

≤

Preﬁx ﬁlter allows discarding candidate set pairs by

only inspecting a fraction of them. To this end, we

ﬁrst ﬁx a total order on the universe U from which all

tokens are drawn.

Lemma 3 (Preﬁx Filter (Chaudhuri et al., 2006)).

Given a set r and a similarity threshold γ, let

pre f (x, γ) ⊆ x denote the subset of x containing its

ﬁrst

− d

× γe + 1 tokens. For any two sets x e y,

and a similarity threshold γ, the following holds:

J (x, y) ≥ γ =⇒ pre f (x, γ) ∩ pre f (y, γ) 6= ∅.

The positional ﬁlter also exploits token ordering for

pruning. This technique ﬁlters dissimilar set pairs us-

ing the position of matching tokens.

Lemma 4 (Positional ﬁlter (Xiao et al., 2011)). Given

a set x, let w = x[i] be a token of x at position i, which

divides x into two partitions, x

(w) = x[1, .., (i − 1)]

and x

(w) = x[i, .., |x|]. Thus, for any two sets x e y,

and a similarity threshold γ, the following holds:

J (x, y) ≥ γ =⇒

∩ y

+ min (|x

|, |y

|) ≥ O (x, y, γ).

2.3 Temporal Similarity

Each set x is associated with a timestamp, denoted

by t (x), which indicates, for example, its arrival

time. Formally, the input stream is denoted by S =

h..., (x

,t (x

)), (x

,t (x

i+1

), ...i.

The concept of temporal similarity captures the in-

tuition that the similarity between two sets diminishes

with their temporal distance. To this end, the differ-

ence in the arrival time is incorporated into the simi-

larity function.

Deﬁnition 2 (Temporal Similarity (Morales and Gio-

nis, 2016)). Given two sets x e y, let ∆t

= |t (x) −

t (y)| be the difference in their arrival time. The tem-

poral similarity between x and y is deﬁned as

∆t

(x, y) = J (x, y) × e

−λ×∆t

where λ is a time-decay parameter.

Example 2. Consider again the sets x and y from

Example 1, obtained from sources X and Y, respec-

tively, in Table 1, and λ = 0.01. We have ∆t

= 5

and, thus, J

∆t

(x, y) = 0.75 × e

−λ×5

≈ 0.71. Consider

now set z obtained from source Z. Despite of shar-

ing all tokens, x and z have a relatively large tempo-

ral distance, i.e., ∆t

= 150. As a result, we have

∆t

(x, z) = 1 × e

−λ×150

≈ 0.22.

Note that J

∆t

(x, y) = J (x, y) when ∆t

= 0 or λ = 0,

and its limit is 0 as ∆t

approaches inﬁnity, at an ex-

ponential rate modulated by λ. The time-decay factor,

together with the similarity threshold, allows deﬁning

a time ﬁlter: given a set x, after a certain period, called

time horizon, no newly arriving set can be similar to

Lemma 5 (Time Filter (Morales and Gionis, 2016)).

Given a time-decay factor λ, let τ =

× ln

be the

time horizon. Thus, for any two sets x e y, the follow-

ing holds:

∆t

(x, y) ≥ γ =⇒ ∆t

< τ.

Note that the time horizon establishes a temporal win-

dow of ﬁxed size, which slides as a new set arrives.

While in the traditional sliding window model (Bab-

cock et al., 2002) the amount of stream data is ﬁxed,

the number of sets can vary widely across different

temporal windows.

2.4 Problem Statement

We are now ready to formally deﬁne the problem con-

sidered in this paper.

Deﬁnition 3 (Similarity Join over Set Streams).

Given a stream of timestamped sets S , a similarity

threshold γ, and a time-decay factor λ, a similarity

join over S returns all set pairs (x, y) in S such that

∆t

(x, y) ≥ γ.

3 SIMILARITY JOIN OVER SET

STREAMS

In this section, we present our proposal to solve the

problem of efﬁciently answering similarity joins over

set streams. We ﬁrst describe a baseline approach

based on a straightforward adaptation of an existing

set similarity join algorithm for static data. Then, we

present the main contribution of this paper, a new al-

gorithm deeply integrating characteristics of tempo-

ral similarity to improve runtime and reduce memory

consumption.

ICEIS 2020 - 22nd International Conference on Enterprise Information Systems

Algorithm 1: The PPJoin algorithm over set

streams.

Input: Set stream S , threshold γ, decay λ

Output: All pairs (x, y) ∈ S s.t. J

∆t

(x, y) ≥ γ

1 I

← ∅ (1 ≤ i ≤ |U|)

2 while true do

3 x ← read(S )

4 M ← empty map from set id to int

5 for i ← 1 to |pre f (x, γ)| do

6 k ← x[i]

7 foreach (y, j) ∈ I

8 if |y| < |x| × γ then

9 continue

10 ubound ← 1 + min(|x| − i, |y| − j)

11 if M[y] + ubound ≥ O (x, y, γ) then

12 M[y] ← M[y] + 1

13 else

14 M[y] ← − ∞

15 I

← I

∪ (x, i)

16 R

← Verify(x, M, γ)

17 R ← ApplyDecay(R

, γ, λ)

18 Emit(R)

3.1 Baseline Approach

Most state-of-the-art set similarity join algorithms

follow a ﬁltering-veriﬁcation framework (Mann et al.,

2016). In this framework, the input set collection

is scanned sequentially, and each set goes through

the ﬁltering and veriﬁcation phases. In the ﬁltering

phase, tokens of the current set (called henceforth

probe set) are used to ﬁnd potentially similar sets that

have already been processed (called henceforth candi-

date sets). The ﬁlters discussed in the previous section

are then applied to reduce the number of candidates.

This phase is supported by an inverted index, which

is incrementally built as the sets are processed. In the

veriﬁcation phase, the similarity between the probe

set and each of the surviving candidates is fully cal-

culated, and those pairs satisfying the similarity pred-

icate are sent to the output.

A naive way to perform set similarity join in a

stream setting is to simply carry out the ﬁltering and

veriﬁcation phases of an existing algorithm on each

incoming set. The temporal decay is then applied to

the similarity of the pairs returned by the veriﬁcation

in a post-processing phase, before sending results to

the output.

Algorithm 1 describes this naive approach for

PPJoin (Xiao et al., 2011), one of the best perform-

ing algorithms in a recent empirical evaluation (Mann

et al., 2016). The algorithm continuously processes

sets from the input stream as they arrive. The ﬁlter-

ing phase uses preﬁx tokens (Line 5) to probe the in-

verted index (Line 7). Each set found in the associated

inverted list is considered a candidate and checked

against conditions using the size-based ﬁlter (Line 8)

and the positional ﬁlter (Lines 10–11). A reference to

the probe set is appended to the inverted list associ-

ated with each preﬁx token (Line 15). Not shown in

the algorithm, the veriﬁcation phase (Line 16) can be

highly optimized by exploiting the token ordering in

a merge-like fashion and the overlap bound to deﬁne

early stopping conditions (Ribeiro and H

arder, 2011).

Finally, the temporal decay is applied, and a last check

against the threshold is performed to produce an out-

put (Line 17).

Clearly, the above approach has two serious draw-

backs. First, space consumption of the inverted in-

dex can be exorbitant and quickly exceed the avail-

able memory. Even worse, a large part of the index

can be stale entries, i.e., entries referencing sets that

will not be similar to any set arriving in the future.

Second, temporal decay is applied only after the ver-

iﬁcation phase. Therefore, much computation in the

veriﬁcation is wasted on set pairs that cannot be sim-

ilar owing to the difference in their arrival times.

3.2 The SSTR Algorithm

We now present our proposed algorithm called SSTR

for similarity joins over set streams. SSTR exploits

properties of the temporal similarity deﬁnition to

avoid the pitfalls of the naive approach. First, SSTR

dynamically removes old entries from the inverted

lists that are outside the window induced by the probe

set and the time horizon. Second, it uses the tempo-

ral decay to derive a new similarity threshold between

the probe set and each candidate set. This new thresh-

old is greater than the original, which increases the

effectiveness of the size-based and positional ﬁlters.

The steps of STTR are formalized in Algorithm

2. References to sets whose difference in arrival time

with the probe set is greater than the time horizon is

removed as the inverted lists are scanned (Line 8).

Note that the entries in the inverted lists are sorted in

increasing timestamp order. Thus, all stale entries are

grouped at the beginning of the lists. For each can-

didate set, a new threshold value is calculated (Line

10), which is used in the size-based ﬁlter and to cal-

culate the overlap bound (Lines 11 and 15, respec-

tively). In the same way, such increased, candidate-

speciﬁc threshold is also used in the veriﬁcation phase

to obtain greater overlap bounds and, thus, improve

the effectiveness of the early-stop conditions. For this

SSTR: Set Similarity Join over Stream Data

Algorithm 2: The SSTR algorithm.

Input: Set stream S , threshold γ, decay λ

Output: All pairs (x, y) ∈ F s.t. J

∆t

(x, y) ≥ γ

1 τ =

× ln

2 I

← ∅ (1 ≤ i ≤ |U|)

3 while true do

4 x ← read(S )

5 M ← empty map from set id to int

6 for i ← 1 to |pre f (x, γ)| do

7 k ← x[i]

8 Remove all (y, j) from I

s.t. ∆t

> τ

9 foreach (y, j) ∈ I

10 γ

←

−λ×∆t

11 if |y| < |x| × γ

then

12 M[y] ← − ∞

13 continue

14 ubound ← 1 + min(|x| − i, |y| − j)

15 if M[y].s + ubound ≥ O (x, y, γ

)

then

16 M[y].s ← M[y].s + 1

17 else

18 M[y] ← − ∞

19 I

← I

∪ (x, i)

20 R ← Verify (x, M, γ, λ)

21 Emit(R)

reason, the time-decay parameter is passed to the Ver-

ify procedure (Line 20), which now directly produces

output pairs

Even with the removal of stale entries from the in-

verted lists, SSTR still can incurs into high memory

consumption issues for temporal windows containing

too many sets. This situation can happen due to very

small time-decay parameters leading to large win-

dows or at peak data stream rate leading to ”dense”

windows. In such cases, sacriﬁcing timeliness by re-

sorting to some approximation method, such as batch

processing (Babcock et al., 2002), is inevitable. Nev-

ertheless, considering a practical scenario where a

memory budget has been deﬁned, the SSTR algorithm

can dramatically reduce the frequency of such batch

processing modes in comparison to the baseline ap-

proach, as we empirically demonstrate next.

In our implementation, we avoid repeated calculations

of candidate-speciﬁc thresholds and overlap bounds by stor-

ing them in the map M.

Table 2: Datasets statistics.

Name Population Avg. set size Timestamp

DBLP 350 000 76 Poisson

WIKI 1 000 000 53 Uniform

TWITTER 2 824 998 90 Publishing Date

REDDIT 19 456 493 53 Publishing Date

4 EXPERIMENTS

We now present an experimental study of the tech-

niques proposed in this paper. The goal of our ex-

periments is to evaluate the effectiveness of our pro-

posed techniques for reducing the comparison space

and memory consumption. To this end, we com-

pare our SSTR algorithm with the baseline approach,

which is abbreviated to SPPJ (streaming PPJoin). In

this context, we also evaluated the effect of the pa-

rameters γ and λ in the resulting execution times.

4.1 Datasets and Setup

We used four datasets: DBLP

, containing informa-

tion about computer science publications; WIKI

an encyclopedia containing generalized information

about different topics; TWITTER

, geocoded tweets

collected during Brazil elections from 2018; and

REDDIT, a social news aggregation, web content rat-

ing, and discussion website (Baumgartner, 2019). For

DBLP and WIKI, we started by randomly selecting

70k and 200k article titles, respectively. Then, we

generated four fuzzy duplicates from each string by

performing transformations on string attributes, such

as characters insertions, deletions or substitutions. We

end up with 350k and 1M strings for DBLP and

WIKI, respectively. Finally, we assigned artiﬁcial

timestamps to each string in these datasets, sampled

from a Poisson (DBLP) and Uniform (WIKI) dis-

tribution function. For this reason, we call DBLP

and WIKI (semi)synthetic datasets. For TWITTER

and REDDIT, we used the complete dataset available

without applying any modiﬁcation, where the publi-

cation time available for each item was used as times-

tamp. For this reason, we call TWITTER and RED-

DIT real-world datasets. The datasets are heteroge-

neous, exhibiting different characteristics, as summa-

rized in Table 2.

For the similarity threshold γ, we explore a range

of values in [0.5, 0.95], while the time-decay fac-

dblp.uni-trier.de/xml

https://en.wikipedia.org/wiki/Wikipedia:Database

download

https://developer.twitter.com/en/products/tweets

ICEIS 2020 - 22nd International Conference on Enterprise Information Systems

tor λ we use exponentially increasing values in the

range [10

−4

, 10

−1

]. For all datasets, we tokenized the

strings into sets of 3-grams, hashed the tokens into

four byte values, and ordered them within each set

lexicographically.

We conduct our experiments on an Intel E5-2620

@ 2.10GHz with 15MB of cache, 16GB of RAM,

running Ubuntu 16.04 LTS. We report the average

runtime over ﬁve runs. All algorithms were imple-

mented in Java SDK 11.

Some parameter conﬁgurations were very trouble-

some to execute in our hardware environment, both in

terms of runtime and memory; this is particularly the

case for SPPJ on the largest datasets. As a result, we

were unable to ﬁnish the execution of the algorithms

in some settings. In this study, the experiments have

a timeout of 3 hours for each execution.

4.2 Results

We ﬁrst analyze the results on the synthetic datasets.

Figure 1 plots the runtimes for SSTR and SPPJ on

DBLP and WIKI datasets. As expected, SPPJ was

only able to ﬁnish its execution for very high thresh-

old values. In contrast, SSTR successfully terminated

in all settings on WIKI. Higher threshold values in-

crease the effectiveness of the preﬁx ﬁlter, which ben-

eﬁts both SPPJ and SSTR. Yet, in most cases where

SPPJ was able to terminate, SSTR was up to three or-

ders of magnitude faster. These results highlight the

effectiveness of our techniques in drastically reduc-

ing the number of similarity comparisons as well as

memory usage.

On the DBLP dataset, SSTR terminates within the

time limit for all threshold values only for λ = 0.1.

The reason is that the Poisson distribution generates

some very dense temporal windows, with set objects

temporally very close to each other. For small time-

decay values, temporal windows are large and more

sets have to be kept in the inverted lists and compared

in the veriﬁcation phase. Conversely, greater time-

decay values translate into a smaller time horizon and,

thus, narrower temporal windows. As a result, the

time ﬁlter is more effective for pruning stale entries

from the inverted lists. Moreover, time-decay values

lead to greater candidate-speciﬁc thresholds, which,

in turn, improve the pruning power of the size-based

and positional ﬁlters.

We now analyze the results on the real-world

datasets. Figure 2 plots the runtimes for SSTR on

TWITTER and REDDIT. We do not show the results

for SPPJ because it failed due to lack of memory on

these datasets in all settings. Obviously, as SPPJ does

not prune stale entries from the index, it cannot di-

rectly handle the largest datasets in our experimental

setting. Note that we can always reconstruct the in-

verted index, for example after having reached some

space limit. However, this strategy sacriﬁces timeli-

ness, accuracy, or both. While resorting to such batch

processing mode is inevitable in stressful scenarios,

the results show that the SSTR algorithm can nev-

ertheless sustain continuous stream processing much

longer than SPPJ.

Another important observation is that, overall,

SSTR successfully terminates in all settings on real-

world datasets; the only exception is on REDDIT for

the smallest λ value. Moreover, even though those

datasets are larger than DBLP and WIKI, SSTR is up

to two orders of magnitude faster on them. The expla-

nation lies in the timestamp distribution of the real-

world datasets, which exhibit more ”gaps” as com-

pared to the synthetic ones. Hence, the induced tem-

poral windows are more ”sparse” on those datasets.

which is effectively exploited by the time ﬁlter to dy-

namically maintain the length of the inverted lists re-

duced to a minimum. The other trends remain the

same: execution times increase and decrease as simi-

larity thresholds and time-decay parameters decrease

and increase, respectively.

5 RELATED WORK

There is a long line of research on efﬁciently answer-

ing set similarity joins (Sarawagi and Kirpal, 2004;

Chaudhuri et al., 2006; Xiao et al., 2011; Vernica

et al., 2010; Ribeiro and H

arder, 2011; Quirino et al.,

2017; Ribeiro-J

unior et al., 2017; Mann et al., 2016;

Wang et al., 2017). Popular optimizations, such as

size-based ﬁltering, preﬁx ﬁltering, and positional ﬁl-

tering, were incorporated into our algorithm. Re-

cently, reference (Wang et al., 2017) exploited set re-

lations to improve performance — the key insight is

that similar sets produce similar results. However,

one of the underlying techniques, the so-called index-

level skipping, relies on building the whole inverted

index before start processing and, thus, cannot be

used in our context where new sets are continuously

arriving.

Further, set similarity join has been addressed

in a wide variety of settings, including: distributed

platforms (Vernica et al., 2010; do Carmo Oliveira

et al., 2018); many-core architectures (Quirino et al.,

2017; Ribeiro-J

unior et al., 2017); relational DBMS,

either declaratively in SQL (Ribeiro et al., 2016b)

or within the query engine as a physical operator

(Chaudhuri et al., 2006); cloud environments (Sid-

ney et al., 2015); integrated into clustering algorithms

SSTR: Set Similarity Join over Stream Data

Figure 1: Runtime results of the algorithms SPPJ and SSTR on synthetic datasets.

Figure 2: Runtime results of the algorithms SSTR on real-world datasets.

(Ribeiro et al., 2016a; Ribeiro et al., 2018); and prob-

abilistic, either for increasing performance (at the ex-

pense of missing some valid results) (Broder et al.,

1998; Christiani et al., 2018) or modeling uncertain

data (Lian and Chen, 2010). However, none of these

previous studies considered similarity join over set

streams.

Previous work on similarity join over streams fo-

cused on data objects represented as vectors, where

the similarity between two using vectors is mea-

sured using Euclidean distance (Lian and Chen, 2009;

Lian and Chen, 2011) or cosine (Morales and Gionis,

2016). Lian and Chen (Lian and Chen, 2009) pro-

posed an adaptive approach based on a formal cost

model for multi-way similarity join over streams. The

same authors later addressed similarity joins over un-

certain streams (Lian and Chen, 2011).

Morales and Gionis (Morales and Gionis, 2016)

introduced the notion of time-dependent similarity.

The authors then adapted existing similarity join al-

gorithms for vectors, namely AllPairs (Bayardo et al.,

2007) and L2AP (Anastasiu and Karypis, 2014), to

incorporate this notion and exploit its properties to re-

duce the number of candidate pairs and dynamically

remove stale entries from the inverted index. We fol-

low a similar approach here, but the details of these

optimizations are not directly applicable to our con-

text, as we focus on a stream of data objects repre-

sented as sets.

Processing the entire data of possibly unbounded

streams is clearly infeasible. Therefore, some method

has to be used to limit the portion of stream history

processed at each query evaluation. The sliding win-

dow model is popularly used in streaming similarity

ICEIS 2020 - 22nd International Conference on Enterprise Information Systems

processing (Lian and Chen, 2009; Lian and Chen,

2011; Shen et al., 2014). As already mentioned, only

a ﬁxed amount of recent stream data is computed at

each query evaluation in this model (Babcock et al.,

2002). In contrast, the temporal similarity adopted

here induces a ﬁxed temporal window (i.e., the time

horizon) with a variable amount of stream data.

Streaming similarity search ﬁnds all data objects

that are similar to a given query (Kraus et al., 2017).

To some extent, similarity join can be viewed as a

sequence of searches using each arriving object as a

query object. A fundamental difference in this con-

text is that the threshold is ﬁxed for joins, while it can

vary along distinct queries for searches.

Top-k queries have also been studied in the

streaming setting (Shen et al., 2014; Amagata et al.,

2019). Focusing on streams of vectors, Shen et al.

(Shen et al., 2014) proposed a framework supporting

queries with different similarity functions and win-

dow sizes. Amagata et al. (Amagata et al., 2019)

presented an algorithm for kNN self-join, a type top-k

query that ﬁnds the k most similar objects for each ob-

ject. This work assumes objects represented as sets,

however the dynamic scenario considered is very dif-

ferent: instead of a stream of sets, the focus is on a

stream of updates continuously inserting and deleting

elements of existing sets.

Finally, duplicate detection in streams is a well-

studied problem (Metwally et al., 2005; Deng and

Raﬁei, 2006; Dutta et al., 2013). A common ap-

proach to dealing with unbounded streams is to em-

ploy space-preserving, probabilistic data structures,

such as Bloom Filters and Quotient Filters together

with window models. However, these proposals aim

at detecting exact duplicates and, therefore, similarity

matching is not addressed.

6 CONCLUSIONS AND FUTURE

WORK

This paper presented a new algorithm called SSTR

for set similarity join over set streams. To the best of

our knowledge, set similarity join has not been previ-

ously investigated in a streaming setting. We adopted

the concept of temporal similarity and exploited its

properties to reduce processing cost and memory us-

age. We reported an extensive experimental study on

synthetic and real-world datasets, whose results con-

ﬁrmed the efﬁciency of our solution. Future work is

mainly oriented towards designing a parallel version

of SSTR and an algorithmic framework for seamless

integration with batch processing models.

ACKNOWLEDGEMENTS

This work was partially supported by Brazilian

agency CAPES.

REFERENCES

Abadi, D. J., Ahmad, Y., Balazinska, M., C¸ etintemel, U.,

Cherniack, M., Hwang, J., Lindner, W., Maskey, A.,

Rasin, A., Ryvkina, E., Tatbul, N., Xing, Y., and

Zdonik, S. B. (2005). The Design of the Borealis

Stream Processing Engine. In Proceedings of the Con-

ference on Innovative Data Systems Research, pages

277–289.

Amagata, D., Hara, T., and Xiao, C. (2019). Dynamic Set

kNN Self-Join. In Proceedings of the IEEE Interna-

tional Conference on Data Engineering, pages 818–

829.

Anastasiu, D. C. and Karypis, G. (2014). L2AP: fast co-

sine similarity search with preﬁx L-2 norm bounds. In

Proceedings of the IEEE International Conference on

Data Engineering, pages 784–795.

Babcock, B., Babu, S., Datar, M., Motwani, R., and Widom,

J. (2002). Models and Issues in Data Stream Systems.

In Proceedings of the ACM Symposium on Principles

of Database Systems, pages 1–16.

Baumgartner, J. (2019). Reddit May 2019 Submissions.

Harvard Dataverse.

Bayardo, R. J., Ma, Y., and Srikant, R. (2007). Scaling up

All Pairs Similarity Search. In Proceedings of the In-

ternational World Wide Web Conferences, pages 131–

140. ACM.

Broder, A. Z., Charikar, M., Frieze, A. M., and Mitzen-

macher, M. (1998). Min-Wise Independent Permuta-

tions (Extended Abstract). In Proceedings of the ACM

SIGACT Symposium on Theory of Computing, pages

327–336. ACM.

Carbone, P., Katsifodimos, A., Ewen, S., Markl, V., Haridi,

S., and Tzoumas, K. (2015). Apache Flink

: Stream

and Batch Processing in a Single Engine. IEEE Data

Engineering Bulletin, 38(4):28–38.

Chaudhuri, S., Ganti, V., and Kaushik, R. (2006). A Primi-

tive Operator for Similarity Joins in Data Cleaning. In

Proceedings of the IEEE International Conference on

Data Engineering, page 5. IEEE Computer Society.

Christiani, T., Pagh, R., and Sivertsen, J. (2018). Scalable

and Robust Set Similarity Join. In Proceedings of the

IEEE International Conference on Data Engineering,

pages 1240–1243. IEEE Computer Society.

Deng, F. and Raﬁei, D. (2006). Approximately Detecting

Duplicates for Streaming Data using Stable Bloom

Filters. In Proceedings of the ACM SIGMOD Inter-

national Conference on Management of Data, pages

25–36.

do Carmo Oliveira, D. J., Borges, F. F., Ribeiro, L. A., and

Cuzzocrea, A. (2018). Set similarity joins with com-

plex expressions on distributed platforms. In Proceed-

ings of the Symposium on Advances in Databases and

Information Systems, pages 216–230.

SSTR: Set Similarity Join over Stream Data

Dutta, S., Narang, A., and Bera, S. K. (2013). Streaming

quotient ﬁlter: A near optimal approximate duplicate

detection approach for data streams. Proceedings of

the VLDB Endowment, 6(8):589–600.

Kraus, N., Carmel, D., and Keidar, I. (2017). Fishing in

the Stream: Similarity Search over Endless Data. In

bigdata, pages 964–969.

Lian, X. and Chen, L. (2009). Efﬁcient Similarity Join

over Multiple Stream Time Series. IEEE Transactions

on Knowledge and Data Engineering, 21(11):1544–

1558.

Lian, X. and Chen, L. (2010). Set Similarity Join on Prob-

abilistic Data. Proceedings of the VLDB Endowment,

3(1):650–659.

Lian, X. and Chen, L. (2011). Similarity Join Process-

ing on Uncertain Data Streams. IEEE Transactions

on Knowledge and Data Engineering, 23(11):1718–

1734.

Mann, W., Augsten, N., and Bouros, P. (2016). An Em-

pirical Evaluation of Set Similarity Join Techniques.

PVLDB, 9(9):636–647.

Metwally, A., Agrawal, D., and El Abbadi, A. (2005). Du-

plicate Detection in Click Streams. In Proceedings of

the International World Wide Web Conferences, pages

12–21.

Morales, G. D. F. and Gionis, A. (2016). Streaming Similar-

ity Self-Join. Proceedings of the VLDB Endowment,

9(10):792–803.

Quirino, R. D., Ribeiro-J

unior, S., Ribeiro, L. A., and Mar-

tins, W. S. (2017). fgssjoin: A GPU-based Algorithm

for Set Similarity Joins. In International Conference

on Enterprise Information Systems, pages 152–161.

SCITEPRESS.

Ribeiro, L. A., Cuzzocrea, A., Bezerra, K. A. A., and

do Nascimento, B. H. B. (2016a). Sjclust: To-

wards a framework for integrating similarity join al-

gorithms and clustering. In International Confer-

ence on Enterprise Information Systems, pages 75–80.

SCITEPRESS.

Ribeiro, L. A., Cuzzocrea, A., Bezerra, K. A. A., and

do Nascimento, B. H. B. (2018). SjClust: A Frame-

work for Incorporating Clustering into Set Similarity

Join Algorithms. LNCS Transactions on Large-Scale

Data- and Knowledge-Centered Systems, 38:89–118.

Ribeiro, L. A. and H

arder, T. (2011). Generalizing Preﬁx

Filtering to Improve Set Similarity Joins. Information

Systems, 36(1):62–78.

Ribeiro, L. A., Schneider, N. C., de Souza In

acio, A., Wag-

ner, H. M., and von Wangenheim, A. (2016b). Bridg-

ing Database Applications and Declarative Similarity

Matching. Journal of Information and Data Manage-

ment, 7(3):217–232.

Ribeiro-J

unior, S., Quirino, R. D., Ribeiro, L. A., and Mar-

tins, W. S. (2017). Fast Parallel Set Similarity Joins

on Many-core Architectures. Journal of Information

and Data Management, 8(3):255–270.

Sarawagi, S. and Kirpal, A. (2004). Efﬁcient Set Joins

on Similarity Predicates. In Proceedings of the ACM

SIGMOD International Conference on Management

of Data, pages 743–754.

Shen, Z., Cheema, M. A., Lin, X., Zhang, W., and Wang,

H. (2014). A Generic Framework for Top-k Pairs and

Top-k Objects Queries over Sliding Windows. IEEE

Transactions on Knowledge and Data Engineering,

26(6):1349–1366.

Sidney, C. F., Mendes, D. S., Ribeiro, L. A., and H

arder,

T. (2015). Performance Prediction for Set Similarity

Joins. In Proceedings of the ACM Symposium on Ap-

plied Computing, pages 967–972.

Stonebraker, M., C¸ etintemel, U., and Zdonik, S. B. (2005).

The 8 Requirements of Real-time Stream Processing.

SIGMOD Record, 34(4):42–47.

Vernica, R., Carey, M. J., and Li, C. (2010). Efﬁcient Paral-

lel Set-similarity Joins using MapReduce. In Proceed-

ings of the ACM SIGMOD International Conference

on Management of Data, pages 495–506. ACM.

Wang, X., Qin, L., Lin, X., Zhang, Y., and Chang, L.

(2017). Leveraging Set Relations in Exact Set Sim-

ilarity Join. Proceedings of the VLDB Endowment,

10(9):925–936.

Xiao, C., Wang, W., Lin, X., Yu, J. X., and Wang, G.

(2011). Efﬁcient Similarity Joins for Near-duplicate

Detection. ACM Transactions on Database Systems,

36(3):15:1–15:41.

ICEIS 2020 - 22nd International Conference on Enterprise Information Systems