processing (Lian and Chen, 2009; Lian and Chen,
2011; Shen et al., 2014). As already mentioned, only
a fixed amount of recent stream data is computed at
each query evaluation in this model (Babcock et al.,
2002). In contrast, the temporal similarity adopted
here induces a fixed temporal window (i.e., the time
horizon) with a variable amount of stream data.
Streaming similarity search finds all data objects
that are similar to a given query (Kraus et al., 2017).
To some extent, similarity join can be viewed as a
sequence of searches using each arriving object as a
query object. A fundamental difference in this con-
text is that the threshold is fixed for joins, while it can
vary along distinct queries for searches.
Top-k queries have also been studied in the
streaming setting (Shen et al., 2014; Amagata et al.,
2019). Focusing on streams of vectors, Shen et al.
(Shen et al., 2014) proposed a framework supporting
queries with different similarity functions and win-
dow sizes. Amagata et al. (Amagata et al., 2019)
presented an algorithm for kNN self-join, a type top-k
query that finds the k most similar objects for each ob-
ject. This work assumes objects represented as sets,
however the dynamic scenario considered is very dif-
ferent: instead of a stream of sets, the focus is on a
stream of updates continuously inserting and deleting
elements of existing sets.
Finally, duplicate detection in streams is a well-
studied problem (Metwally et al., 2005; Deng and
Rafiei, 2006; Dutta et al., 2013). A common ap-
proach to dealing with unbounded streams is to em-
ploy space-preserving, probabilistic data structures,
such as Bloom Filters and Quotient Filters together
with window models. However, these proposals aim
at detecting exact duplicates and, therefore, similarity
matching is not addressed.
This paper presented a new algorithm called SSTR
for set similarity join over set streams. To the best of
our knowledge, set similarity join has not been previ-
ously investigated in a streaming setting. We adopted
the concept of temporal similarity and exploited its
properties to reduce processing cost and memory us-
age. We reported an extensive experimental study on
synthetic and real-world datasets, whose results con-
firmed the efficiency of our solution. Future work is
mainly oriented towards designing a parallel version
of SSTR and an algorithmic framework for seamless
integration with batch processing models.
This work was partially supported by Brazilian
agency CAPES.
