processing (Lian and Chen, 2009; Lian and Chen,
2011; Shen et al., 2014). As already mentioned, only
a fixed amount of recent stream data is computed at
each query evaluation in this model (Babcock et al.,
2002). In contrast, the temporal similarity adopted
here induces a fixed temporal window (i.e., the time
horizon) with a variable amount of stream data.
Streaming similarity search finds all data objects
that are similar to a given query (Kraus et al., 2017).
To some extent, similarity join can be viewed as a
sequence of searches using each arriving object as a
query object. A fundamental difference in this con-
text is that the threshold is fixed for joins, while it can
vary along distinct queries for searches.
Top-k queries have also been studied in the
streaming setting (Shen et al., 2014; Amagata et al.,
2019). Focusing on streams of vectors, Shen et al.
(Shen et al., 2014) proposed a framework supporting
queries with different similarity functions and win-
dow sizes. Amagata et al. (Amagata et al., 2019)
presented an algorithm for kNN self-join, a type top-k
query that finds the k most similar objects for each ob-
ject. This work assumes objects represented as sets,
however the dynamic scenario considered is very dif-
ferent: instead of a stream of sets, the focus is on a
stream of updates continuously inserting and deleting
elements of existing sets.
Finally, duplicate detection in streams is a well-
studied problem (Metwally et al., 2005; Deng and
Rafiei, 2006; Dutta et al., 2013). A common ap-
proach to dealing with unbounded streams is to em-
ploy space-preserving, probabilistic data structures,
such as Bloom Filters and Quotient Filters together
with window models. However, these proposals aim
at detecting exact duplicates and, therefore, similarity
matching is not addressed.
6 CONCLUSIONS AND FUTURE
WORK
This paper presented a new algorithm called SSTR
for set similarity join over set streams. To the best of
our knowledge, set similarity join has not been previ-
ously investigated in a streaming setting. We adopted
the concept of temporal similarity and exploited its
properties to reduce processing cost and memory us-
age. We reported an extensive experimental study on
synthetic and real-world datasets, whose results con-
firmed the efficiency of our solution. Future work is
mainly oriented towards designing a parallel version
of SSTR and an algorithmic framework for seamless
integration with batch processing models.
ACKNOWLEDGEMENTS
This work was partially supported by Brazilian
agency CAPES.
REFERENCES
Abadi, D. J., Ahmad, Y., Balazinska, M., C¸ etintemel, U.,
Cherniack, M., Hwang, J., Lindner, W., Maskey, A.,
Rasin, A., Ryvkina, E., Tatbul, N., Xing, Y., and
Zdonik, S. B. (2005). The Design of the Borealis
Stream Processing Engine. In Proceedings of the Con-
ference on Innovative Data Systems Research, pages
277–289.
Amagata, D., Hara, T., and Xiao, C. (2019). Dynamic Set
kNN Self-Join. In Proceedings of the IEEE Interna-
tional Conference on Data Engineering, pages 818–
829.
Anastasiu, D. C. and Karypis, G. (2014). L2AP: fast co-
sine similarity search with prefix L-2 norm bounds. In
Proceedings of the IEEE International Conference on
Data Engineering, pages 784–795.
Babcock, B., Babu, S., Datar, M., Motwani, R., and Widom,
J. (2002). Models and Issues in Data Stream Systems.
In Proceedings of the ACM Symposium on Principles
of Database Systems, pages 1–16.
Baumgartner, J. (2019). Reddit May 2019 Submissions.
Harvard Dataverse.
Bayardo, R. J., Ma, Y., and Srikant, R. (2007). Scaling up
All Pairs Similarity Search. In Proceedings of the In-
ternational World Wide Web Conferences, pages 131–
140. ACM.
Broder, A. Z., Charikar, M., Frieze, A. M., and Mitzen-
macher, M. (1998). Min-Wise Independent Permuta-
tions (Extended Abstract). In Proceedings of the ACM
SIGACT Symposium on Theory of Computing, pages
327–336. ACM.
Carbone, P., Katsifodimos, A., Ewen, S., Markl, V., Haridi,
S., and Tzoumas, K. (2015). Apache Flink
TM
: Stream
and Batch Processing in a Single Engine. IEEE Data
Engineering Bulletin, 38(4):28–38.
Chaudhuri, S., Ganti, V., and Kaushik, R. (2006). A Primi-
tive Operator for Similarity Joins in Data Cleaning. In
Proceedings of the IEEE International Conference on
Data Engineering, page 5. IEEE Computer Society.
Christiani, T., Pagh, R., and Sivertsen, J. (2018). Scalable
and Robust Set Similarity Join. In Proceedings of the
IEEE International Conference on Data Engineering,
pages 1240–1243. IEEE Computer Society.
Deng, F. and Rafiei, D. (2006). Approximately Detecting
Duplicates for Streaming Data using Stable Bloom
Filters. In Proceedings of the ACM SIGMOD Inter-
national Conference on Management of Data, pages
25–36.
do Carmo Oliveira, D. J., Borges, F. F., Ribeiro, L. A., and
Cuzzocrea, A. (2018). Set similarity joins with com-
plex expressions on distributed platforms. In Proceed-
ings of the Symposium on Advances in Databases and
Information Systems, pages 216–230.
SSTR: Set Similarity Join over Stream Data
59