in relation to the state-of-the-art algorithms in lite-
rature. Our experiments, with standard datasets, re-
vealed good speedups, with a scalable behavior as we
increase the size of the datasets. Besides the good re-
sults in this paper, many improvements can be done.
We did not explore some specific optimization in re-
lation to the many-core architectures, like the use of
the so called shared memory in CUDA or local me-
mory in OpenCL (both are parallel development fra-
meworks), as well as memory coalescing. One ob-
servation in this research is the inherently sequential
nature of positional filtering techniques, which hin-
ders a higher level of parallelism. We plan, in future
work, to remove the positional filtering techniques
from our filtering phase, and increase the degree of
parallelism by assigning one processing core to each
token, instead of each set, to make possible coalesced
memory accesses, hoping that the gain with the higher
degree of parallelism compensates the loss in filtering
capacity. We also plan to make use of shared/local
memory as a way to increase locality and, hence,
achieve greater speedups. Finally, we plan to imple-
ment a multi-GPU version (to run on GPU clusters)
and process bigger datasets.
REFERENCES
Arasu, A., Ganti, V., and Kaushik, R. (2006). Efficient exact
set-similarity joins. In Proceedings of the 32nd inter-
national conference on Very large data bases, pages
918–929. VLDB Endowment.
Bayardo, R. J., Ma, Y., and Srikant, R. (2007). Scaling up
All Pairs Similarity Search. In WWW, pages 131–140.
Broder, A. Z., Charikar, M., Frieze, A. M., and Mitzenma-
cher, M. (1998). Min-Wise Independent Permutations
(Extended Abstract). In STOC, pages 327–336.
Chac
´
on, A., Marco-Sola, S., Espinosa, A., Ribeca, P., and
Moure, J. C. (2014). Thread-cooperative, Bit-parallel
Computation of Levenshtein Distance on GPU. In
ICS, pages 103–112.
Chaudhuri, S., Ganti, V., and Kaushik, R. (2006). A primi-
tive operator for similarity joins in data cleaning. In
ICDE, page 5.
Cruz, M. S. H., Kozawa, Y., Amagasa, T., and Kitagawa, H.
(2016). Accelerating set similarity joins using gpus.
TLDKS, 28:1–22.
Deng, D., Li, G., Hao, S., Wang, J., and Feng, J. (2014).
MassJoin: A Mapreduce-based Method for Scalable
String Similarity Joins. In ICDE, pages 340–351.
Doan, A., Halevy, A. Y., and Ives, Z. G. (2012). Principles
of Data Integration. Morgan Kaufmann.
Gravano, L., Ipeirotis, P. G., Jagadish, H. V., Koudas, N.,
Muthukrishnan, S., and Srivastava, D. (2001). Ap-
proximate string joins in a database (almost) for free.
In VLDB, pages 491–500.
Indyk, P. and Motwani, R. (1998). Approximate Nearest
Neighbors: Towards Removing the Curse of Dimensi-
onality. In STOC, pages 604–613.
Junior, S. R., Quirino, R. D., Ribeiro, L. A., and Martins,
W. S. (2016). gssjoin: a gpu-based set similarity join
algorithm. In SBBD, pages 64–75.
Kirk, D. B. and Hwu, W.-m. W. (2010). Programming
Massively Parallel Processors: A Hands-on Appro-
ach. Morgan Kaufmann Publishers Inc., San Fran-
cisco, CA, USA, 1st edition.
Leskovec, J., Rajaraman, A., and Ullman, J. D. (2014). Mi-
ning of Massive Datasets, 2nd Ed. Cambridge Univer-
sity Press.
Li, C., Lu, J., and Lu, Y. (2008). Efficient Merging and Fil-
tering Algorithms for Approximate String Searches.
In ICDE, pages 257–266.
Lieberman, M. D., Sankaranarayanan, J., and Samet, H.
(2008). A Fast Similarity Join Algorithm Using
Graphics Processing Units. In ICDE, pages 1111–
1120.
Mann, W., Augsten, N., and Bouros, P. (2016). An Em-
pirical Evaluation of Set Similarity Join Techniques.
PVLDB, 9(9):636–647.
Ribeiro, L. A. and H
¨
arder, T. (2011). Generalizing Prefix
Filtering to Improve Set Similarity Joins. Information
Systems, 36(1):62–78.
Sarawagi, S. and Kirpal, A. (2004). Efficient Set Joins on
Similarity Predicates. In SIGMOD, pages 743–754.
Vernica, R., Carey, M. J., and Li, C. (2010). Efficient Pa-
rallel Set-similarity Joins using MapReduce. In SIG-
MOD, pages 495–506.
Wang, J., Li, G., and Feng, J. (2012). Can We Beat the Pre-
fix Filtering?: An Adaptive Framework for Similarity
Join and Search. In SIGMOD, pages 85–96.
Xiao, C., Wang, W., Lin, X., and Shang, H. (2009). Top-
k set similarity joins. In 2009 IEEE 25th Internatio-
nal Conference on Data Engineering, pages 916–927.
IEEE.
Xiao, C., Wang, W., Lin, X., Yu, J. X., and Wang, G.
(2011). Efficient Similarity Joins for Near-duplicate
Detection. TODS, 36(3):15.
fgssjoin: A GPU-based Algorithm for Set Similarity Joins
161