fgssjoin: A GPU-based Algorithm for Set Similarity Joins
Rafael D. Quirino, Sidney R. Junior, Leonardo A. Ribeiro, Wellington S. Martins
2017
Abstract
Set similarity join is a core operation for text data integration, cleaning and mining. Most state-of-the-art solutions rely on inherently sequential, CPU-based algorithms. In this paper we propose a parallel algorithm for the set similarity join problem, harnessing the power of GPU systems through filtering techniques and divide-and-conquer strategies that scales well with data size. Experiments show substantial speedups over the fastest algorithms in literature.
References
- Arasu, A., Ganti, V., and Kaushik, R. (2006). Efficient exact set-similarity joins. In Proceedings of the 32nd international conference on Very large data bases, pages 918-929. VLDB Endowment.
- Bayardo, R. J., Ma, Y., and Srikant, R. (2007). Scaling up All Pairs Similarity Search. In WWW, pages 131-140.
- Broder, A. Z., Charikar, M., Frieze, A. M., and Mitzenmacher, M. (1998). Min-Wise Independent Permutations (Extended Abstract). In STOC, pages 327-336.
- Chacón, A., Marco-Sola, S., Espinosa, A., Ribeca, P., and Moure, J. C. (2014). Thread-cooperative, Bit-parallel Computation of Levenshtein Distance on GPU. In ICS, pages 103-112.
- Chaudhuri, S., Ganti, V., and Kaushik, R. (2006). A primitive operator for similarity joins in data cleaning. In ICDE, page 5.
- Cruz, M. S. H., Kozawa, Y., Amagasa, T., and Kitagawa, H. (2016). Accelerating set similarity joins using gpus. TLDKS, 28:1-22.
- Deng, D., Li, G., Hao, S., Wang, J., and Feng, J. (2014). MassJoin: A Mapreduce-based Method for Scalable String Similarity Joins. In ICDE, pages 340-351.
- Doan, A., Halevy, A. Y., and Ives, Z. G. (2012). Principles of Data Integration. Morgan Kaufmann.
- Gravano, L., Ipeirotis, P. G., Jagadish, H. V., Koudas, N., Muthukrishnan, S., and Srivastava, D. (2001). Approximate string joins in a database (almost) for free. In VLDB, pages 491-500.
- Indyk, P. and Motwani, R. (1998). Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality. In STOC, pages 604-613.
- Junior, S. R., Quirino, R. D., Ribeiro, L. A., and Martins, W. S. (2016). gssjoin: a gpu-based set similarity join algorithm. In SBBD, pages 64-75.
- Kirk, D. B. and Hwu, W.-m. W. (2010). Programming Massively Parallel Processors: A Hands-on Approach. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1st edition.
- Leskovec, J., Rajaraman, A., and Ullman, J. D. (2014). Mining of Massive Datasets, 2nd Ed. Cambridge University Press.
- Li, C., Lu, J., and Lu, Y. (2008). Efficient Merging and Filtering Algorithms for Approximate String Searches. In ICDE, pages 257-266.
- Lieberman, M. D., Sankaranarayanan, J., and Samet, H. (2008). A Fast Similarity Join Algorithm Using Graphics Processing Units. In ICDE, pages 1111- 1120.
- Mann, W., Augsten, N., and Bouros, P. (2016). An Empirical Evaluation of Set Similarity Join Techniques. PVLDB, 9(9):636-647.
- Ribeiro, L. A. and Härder, T. (2011). Generalizing prefix filtering to improve set similarity joins. Information Systems, 36(1):62-78.
- Sarawagi, S. and Kirpal, A. (2004). Efficient Set Joins on Similarity Predicates. In SIGMOD, pages 743-754.
- Vernica, R., Carey, M. J., and Li, C. (2010). Efficient Parallel Set-similarity Joins using MapReduce. In SIGMOD, pages 495-506.
- Wang, J., Li, G., and Feng, J. (2012). Can We Beat the Prefix Filtering?: An Adaptive Framework for Similarity Join and Search. In SIGMOD, pages 85-96.
- Xiao, C., Wang, W., Lin, X., and Shang, H. (2009). Topk set similarity joins. In 2009 IEEE 25th International Conference on Data Engineering, pages 916-927. IEEE.
- Xiao, C., Wang, W., Lin, X., Yu, J. X., and Wang, G. (2011). Efficient Similarity Joins for Near-duplicate Detection. TODS, 36(3):15.
Paper Citation
in Harvard Style
Quirino R., Junior S., Ribeiro L. and Martins W. (2017). fgssjoin: A GPU-based Algorithm for Set Similarity Joins . In Proceedings of the 19th International Conference on Enterprise Information Systems - Volume 1: ICEIS, ISBN 978-989-758-247-9, pages 152-161. DOI: 10.5220/0006339001520161
in Bibtex Style
@conference{iceis17,
author={Rafael D. Quirino and Sidney R. Junior and Leonardo A. Ribeiro and Wellington S. Martins},
title={fgssjoin: A GPU-based Algorithm for Set Similarity Joins},
booktitle={Proceedings of the 19th International Conference on Enterprise Information Systems - Volume 1: ICEIS,},
year={2017},
pages={152-161},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0006339001520161},
isbn={978-989-758-247-9},
}
in EndNote Style
TY - CONF
JO - Proceedings of the 19th International Conference on Enterprise Information Systems - Volume 1: ICEIS,
TI - fgssjoin: A GPU-based Algorithm for Set Similarity Joins
SN - 978-989-758-247-9
AU - Quirino R.
AU - Junior S.
AU - Ribeiro L.
AU - Martins W.
PY - 2017
SP - 152
EP - 161
DO - 10.5220/0006339001520161