SJClust: Towards a Framework for Integrating Similarity Join Algorithms and Clustering

Leonardo Andrade Ribeiro, Alfredo Cuzzocrea, Karen Aline Alves Bezerra, Ben Hur Bahia do Nascimento

Abstract

A critical task in data cleaning and integration is the identification of duplicate records representing the same real-world entity. A popular approach to duplicate identification employs similarity join to find pairs of similar records followed by a clustering algorithm to group together records that refer to the same entity. However, the clustering algorithm is strictly used as a post-processing step, which slows down the overall performance and only produces results at the end of the whole process. In this paper, we propose SjClust, a framework to integrate similarity join and clustering into a single operation. Our approach allows to smoothly accommodating a variety of cluster representation and merging strategies into set similarity join algorithms, while fully leveraging state-of-the-art optimization techniques.

References

  1. Altwaijry, H., Kalashnikov, D. V., and Mehrotra, S. (2013). Query-driven approach to entity resolution. PVLDB, 6(14):1846-1857.
  2. Altwaijry, H., Mehrotra, S., and Kalashnikov, D. V. (2015). Query: A framework for integrating entity resolution with query processing. PVLDB, 9(3):120-131.
  3. Bayardo, R. J., Ma, Y., and Srikant, R. (2007). Scaling up all pairs similarity search. In Proc. of the 16th Intl. Conf. on World Wide Web, pages 131-140.
  4. Cannataro, M., Cuzzocrea, A., Mastroianni, C., Ortale, R., and Pugliese, A. (2002). Modeling adaptive hypermedia with an object-oriented approach and xml. Second International Workshop on Web Dynamics.
  5. Chaudhuri, S., Ganti, V., and Kaushik, R. (2006). A primitive operator for similarity joins in data cleaning. In Proc. of the 22nd Intl. Conf. on Data Engineering, page 5.
  6. Cuzzocrea, A. (2013). Analytics over big data: Exploring the convergence of datawarehousing, OLAP and data-intensive cloud infrastructures. In 37th Annual IEEE Computer Software and Applications Conference, COMPSAC 2013, Kyoto, Japan, July 22-26, 2013, pages 481-483.
  7. Cuzzocrea, A., Bellatreche, L., and Song, I. (2013a). Data warehousing and OLAP over big data: current challenges and future research directions. In Proceedings of the sixteenth international workshop on Data warehousing and OLAP, DOLAP 2013, San Francisco, CA, USA, October 28, 2013, pages 67-70.
  8. Cuzzocrea, A., Saccà, D., and Ullman, J. D. (2013b). Big data: a research agenda. In 17th International Database Engineering & Applications Symposium, IDEAS 7813, Barcelona, Spain - October 09 - 11, 2013, pages 198-203.
  9. Doan, A., Halevy, A. Y., and Ives, Z. G. (2012). Principles of Data Integration. Morgan Kaufmann.
  10. Elmagarmid, A. K., Ipeirotis, P. G., and Verykios, V. S. (2007). Duplicate record detection: A survey. TKDE, 19(1):1-16.
  11. Hassanzadeh, O., Chiang, F., Miller, R. J., and Lee, H. C. (2009). Framework for evaluating clustering algorithms in duplicate detection. PVLDB, 2(1):1282- 1293.
  12. Idreos, S., Papaemmanouil, O., and Chaudhuri, S. (2015). Overview of data exploration techniques. In Proc. of the SIGMOD Conference, pages 277-281.
  13. Kazimianec, M. and Augsten, N. (2011). Pg-skip: Proximity graph based clustering of long strings. In Proc. of the DASFAA Conference, pages 31-46.
  14. Koudas, N., Sarawagi, S., and Srivastava, D. (2006). Record linkage: Similarity measures and algorithms. In Proc. of the SIGMOD Conference, pages 802-803.
  15. Leung, C. K., Cuzzocrea, A., and Jiang, F. (2013). Discovering frequent patterns from uncertain data streams with time-fading and landmark models. T. Large-Scale Data- and Knowledge-Centered Systems, 8:174-196.
  16. Mazeika, A. and B öhlen, M. H. (2006). Cleansing databases of misspelled proper nouns. In Proc. of the First Int'l VLDB Workshop on Clean Databases.
  17. Ribeiro, L. and Härder, T. (2009). Efficient set similarity joins using min-prefixes. In Proc. of ADBIS Conference, pages 88-102.
  18. Ribeiro, L. A. and Härder, T. (2011). Generalizing prefix filtering to improve set similarity joins. Information Systems, 36(1):62-78.
  19. Sarawagi, S. and Kirpal, A. (2004). Efficient set joins on similarity predicates. In Proc. of the SIGMOD Conference, pages 743-754.
  20. Schneider, N. C., Ribeiro, L. A., de Souza Inácio, A., Wagner, H. M., and von Wangenheim, A. (2015). Simdatamapper: An architectural pattern to integrate declarative similarity matching into database applications. In Proc. of the SBBD Conference, pages 967- 972.
  21. Sidney, C. F., Mendes, D. S., Ribeiro, L. A., and Härder, T. (2015). Performance prediction for set similarity joins. In Proc. of the the ACM Symposium on Applied Computing, pages 967-972.
  22. Wang, J., Li, G., and Feng, J. (2012). Can we beat the prefix filtering?: an adaptive framework for similarity join and search. In Proc. of the SIGMOD Conference, pages 85-96.
  23. Xiao, C., Wang, W., Lin, X., and Yu, J. X. (2008). Efficient similarity joins for near duplicate detection. In Proc. of the 17th Intl. Conf. on World Wide Web, pages 131- 140.
Download


Paper Citation


in Harvard Style

Ribeiro L., Cuzzocrea A., Bezerra K. and Nascimento B. (2016). SJClust: Towards a Framework for Integrating Similarity Join Algorithms and Clustering . In Proceedings of the 18th International Conference on Enterprise Information Systems - Volume 1: ICEIS, ISBN 978-989-758-187-8, pages 75-80. DOI: 10.5220/0005868700750080


in Bibtex Style

@conference{iceis16,
author={Leonardo Andrade Ribeiro and Alfredo Cuzzocrea and Karen Aline Alves Bezerra and Ben Hur Bahia do Nascimento},
title={SJClust: Towards a Framework for Integrating Similarity Join Algorithms and Clustering},
booktitle={Proceedings of the 18th International Conference on Enterprise Information Systems - Volume 1: ICEIS,},
year={2016},
pages={75-80},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005868700750080},
isbn={978-989-758-187-8},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 18th International Conference on Enterprise Information Systems - Volume 1: ICEIS,
TI - SJClust: Towards a Framework for Integrating Similarity Join Algorithms and Clustering
SN - 978-989-758-187-8
AU - Ribeiro L.
AU - Cuzzocrea A.
AU - Bezerra K.
AU - Nascimento B.
PY - 2016
SP - 75
EP - 80
DO - 10.5220/0005868700750080