7 CONCLUSIONS AND FUTURE
WORK
In this paper, we presented SJClust, a framework to
integrate clustering into set similarity join algorithms.
Our framework provides flexibility and extensibility
to accommodate different clustering methods, while
fully leveraging existing optimization techniques and
avoiding undesirable blocking behavior.
Future work is mainly oriented towards enriching
our framework with advanced features such as un-
certain data management (e.g., (Leung et al., 2013)),
adaptiveness (e.g., (Cannataro et al., 2002)), and exe-
cution time prediction (e.g, (Sidney et al., 2015)).
ACKNOWLEDGEMENTS
This research was partially supported by the Brazilian
agencies CNPq and CAPES.
REFERENCES
Altwaijry, H., Kalashnikov, D. V., and Mehrotra, S. (2013).
Query-driven approach to entity resolution. PVLDB,
6(14):1846–1857.
Altwaijry, H., Mehrotra, S., and Kalashnikov, D. V. (2015).
Query: A framework for integrating entity resolution
with query processing. PVLDB, 9(3):120–131.
Bayardo, R. J., Ma, Y., and Srikant, R. (2007). Scaling up
all pairs similarity search. In Proc. of the 16th Intl.
Conf. on World Wide Web, pages 131–140.
Cannataro, M., Cuzzocrea, A., Mastroianni, C., Ortale, R.,
and Pugliese, A. (2002). Modeling adaptive hyperme-
dia with an object-oriented approach and xml. Second
International Workshop on Web Dynamics.
Chaudhuri, S., Ganti, V., and Kaushik, R. (2006). A prim-
itive operator for similarity joins in data cleaning. In
Proc. of the 22nd Intl. Conf. on Data Engineering,
page 5.
Cuzzocrea, A. (2013). Analytics over big data: Explor-
ing the convergence of datawarehousing, OLAP and
data-intensive cloud infrastructures. In 37th Annual
IEEE Computer Software and Applications Confer-
ence, COMPSAC 2013, Kyoto, Japan, July 22-26,
2013, pages 481–483.
Cuzzocrea, A., Bellatreche, L., and Song, I. (2013a). Data
warehousing and OLAP over big data: current chal-
lenges and future research directions. In Proceedings
of the sixteenth international workshop on Data ware-
housing and OLAP, DOLAP 2013, San Francisco, CA,
USA, October 28, 2013, pages 67–70.
Cuzzocrea, A., Sacc
`
a, D., and Ullman, J. D. (2013b).
Big data: a research agenda. In 17th International
Database Engineering & Applications Symposium,
IDEAS ’13, Barcelona, Spain - October 09 - 11, 2013,
pages 198–203.
Doan, A., Halevy, A. Y., and Ives, Z. G. (2012). Principles
of Data Integration. Morgan Kaufmann.
Elmagarmid, A. K., Ipeirotis, P. G., and Verykios, V. S.
(2007). Duplicate record detection: A survey. TKDE,
19(1):1–16.
Hassanzadeh, O., Chiang, F., Miller, R. J., and Lee, H. C.
(2009). Framework for evaluating clustering algo-
rithms in duplicate detection. PVLDB, 2(1):1282–
1293.
Idreos, S., Papaemmanouil, O., and Chaudhuri, S. (2015).
Overview of data exploration techniques. In Proc. of
the SIGMOD Conference, pages 277–281.
Kazimianec, M. and Augsten, N. (2011). Pg-skip: Proxim-
ity graph based clustering of long strings. In Proc. of
the DASFAA Conference, pages 31–46.
Koudas, N., Sarawagi, S., and Srivastava, D. (2006). Record
linkage: Similarity measures and algorithms. In Proc.
of the SIGMOD Conference, pages 802–803.
Leung, C. K., Cuzzocrea, A., and Jiang, F. (2013).
Discovering frequent patterns from uncertain data
streams with time-fading and landmark models. T.
Large-Scale Data- and Knowledge-Centered Systems,
8:174–196.
Mazeika, A. and B
¨
ohlen, M. H. (2006). Cleansing databases
of misspelled proper nouns. In Proc. of the First Int’l
VLDB Workshop on Clean Databases.
Ribeiro, L. and H
¨
arder, T. (2009). Efficient set similarity
joins using min-prefixes. In Proc. of ADBIS Confer-
ence, pages 88–102.
Ribeiro, L. A. and H
¨
arder, T. (2011). Generalizing prefix
filtering to improve set similarity joins. Information
Systems, 36(1):62–78.
Sarawagi, S. and Kirpal, A. (2004). Efficient set joins on
similarity predicates. In Proc. of the SIGMOD Con-
ference, pages 743–754.
Schneider, N. C., Ribeiro, L. A., de Souza In
´
acio, A.,
Wagner, H. M., and von Wangenheim, A. (2015).
Simdatamapper: An architectural pattern to integrate
declarative similarity matching into database applica-
tions. In Proc. of the SBBD Conference, pages 967–
972.
Sidney, C. F., Mendes, D. S., Ribeiro, L. A., and H
¨
arder,
T. (2015). Performance prediction for set similarity
joins. In Proc. of the the ACM Symposium on Applied
Computing, pages 967–972.
Wang, J., Li, G., and Feng, J. (2012). Can we beat the
prefix filtering?: an adaptive framework for similarity
join and search. In Proc. of the SIGMOD Conference,
pages 85–96.
Xiao, C., Wang, W., Lin, X., and Yu, J. X. (2008). Efficient
similarity joins for near duplicate detection. In Proc.
of the 17th Intl. Conf. on World Wide Web, pages 131–
140.
ICEIS 2016 - 18th International Conference on Enterprise Information Systems
80