Authors:
Luiz Olmes Carvalho
;
Lucio F. D. Santos
;
Willian D. Oliveira
;
Agma J. M. Traina
and
Caetano Traina Jr.
Affiliation:
University of São Paulo, Brazil
Keyword(s):
Similarity Search, Similarity Join, Query Operators, Wide-join, Near-duplicate Detection.
Related
Ontology
Subjects/Areas/Topics:
Databases and Information Systems Integration
;
Enterprise Information Systems
;
Query Languages and Query Processing
Abstract:
Crowdsourcing information is being increasingly employed to improve and support decision making in emergency situations. However, the gathered records quickly become too similar among themselves and handling several similar reports does not add valuable knowledge to assist the helping personnel at the control center in their decision making tasks. The usual approaches to detect and handle the so-called near-duplicate data rely on costly twofold processing. Aimed at reducing the cost and also improving the ability of duplication detection, we developed a framework model based on the similarity wide-join database operator. We extended the wide-join definition empowering it to surpass its restrictions and accomplish the near-duplicate task too. In this paper, we also provide an efficient algorithm based on pivots that speeds up the entire process, which enables retrieving the top similar elements in a single-pass processing. Experiments using real datasets show that our framework is up
to three orders of magnitude faster than the competing techniques in the literature, whereas also improving the quality of the result in about 35 percent.
(More)