difference was 0.016. In the Single-Link algorithm
(Figure 12), we did not observe relevant difference
among the quality results.
5 RELATED WORK
Recent researches have focused on the use of queries,
indexing techniques or both to reduce the volume of
data to be processed (Bhattacharya and Getoor, 2007;
Altwaijry et al., 2013; Christen, 2012a; Ramadan et
al., 2015; Vieira, 2016). Different indexing
techniques are summarized in (Christen, 2012a).
However, most of these techniques are focused on
traditional ER process, with batch algorithms and just
few researches focus on incremental ER (Gruenheid
et al., 2014; Whang et al., 2013; Altowin et al., 2014;
Whang and Garcia-Molina, 2014).
In (Bhattacharya and Getoor, 2007; Altwaijry et
al., 2013), a query-time ER is proposed, but the
indexing to reuse previous classifications was not
considered. In (Whang et al., 2013; Gruenheid et al.,
2014), an incremental ER approach is proposed, but
the indexing is static and the ER is not query-driven.
In (Ramadan et al., 2015) dynamic indexes are
proposed. Both papers focused on information
retrieval and not on data integration process
(Christen, 2012). Besides that, just attribute and
similarity values are indexed and not clusters of
tuples that refer to the same real-world entity.
Our indexes are different in three aspects. First,
our focus is the data integration process and an
incremental ER over query results. Second, our
proposal is to index tuple identifiers, and not attribute
values. In scenarios with a large volume of data, using
multiple attributes for similarity index functions can
be very costly and time-consuming (Christen, 2012;
Ribeiro et al, 2016). Third, we propose to index
similarity values and previous ER of tuples from
multiple data sources.
6 CONCLUSIONS
In this paper, two indexes for incremental ER over
query results were presented, Cluster Index and
Similarity Index. The quality and the efficiency of the
ER process were evaluated, as well as the impact of
the Similarity Index size on the incremental ER
process was investigated. We showed, on a real
dataset, that our indexes are suitable for the
incremental ER process. The incremental ER had the
same quality of traditional processes, without
indexes, but was more efficient. As future work, we
intend to analyze the indexes with other datasets, as
well as to evaluate other ER incremental algorithms.
REFERENCES
Altowim, Y., Kalashnikov, D. V., Mehrotra, S. (2014).
Progressive Approach to Relational Entity Resolution.
In: VLDB. Hangshou, China.
Altwaijry, H., Kalashnikov, D. D., Mehrotra, S. (2013).
Query-Driven Approach to Entity Resolution. In:
VLDB. Trento, Italy.
Bhattacharya, I., Getoor, L. (2007). Query-time Entity
Resolution. Journal of Artificial Intelligence Research.
V 30 , issue 1, pp 621-657.
Bhattacharya, I.; Getoor, L. (2007a). Entity Resolution In
Graphs. In: Mining Graph Data. John Wiley & Sons,
Inc.
CDDB (2016). Available in: http://hpi.de/naumann/
projects/repeatability/datasets/cd-datasets.html.
Christen, P. (2008). Febrl – An Open Source Data Cleaning,
Deduplication and Record Linkage System with a
Graphical User Interface. In: KDD. Las Vegas, USA.
Christen, P. (2012). Data Matching: Concepts and
Techniques for Record Linkage, Entity Resolution, and
Duplicate Detection. Springer.
Christen, P. (2012a). A Survey of Indexing Techniques for
Scalable Record Linkage and Deduplication. In: TKDE.
V 24, issue 9, pp 1537-1555.
FreeDB (2016). Available in: http://www.freedb.org/
Gruenheid, A.; Dong, X. L.; Srivastava, D. (2014).
Incremental Record Linkage. In: VLDB. Hangzhou,
China.
Guo, S.; Dong, X.; Srivastava, D.; Zajac, R. (2010). Record
linkage with uniqueness constraints and erroneous
values. In: PVLDB. Singapore.
Ramadan, B. et al. (2015). Dynamic Sorted Neighbourhood
Indexing for Real-Time Entity Resolution. In: Journal
of Data and Information Quality. V 6, issue 4, nº 15.
Ribeiro, L. A. et al. (2016). SJClust: Towards a Framework
for Integrating Similarity Join Algorithms and
Clustering. In: ICEIS. Rome, Italy.
Su, W., Wang, J., Lochovsky, F, H. (2010). Record
Matching Over Query Results from Multiple Web
Databases. In: TKDE. V 22, issue 4, pp 578-589.
Tan, P.; Steinbach, M.; Kumar, V. (2006). Introduction to
Data Mining. Pearson.
Vieira, P. K. M.; Salgado, A. C.; Lóscio, B. F. (2016). A
Query-driven and Incremental Process for Entity
Resolution. In: AMW. Panama City, Panama.
Whang, S. E.; Marmaros, D.; Garcia-Molina, H. (2013).
Pay-As-You-Go Entity Resolution. In: TKDE. V 25,
issue 5, pp 1111-1124.
Whang, S. E.; Garcia-Molina, H. (2014). Incremental entity
resolution on rules and data. In VLDB Journal. V 23,
issue 1, pp 77- 102.