A Strategy for Selecting Relevant Attributes for Entity Resolution in Data Integration Systems

Gabrielle Karine Canalle, Bernadette Farias Lóscio, Ana Carolina Salgado

Abstract

Data integration is an essential task for achieving a unified view of data stored in heterogeneous and distributed data sources. A key step in this process is the Entity Resolution, which consists of identifying instances that refer to the same real-world entity. In general, similarity functions are used to discover equivalent instances. The quality of the Entity Resolution result is directly affected by the set of attributes selected to be compared. However, such attribute selection can be challenging. In this context, this work proposes a strategy for selection of relevant attributes to be considered in the process of Entity Resolution, more precisely in the instance matching phase. This strategy considers characteristics from attributes, such as quantity of duplicated and null values, in order to identify the most relevant ones for the instance matching process. In our experiments, the proposed strategy achieved good results for the Entity Resolution process. Thus, the attributes classified as relevant were the ones that contributed to find the greatest number of true matches with a few incorrect matches.

References

  1. Bianco, G. D., de Matos Galante, R., Gonalves, M. A., Canuto, S. D., and Heuser, C. A. (2015). A practical and effective sampling selection strategy for large scale deduplication. IEEE Trans. Knowl. Data Eng., 27(9):2305-2319.
  2. Canalle, G. K. (2016). Uma estratgia para seleo de atributos relevantes no processo de resoluo de entidades.
  3. Caruccio, L., Deufemia, V., and Polese, G. (2016). Relaxed functional dependencies - a survey of approaches. IEEE Trans. Knowl. Data Eng., 28(1):147-165.
  4. Chen, J., Jin, C., Zhang, R., and Zhou, A. (2012). A learning method for entity matching. In In Proceedings of 10th International Workshop on Quality in Databases, East China Normal University, China.
  5. Christen, P. (2012). Data Matching. Springer, Heidelberg.
  6. Dash, M., Choi, K., Scheuermann, P., and Liu, H. (2002). Feature selection for clustering - a filter solution. In ICDM, pages 115-122. IEEE Computer Society.
  7. de Carvalho, M. G., Laender, A. H. F., Goncalves, M. A., and da Silva, A. S. (2010). A genetic programming approach to record deduplication. IEEE Transactions on Knowledge and Data Engineering, 99(PrePrints).
  8. Dong, X. L. and Srivastava, D. (2015). Big Data Integration. Synthesis Lectures on Data Management. Morgan & Claypool Publishers.
  9. Draisbach, U. and Naumann, F. (2010). Dude: The duplicate detection toolkit. In In Proceedings of the International Workshop on Quality in Databases (QDB).
  10. Fan, W., Jia, X., Li, J., and Ma, S. (2009). Reasoning about record matching rules. PVLDB, 2(1):407-418.
  11. Gruenheid, A., Dong, X. L., and Srivastava, D. (2014). Incremental record linkage. PVLDB, 7(9):697-708.
  12. Gu, L., Baxter, R., Vickers, D., and Rainsford, C. (2003). Record linkage: Current practice and future directions. Technical report, CSIRO Mathematical and Information Sciences.
  13. Jouve, P.-E. and Nicoloyannis, N. (2005). A filter feature selection method for clustering. In Hacid, M.-S., Murray, N. V., Ras, Z. W., and Tsumoto, S., editors, ISMIS, volume 3488 of Lecture Notes in Computer Science, pages 583-593. Springer.
  14. Kopcke, H. and Rahm, E. (2010). Frameworks for entity matching: A comparison. Data Knowl. Eng., 69(2):197-210.
  15. Li, Y., Lu, B.-L., and Wu, Z.-F. (2006). A hybrid method of unsupervised feature selection based on ranking. In ICPR (2), pages 687-690. IEEE Computer Society.
  16. Mihaila, G. A., Raschid, L., and Vidal, M.-E. (2000). Using quality of data metadata for source selection and ranking. In WebDB (Informal Proceedings), pages 93-98.
  17. Oliveira, M. I. d. S., Lscio, B., and Gama, K. (2015). Anlise de desempenho de catlogo de produtores de dados para internet das coisas baseado em sensorml e nosql. XIV Workshop em Desempenho de Sistemas Computacionais e de Comunicao.
  18. Sarawagi, S. and Bhamidipaty, A. (2002). Interactive deduplication using active learning. In KDD 7802: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 269-278, New York, NY, USA. ACM.
  19. Su, W., Wang, J., Lochovsky, F. H., and Society, I. C. (2010). Record Matching over Query Results from Multiple Web Databases. IEEE Transactions on Knowledge and Data Engineering, 22(4):578-589.
  20. Wang, R. Y. and Strong, D. M. (1996). Beyond accuracy: what data quality means to data consumers. J. Manage. Inf. Syst., 12(4):5-33.
Download


Paper Citation


in Harvard Style

Karine Canalle G., Lóscio B. and Salgado A. (2017). A Strategy for Selecting Relevant Attributes for Entity Resolution in Data Integration Systems . In Proceedings of the 19th International Conference on Enterprise Information Systems - Volume 1: ICEIS, ISBN 978-989-758-247-9, pages 80-88. DOI: 10.5220/0006316100800088


in Bibtex Style

@conference{iceis17,
author={Gabrielle Karine Canalle and Bernadette Farias Lóscio and Ana Carolina Salgado},
title={A Strategy for Selecting Relevant Attributes for Entity Resolution in Data Integration Systems},
booktitle={Proceedings of the 19th International Conference on Enterprise Information Systems - Volume 1: ICEIS,},
year={2017},
pages={80-88},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0006316100800088},
isbn={978-989-758-247-9},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 19th International Conference on Enterprise Information Systems - Volume 1: ICEIS,
TI - A Strategy for Selecting Relevant Attributes for Entity Resolution in Data Integration Systems
SN - 978-989-758-247-9
AU - Karine Canalle G.
AU - Lóscio B.
AU - Salgado A.
PY - 2017
SP - 80
EP - 88
DO - 10.5220/0006316100800088