Contact Deduplication in Mobile Devices using Textual Similarity and Machine Learning

Eduardo N. Borges, Rafael F. Pinheiro, Graçaliz P. Dimuro

2017

Abstract

This paper presents a method that identifies duplicate contacts, i.e., records representing the same person or organization, automatically collected from multiple data sources. Contacts are compared using similarity functions, which scores are combined by a classification model based on decision trees, avoiding the need for an expert to manually configure similarity thresholds. The experiments show that the proposed method identified correctly up to 92% of duplicate contacts.

References

  1. Accaci, A. (2016). Duplicate contacts, v. 3.23. http:// play.google.com/store/apps/details?id=com.accaci. Available: November, 2016.
  2. Bilenko, M. and Mooney, R. J. (2003). Adaptive duplicate detection using learnable string similarity measures. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 39-48.
  3. Borges, E. N., Becker, K., Heuser, C. A., and Galante, R. (2011). A classification-based approach for bibliographic metadata deduplication. In Proceedings of the IADIS Int. Conference WWW/Internet, pages 221- 228.
  4. Chaudhuri, S., Ganjam, K., Ganti, V., and Motwani, R. (2003). Robust and efficient fuzzy match for online data cleaning. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 313-324.
  5. Christen, P. (2012). A survey of indexing techniques for scalable record linkage and deduplication. IEEE Transactions on Knowledge and Data Engineering, 24(9):1537-1555.
  6. Cohen, W. W., Ravikumar, P., and Fienberg, S. E. (2003). A comparison of string distance metrics for namematching tasks. In Proceedings of the IJCAI Workshop on Information Integration, pages 73-78.
  7. Dabhi, P. (2015). Duplicate contacts delete, v. 1.1. http://play.google.com/store/apps/details?id=com.don. contactdelete. Available: November, 2016.
  8. Dal Bianco, G., Galante, R., Gonalves, M. A., Canuto, S., and Heuser, C. A. (2015). A practical and effective sampling selection strategy for large scale deduplication. IEEE Transactions on Knowledge and Data Engineering, 27(9):2305-1319.
  9. de Carvalho, M. G., Laender, A. H. F., Gonalves, M. A., and da Silva, A. S. (2008). Replica identification using genetic programming. In Proceedings of the ACM Symposium on Applied Computing, pages 1801-1806.
  10. Dorneles, C. F., Nunes, M. F., Heuser, C. A., Moreira, V. P., da Silva, A. S., and de Moura, E. S. (2009). A strategy for allowing meaningful and comparable scores in approximate matching. Information Systems, 34(8):673- 689.
  11. Goadrich, M. H. and Rogers, M. P. (2011). Smart smartphone development: Ios versus android. In Proceedings of the 42nd ACM Technical Symposium on Computer Science Education, SIGCSE 7811, pages 607- 612, New York. ACM.
  12. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., and Witten, I. H. (2009). The WEKA data mining software: an update. SIGKDD Explor. Newsl., 11(1):10-18.
  13. Kowalski, G. J. and Maybury, M. T. (2002). Information Storage and Retrieval Systems : Theory and Implementation. Springer, Boston, MA, USA. SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pages 233-246.
  14. Manning, C. D., Raghavan, P., and Schutze, H. (2008). Introduction to Information Retrieval. Cambridge University Press.
  15. Ogata, M. and Komoda, N. (2013). The parameter optimization in multiple layered deduplication system. In Proceedings of the International Conference on Enterprise Information Systems, volume 2, pages 143-150.
  16. ORGwareTech. (2015). Contact merger, v. 3.8. http:// play.google.com/store/apps/details?id=com.orgware. contactsmerge. Available: November, 2016.
  17. Peng, T. and Mackay, C. (2014). Approximate string matching techniques. In Proceedings of the International Conference on Enterprise Information Systems, volume 1, pages 217-224.
  18. Quinlan, J. R. (1993). C4.5: programs for machine learning. Morgan Kaufmann Publishers Inc., San Francisco, USA.
  19. Silva, A. M. (2012). Contacts cleaner, v. 1.6. http:// play.google.com/store/apps/details?id=br.com. contacts.cleaner.by.alan. Available: November, 2016.
  20. Sunil, D. M. (2016). Duplicate contacts manager, v. 2.8. http://play.google.com/store/apps/details?id=com. makelifesimple.duplicatedetector. Available: November, 2016.
Download


Paper Citation


in Harvard Style

Borges E., Pinheiro R. and Dimuro G. (2017). Contact Deduplication in Mobile Devices using Textual Similarity and Machine Learning . In Proceedings of the 19th International Conference on Enterprise Information Systems - Volume 1: ICEIS, ISBN 978-989-758-247-9, pages 64-72. DOI: 10.5220/0006275100640072


in Bibtex Style

@conference{iceis17,
author={Eduardo N. Borges and Rafael F. Pinheiro and Graçaliz P. Dimuro},
title={Contact Deduplication in Mobile Devices using Textual Similarity and Machine Learning},
booktitle={Proceedings of the 19th International Conference on Enterprise Information Systems - Volume 1: ICEIS,},
year={2017},
pages={64-72},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0006275100640072},
isbn={978-989-758-247-9},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 19th International Conference on Enterprise Information Systems - Volume 1: ICEIS,
TI - Contact Deduplication in Mobile Devices using Textual Similarity and Machine Learning
SN - 978-989-758-247-9
AU - Borges E.
AU - Pinheiro R.
AU - Dimuro G.
PY - 2017
SP - 64
EP - 72
DO - 10.5220/0006275100640072