Contact Deduplication in Mobile Devices using Textual Similarity and Machine Learning
Eduardo N. Borges, Rafael F. Pinheiro, Graçaliz P. Dimuro
2017
Abstract
This paper presents a method that identifies duplicate contacts, i.e., records representing the same person or organization, automatically collected from multiple data sources. Contacts are compared using similarity functions, which scores are combined by a classification model based on decision trees, avoiding the need for an expert to manually configure similarity thresholds. The experiments show that the proposed method identified correctly up to 92% of duplicate contacts.
References
- Accaci, A. (2016). Duplicate contacts, v. 3.23. http:// play.google.com/store/apps/details?id=com.accaci. Available: November, 2016.
- Bilenko, M. and Mooney, R. J. (2003). Adaptive duplicate detection using learnable string similarity measures. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 39-48.
- Borges, E. N., Becker, K., Heuser, C. A., and Galante, R. (2011). A classification-based approach for bibliographic metadata deduplication. In Proceedings of the IADIS Int. Conference WWW/Internet, pages 221- 228.
- Chaudhuri, S., Ganjam, K., Ganti, V., and Motwani, R. (2003). Robust and efficient fuzzy match for online data cleaning. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 313-324.
- Christen, P. (2012). A survey of indexing techniques for scalable record linkage and deduplication. IEEE Transactions on Knowledge and Data Engineering, 24(9):1537-1555.
- Cohen, W. W., Ravikumar, P., and Fienberg, S. E. (2003). A comparison of string distance metrics for namematching tasks. In Proceedings of the IJCAI Workshop on Information Integration, pages 73-78.
- Dabhi, P. (2015). Duplicate contacts delete, v. 1.1. http://play.google.com/store/apps/details?id=com.don. contactdelete. Available: November, 2016.
- Dal Bianco, G., Galante, R., Gonalves, M. A., Canuto, S., and Heuser, C. A. (2015). A practical and effective sampling selection strategy for large scale deduplication. IEEE Transactions on Knowledge and Data Engineering, 27(9):2305-1319.
- de Carvalho, M. G., Laender, A. H. F., Gonalves, M. A., and da Silva, A. S. (2008). Replica identification using genetic programming. In Proceedings of the ACM Symposium on Applied Computing, pages 1801-1806.
- Dorneles, C. F., Nunes, M. F., Heuser, C. A., Moreira, V. P., da Silva, A. S., and de Moura, E. S. (2009). A strategy for allowing meaningful and comparable scores in approximate matching. Information Systems, 34(8):673- 689.
- Goadrich, M. H. and Rogers, M. P. (2011). Smart smartphone development: Ios versus android. In Proceedings of the 42nd ACM Technical Symposium on Computer Science Education, SIGCSE 7811, pages 607- 612, New York. ACM.
- Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., and Witten, I. H. (2009). The WEKA data mining software: an update. SIGKDD Explor. Newsl., 11(1):10-18.
- Kowalski, G. J. and Maybury, M. T. (2002). Information Storage and Retrieval Systems : Theory and Implementation. Springer, Boston, MA, USA. SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pages 233-246.
- Manning, C. D., Raghavan, P., and Schutze, H. (2008). Introduction to Information Retrieval. Cambridge University Press.
- Ogata, M. and Komoda, N. (2013). The parameter optimization in multiple layered deduplication system. In Proceedings of the International Conference on Enterprise Information Systems, volume 2, pages 143-150.
- ORGwareTech. (2015). Contact merger, v. 3.8. http:// play.google.com/store/apps/details?id=com.orgware. contactsmerge. Available: November, 2016.
- Peng, T. and Mackay, C. (2014). Approximate string matching techniques. In Proceedings of the International Conference on Enterprise Information Systems, volume 1, pages 217-224.
- Quinlan, J. R. (1993). C4.5: programs for machine learning. Morgan Kaufmann Publishers Inc., San Francisco, USA.
- Silva, A. M. (2012). Contacts cleaner, v. 1.6. http:// play.google.com/store/apps/details?id=br.com. contacts.cleaner.by.alan. Available: November, 2016.
- Sunil, D. M. (2016). Duplicate contacts manager, v. 2.8. http://play.google.com/store/apps/details?id=com. makelifesimple.duplicatedetector. Available: November, 2016.
Paper Citation
in Harvard Style
Borges E., Pinheiro R. and Dimuro G. (2017). Contact Deduplication in Mobile Devices using Textual Similarity and Machine Learning . In Proceedings of the 19th International Conference on Enterprise Information Systems - Volume 1: ICEIS, ISBN 978-989-758-247-9, pages 64-72. DOI: 10.5220/0006275100640072
in Bibtex Style
@conference{iceis17,
author={Eduardo N. Borges and Rafael F. Pinheiro and Graçaliz P. Dimuro},
title={Contact Deduplication in Mobile Devices using Textual Similarity and Machine Learning},
booktitle={Proceedings of the 19th International Conference on Enterprise Information Systems - Volume 1: ICEIS,},
year={2017},
pages={64-72},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0006275100640072},
isbn={978-989-758-247-9},
}
in EndNote Style
TY - CONF
JO - Proceedings of the 19th International Conference on Enterprise Information Systems - Volume 1: ICEIS,
TI - Contact Deduplication in Mobile Devices using Textual Similarity and Machine Learning
SN - 978-989-758-247-9
AU - Borges E.
AU - Pinheiro R.
AU - Dimuro G.
PY - 2017
SP - 64
EP - 72
DO - 10.5220/0006275100640072