ana Gouveia, Adriana Jouris) = 0.93. This and many
other distinct pairs of contacts consisting of only two
names and where the first name is exactly the same
return very high scores for this function. This behav-
ior was because JaroWinkler considers the size of the
prefix in common to the strings.
In D
5
, the contacts Fernando Luis Martins and
Luiz Fernando Tusnski are incorrectly detected as dis-
tinct pairs because the score returned by the Jaccard
similarity function is 0.5. This value is higher than the
maximum score of 0.33 expected by the classification
model. This type of error was most frequent for this
dataset (1375/2926 = 47%), in addition to the cases
where the similarity between emails was not used by
the Levenshtein function (1354/2926 = 46.3%).
6 CONCLUSION
This paper presented a method for deduplication of
contacts that facilitates the integration process and
considerably reduces the time a user would take to
manually associate contacts from different accounts.
The experiments show that, using textual similarity
functions and machine learning algorithms, it was
possible to correctly identify up to 92.1% of duplicate
contacts pairs that do not share telephone numbers or
e-mail addresses. The contribution of the proposed
work when compared to the tools presented in Section
2 becomes evident because these pairs of contacts can
not be detected by any of them.
However, other identification errors can still ap-
pear. For example, the contact with name = “Mom”
stored on the SIM card with the home phone number
would not be detected as a duplicate of the record con-
taining its proper name and cell number. There may
still be homonyms that do not represent the same per-
son, as in the case of “Orlando Marasciulo” (records
6, 7 and 8 of Figure 1).
As future work we intend to adopt a cloud archi-
tecture that stores the users local learning models and
integrates them generating a global model. The hits
and errors of the deduplication processes will be com-
bined in order to improve the deduplication process
for all users. Also, the prototype will be reimple-
mented as a service allowing an efficient incremental
deduplication for each insertion or deletion of a con-
tact.
Finally, the graphical interface will only allow the
user set up parameters and interact with the integra-
tion algorithm, choosing between two or more repre-
sentations of a duplicate contact name.
REFERENCES
Accaci, A. (2016). Duplicate contacts, v. 3.23. http://
play.google.com/store/apps/details?id=com.accaci.
Available: November, 2016.
Bilenko, M. and Mooney, R. J. (2003). Adaptive dupli-
cate detection using learnable string similarity mea-
sures. In Proceedings of the ACM SIGKDD Interna-
tional Conference on Knowledge Discovery and Data
Mining, pages 39–48.
Borges, E. N., Becker, K., Heuser, C. A., and Galante, R.
(2011). A classification-based approach for biblio-
graphic metadata deduplication. In Proceedings of
the IADIS Int. Conference WWW/Internet, pages 221–
228.
Chaudhuri, S., Ganjam, K., Ganti, V., and Motwani, R.
(2003). Robust and efficient fuzzy match for online
data cleaning. In Proceedings of the ACM SIGMOD
International Conference on Management of Data,
pages 313–324.
Christen, P. (2012). A survey of indexing techniques
for scalable record linkage and deduplication. IEEE
Transactions on Knowledge and Data Engineering,
24(9):1537–1555.
Cohen, W. W., Ravikumar, P., and Fienberg, S. E. (2003).
A comparison of string distance metrics for name-
matching tasks. In Proceedings of the IJCAI Workshop
on Information Integration, pages 73–78.
Dabhi, P. (2015). Duplicate contacts delete, v. 1.1.
http://play.google.com/store/apps/details?id=com.don.
contactdelete. Available: November, 2016.
Dal Bianco, G., Galante, R., Gonalves, M. A., Canuto, S.,
and Heuser, C. A. (2015). A practical and effective
sampling selection strategy for large scale deduplica-
tion. IEEE Transactions on Knowledge and Data En-
gineering, 27(9):2305–1319.
de Carvalho, M. G., Laender, A. H. F., Gonalves, M. A.,
and da Silva, A. S. (2008). Replica identification us-
ing genetic programming. In Proceedings of the ACM
Symposium on Applied Computing, pages 1801–1806.
Dorneles, C. F., Nunes, M. F., Heuser, C. A., Moreira, V. P.,
da Silva, A. S., and de Moura, E. S. (2009). A strategy
for allowing meaningful and comparable scores in ap-
proximate matching. Information Systems, 34(8):673–
689.
Goadrich, M. H. and Rogers, M. P. (2011). Smart smart-
phone development: Ios versus android. In Proceed-
ings of the 42nd ACM Technical Symposium on Com-
puter Science Education, SIGCSE ’11, pages 607–
612, New York. ACM.
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann,
P., and Witten, I. H. (2009). The weka data min-
ing software: An update. SIGKDD Explor. Newsl.,
11(1):10–18.
Kowalski, G. J. and Maybury, M. T. (2002). Information
Storage and Retrieval Systems : Theory and Imple-
mentation. Springer, Boston, MA, USA.
Lenzerini, M. (2002). Data integration: a theoret-
ical perspective. In Proceedings of the ACM
Contact Deduplication in Mobile Devices using Textual Similarity and Machine Learning
71