same street would be matched. The same quality is-
sue for Address will result in false hits across all types
of matches even where the distance threshold is set to
0.
The Client-Family Rule is generally not part of
record linkage research. As expected, in sparse
datasets (datasets with low numbers of client-client
matches), the system detected more Client-Family
matches. Interestingly, the optimum distance thresh-
old is different. While we still have a significant
change with threshold setting 2 and 3, there is enough
deterioration in results between 1 and 2 to select a
threshold setting of 1 (CF1). However, in Exp1.1,
there were 30,330 matches detected with an accuracy
of 72.14%. By increasing the distance threshold to 1,
while this detects an extra 5,727 records, only 1,239
were accurate resulting in a drop in overall accuracy
to 64.12%. For this category, it is not definitive if CF0
(all experiments produce the same number of matches
so we choose 3.3 as the most efficient) or CF1 had
more matches, but more checking and false positives
(choose 2.2 as the most efficient combined with the
higher matches).
The Client-Domicile Rule did not perform well ei-
ther on accuracy nor on the number of true matches.
The accuracy for co-habitants is very low even though
the threshold was set to 0. The poor quality of
Address is problematic for this match type, because
SM
Address is the only similarity matrix used in
this rule. Our fuzzy matching (threshold greater
than 0) can handle abbreviations like ’rd’ for ’road’,
’st’ for ’saint’ in the Client-Client Rule and Client-
Family Rule only because those rules required a
higher dimensionality (used additional similarity met-
rics). In summary, while the number of false hits is
high, it succeeded in providing a new dimension to
the relationship graph for our industry partner.
7 CONCLUSIONS
Strategic business knowledge such as Customer Life-
time Values for a customer database cannot be de-
livered without building full customer records, which
contain the entire history of transactions. In our work,
we use real world customer datasets from the insur-
ance sector with the goal of uniting client records by:
connecting all records (various policy data) for the
same client; connecting clients to family members
(where both have policies); and connecting clients
with co-habitants (where the co-habitant is also a
client). As data is never clean, this is a significant
task, even for relatively large datasets.
In this research, our goal was to segment the over-
all dataset so as to reduce matching complexity but
to do so in a manner that kept ”matching” records in
the same segment. Early experiments were quite clear
that an aggregated similarity matrix did not provide
the required matching granularity to deliver accurate
results. For this reason, we create a multidimensional
similarity matrix and applied a set of rules to assist the
matching process. Our results show very good match-
ing results when comparing client-to-client data; quite
good results when matching clients with family mem-
bers and mixed results when trying to detect cohabit-
ing policy holders. Evaluation was provided by our
industry partner who, as a result of our work, are
building far larger customer graphs (customer pro-
files) than was previously possible. For future work,
our goal is to develop an auto-validation method sim-
ilar to (McCarren et al., 2017) to remove anomalies
while replacing the current human checking process
performed by our industry partners. This work will
also incorporate precision and recall in larger datasets.
REFERENCES
Baxter, R., Christen, P., Churches, T., et al. (2003). A com-
parison of fast blocking methods for record linkage.
In ACM SIGKDD, volume 3, pages 25–27. Citeseer.
Bhattacharya, I. and Getoor, L. (2007). Collective entity
resolution in relational data. ACM Transactions on
Knowledge Discovery from Data (TKDD), 1(1):5.
Bilenko, M., Kamath, B., and Mooney, R. J. (2006). Adap-
tive blocking: Learning to scale up record linkage.
In Data Mining, 2006. ICDM’06. Sixth International
Conference on, pages 87–96. IEEE.
Chen, I. J. and Popovich, K. (2003). Understanding cus-
tomer relationship management (crm): People, pro-
cess and technology. Business Process Management
Journal, 9(5):672–688.
Cohen, W., Ravikumar, P., and Fienberg, S. (2003). A
comparison of string metrics for matching names and
records. In Kdd workshop on data cleaning and object
consolidation, volume 3, pages 73–78.
Day, W. H. and Edelsbrunner, H. (1984). Efficient algo-
rithms for agglomerative hierarchical clustering meth-
ods. Journal of classification, 1(1):7–24.
Di Benedetto, C. A. and Kim, K. H. (2016). Customer eq-
uity and value management of global brands: Bridg-
ing theory and practice from financial and marketing
perspectives: Introduction to a journal of business re-
search special section. Journal of Business Research,
69(9):3721–3724.
Dong, X., Gabrilovich, E., Heitz, G., Horn, W., Lao, N.,
Murphy, K., Strohmann, T., Sun, S., and Zhang, W.
(2014). Knowledge vault: A web-scale approach to
probabilistic knowledge fusion. In Proceedings of
the 20th ACM SIGKDD international conference on
Knowledge discovery and data mining, pages 601–
610. ACM.
ICEIS 2019 - 21st International Conference on Enterprise Information Systems
156