the achievable savings. By contrast, avoiding the most
frequent 0/1-bits for LSH blocking proved to be ben-
eficial for very large datasets. For future work, we
plan to investigate further P3RL approaches and make
them available in a toolbox for use in applications and
for a comparative evaluation.
REFERENCES
Bloom, B. (1970). Space/time trade-offs in hash coding
with allowable errors. CACM, 13(7):422–426.
Carbone, P. et al. (2015). Apache Flink: Stream and batch
processing in a single engine. IEEE TCDE, 36(4).
Christen, P. (2012). Data Matching: Concepts and Tech-
niques for Record Linkage, Entity Resolution, and Du-
plicate Detection. Springer.
Christen, P., Schnell, R., Vatsalan, D., and Ranbaduge, T.
(2017). Efficient cryptanalysis of bloom filters for
privacy-preserving record linkage. In PAKDD.
Christen, P. and Vatsalan, D. (2013). Flexible and extensible
generation and corruption of personal data. In ACM
CIKM, pages 1165–1168.
Dal Bianco, G., Galante, R., and Heuser, C. A. (2011). A
fast approach for parallel deduplication on multicore
processors. In ACM SAC, pages 1027–1032.
Dean, J. and Ghemawat, S. (2008). MapReduce: Simplified
data processing on large clusters. CACM, 51(1).
Durham, E. A. (2012). A framework for accurate, efficient
private record linkage. PhD thesis, Vanderbilt Univer-
sity.
Fisher, J., Christen, P., Wang, Q., and Rahm, E. (2015). A
clustering-based framework to control block sizes for
entity resolution. In Proc. KDD.
Forchhammer, B., Papenbrock, T., Stening, T., Viehmeier,
S., Draisbach, U., and Naumann, F. (2013). Duplicate
Detection on GPUs. In Proc. BTW.
Indyk, P. and Motwani, R. (1998). Approximate nearest
neighbors: Towards removing the curse of dimension-
ality. In STOC, pages 604–613. ACM.
Karakasidis, A. and Verykios, V. S. (2009). Privacy pre-
serving record linkage using phonetic codes. In Proc.
BCI.
Karapiperis, D. and Verykios, V. S. (2013). A distributed
framework for scaling up LSH-based computations in
privacy preserving record linkage. In Proc. BCI.
Karapiperis, D. and Verykios, V. S. (2014). A distributed
near-optimal LSH-based framework for privacy-
preserving record linkage. ComSIS, 11(2):745–763.
Karapiperis, D. and Verykios, V. S. (2015). An LSH-
based blocking approach with a homomorphic match-
ing technique for privacy-preserving record linkage.
IEEE TKDE, 27(4):909–921.
Karapiperis, D. and Verykios, V. S. (2016). A fast and effi-
cient hamming LSH-based scheme for accurate link-
age. KAIS, 49(3):861–884.
Kolb, L., Thor, A., and Rahm, E. (2012). Dedoop: Efficient
Deduplication with Hadoop. PVLDB, 5(12).
Kolb, L., Thor, A., and Rahm, E. (2013). Don’t match
twice: Redundancy-free similarity computation with
MapReduce. In DanaC, pages 1–5. ACM.
K
¨
opcke, H. and Rahm, E. (2010). Frameworks for entity
matching: A comparison. DKE, 69(2):197–210.
Kroll, M. and Steinmetzer, S. (2014). Automated crypt-
analysis of bloom filter encryptions of health records.
ICHI.
Kuzu, M., Kantarcioglu, M., Durham, E., and Malin, B.
(2011). A constraint satisfaction cryptanalysis of
bloom filters in private record linkage. In Proc. PETS.
Kuzu, M., Kantarcioglu, M., Durham, E. A., Toth, C., and
Malin, B. (2013). A practical approach to achieve
private medical record linkage in light of public re-
sources. JMIA, 20(2).
Mitzenmacher, M. and Upfal, E. (2005). Probability and
Computing: Randomized Algorithms and Probabilis-
tic Analysis. Cambridge University Press.
Ngomo, A.-C. N., Kolb, L., Heino, N., Hartung, M., Auer,
S., and Rahm, E. (2013). When to reach for the cloud:
Using parallel hardware for link discovery. In ESWC,
pages 275–289. Springer.
Niedermeyer, F., Steinmetzer, S., Kroll, M., and Schnell, R.
(2014). Cryptanalysis of basic bloom filters used for
privacy preserving record linkage. JPC, 6(2):59–79.
Odell, M. and Russell, R. (1918). The soundex coding sys-
tem. US Patents, 1261167.
Schnell, R. (2015). Privacy-preserving record linkage.
Methodological Developments in Data Linkage, pages
201–225.
Schnell, R., Bachteler, T., and Reiher, J. (2009). Privacy-
preserving record linkage using bloom filters. BMC
Medical Informatics and Decision Making, 9(1):41.
Schnell, R., Bachteler, T., and Reiher, J. (2011). A
novel error-tolerant anonymous linking code. German
Record Linkage Center, No. WP-GRLC-2011-02.
Schnell, R. and Borgs, C. (2016). Randomized response and
balanced bloom filters for privacy preserving record
linkage. In ICDMW, pages 218–224. IEEE.
Sehili, Z., Kolb, L., Borgs, C., Schnell, R., and Rahm,
E. (2015). Privacy preserving record linkage with
PPJoin. In Proc. BTW.
Vatsalan, D., Christen, P., O’Keefe, C. M., and Verykios,
V. S. (2014). An evaluation framework for privacy-
preserving record linkage. JPC, 6(1):3.
Vatsalan, D., Christen, P., and Verykios, V. S. (2013). A
taxonomy of privacy-preserving record linkage tech-
niques. Information Systems, 38(6):946–969.
Vatsalan, D., Sehili, Z., Christen, P., and Rahm, E. (2017).
Privacy-preserving record linkage for Big Data: Cur-
rent approaches and research challenges. Handbook
of Big Data Technologies.
Wang, C. et al. (2010). MapDupReducer: Detecting near
duplicates over massiv datasets. In Proc. SIGMOD.
Parallel Privacy-preserving Record Linkage using LSH-based Blocking
203