Automated Cryptanalysis of Bloom Filter Encryptions of Health Records

Martin Kroll and Simone Steinmetzer

Research Methodology Group, University of Duisburg-Essen, Essen, Germany

Keywords:

Bloom Filter, Privacy-preserving Record Linkage, Anonymity, Hash Function, Cryptographic Attack.

Abstract:

Privacy-preserving record linkage with Bloom ﬁlters has become increasingly popular in medical applications,

since Bloom ﬁlters allow for probabilistic linkage of sensitive personal data. However, since evidence indi-

cates that Bloom ﬁlters lack sufﬁciently high security where strong security guarantees are required, several

suggestions for their improvement have been made in literature. One of those improvements proposes the

storage of several identiﬁers in one single Bloom ﬁlter. In this paper we present an automated cryptanalysis of

this Bloom ﬁlter variant. The three steps of this procedure constitute our main contributions: (1) a new method

for the detection of Bloom ﬁlter encrytions of bigrams (so-called atoms), (2) the use of an optimization algo-

rithm for the assignment of atoms to bigrams, (3) the reconstruction of the original attribute values by linkage

against bigram sets obtained from lists of frequent attribute values in the underlying population. To sum up,

our attack provides the ﬁrst convincing attack on Bloom ﬁlter encryptions of records built from more than one

identiﬁer.

1 INTRODUCTION

Record linkage between databases containing infor-

mation on individual people is popular in a large num-

ber of medical applications, for example the iden-

tiﬁcation of patient deaths (Jones et al., 2005), the

evaluation of disease treatment (Newman and Brown,

1997) and the linkage of cancer registries in epidemi-

ology (Van Den Brandt et al., 1990). In many ap-

plications data sets are merged using personal iden-

tiﬁers such as forenames, surnames, place and date

of birth. Due to privacy concerns this has to be done

via privacy-preserving record linkage (PPRL). How-

ever, since personal identiﬁers often contain typing or

spelling errors, encrypting the identiﬁer values and

linking only those that match exactly does not pro-

vide satisfactory results. Therefore, to allow for errors

in encrypted personal identiﬁers, in many European

countries encrypted phonetic codes, such as Soundex

codes, are commonly used, especially by cancer reg-

istries. As the performance of these codes is still non

satisfactory, several novel privacy-preserving record

linkage methods have been suggested during the last

years. For example Schnell et al. (Schnell et al., 2009)

developed a method based on Bloom ﬁlters. Bloom-

ﬁlter-based record linkage has already been used in

medical applications in a number of different coun-

tries (Kuehni et al., 2012; Rocha, 2013; Randall et al.,

2014; Schnell et al., 2014).

Another frequently applied privacy-preserving

record linkage method uses anonymous linking codes

(Herzog et al., 2007). The basic principle of an anony-

mous linking code is to standardize all particular iden-

tiﬁers of a record (removal of certain characters and

diacritics, use of upper case letters), to concatenate

them to a single string and ﬁnally to put this sin-

gle string into a cryptographic hash function. By

combining this principle with Bloom ﬁlters, Schnell

et al. (Schnell et al., 2011) ﬁrst developed a novel

error-tolerant anonymous linking code, called Crypto-

graphic Longterm Key (CLK). Instead of encrypting

every single identiﬁer from a record of several iden-

tiﬁers through a Bloom ﬁlter, multiple identiﬁers are

stored in one single Bloom ﬁlter, called CLK. Tests on

several databases showed that CLKs yield good link-

age properties, superior to well-known anonymous

linking codes (Schnell et al., 2011).

Only recently Randall et al. (Randall et al., 2014)

presented a study on 26 million records of hospital

admissions data and showed that privacy-preserving

record linkage with Bloom ﬁlters built from multiple

identiﬁers is applicable to large real-world databases

without loss in linkage quality.

However, only little research on the security of

Bloom ﬁlters built from more than one identiﬁer

has yet been published (see subsection 2.2). In

several countries, this lack of research prevents the

widespread use of Bloom ﬁlter encryptions for real-

Kroll M. and Steinmetzer S..

Automated Cryptanalysis of Bloom Filter Encryptions of Health Records.

DOI: 10.5220/0005176000050013

In Proceedings of the International Conference on Health Informatics (HEALTHINF-2015), pages 5-13

ISBN: 978-989-758-068-0

 2015 SCITEPRESS (Science and Technology Publications, Lda.)

world medical databases (such as cancer registries)

where the anonymity of the individuals has to be guar-

anteed. For example, in its Beyond 2011 Programme

the British Ofﬁce for National Statistics investigated

several methods for linking sensitive data sets (Ofﬁce

for National Statistics, 2013). The investigators came

to the conclusion that none of the “(...) recent inno-

vations, such as bloom ﬁlter encryption (...)” can be

recommended because they “(...) have not been fully

explored from an accreditation perspective”. Thus,

research showing drawbacks of the recent Bloom ﬁl-

ter techniques is important because it guides the di-

rection for future research and might motivate further

development of the recent procedures. In this paper,

we intend to investigate this issue in detail by giving

the ﬁrst convincing cryptanalysis of Bloom ﬁlter en-

cryptions built from more than one identiﬁer.

2 BACKGROUND

In 1970, Burton Howard Bloom (Bloom, 1970) intro-

duced a novel approach that permits the efﬁcient test-

ing of set membership through a probabilistic space-

efﬁcient data structure. A Bloom ﬁlter is a bit array of

length L, which at ﬁrst contains zeros only. Let S ⊆ U

be a subset of a universe U. Then S can be stored in a

Bloom ﬁlter B = B(S) = (b

,. .. ,b

L−1

) in the follow-

ing way: Each element s ∈ S is mapped via k different

hash functions h

,. .. ,h

k−1

: S −→ {0,...,L − 1} and

all the corresponding bit positions b

(s)

,. .. ,b

k−1

(s)

are set to one. Once a bit position is set to one this

value no longer changes.

Furthermore, to test whether an item u ∈ U from

the universe is contained in S, u is hashed through the

k hash functions h

,. .. ,h

k−1

as well. Consequently,

if all bit positions b

(u)

,. .. ,b

k−1

(u)

in the Bloom ﬁl-

ter are set to one, then u ∈ S holds with high proba-

bility. However, false positive values can occur when

the ones on positions h

(u),. .. ,h

k−1

(u) are caused by

two or more different elements from S. Then the test

indicates u ∈ S although this is not the case. Other-

wise, if at least one bit position in the two Bloom ﬁl-

ters varies, u clearly is no member of S.

2.1 PPRL with Bloom Filters Built from

Multiple Identiﬁers

In (Schnell et al., 2009) Bloom ﬁlters were used

in privacy-preserving record linkage for the ﬁrst

time. This approach was expanded to Cryptographic

Longterm Keys in (Schnell et al., 2011).

In common PPRL protocols two data owners A

and B agree on a set of identiﬁers that occur in both

of their databases. Next, these identiﬁers are stan-

dardized, then padded with blanks at the beginning

and the end, and ﬁnally split into substrings of two

characters. Each substring of the ﬁrst identiﬁer corre-

sponding to a record is mapped to the ﬁrst Bloom ﬁl-

ter via several hash functions. Afterwards, each sub-

string of the second identiﬁer, corresponding to the

same record, is mapped through another set of hash

functions to the ﬁrst Bloom ﬁlter as well. This proce-

dure is repeated until all identiﬁers of the ﬁrst record

are stored in the ﬁrst Bloom ﬁlter. Next, all identiﬁers

corresponding to the second record of the database

are mapped through the utilized hash functions to a

second Bloom ﬁlter and so on. Performing this pro-

cedure for all entries of the database results in a set

of Bloom ﬁlters where each Bloom ﬁlter is built from

multiple identiﬁers. Thus, the similarity of the Bloom

ﬁlters is a measure for the similarity of the encoded

identiﬁers. Usually, the linkage of the two databases

is conducted by a third party C.

Because of the speciﬁc structure of Bloom ﬁlters,

record linkage based on Bloom ﬁlters built from mul-

tiple identiﬁers allows for errors in the encrypted data.

Therefore, they can be applied to linking large data

sets such as national medical databases (Randall et al.,

2014).

2.2 Extant Research: Attacks on Bloom

Filters of One or More Identiﬁers

To the best of our knowledge, only two ways of at-

tacking Bloom ﬁlters of one identiﬁer and one way

of attacking Bloom ﬁlters of multiple identiﬁers are

known so far.

The ﬁrst cryptanalysis of Bloom ﬁlters was pub-

lished in 2011. Kuzu et al. (Kuzu et al., 2011) sam-

pled 20,000 records from a voter registration list and

encrypted the substrings of two characters from the

forenames through 15 hash functions and Bloom ﬁl-

ters of length 500 bits. Their attack consisted in solv-

ing a constraint satisfaction problem (CSP). Through

a frequency analysis of the fornames and the Bloom

ﬁlters and by applying their CSP solver to the prob-

lem, Kuzu et al. were able to decipher approximately

11% of the data.

In contrast, Niedermeyer et al. (Niedermeyer

et al., 2014) proposed an attack on 10,000 Bloom ﬁl-

ters built from encrypted German surnames that were

considered to be a random sample of a known popu-

lation. For the generation of the Bloom ﬁlters 15 hash

functions and Bloom ﬁlter length 1,000 were used.

Then they conducted a manual attack based on the fre-

quencies of the substrings of length two, which they

derived from the German surnames. Thus, Nieder-

HEALTHINF2015-InternationalConferenceonHealthInformatics

meyer et al. deciphered the 934 most frequent sur-

names of 7,580 different ones, which corresponds to

approximately 12% of the data set. However, their

attack is not limited to the most frequent names and

could be extended to the decipherment of nearly all

names.

In 2012 Kuzu et al. (Kuzu et al., 2012) showed

an attack on Bloom ﬁlters built from multiple identi-

ﬁers. They applied their constraint solver to forename

and surname, as well as forename, surname, city and

ZIP code, of 50,000 randomly selected records from

the North Carolina voter registration list. However,

they were not able to mount a successful attack. Thus,

Kuzu et al. supposed that combining multiple per-

sonal identiﬁers into a single Bloom ﬁlter would of-

fer a protection mechanism against frequency attacks.

Although they suspected that their attack did not un-

cover all vulnerabilities of the Bloom ﬁlter encodings,

they showed that the CSP for multiple identiﬁers is in-

tractable to solve by their constraint solver.

2.3 Our Contribution

In this paper we present a fully automated attack on a

database containing forenames, surnames and the rel-

evant place of birth as well. All records are considered

to be a random sample of a known population. We

suppose that the attacker only knows some publicly

available lists of the most common forenames, sur-

names and locations. The attack is based on analyzing

the frequencies and the combined occurence of sub-

strings of length two from the identiﬁers of these lists.

Furthermore, we are interested in recovering as many

identiﬁers as possible. Our cryptanalysis was imple-

mented using the programming languages Python and

C++.

3 ENCRYPTION

In this section some basic notation is introduced and

the encrypting procedure is described.

In record linkage scenarios, strings are usually

standardized through transformations such as capital-

ization of characters or removal of diacritics (Randall

et al., 2013). After this preprocessing step all strings

contain only tokens from some predeﬁned alphabet Σ.

Throughout this article, we use the canonical alpha-

bet Σ := {A,B,...,Z,

}, where denotes the padding

blank. Thus, for example the popular German sur-

name M

uller is transformed to MUELLER in the pre-

processing step. As usual, we denote substrings of

two characters with bigrams and the set containing

all the bigrams with Σ

, i.e.

= { , A,..., Z,A ,...,Z ,AA,...,ZZ}.

The Bloom ﬁlter encryption of a record from a

database is created by storing the bigram set associ-

ated with this record into a Bloom ﬁlter. The bigram

set associated with a record is deﬁned as the set con-

taining the bigrams from all the identiﬁers. Here, a

distinction between the bigrams occuring in different

identiﬁers is made. Thus, if the set of identiﬁers is

denoted with I , the bigram set of a record is a subset

of I × Σ

For example, if we have I = {surname,forename}

and the database contains a record, Peter M

uller,

the bigram set S

record

associated with this record

would contain the bigrams P

, PE

, ET

, TE

, ER

, M

, MU

, UE

, EL

, LL

, LE

, ER

and R

(the sub-

script f indicates the bigrams occuring in the fore-

name identiﬁer, the subscript s the ones occuring in

the surname identiﬁer).

Next, this bigram set is stored into a Bloom ﬁlter

,. .. ,b

L−1

) of length L by means of k independent

hash functions

: I × Σ

→ {0, .. ., L − 1}

for i = 0,. .. ,k −1. In practice, one could alternatively

use different hash functions h

: Σ

→ {0,...,L − 1}

for the distinct identiﬁers in order to guarantee that the

hash values for distinct identiﬁers are not the same.

Further, as in (Niedermeyer et al., 2014) we in-

troduce the term atom for the speciﬁc Bloom ﬁlters

which occur as the fundamental building blocks of the

encryption method.

Deﬁnition 3.1 (Atom). Let L,k ∈ N and some hash

functions h

,. .. ,h

k−1

be deﬁned as above. Then, a

Bloom ﬁlter

B := (b

,. .. ,b

L−1

) ∈ {0, 1}

is termed an atom if there exists a bigram β ∈ I × Σ

such that b

= 1 ⇔ h

(β) = j for some i = 0,...,k−1.

Such a Bloom ﬁlter is called the atom realized by the

bigram β and denoted with B(β).

Thus, atoms are special Bloom ﬁlters. Since each

bigram is hashed via each h

for i = 0,...,k − 1, at

most k positions in an atom can be set to one.

By combining the atoms of the underlying bigram

set of a record with the bitwise OR operation, the

Bloom ﬁlter of a record is composed as

B(record) =

β∈S

record

B(β),

where

denotes the bitwise OR operator.

Note that the same bigram from Σ

is hashed dif-

ferently if it occurs in distinct identiﬁers. This is il-

lustrated in Figure 1 for the example of the bigram ER

AutomatedCryptanalysisofBloomFilterEncryptionsofHealthRecords

. . .

0 0

0 0 0

. . .

0 0 0 0

0 0

999

. . .

0 0

. . .

0 0

999

Figure 1: Two different atoms of the bigram ER. These atoms are realized when instances of ER occur in distinct identiﬁers.

0000100000. .. 0000000010 B( P

)

∨ 0001000001. .. 0100000100 B(PE

)

∨ 0101010101. .. 0001010101 B(ET

)

∨ 0001000010. .. 0001000010 B(TE

)

∨ 0100010001. .. 0000000100 B(ER

)

∨ 0101010101. .. 0000000001 B(R

)

0101110111. .. 0101010111 B(Peter)

0000000100. .. 0000000001 B( M

)

∨ 0010000000. .. 0100000000 B(MU

)

∨ 0000100000. .. 0010000010 B(UE

)

∨ 1000000010. .. 0010000000 B(EL

)

∨ 0100001000. .. 0100001000 B(LL

)

∨ 1000000100. .. 0001000000 B(LE

)

∨ 1001001001. .. 0000100100 B(ER

)

∨ 0010001000. .. 0000000010 B(R

)

1111101111. .. 0111101111 B(M

uller)

Figure 2: Bloom ﬁlters of the forename Peter and the surname M

uller, composed of the atoms belonging to the underlying

bigrams.

which occurs in the record Peter M

uller both in the

surname and the forename identiﬁer.

Mapping each bigram of the forename Peter with

k hash functions results in six atoms; for the sur-

name M

uller, we get eight atoms. Thus, the separate

Bloom ﬁlters for these identiﬁers might be composed

as illustrated in Figure 2.

The ﬁnal Bloom ﬁlter for the record Peter

uller is composed by appling the bitwise OR op-

eration to the separate Bloom ﬁlter encryptions of the

distinct identiﬁers. This is demonstrated in Figure 3.

In practice, the Bloom ﬁlter encryption of a record

might contain a mixture of string valued identiﬁers

(such as forename, surname or place of birth) and

also numerical identiﬁers, such as date of birth. How-

ever, in this paper we restrict ourselves to the case of

string valued attributes only, albeit our cryptanalysis

proposed below is not limited to such attributes.

Assumptions

In many record linkage scenarios, it is supposed that

a semi-trusted third party conducts the record link-

age between two encrypted databases. In this paper

we assume a data set containing Bloom ﬁlters built

from multiple identiﬁers that is sent to a semi-trusted

third party. This third party acts as the adversary and

tries to infer as much information as possible from

the record encryptions. We further suppose that the

attacker has knowledge of the encryption process.

For our scenario we generated 100,000 Bloom ﬁl-

ters built from standardized German forenames, sur-

names and cities according to the distribution in the

population. The identiﬁers were truncated after the

tenth letter, padded with blanks, respectively, and

were broken into bigrams. Then the bigrams were

hashed through k = 20 hash functions into Bloom ﬁl-

HEALTHINF2015-InternationalConferenceonHealthInformatics

0101110111. .. 0101010111 B(Peter)

∨ 1111101111. .. 0111101111 B(M

uller)

1111111111. .. 0111111111 B(entire record)

Figure 3: The Bloom ﬁlter of the record Peter M

uller is obtained by applying the bitwise OR operation to the Bloom ﬁlter

encryptions of the separate identiﬁers.

ters of length L = 1, 000. As proposed in (Schnell

et al., 2009) and (Schnell et al., 2011), we used the so-

called double hashing scheme for the generation of k

hash functions from two hash functions f and g. This

double hashing scheme is deﬁned via the equation

= ( f + i · g) mod L for i = 0, .. ., k − 1 (1)

and was originally proposed in (Kirsch and Mitzen-

macher, 2008) as a simple hashing method for Bloom

ﬁlters yielding satisfactory performance results.

In our cryptanalysis we assume that the adversary

knows that the hash values are generated in accor-

dance with equation (1). It is self-evident that s/he

must not have direct access to the hash functions f

and g since this would permit the adversary to check

whether a speciﬁc bigram is contained in a given

Bloom ﬁlter.

Note that the double hashing scheme has also been

used for the generation of Bloom ﬁlters by Kuzu et

al. (Kuzu et al., 2012). However, in that paper the

knowledge of the double hashing scheme was not ex-

ploited in their cryptanalysis.

4 CRYPTANALYSIS

This section provides a detailed description of the de-

ciphering process. At ﬁrst we try to detect the atoms

that are contained in the given Bloom ﬁlters. Then,

we assign bigrams to these atoms by means of an op-

timization algorithm. Finally, the original attributes

are reconstructed from the atoms.

Our approach for the development of a fully auto-

mated attack is based on previous results on the au-

tomated cryptanalysis of simple substitution ciphers

presented by Jakobsen (Jakobsen, 1995). We give a

short account of Jakobsen’s results in order to moti-

vate our procedure.

4.1 Automated Cryptanalysis of Simple

Substitution Ciphers

The encryption of a plaintext message through a sim-

ple substitution cipher is deﬁned by a permutation of

the underlying alphabet Σ. For instance, the message

HELLO LISBON with tokens from the alphabet

Σ = { ,A,B,...,Z} could be encrypted as

RVUUYJUOWAYL.

It is well known that this kind of encryption can be

broken easily by means of a frequency analysis. How-

ever, just replacing the i-th frequent character in the

ciphertext with the i-th frequent character in the un-

derlying language will usually not lead to the cor-

rect decipherment (even for longer messages). This

is commonly compensated for by taking bigram fre-

quencies into consideration as well.

The expected bigram frequencies can be obtained

from a training data set composed of the underlying

language and stored in a quadratic matrix E (in the

above example a 27 × 27 matrix), where the entry e

i j

is equal to the relative proportion of the bigram c

the training text corpus and c

denotes the i-th charac-

ter of the alphabet. Analogously, the bigram frequen-

cies of the ciphertext can be stored in a matrix D.

The algorithm proposed by Jakobsen (Jakobsen,

1995) was intended to ﬁnd a permutation σ

opt

of the

alphabet such that the objective function f deﬁned via

f (σ) :=

∑

i, j

σ(i)σ( j)

− e

i j

| (2)

was minimized. The algorithm starts with the initial

permutation that reﬂects the best assignment between

single characters in the plaintext and the ciphertext

with respect to their relative frequency. In each step of

the algorithm two elements of the currently best per-

mutation σ

opt

are swapped, leading to a new candidate

permutation σ. If f (σ) < f (σ

opt

) holds, the current

permutation is updated to σ, otherwise σ is discarded

and a new candidate σ is generated by swapping two

other elements of σ

opt

. This is repeated until no swap

leads to a further improvement of the objective func-

tion f . Throughout this paper we use the same strat-

egy as Jakobsen in (Jakobsen, 1995), in order to de-

termine the elements of the current permutation to be

swapped. For a more detailed description of Jakob-

sen’s method in the case of simple substitution ciphers

we refer the reader to the original paper (Jakobsen,

1995). Figure 2 in (Jakobsen, 1995) shows that a ci-

phertext of length 600 built by a simple substitution

cipher can be entirely broken by this method. It is

clear that some modiﬁcation of Jakobsen’s original al-

gorithm is necessary in order to make it applicable in

our setting as well. In particular, the deﬁnitions of the

matrices D and E must be changed. Their adopted

AutomatedCryptanalysisofBloomFilterEncryptionsofHealthRecords

deﬁnitions are introduced in subsection 4.3.

4.2 Atom Detection

As in (Niedermeyer et al., 2014), the basic principle

of our approach consists in the detection of atoms,

which represent the encryption of one single bigram

only. Since the Bloom ﬁlter of a string is created by

the superposition of at least a few atoms, the recon-

struction of the atoms given only a set of Bloom ﬁl-

ters turns out to be difﬁcult. Note that this task can-

not be solved in a satisfactory manner if Bloom ﬁlters

are considered isolatedly or in small groups because

in this case too many binary vectors will be wrongly

classiﬁed as atoms.

Let us give a short motivation for our novel

method aiming at atom detection. If the bitwise AND

operation is applied to a set of Bloom ﬁlters that have

one bigram β in common, at least all positions set to

one by β are equal to one in the result. However, for

prevalent bigrams it should be expected that all the

other positions are set to zero if a sufﬁcient number of

Bloom ﬁlters are considered, i.e., the result would be

exactly the atom induced by the bigram β.

Of course, if an adversary has access to a set

of Bloom ﬁlters, s/he does not a priori know which

Bloom ﬁlters have a bigram in common. This obsta-

cle can be avoided as follows: Under the assumption

that the double hashing scheme is being used, the ad-

versary is able to determine for each combination of

bit positions from equation (1) the set of Bloom ﬁlters

for which all these positions are set to 1. Then, the

bitwise AND operation is applied to the set of these

Bloom ﬁlters. If the result coincides with the atom, it

is considered to be the realization of a bigram by the

adversary.

The resulting set of atoms was further reduced by

discarding atoms of Hamming weight

∑

999

i=0

equal

to 1, 2, 4 or 5 and keeping only atoms of Hamming

weight equal to 8, 10 or 20.

Otherwise, too many binary vectors would have

been classiﬁed incorrectly as atoms. The probability

that an atom has Hamming weight less than 8 in our

setting is equal to 0.008. This value can be derived in

analogy to Lemma A.1 and the subsequent example

in (Niedermeyer et al., 2014). We denote the num-

ber of atoms found by n. For our speciﬁc data set we

got n = 1,776. This result seems reasonable, because

the total number of possible atoms is bounded from

above by 2,187 and obviously not all of these atoms,

in particular atoms realized by rare bigrams, occur in

our simulated data. As we checked later on, 1,337

of the 1,776 extracted conjectured atoms were indeed

true atoms, that is to say atoms generated by one of

10000

20000

30000

name:A_

loc:EN

surname:ER

loc:N_

loc:ER

name:AN

surname:R_

name:E_

name:AR

name:_M

Bigram

Count

Figure 4: Absolute frequencies of the 10 most frequent bi-

grams in our training data set.

the 2,187 bigrams. The subsequent analysis demon-

strates that this percentage of correct atom detection

is sufﬁcient for a successful cryptanalysis. For each

atom α we determined the set of Bloom ﬁlters con-

taining this atom, i.e. Bloom ﬁlters for which all bit

positions of the atom are set to 1. We denote the atoms

with α

,. .. ,α

1776

according to decreasing frequency.

In order to give an illustrative example, we assert that

in the Bloom ﬁlter No. 850 the atoms α

, α

106

, α

110

, α

123

, α

138

169

, α

194

, α

197

, α

218

, α

254

, α

309

, α

313

, α

317

, α

334

335

, α

396

, α

398

, α

453

, α

607

, α

668

, α

705

, α

782

, α

821

960

and α

1131

were detected.

In the subsequent section we explain how correla-

tions between the occurences of atoms in the Bloom

ﬁlters and bigrams in a training data set can be used

to give adequate deﬁnitions of the matrices D and E

that serve as the input of Jakobsen’s algorithm.

4.3 Correlation of Atoms and Bigrams

A naive assignment of bigrams to atoms is possible

only for few frequent bigrams. For example, if Ger-

man surnames, given names and birth locations are

considered together, the most frequent bigram is A

(the bigram A in the forename identiﬁer) such that

the most frequent atom is likely to be the encryption

of this bigram. The absolute frequencies of the 10

most frequent bigrams in the considered training data

are illustrated in Figure 4.

Except for the ﬁrst few bigrams, the bigram fre-

quencies are too close together such that naive match-

ing is not promising for automatic decipherment.

In the example of Bloom ﬁlter No. 850 already

introduced above, this naive assignment would lead to

the conjecture that the corresponding record contains

the following bigrams: N

, R

, CH

, N

, HE

, S

, E

, L

, BE

, NI

, AR

, W

, P

, NG

, IR

HEALTHINF2015-InternationalConferenceonHealthInformatics

, MI

, NI

, VE

, OS

, NS

, UN

, AT

, V

, LH

, OW

, ZB

, RR

, DY

and MR

. However, from this list of

bigrams it is obviously impossible to reconstruct any

meaningful information.

For this reason, we also took correlations between

bigrams into account. For example, for records sam-

pled from the population of Germany the appearance

of the bigram CH

in a record makes the appearance

of the bigram SC

in the same record more likely be-

cause the trigram SCH frequently appears in German

surnames.

We model this kind of information on the corre-

lation of atoms and bigrams by means of two ma-

trices D and E. Assume that the attribution val-

ues of the records built from tokens of the alphabet

Σ = { ,A, B,. .. ,Z} are to be encrypted. Thus, for

each (string valued) identiﬁer we have 729 possible

bigrams. Since the same bigram is encrypted differ-

ently for each identiﬁer we have to distinguish be-

tween different instances of the same bigram. In our

setting we denote the bigram β for the surname, fore-

name and location identiﬁer with β

, β

and β

, re-

spectively. Altogether, the set Σ

containing all possi-

ble bigrams consists of 3 · 729 = 2,187 elements.

Let us now introduce the matrix E containing in-

formation about the expected bigram correlations ob-

tained from the training data set. Note that the train-

ing data should be as similar to the encrypted data as

possible, e.g. a random sample from the same under-

lying population as the encrypted data. If the prevail-

ing Bloom ﬁlters are known to contain encryptions

of records from the German population, an attacker

would try to get access to a comparable database con-

taining the same identiﬁers. The attribute values of

this training data set are preprocessed analogously to

the preprocessing routine before the encryption pro-

cess. Then, the bigram sets for all the attribute values

are created. We denote the bigrams with β

,. .. ,β

2187

according to decreasing frequency. Let T be the to-

tal number of records in the training data set and t

i j

the number of records that contain both bigram β

and

bigram β

. Then the matrix E = (e

i j

)

i, j=1,...,2187

deﬁned via

i j

(

i j

/T if i 6= j,

0 if i = j.

The matrix D is formed in a similar way on the ba-

sis of joint appearances of atoms in the Bloom ﬁlters.

Let N be the number of Bloom ﬁlters for which atoms

have been extracted. We denote the number of Bloom

ﬁlters that contain both atom α

and atom α

by b

i j

The matrix D = (d

i j

)

i, j=1,...,2187

is deﬁned through

i j

(

i j

/N if i 6= j and i, j ≤ 1776,

0 if i = j or max(i, j) > 1776.

The procedure suggested by Jakobsen which was

described above can now directly be applied to the

matrices D and E:

OPTIMIZATION ALGORITHM.

Input: D,E as deﬁned in section 4.3

Output: σ

opt

∈ S

2187

minimizing

f (σ) =

∑

i, j

σ(i)σ( j)

− e

1: σ

opt

(i) = i ∀i  Initialization

2: min ← f (σ

opt

)

3: a,b ← 1

4: repeat

5: σ ← σ

opt

6: a ← a + 1

7: if a + b ≤ 2187 then

8: σ(a) ← σ

opt

(b), σ(b) ← σ

opt

(a)

9: else

10: a ← 1, b ← b + 1

11: if f (σ) < f (σ

opt

) then  Update

12: min ← f (σ)

13: σ

opt

← σ

14: a,b ← 1

15: until b = 2187

The progress of the optimization algorithm is il-

lustrated by means of Figure 5.

The result of the algorithm will be the ﬁnal assign-

ment between atoms and bigrams deﬁned by a permu-

tation σ

opt

∈ S

2187

and the assignment rule α

opt

(i)

→

. This assignment is used to reconstruct the original

bigram sets encrypted in the Bloom ﬁlters.

For example, the bigrams ER

, R

, CH

, N

, HE

, SC

, S

, E

, HE

, K

, RL

, AR

, Z

, ON

, SI

, F

, LS

, HW

, SO

, RU

, UR

, IM

, KA

, MO

, AV

, FI

, HH

, SR

, UZ

and MR

were assigned to the Bloom

ﬁlter No. 850.

Figure 5: Progress of the optimization algorithm for our

data set. The initial value of the objective function is 370.99

and 2,812 updating steps were performed. The ﬁnal value

of the objective function f (σ

opt

) was equal to 168.5.

AutomatedCryptanalysisofBloomFilterEncryptionsofHealthRecords

In the following section we describe how attribute

values were reassembled from the reconstructed bi-

gram sets.

4.4 Reconstruction of Attribute Values

In order to reconstruct the original attribute values of

the records, we separated the bigrams belonging to

different identiﬁers for each Bloom ﬁlter.

In the example of Bloom ﬁlter No. 850, we ob-

tained the bigrams N

, S

, ON

, SI

,SO

, IM

, MO

for the forename identiﬁer, the bigrams ER

, R

, SC

, HE

, Z

, F

, IS

, UR

, FI

, MR

for the sur-

name identiﬁer and ﬁnally the bigrams HE

, E

, RL

, AR

, LS

, HW

, RU

, KA

, UH

, HH

, SR

, UZ

for the location identiﬁer. From this list it is already

possible to guess the original identiﬁer values at ﬁrst

glance.

Our fully automated approach to reconstructing

the original identiﬁer values was to compare the ob-

tained bigram sets with a list of bigram sets gener-

ated from reference lists of surnames, names and lo-

cations. For Bloom ﬁlter No. 850, for example, an

adversary would correctly obtain that this Bloom ﬁl-

ter encrypts a record belonging to the person Simon

Fischer from the German city Karlsruhe.

4.5 Results

By using the approach described above, we were able

to reconstruct 59.6% of the forenames, 73.9% of the

surnames and 99.7% of the locations correctly. For

44% of the 100,000 records all the identiﬁer values

were recuperated successfully.

5 CONCLUSION

In this paper we demonstrate a successful fully au-

tomated attack on Bloom ﬁlters built from multiple

identiﬁers. We were able to recover approximately

77.7 % of the original identiﬁer values. In contrast to

the assumptions in (Kuzu et al., 2012) and (Nieder-

meyer et al., 2014), that storing all identiﬁers in a sin-

gle Bloom ﬁlter makes it more difﬁcult to attack, we

needed only moderate computational effort and pub-

licly available lists of forenames, surnames, and loca-

tions to reconstruct the identiﬁers. Note that there is

no huge impact of the size of the database containing

the Bloom ﬁlters. For our cryptanalysis it is sufﬁcient

to perform the attack on a subset of the given Bloom

ﬁlters (100,000 as in our example should be adequate

in most cases). Then for the remaining Bloom ﬁl-

ters it would be sufﬁcient to check for the atoms con-

tained in those and to reconstruct the attribute values,

since most assignments of atoms to bigrams are al-

ready known. Thus, the time needed for cryptanalysis

is linear in the number of input Bloom ﬁlters. The

time needed for the detection of atoms is O(L

) since

there are L possible values for the hash functions f

and g in equation (1). Furthermore, the detection of

atoms could easily be parallelized to make the compu-

tation faster and values of L signiﬁcantly larger than

L = 1,000 as considered in this paper would also have

negative effects on the time needed for performing the

linkage between two databases (note that in the large

scale study reported in (Randall et al., 2014) a Bloom

ﬁlter length of only 100 was considered). Thus, the

most time consuming step in our cryptanalysis should

be the optimization algorithm presented in subsection

4.3. Indeed, in the chosen parameter setup this proce-

dure took about 402 minutes on a notebook with 2.80

GHz Intel



Core running Ubuntu 14.04 LTS.

To sum up, we do not recommend the usage of

Bloom ﬁlters built from one or more identiﬁers, gen-

erated with the double hashing scheme, in appli-

cations where high security standards are required.

However, we applied our attack in a very special

scenario, because the generated databases were en-

crypted using the double hashing scheme. Thus, there

are options for an improvement of the setting.

For example Niedermeyer et al. (Niedermeyer

et al., 2014) proposed several methods such as fake

injections, salting or randomly selected hash values

to harden the Bloom ﬁlters. Hence, we are conﬁdent

that methods like those proposed by Niedermeyer et

al. show promise in the prevention of attacks like the

one presented in this paper.

ACKNOWLEDGEMENTS

Research of both authors was ﬁnancially supported

by the research grant SCHN 586/19-1 of the Ger-

man Research Foundation (DFG) awarded to the head

of the Research Methodology Group, Rainer Schnell.

We thank him and the three anonymous reviewers for

their helpful comments.

REFERENCES

Bloom, B. H. (1970). Space/time trade-offs in hash coding

with allowable errors. Communications of the ACM,

13(7):422–426.

Herzog, T. N., Scheuren, F. J., and Winkler, W. E.

(2007). Data Quality and Record Linkage Techniques.

Springer, New York.

HEALTHINF2015-InternationalConferenceonHealthInformatics

Jakobsen, T. (1995). A fast method for the cryptanalysis of

substitution ciphers. Cryptologia, 19(3):265–274.

Jones, M., McEwan, P., Morgan, C. L., Peters, J. R., Good-

fellow, J., and Currie, C. J. (2005). Evaluation of the

pattern of treatment, level of anticoagulation control,

and outcome of treatment with warfarin in patients

with non-valvar atrial ﬁbrillation: a record linkage

study in a large British population. Heart, 91(4):472–

477.

Kirsch, A. and Mitzenmacher, M. (2008). Less hashing,

same performance: Building a better Bloom ﬁlter.

Random Structures & Algorithms, 33(2):187–218.

Kuehni, C. E., Rueegg, C. S., Michel, G., Rebholz, C. E.,

Strippoli, M.-P. F., Niggli, F. K., Egger, M., and

von der Weid, N. X. (2012). Cohort proﬁle: The Swiss

childhood cancer survivor study. International Jour-

nal of Epidemiology, 41(6):1553–1564.

Kuzu, M., Kantarcioglu, M., Durham, E., and Malin, B.

(2011). A constraint satisfaction cryptanalysis of

bloom ﬁlters in private record linkage. In Fischer-

ubner, S. and Hopper, N., editors, Privacy Enhanc-

ing Technologies, volume 6794 of Lecture Notes in

Computer Science, pages 226–245. Springer, Berlin.

Kuzu, M., Kantarcioglu, M., Durham, E. A., Toth, C., and

Malin, B. (2012). A practical approach to achieve

private medical record linkage in light of public re-

sources. Journal of the American Medical Informatics

Association, 20(2):285–292.

Newman, T. B. and Brown, A. N. (1997). Use of commer-

cial record linkage software and vital statistics to iden-

tify patient deaths. Journal of the American Medical

Informatics Association, 4(3):233–237.

Niedermeyer, F., Steinmetzer, S., Kroll, M., and Schnell,

R. (2014). Cryptanalysis of basic Bloom ﬁlters used

for privacy preserving record linkage. Working Pa-

per NO.WP-GRLC-2014-04, German Record Link-

age Center, N

urnberg.

Ofﬁce for National Statistics (2013). Beyond 2011: Match-

ing anonymous data. Methods & Policies M9, ONS,

London.

Randall, S. M., Ferrante, A. M., Boyd, J. H., Bauer, J. K.,

and Semmens, J. B. (2014). Privacy-preserving record

linkage on large real world datasets. Journal of

Biomedical Informatics.

Randall, S. M., Ferrante, A. M., Boyd, J. H., and Semmens,

J. B. (2013). The effect of data cleaning on record

linkage quality. BMC Medical Informatics and Deci-

sion Making, 13(64).

Rocha, M. C. N. (2013). Vigil

ancia dos

obitos Registrados

com Causa B

asica Hansen

ıase. Master thesis, Uni-

versidade de Bras

ılia, Bras

ılia.

Schnell, R., Bachteler, T., and Reiher, J. (2009). Privacy-

preserving record linkage using Bloom ﬁlters. BMC

Medical Informatics and Decision Making, 9(41):1–

11.

Schnell, R., Bachteler, T., and Reiher, J. (2011). A novel

error-tolerant anonymous linking code. Working Pa-

per NO.WP-GRLC-2011-02, German Record Link-

age Center, N

urnberg.

Schnell, R., Richter, A., and Borgs, C. (2014). Performance

of different methods for privacy preserving record

linkage with large scale medical data sets. Presenta-

tion at International Health Data Linkage Conference,

Vancouver.

Van Den Brandt, P. A., Schouten, L. J., Goldbohm, R. A.,

Dorant, E., and Hunen, P. M. H. (1990). Develop-

ment of a record linkage protocol for use in the Dutch

cancer registry for epidemiological research. Interna-

tional Journal of Epidemiology, 19(3):553–558.

AutomatedCryptanalysisofBloomFilterEncryptionsofHealthRecords