the application of the Reduce function to a single key
and its associated values). For each key, the Reduce
function checks if the considered key is associated to
two values. If it is the case, i.e., tuple t is present
in both relations, then the public cloud produces and
sends the pair (−,t) to the user. The dash value “−”
corresponds to the empty value, we use it to be con-
sistent with the key-value result form required by the
MapReduce paradigm. Hence, all tuples received by
the user correspond to the tuples that are in both rela-
tions. However, the key is irrelevant at the end of the
protocol, hence we often omit to write it. We illustrate
this approach with the following example considering
three relations.
Example: We consider three relations: NSA,
GCHQ, and Mossad. Each relation is owned by their
respective data owner. These three relations have the
same schema composed of only one attribute, namely
“Suspect’s ID”. They are defined as follows: NSA =
{F654,U840, X098}, GCHQ = {F654, M349,P027},
and Mossad = {F654,M349,U840}. An external user,
called Interpol, wants to receive the intersection of
these three relations denoted Interpol. We illus-
trate the execution of intersection computation with
MapReduce for this setting in Figure 1. First, each
data owner outsources their respective relation into
the public cloud. Then, the public cloud runs the
map function on each relation and sends the output to
the master controller in order to sort key-value pairs
by key. Then, the master controller sends key-value
pairs sharing the same key to the same reducer. In
our example, we obtain 5 reducers since there are 5
different suspect’s identities. The reducer associated
to the key F654 has three values since the identity
F654 is present in the three relations NSA, GCHQ,
and Mossad. The reducer associated to the key M349
has two values since the identity M349 is only present
in relations GCHQ and Mossad. Other reducers are
associated to only one value since the corresponding
suspect’s identity is present in only one relation. For
each reducer, the public cloud runs the reduce func-
tion and sends the tuple (−, ID) to the user if the sus-
pect’s identity ID is present is the three relations, else
the public cloud sends nothing. In our example, we
observe that the user Interpol only receives the pair
(−,F654) since the suspect’s identity F654 is present
in the three relations NSA, GCHQ, and Mossad.
1.2 Problem Statement
We assume n + 2 parties: n data owners, the public
cloud, and the external user (simply referred as user
in the following). Each data owner is trusted (i.e.,
they dutifully follow the protocol and do not collude
with other party) and outsources a relation R
i
, with
i ∈ J1,nK, to the public cloud, denoted C . We denote
by R
i
the owner of the relation R
i
for i ∈ J1,nK. A
user, denoted U, and who does not know the individ-
ual relations R
i
is authorized to query the intersection
of these n relations.
We assume that the public cloud is semi-
honest (Lindell, 2017), i.e., it executes dutifully the
computation task but tries to learn the maximum of
information on relations R
i
and on their intersection.
In the original protocol (Leskovec et al., 2014), tuples
of each relation are not encrypted, hence the public
cloud learns all the content of each relation and the
result of the intersection that it sends to the user as
illustrated in Figure 1. To preserve data owners’ pri-
vacy, the cloud should not learn any plain input data,
contrary to what happens for the original protocol.
Moreover, we assume that the public cloud can
collude with the user, i.e., they share all their respec-
tive private information. We want that the user that
queried the intersection of these n relations may learn
nothing else than the intersection of the n relations,
even in case of collusion with the public cloud.
1.3 Contributions
We revisit the standard protocol for the computa-
tion of intersection with MapReduce (Leskovec et al.,
2014) and propose a new protocol called SI (for Se-
cure Intersection) that satisfies our aforementioned
problem statement. More precisely:
• Our protocol SI guarantees that the user who
queries the intersection of the n relations learns
only the final result. Moreover, the public cloud
does not learn information about the input data
that belongs to the data owners, it learns only the
cardinal of each relation and of the intersection.
SI also satisfies the problem setting in the pres-
ence of collusion between the user and the public
cloud. The security proof of our protocol is given
in the extended version available online
1
.
• To show the practical scalability of SI, we present
experimental results using the MapReduce open-
source implementation Apache Hadoop 3.2.0.
• Our protocol SI is efficient from both computa-
tion and communication points of view. The over-
head for the computation complexity is linear in
the number of tuples by relation while the com-
munication complexity is the same as in the stan-
dard protocol (Leskovec et al., 2014). Our tech-
nique is based on classical cryptographic tools
such that pseudo-random function, asymmetric
1
https://hal.archives-ouvertes.fr/hal-02129141
Secure Intersection with MapReduce
237