phertext values is necessary for linear regression. Li-
braries such as HElib
6
offer this operation, yet the
size of messages and keys and also the running time
are large. For example, one could apply the method
of Encrypted X&y with θ encrypted. In this case,
the most costly operation per gradient descent itera-
tion step is the multiplication of an encrypted n-by-n
matrix with an encrypted vector of length n. Imple-
menting this as proposed in (Halevi and Shoup, 2014),
gives a lower bound of the running time per iteration
of 25s for CBM and 8s for CCPP with HElib’s default
configuration for 32-bit plaintext integers. Thus, this
method is at least 10,400 (178,000) times slower than
plaintext operations for CCPP (CBM).
With a naive encoding of numbers (e.g., HElib’s
current encoding), around 9GB of encrypted data
would need to be sent for the training task with CCPP.
Different methods to compute the inverse of a matrix
would need to be considered to decrease the commu-
nication cost. These results clearly show the substan-
tial difference in performance when either X or θ is
left in plaintext as opposed to encrypting X, y, and θ.
5 RELATED WORK
Privacy-preserving techniques for outsourcing ma-
chine learning tasks received a lot of attention in
a variety of scenarios. In this section, we discuss
the most closely related approaches for regression.
To the best of our knowledge, existing work em-
ploys either protocols with additional parties, e.g.,
two-server or multi-party-computation solutions un-
der non-collusion assumptions, e.g., (Damgard et al.,
2015; Du et al., 2004; Hall et al., 2011; Karr et al.,
2009; Nikolaenko et al., 2013; Peter et al., 2013;
Samet, 2015), or protocols based on fully homomor-
phic encryption, e.g., (Graepel et al., 2012; Bost et al.,
2014).
Nikolaenko et al. consider the scenario where both
the dependent and independent variables are confi-
dential and the model is computed in plaintext (Niko-
laenko et al., 2013). They propose a two-server solu-
tion for ridge regression using the partially homomor-
phic Paillier cryptosystem (Paillier, 1999) and garbled
circuits (Goldwasser et al., 1987; Yao, 1986). Under
the assumption that the two servers do not collude,
they provide methods for the parameter-free Cholesky
decomposition to compute the pseudo inverse. On
the same data sets and on data sets of similar di-
mensions, their approach can take 100-1000 times
longer, despite the fact that they use shorter keys.
6
See https://github.com/shaih/HElib.
Other solutions for the privacy-preserving computa-
tion with multiple servers include encryption schemes
with trapdoors (Peter et al., 2013), multi-party-
computation schemes or shared data, e.g., (Damgard
et al., 2015; Du et al., 2004; Hall et al., 2011; Karr
et al., 2009; Samet, 2015).
Graepel et al. present an approach enabling the
computation of machine learning functions as long as
they can be expressed as or approximated by a poly-
nomial of bounded degree with leveled homomorphic
encryption (Graepel et al., 2012), using the library
HElib based on the Brakerski-Gentry-Vaikuntanathan
scheme (Brakerski et al., 2012). They focus on binary
classification (linear means classification and Fisher’s
linear discriminant classifier). Moreover, they assume
that it is known for two encrypted training examples
whether they are labeled with the same classification
(without revealing which one it is). In contrast, we
apply simpler encryption methods that are several or-
ders of magnitude faster on the data set BCW. Bost
et al. consider privacy preserving classification (pre-
dictions but no training) (Bost et al., 2014). They
combine different encryption schemes into building
blocks for the computation of comparisons, argmax,
and the dot product. These building blocks require
messages to be exchanged between the client and the
server, which is not necessary in the computation of
predictions with our algorithms.
6 CONCLUSION
We have proposed methods to train a regression
model and use it for predictions in scenarios where
part of the data and the model are confidential and
must be encrypted. By exploiting the fact that not
everything is encrypted, our methods work with par-
tially homomorphic encryption and thereby achieve
a significantly lower slow-down factor than state-of-
the-art methods applicable to scenarios where every-
thing must be encrypted. We have further presented
an evaluation of our methods on two data sets and
found the times needed to train a model and make
predictions small enough for practical use. Our main
contribution is hence addressing the problem in ways
that enable the use of partially homomorphic encryp-
tion and a single server. To the best of our knowledge,
there is no existing work for scenarios where indepen-
dent variables can be public and the dependent vari-
ables and model must be encrypted. The trade-offs of
the different methods we propose are of interest since
they are suitable for different dataset properties.
In this paper, we have provided the details for lin-
ear regression only; however, it is important to note
Privacy-preserving Regression on Partially Encrypted Data
265