data clustering. Of note is the 3-D Updatable Dis-
tance Matrix (UDM) introduced in (Almutairi et al.,
2017). However, use of the UDM featured two disad-
vantages: (i) a substantial memory requirement, be-
cause the first two dimensions of the matrix were cor-
related to the number of records in the given dataset
thus limiting the scalability and (ii) the potential for
reverse engineer given that a UDM is essentially a
(very large) set of linear equations.
Given the above, this paper proposes the idea of
the Secure Chain Distance Matrix (SCDM) which
provides for secure third party data mining us-
ing a proposed Order Preserving Encryption (OPE)
scheme, which can limit recourse to data owners dur-
ing the processing of the data (depending on the na-
ture of clustering) and features none of the mem-
ory requirement and security disadvantages associ-
ated with the UDM concept proposed in (Almutairi
et al., 2017). The novel elements of the SCDM con-
cept are firstly the chaining mechanism used, which
means that the storage requirement, compared with
UDMs, is reduced by a factor equivalent to the num-
ber of input data records (−1). Secondly, the pro-
posed Order Preserving Encryption (OPE) scheme
with which the matrix is encoded, thus allowing for
third party record comparison without the risk of po-
tential reverse engineering as in the case of UDM.
The SCDM concept is fully described and evaluated.
The evaluation is conducted in the context of three
different clustering algorithms (Nearest Neighbour,
DBSCAN and k-Means), however, the SCDM idea
clearly has wider application.
The rest of this paper is structured as follows. Sec-
tion 2 provides a review of related research. Sec-
tion 3 presents the data encryption schemes used to
provide for proposed secure clustering methods. The
proposed Secure Chain Distance Matrix (SCDM) idea
is then detailed in Section 4. The utilisation of the
SCDM concept, in the context of secure data cluster-
ing, is presented in Section 5. Section 6 then reports
on the experiments conducted to evaluate the SCDM
concept and the results obtained (in the context of se-
cure data clustering). The paper is concluded in Sec-
tion 7, with a summary of the main findings and sug-
gestion for future work.
2 PREVIOUS WORK
This section presents a review of previous work on
secure data clustering that uses HE schemes as a data
confidentiality preservation method. The main chal-
lenge of HE-based privacy preserving data clustering
(and other forms of data mining), is that HE schemes
support only a limited number of operations. Sev-
eral solutions havebeen proposed to address this chal-
lenge, mostly in the context of collaborative data clus-
tering whereas the work presented in this paper is
directed at third party data clustering, which can be
broadly categorised into: (i) involving data owners
when unsupported operations are required, and (ii)
utilising the concept of “secret sharing” to delegate a
key and operations to semi-honest and non-colluding
parties that collaboratively perform operations on the
data owners’ behalf. Both have limitations in term of
communication complexity and security threats.
The main feature of the first category is the main-
tenance of data confidentiality by allowing a third
party to only manipulate cyphertexts using HE prop-
erties (no access to any secret key). In this case, in
the context of data clustering, data owner participa-
tion becomes a necessity. In some cases, the majority
of the work is done by data owners. For example,
a number of authors have proposed mechanisms for
k-means clustering using Secure Multi-Party Compu-
tation (SMPC), where data owners repeatedly cluster
their own data and only share encrypted data centroids
so that an eventual global clustering can be arrived at
(Jha et al., 2005; Mittal et al., 2014). A similar idea
is used in (Tong et al., 2018) to implement DBSCAN
where data owners independently apply DBSCAN on
their local data. The resulting boundary records and
their labels are then shared (in plaintext) with the third
party who then determines global boundary records
which are returned to the individualdata owner so that
they can update their local clusters. However, shar-
ing boundary data records in plaintext form presents
a security threat. Secure nearest neighbour clustering
is presented in (Shaneck et al., 2009) using SMPC
primitives; secure product for distance calculation
and Yao’s millionaires’ protocol for data comparison.
A significant drawback of these proposed solutions
is that they introduce a computation/communication
overhead because of the amount of data owner partic-
ipation required.
In (Erkin et al., 2009; Liu et al., 2014; Almutairi
et al., 2017; Rahman et al., 2017) the basic idea was
for the third party to do as much of the clustering as
possible (centroid calculation, data aggregation and
so), using the properties of a selected HE scheme, and
involve data owners only when the properties of the
particular HE scheme used do not support the desired
analysis. For example, in the case of (Erkin et al.,
2009), in the context of collaborative clustering, the
adopted HE scheme does not support record similar-
ity checking, thus this is done by a randomly selected
data owner. The number of data owner participation
instances is given by n × |C| × i, where n is the num-