CoExDBSCAN: Density-based Clustering with Constrained Expansion

Benjamin Ertl

1 a

, J

org Meyer

1 b

, Matthias Schneider

and Achim Streit

1 c

Steinbuch Centre for Computing (SCC), Karlsruhe Institute of Technology (KIT), Karlsruhe, Germany

Institute for Meteorology and Climate Research (IMK-ASF), Karlsruhe Institute of Technology (KIT), Karlsruhe, Germany

Keywords:

Data Mining, Machine Learning, Pattern Recognition, Clustering, Correlation Clustering, Constrained

Clustering, DBSCAN, Spatio-temporal Data, Climate Research.

Abstract:

Full space clustering methods suffer the curse of dimensionality, for example points tend to become equidistant

from one another as the dimensionality increases. Subspace clustering and correlation clustering algorithms

overcome these issues, but still face challenges when data points have complex relations or clusters overlap. In

these cases, clustering with constraints can improve the clustering results, by including a priori knowledge into

the clustering process. This article proposes a new clustering algorithm CoExDBSCAN, density-based clus-

tering with constrained expansion, which combines traditional, density-based clustering with techniques from

subspace, correlation and constrained clustering. The proposed algorithm uses DBSCAN to ﬁnd density-

connected clusters in a deﬁned subspace of features and restricts the expansion of clusters to a priori con-

straints. We provide veriﬁcation and runtime analysis of the algorithm on a synthetic dataset and experimental

evaluation on a climatology dataset of satellite observations. The experimental dataset demonstrates, that

our algorithm is especially suited for spatio-temporal data, where one subspace of features deﬁnes the spatial

extent of the data and another correlations between features.

1 INTRODUCTION

Mining datasets has become more challenging with

the increasing amount of high dimensional data that

is available today through new technologies, higher

processing power, bigger storage capacities and data-

driven research. Finding clusters in such datasets

can reveal interesting patterns and dependencies of-

ten caused by complex correlations. However, tradi-

tional full space clustering algorithms suffer the curse

of dimensionality, for example points tend to become

equidistant from one another as the dimensionality

increases (Friedman, 1994). For this purpose, dif-

ferent subspace and correlation clustering algorithms

have been proposed, which extend traditional clus-

tering algorithms to detect correlations in subsets of

features (Agrawal et al., 1998) (Aggarwal and Yu,

2000). While some correlation algorithms are able to

ﬁnd arbitrarily oriented subspace clusters, for exam-

ple CASH (Achtert et al., 2008), or can identify local

subgroups of data objects sharing a uniform but ar-

bitrarily complex correlation, for example 4C (B

ohm

https://orcid.org/0000-0003-1431-2243

https://orcid.org/0000-0003-0861-8481

https://orcid.org/0000-0002-5065-469X

et al., 2004a), these algorithms still face challenges

for instance with overlapping clusters or uncorrelated

complex relations between features. For this reason,

there has been a growing interest in semi-supervised

clustering methods, in particular constrained cluster-

ing, where additional information or domain knowl-

edge is included into the clustering process (Pourra-

jabi et al., 2014) (Basu et al., 2008) (Dinler and Tural,

2016). Incorporating a priori knowledge into the clus-

tering process can improve the clustering results, lead

to better performance and align the outcome of the

cluster analysis with knowledge of domain experts.

In this paper, we propose a new density-based clus-

tering algorithm with constrained cluster expansion,

CoExDBSCAN, that combines different techniques

from subspace, correlation and constrained cluster-

ing. The proposed algorithm uses DBSCAN (Ester

et al., 1996) to ﬁnd density-connected clusters in a de-

ﬁned subspace of features and restricts the expansion

of clusters to a priori constraints. The validation of

the algorithm on an experimental, real-world dataset

demonstrates, that our algorithm is especially suited

for spatio-temporal data, where one subspace of fea-

tures deﬁnes the spatial extent of the data and another

correlations between features.

104

Ertl, B., Meyer, J., Schneider, M. and Streit, A.

CoExDBSCAN: Density-based Clustering with Constrained Expansion.

DOI: 10.5220/0010131201040115

In Proceedings of the 12th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2020) - Volume 1: KDIR, pages 104-115

ISBN: 978-989-758-474-9

Speciﬁcally, our contributions with this work can

be summarized as follows:

• We introduce two user-deﬁned parameters to the

original DBSCAN algorithm. One to deﬁne the

dimensions of the subspace to be used to discover

density-based clusters, and one to deﬁne the di-

mensions of the subspace to be used to apply con-

straints to the cluster expansion of DBSCAN.

• We modify the cluster expansion step in the orig-

inal DBSCAN algorithm to be restricted to user-

deﬁned constraints.

• We propose a generic constraint to discover cor-

related structures in large datasets.

• Finally, we provide results of thorough ex-

perimental studies on synthetic and real-world

datasets and demonstrate, that our algorithm is

especially suited for spatio-temporal data, where

one subspace of features deﬁnes the spatial extent

of the data and another correlations between fea-

tures.

The reminder of the paper is organized as follows:

Section 2 compares our proposed algorithm to related

work while Section 3 presents the proposed algorithm

in detail. The evaluation is provided in Section 4 with

veriﬁcation and runtime analysis of the algorithm on a

synthetic dataset and experimental evaluation on a cli-

matology dataset of satellite observations. In Section

5 we give a discussion on the results while Section 6

provides the conclusions and outlooks.

All datasets together with the code for this paper

is publicly available

2 RELATED WORK

Numerous clustering algorithms have been developed

and studied over time. A well received survey is pro-

vided by Anil K. Jain in his article ”Data Clustering:

50 Years Beyond K-Means” (Jain, 2010). In this pa-

per we focus on most related and relevant work in the

areas of density-based clustering, correlation cluster-

ing and constrained clustering.

Martin Ester et al. presented the DBSCAN algo-

rithm in 1996 as a density-based clustering algorithm

for discovering clusters in large spatial databases with

noise (Ester et al., 1996). The authors introduced a

until then new notion of clusters, based on the den-

sity of point neighbourhoods. The algorithm con-

siders data point by data point. If the distance from

an initial point to at least a minPts deﬁned number

of other points is smaller than a deﬁned distance ε,

https://github.com/bertl4398/kdir2020

these points form a cluster. The initial point is con-

sidered a core point and the remaining points the

ε-neighbourhood of that point. If a cluster can be

formed, this cluster is expanded by applying the initial

step to all points in the ε-neighbourhood. If the initial

point is not a core point, the point is considered to be a

noise point and the algorithm moves to the next point

in the dataset. Noise points can become border points

and therefore be associated to a different cluster later

on, if they are density-reachable from some other data

point. The two parameters minPts and ε determine

the outcome of the DBSCAN algorithm. While the

purpose of minPts is to smooth the density estimate

and is recommended to be chosen according to the di-

mensionality of the dataset, the radius parameter ε de-

pends on the distance function, and should ideally be

based on domain knowledge (Schubert et al., 2017).

Schubert et al. discuss further the advantages and

disadvantages of DBSCAN. Most notably, Schubert

(Schubert et al., 2017) states, that DBSCAN contin-

ues to be relevant even for high-dimensional data, but

becomes difﬁcult to use:

”Independent of the algorithm, the parame-

ter ε of DBSCAN becomes hard to choose in

high- dimensional data due to the loss of con-

trast in distances (Beyer et al., 1999; Houle

et al., 2010; Zimek et al., 2012). Irrespec-

tive of the index, it therefore becomes difﬁ-

cult to use DBSCAN in high-dimensional data

because of parameterization; other algorithms

such as OPTICS and HDBSCAN*, that do not

require the ε parameter are easier to use, but

still suffer from high dimensionality.”

The OPTICS (Ankerst et al., 1999) and HDBSCAN*

(Campello et al., 2013) algorithm mentioned by Schu-

bert are examples of DBSCAN variants that focus on

ﬁnding hierarchical clustering results (Schubert et al.,

2017). Although our work is based on the origi-

nal DBSCAN algorithm, it is not restricted to it and

can be used in combination with any variant where

the cluster expansion step can be restricted to user-

deﬁned constraints.

To overcome the issues of conventional, full space

clustering methods in high-dimensional data, algo-

rithms in the area of subspace clustering and corre-

lation clustering have attracted more and more at-

tention recently (Achtert et al., 2008). For example

CLIQUE (Agrawal et al., 1998) for axis-parallel sub-

spaces or ORCLUS (Aggarwal and Yu, 2000) and

CASH (Achtert et al., 2008) for arbitrarily oriented

subspaces. Elke Achtert et al. introduced the CASH

algorithm as an efﬁcient and effective method to ﬁnd

arbitrarily oriented subspace clusters. The main idea

of the algorithm is to transform every data point from

CoExDBSCAN: Density-based Clustering with Constrained Expansion

105

data space to parameter space, the space of all possi-

ble subspaces, by using the ideas of the Hough trans-

formation (Hough, 1962; Duda and Hart, 1972). Ev-

ery data point in data space is mapped onto the cor-

responding sinusoidal curve in parameter space. Hy-

percuboids in parameter space with many intersecting

sinusoidal curves indicate points in data space that

are located on, or near, a common hyperplane, and

therefore are considered to form a subspace cluster

(Achtert et al., 2008). According to Achtert et al.

”[CASH is able to] ﬁnd subspace clusters

of different dimensionality even if they are

sparse or are intersected by other clusters

within a noisy environment.”

Since our research interest and real-world data be-

longs manly into this categorization of data, we

choose to compare our clustering results to this algo-

rithm as well. Moreover, an open source implementa-

tion within the ELKI (Schubert and Zimek, 2019) data

mining software is available and has been used in our

experiments. CASH is also outperforming other cor-

relation clustering algorithms such as ORCLUS or 4C

ohm et al., 2004b) on datasets with highly overlap-

ping clusters (Achtert et al., 2008), which is the case

for our synthetic and real-world datasets.

Incorporating additional information or domain

knowledge about the underlying cluster structure of

the data is known as constrained clustering and has

been the subject of extensive research recently (Din-

ler and Tural, 2016). Wagstaff and Cardie (Wagstaff

and Cardie, 2000) introduced the notion of using con-

straints that express information about the underlying

class structure in the clustering process by consider-

ing two general types of constraints, called instance

level constraints; (1) must-link constraints that spec-

ify that two instances have to be in the same cluster

and (2) cannot-link constraints that specify, that two

instances cannot be in the same cluster. The most rel-

evant related work to our paper in this area is the C-

DBSCAN algorithm by Ruiz et al. (Ruiz et al., 2007),

a density-based clustering algorithm with constraints.

C-DBSCAN extends DBSCAN in three steps:

1. Partitioning the data space into dense partitions by

applying a k-d tree (Bentley, 1975)

2. Creating local clusters under cannot-link con-

straints

3. Merging local clusters under must-link and

cannot-link constraints

Ruiz et al. chose a random percentage of points un-

der the must-link constraint and derived the cannot-

link constraint interdependently. The authors could

demonstrate, that even those randomly chosen con-

straints improve the clustering quality substantially,

notably on datasets where the original DBSCAN

performs poorly (Ruiz et al., 2007). Compared to

our approach, we do not follow the method of in-

stance level constraints expressed as must-link and

cannot-link constraints. Our modiﬁcation to DB-

SCAN restricts the cluster expansion to user-deﬁned

constraints, which has been proven to be very ﬂexible

and is explained in detail in the next section.

3 ALGORITHM

First, we are going to recap the main deﬁnitions of

the original DBSCAN algorithm by Ester et al. be-

fore giving a detailed description of our proposed ex-

tensions.

3.1 DBSCAN Recap

In the original paper from Martin Ester et al. pre-

sented at the International Conference on Knowledge

Discovery and Data Mining (KDD) in 1996, Ester

gives six main deﬁnitions essential for the DBSCAN

algorithm (Ester et al., 1996), recapitulated in the fol-

lowing.

Deﬁnition 1. ε-neighbourhood of a Point.

Let DB be a database of points. The ε-neighbourhood

of a point p, denoted by N

(p), is deﬁned by

(p) = {q ∈ DB|dist(p,q) ≤ ε}

Deﬁnition 2. Directly Density-reachable.

A point p is directly density-reachable from a point q

wrt. ε and minPts if

1. p ∈N

(q) and

2. |N

(q) ≥ minPts| (core point condition).

Deﬁnition 3. Density-reachable.

A point p is density-reachable from a point q wrt. ε

and minPts if there is a chain of points p

,..., p

, p

q, p

= p such that p

i+1

is directly density-reachable

from p

Deﬁnition 4. Density-connected.

A point p is density-connected to a point q wrt. ε and

minPts if there is a point o such that both, p and q are

density-reachable from o wrt. ε and minPts.

Deﬁnition 5. Cluster.

A cluster C wrt. ε and minPts is a non-empty subset

of DB satisfying the following conditions:

1. ∀p,q: if p ∈C and q is density-reachable from p

wrt. ε and minPts, then q ∈C. (Maximality)

2. ∀p,q ∈C: p is density-connected to q wrt. ε and

minPts. (Connectivity)

KDIR 2020 - 12th International Conference on Knowledge Discovery and Information Retrieval

106

Deﬁnition 6. Noise.

Let C

,...,C

be the clusters of the database DB wrt.

parameters ε

and minPts

, i = 1,...,k. Then we deﬁne

the noise as the set of points in the database DB not

belonging to any cluster C

, i.e. noise = {p ∈DB|∀i :

p /∈C

}

In a nutshell, clusters extracted by the DBSCAN

algorithm comprise points that have at least a speciﬁc

number of other points (minPts) within a speciﬁc dis-

tance (ε) or otherwise are considered to be noise.

While DBSCAN is able to ﬁnd clusters of arbi-

trary shape, one problem with the global density pa-

rameter ε is, that DBSCAN can not identify clus-

ters with different densities. Although metrics exist

to determine good, global ε and minPts parameters,

for example analysing the k-distance graph as pro-

posed in the original paper (Ester et al., 1996), ﬁnd-

ing good, global parameters becomes even more chal-

lenging for high-dimensional data. An intuitively ac-

cessible example, speciﬁcally why the Euclidean dis-

tance does not work well in high-dimensional data, is

given by Ert

oz et al.(Ert

oz et al., 2003), who intro-

duced a new notion of similarity and density for their

shared nearest neighbour clustering algorithm (SNN).

Another problem with DBSCAN arises with datasets

that contain overlapping clusters. As long as points

are density-connected, these points will end up in the

same cluster, with the rare exception of a point be-

ing a border point in two clusters. For overlapping

clusters, this means, that DBSCAN will merge them

or will consider the lesser dense true cluster points as

noise.

Our solution addresses these problems by modi-

fying the original DBSCAN algorithm as described

in the following section.

3.2 CoExDBSCAN

Our density-based clustering algorithm with

constrained expansion (CoExDBSCAN) modi-

ﬁes the original DBSCAN clustering algorithm

in two ways. First, we introduce a user-deﬁned

parameter to deﬁne the dimensions of the (sub)space

to be used to discover density-based clusters. Second,

we restrict the cluster expansion step in the DBSCAN

algorithm to user-deﬁned constraints, applied to a

user-deﬁned (sub)space of the dataset.

According to these extensions, we can redeﬁne

the ε-neighbourhood deﬁnition from the original DB-

SCAN algorithm introduced in the previous section,

Deﬁnition 1.

Deﬁnition 7. CoExDBSCAN ε-neighbourhood of a

Point.

Let DB be a database of points. The ε-neighbourhood

of a point p, denoted by N

(p), is deﬁned by

(p) = {q ∈ DB|dist(p

) ≤ ε

∧constraints(p

)}

where p

are the subspace representations of point

p and q of the user-deﬁned spatial subspace S, p

are the subspace representations of point p and q

of the user-deﬁned constraint subspace R and the

constraints function evaluates true for each con-

straint T

in a user-deﬁned set of constraints T =

,...,T

The pseudo code representation is given in Algo-

rithm 1 and has been adopted from the pseudo code

representation by Schubert et al. (Schubert et al.,

2017) of the original, sequential DBSCAN algorithm.

Modiﬁcations to the algorithm are coloured red and

marked by star (*). Each object in the database that

has not been processed already is labeled as noise,

if the core point condition is violated, see Deﬁni-

tion 2, i.e. there are not at least minPts objects in

the ε-neighbourhood of the object under considera-

tion. Otherwise, the object is a core point and forms

a new cluster, while iteratively expanding and adding

the core point neighbours to the cluster.

Our ﬁrst modiﬁcation is in the RangeQuery func-

tion in line 3 and line 13 of Algorithm 1 that accepts

a user-deﬁned parameter sDim, allowing to deﬁne the

dimensions of the space for the range query. In the

original DBSCAN algorithm, the range query is al-

ways executed on the full space of the dataset, unless

the algorithm operates on a precomputed distance ma-

trix that takes these restrictions into account. How-

ever, the parametrization of the spatial dimensions

makes this restriction more explicit and easier acces-

sible to the user. Especially for data with explicit spa-

tial dimensions, excluding certain dimensions or all

non-spatial dimensions from the range query can im-

prove the quality of the clustering, as we will demon-

strate in the experimental evaluation of the algorithm.

Our second modiﬁcation is in the expansion step.

While the original DBSCAN algorithm expands a

cluster by starting at one core point and adding all

of its neighbours to a set of seeds that are iteratively

expand on if they satisfy the core point condition it-

self, we allow for additional user-deﬁned constraints

to be considered before adding these points as seeds.

Therefore, even if the neighbour points satisfy the

core point condition as expressed in line 15 of the

algorithm, if the PointConstraint function can not

be satisﬁed in line 17, they will not be added to the

set of seeds, i.e. the algorithm will not expand on

their respective neighbours. The user can deﬁne one

or multiple constraints (cFunc) and the dimensions

CoExDBSCAN: Density-based Clustering with Constrained Expansion

107

(cDim) that the constraints should be applied to. This

approach can signiﬁcantly improve the quality of the

clustering and allows to incorporate a priori knowl-

edge into the clustering process expressed in the form

of sub- or full-space constraints.

Algorithm 1: Pseudo Code of CoExDBSCAN

Algorithm.

input : database DB

input : radius ε

input : density threshold minPts

input : distance function dist

input : spatial dimensions sDim *

input : user-deﬁned constraints cFunc *

input : constraint dimensions cDim *

output: point labels label initially unde f ined

1 foreach point p in database DB do

2 if label(p) 6= unde f ined then continue;

3 Neighbours N ←

RangeQuery(DB,dist,sdim, p,ε) *;

4 if |N| < minPts then

5 label(p) ← Noise;

6 continue;

7 c ← next cluster label;

8 label(p) ← c;

9 Seed set S ←N \{p};

10 foreach q in S do

11 if label(q) = Noise then

label(q) ← c;

12 if label(q) 6= unde fined then

continue;

13 Neighbours N ←

RangeQuery(DB,dist,sdim,q, ε) *;

14 label(q) ← c;

15 if |N| < minPts then continue;

16 foreach s in N do

17 if PointConstraint

(cFunc, cDim,s) is false then

continue *;

18 S ←S ∪s;

The PointConstraint function in line 17 of the

pseudo code takes a set of user-deﬁned constraints in

the form of functions and returns true if all of the

constraints can be satisﬁed or false otherwise. This

behaviour can be relaxed, for example by applying a

threshold of number of constraints that needs to be

satisﬁed instead of the all-true behaviour. Also ﬁnd-

ing good constraints is a challenging task, depend-

ing on the data to analyse, the analysis to be con-

ducted and the expected outcome. Moreover, it is up

to the user to deﬁne non-mutually exclusive and sensi-

ble constraints. However, constraining the DBSCAN

cluster expansion introduces a variety of interesting

studies that can provide valuable insights into differ-

ent datasets, as we will demonstrate in the next sec-

tion.

4 EVALUATION

In this section, we provide the runtime analysis of

our proposed algorithm ﬁrst and verify the improved

quality of the clustering results in the subsequent sec-

tions. For the runtime analysis, we consider the av-

erage complexity of the original DBSCAN algorithm

with the additional complexity of user-deﬁned con-

straints.

We compared the results of our algorithm with

the results from DBSCAN and CASH for different

existing popular reference datasets, for example the

Iris ﬂower dataset (Fisher, 1936) and the artiﬁcial

datasets used for the veriﬁcation of the CURE algo-

rithm (Guha et al., 2001), but chose to base the pre-

sented veriﬁcation in this paper on our own gener-

ated synthetic dataset. All results, including the re-

sults for existing popular reference datasets, can be

found in the GitHub repository

. Generating our own

synthetic dataset allows us to fully control the prop-

erties of the clusters, so that we can create a dataset

that is especially challenging for density-based clus-

tering methods, and proofs to be very challenging for

subspace and correlation clustering methods as well.

In addition, we provide veriﬁcation of the algorithm

on a real-world dataset within the domain of spatio-

temporal data and climate research.

Since the true labels are known for the synthetic

dataset, we use the Rand index adjusted for chance

(Rand, 1971; Hubert and Arabie, 1985) to evaluate

our clustering results. The Rand index is a measure

of similarity between two data clusterings and can be

computed as following (Rand, 1971):

Deﬁnition 8. Rand Index.

Given a set of n elements S = {o

,...,o

}, a partition

X = {X

,...,X

}of S into r subsets and a partition Y =

,...,Y

} of S into s subsets, the Rand index is:

R =

a + b

n(n −1)/2

(1)

with a, the number of pairs of elements in S that are

in the same subset in X and Y , with b, the number of

pairs of elements in S that are in the same subset in X

and in different subsets in Y , and the total number of

pairs





in the denominator.

https://github.com/bertl4398/kdir2020

KDIR 2020 - 12th International Conference on Knowledge Discovery and Information Retrieval

108

The adjusted Rand index, which is bounded above

by 1 and takes on the value of 0 when the index equals

its expected value E(R), can be expressed in general

form of an index corrected for chance the following

(Hubert and Arabie, 1985):

Deﬁnition 9. Adjusted Rand Index.

ARI =

R −E(R)

max(R) −E(R)

(2)

Moreover, we use the clustering accuracy (ACC) to

evaluate our clustering results, which ﬁnds the best

match between the true labels and the cluster labels.

The greater the clustering accuracy, the better the

clustering performance (Role et al., 2019).

Deﬁnition 10. Clustering Accuracy (ACC).

ACC(y, ˆy) = max

perm∈P

n−1

∑

i=0

1(perm(ˆy

) = y

) (3)

where P is the set of all permutations in [1; K] where

K is the number of clusters. The set of all permuta-

tions can be efﬁciently computed using the Hungarian

algorithm (Papadimitriou and Steiglitz, 1998).

However, our experiments have shown, that

”good” clusterings, in terms of number of clusters and

correct labels, often have a lower adjusted Rand index

than ”bad” clusterings. Therefore, in addition to the

adjusted Rand index and clustering accuracy, we eval-

uate our cluster results based on the total number of

clusters assigned, the minimum and maximum clus-

tered data points and to some extend the standard de-

viation of number of points per cluster, as well as with

the aid of visual analysis. For our real-world dataset,

we assess the quality of the clustering primarily ac-

cording to expert opinion.

4.1 Runtime Analysis

The average runtime complexity of the original DB-

SCAN algorithm is O(n logn) (Ester et al., 1996). The

authors argue, that distance queries can be supported

efﬁciently by spatial access methods such as R*-trees,

that have the height of O(log n), and because for each

of n points only one query has to be executed, the av-

erage run time complexity is consequently O(nlog n).

Schubert et al. (Schubert et al., 2017) state, that the

DBSCAN runtime complexity can be Θ(n

·D), with

cost D of computing the distance of two points, if im-

plementing the range query with a linear scan. In gen-

eral however:

”[. . .] DBSCAN remains a method of choice

even for large n because many alternatives are

in Θ(n

) or Θ(n

). [. . .] In the general case

of arbitrary non-metric distance measures, the

worst case remains O(n

·D) [. . .]”

Introducing a set of constraints T = {T

,...,T

} to

the expansion step of the DBSCAN algorithm adds

the complexity of O(n ·max(T )) to check the set of

constraints for each point. The complexity of con-

straints can vary greatly, for example from hash ta-

ble searches with average time complexity O(1) (Cor-

men et al., 2009) to linear regression complexity

O(w

n + w

) for n number of observations and w

number of weights (Mohri et al., 2012), as in our

demonstrated examples in the following sections. In

total, the runtime complexity of CoExDBSCAN, de-

pending on the user-deﬁned constraints, is therefore

on average the runtime complexity of DBSCAN plus

the maximum complexity of the user-deﬁned con-

straints, O(n ·max(T ) + n ·logn).

4.2 Veriﬁcation

Our synthetic dataset contains 3,000 points with three

dimensions and three classes, 1,000 points per class.

Table 1 lists the generation method and interval range

as well as the linear dependencies of the variables.

Figure 1a) illustrates the overlapping nature and linear

dependencies of the three classes.

Table 1: Value range and dependencies for the synthetic

dataset.

Points x y z

1,000 evenly [0,1] −0.5x + 0.2 + ξ −0.5x + 0.2 + ξ

1,000 uniform [0,1) uniform [0,1) 0.1x + 0.1y

1,000 uniform [0,1) uniform [0,1) 0.4x + 0.2y

The values for the x and y variables of cluster 0,

blue color in Figure 1a), are generated by sampling

the random uniform distribution in the half-open in-

terval [0,1); the values for the z variable are com-

puted using the linear equation 0.1x +0.1y. For clus-

ter 1, orange color in the ﬁgure, the values for the

x and y variables are generated also by sampling the

random uniform distribution in the half-open interval

[0,1); the values for the z variable are computed using

the linear equation 0.4x + 0.2y. The green coloured

cluster 2 in the ﬁgure is generated by evenly spaced

x values in the closed interval [0,1] and the values

for the y and z variables following the linear equa-

tion −0.5x + 0.2 + ξ, where ξ is some random varia-

tion with ξ ∼ N (0,0.01). This dataset poses a chal-

lenge to all clustering algorithms that we evaluated

for this study. First, the clusters have different den-

sities, with a very dense, line shaped cluster and two

looser, plane shaped clusters. Second, all clusters are

overlapping with some proportions. The plane shaped

clusters have an overlapping edge, which is crossed

by parts of the line shaped cluster.

We implemented the CoExDBSCAN algorithm

CoExDBSCAN: Density-based Clustering with Constrained Expansion

109

in Python and conducted our experiments with the

scikit-learn (Pedregosa et al., 2011) machine learning

package implemented in Python and the ELKI (Schu-

bert and Zimek, 2019) data mining software written in

Java. Our ﬁrst objective was to ﬁnd the best DBSCAN

clustering according to the adjusted Rand index and

the most accurate clustering, i.e. three clusters with

the same amount of data points each and the fewest

noise. In order to ﬁnd suitable clusterings, we con-

ducted a grid search for DBSCAN with ε in the range

of [0.01,0.2] with a step size of 0.01 and minPts in

the range of [3,100] with a step size of 1.

The DBSCAN clustering result with the highest

adjusted Rand index for the parameters ε = 0.03 and

minPts = 39 in three-dimensional space has only one

single cluster, not depicted here. The line shaped,

dense cluster has been identiﬁed almost perfectly

with 1,018 data points in the cluster. However, with

the speciﬁed ε radius and number of minimum ε-

neighbourhood points, the algorithm was not able to

expand into the plane shaped point structures. Al-

though the adjusted Rand index (ARI = 0.562) is the

highest in the explored parameter space, according to

the accuracy and qualitative assessment, the cluster-

ing result is worse than the qualitative best cluster-

ing result. Figure 1b) shows the qualitative best DB-

SCAN clustering result, with three clusters, fewest

noise points and the maximum amount of points in

each cluster. It becomes apparent, that with a higher

radius ε, the DBSCAN algorithm is now able to ex-

pand into the whole dataset, while the number of ε-

neighbourhood points determines the amount of noise

and closeness of the clusters. A lower number of

minPts relaxes the condition on the cluster expansion

up to a point, where all data can be clustered into one

single cluster. Whereas a higher number of minPts

restricts the cluster expansion more, resulting in clus-

ters that are further apart but also increases the num-

ber of noise points. With our proposed modiﬁcation

in the expansion step of the DBSCAN algorithm, we

can keep the ε parameter at a value that allows to ex-

pand into the whole dataset and lower the number of

minPts at the same time, while avoiding the degener-

ated case of one single cluster.

One advantage of our proposed algorithm is the

ﬂexibility of how users can provide constraints to the

clustering process. In our veriﬁcation experiments,

we tested several constraints that include the a priori

knowledge of correlated structures in the dataset. Ex-

periments have shown, that a suitable constraint that

expresses this information in a generic way is to put

a threshold on the change of the mean squared error

regression loss by including the next core point into

the current set of cluster points. This constraint al-

lows the algorithm to expand clusters on arbitrarily

correlated structures and changing correlation up to a

certain degree.

Another advantage over the original DBSCAN al-

gorithm is the user-deﬁned selection of the spatial di-

mensions and the constraint dimensions. We refer to

those features of the dataset as spatial dimensions that

are used to calculate the RangeQuery, i.e. the pair-

wise distance between data points. Likewise, we re-

fer to the features of the dataset that are used in the

PointConstraint function as constraint dimensions,

see Algorithm 1. In our veriﬁcation experiments, we

chose to include all features as spatial dimensions and

only the x and z features as correlation dimensions.

The threshold for the change of the mean squared er-

ror regression loss has been set to 9 ·10

−6

, according

to empirical tests. The ε and minPts parameters have

been determined by a grid search on the parameter

space for ε in the interval [0.01,0.2] with a step size

of 0.01 and minPts in the range of [3,100] with a step

size of 1.

Figure 1c) shows the best qualitative result, with

three clusters, fewest noise points and the maximum

amount of points in each cluster. By comparing the

CoExDBSCAN results visually to the original DB-

SCAN results, it is apparent, that the CoExDBSCAN

clustering result better captures the inherent struc-

ture of the dataset. Apart from the overlapping area,

the correlated data points in each of the three gener-

ated point samples have been assigned to three dis-

tinct clusters. It should be noted, that we are using

a generic constraint here that includes only the infor-

mation of some arbitrary correlated structures in the

dataset and that even permits gradually changes to

the linear regression of the cluster points. Still, the

clustering of CoExDBSCAN compared to DBSCAN

shows an improvement in our qualitative measures,

although the algorithm was not able to separate the

data points in the overlapping area. As we will dis-

cuss in Section 5, constraints more speciﬁc to the data

could further improve the result, but may lead to over-

ﬁtting the algorithm to this one particular dataset.

In addition, we compared our algorithm to the

CASH algorithm (B

ohm et al., 2004b). According

to Achtert et al., CASH is signiﬁcantly outperform-

ing other correlation clustering algorithms, such as

ORCLUS or 4C, on datasets with highly overlap-

ping clusters in terms of robustness and effectiveness.

CASH requires the user to specify three parameter.

(1) The minimum number of sinusoidal curves that

need to intersect a hypercuboid in parameter space to

be considered a dense area, i.e. the minimum number

of points in a cluster (minPts); (2) the maximal num-

ber of splits along a search path (maxLevel), i.e the

KDIR 2020 - 12th International Conference on Knowledge Discovery and Information Retrieval

110

0.0

0.2

0.4

0.6

0.8

1.0

−0.2

0.0

0.2

0.4

0.6

0.8

1.0

−0.2

0.0

0.2

0.4

0.6

a) Synthetic data true labels

cluster 0

cluster 1

cluster 2

0.0

0.2

0.4

0.6

0.8

1.0

−0.2

0.0

0.2

0.4

0.6

0.8

1.0

−0.2

0.0

0.2

0.4

0.6

b) DBSCAN, epsilon=0.17, minPts=98, ARI=0.104

noise

cluster 0

cluster 1

cluster 2

0.0

0.2

0.4

0.6

0.8

1.0

−0.2

0.0

0.2

0.4

0.6

0.8

1.0

−0.2

0.0

0.2

0.4

0.6

c) CoExDBSCAN, epsilon=0.1, minPts=20, ARI=0.232

noise

cluster 0

cluster 1

cluster 2

0.0

0.2

0.4

0.6

0.8

1.0

−0.2

0.0

0.2

0.4

0.6

0.8

1.0

−0.2

0.0

0.2

0.4

0.6

d) CASH, maxLevel=1, minPts=70, jitter=0.001, ARI=0.231

cluster 0

cluster 1

cluster 2

cluster 3

cluster 4

Figure 1: Clustering results for the synthetic data: a) original data in three dimensional space; best qualitative clustering

results: b) DBSCAN, c) CoExDBSCAN, d) CASH.

maximal deviation from the hyperplane of the cluster

in terms of orientation and jitter; and (3) the amount

of jitter. We performed again a grid search on the

parameter space for minPts in the interval of [3,100]

with a step size of 1, for maxLevel in the range of

[1,10] with a step size of 1 and jitter in the inter-

val of [0.001,0.005] with a step size of 0.001. Figure

1d) illustrates the qualitative best clustering result for

CASH on our synthetic dataset with the parameters

minPts = 70, maxLevel = 1 and jitter = 0.001, with

ﬁve clusters and the maximum amount of points in

each cluster. The lowest number of clusters in the pa-

rameter space is ﬁve, and while the best qualitative re-

sults captures proportions of the correlated structures

well, it performs poorer than CoExDBSCAN in terms

of our evaluation metrics, which are summarized in

Table 2. Table 2 lists the respective algorithm with

the number of clusters identiﬁed, the minimum num-

ber of points in any cluster, the maximum number of

points in any cluster and the number of noise points,

i.e. points that were not assigned to any cluster. Co-

ExDBSCAN outperforms the original DBSCAN al-

gorithm and CASH in terms of the correct number of

Table 2: Summary of Best Qualitative Clustering Results

for the Synthetic Dataset.

Algorithm Clusters ARI ACC Min Max Noise

CoExDBSCAN 3 0.232 0.669 468 1,947 17

DBSCAN 3 0.104 0.555 155 2,132 421

CASH 5 0.231 0.543 51 1,157 NA

clusters identiﬁed, the cluster accuracy, and therefore

closest similarity to the true labels of the data. The

adjusted Rand index provides an indifferent result in

this comparison.

4.3 Real World Example

To demonstrate the signiﬁcance of our algorithm for

real-world data and provide additional veriﬁcation,

we applied DBSCAN and CoExDBSCAN to a dataset

within the domain of spatio-temporal data and climate

research. We chose data from this particular domain,

since the development of the algorithm is part of an

interdisciplinary research project between computer

scientists and climate researchers, with the aim to de-

velop methods and algorithms for data-driven climate

data analysis. Our real-world dataset consists of spec-

CoExDBSCAN: Density-based Clustering with Constrained Expansion

111

Table 3: Summary statistics for the real-world dataset.

lon lat H

O δD

count 38,734 38,734 38,734 38,734

mean 11.40 15.92 3749.84 -231.17

std 32.08 21.66 2057.17 66.15

min -44.99 -25.00 526.53 -475.08

25% -18.95 -3.01 2217.36 -279.89

50% 16.86 17.36 3301.84 -230.05

75% 38.80 35.65 4723.93 -179.05

max 59.99 50.00 14376.72 -70.47

tral data gathered from Metop-A and Metop-B satel-

lites that have been processed for the water vapour

O mixing ratio and water isotopologue δD deple-

tion for air masses at 5km height with most sensi-

tivity. The water isotopologue in question is HDO,

which differs only in the isotopic composition com-

pared to H

O. Isotopologues of atmospheric wa-

ter vapour can make a signiﬁcant contribution for a

better understanding of atmospheric water transport,

because different water transport pathways leave a

distinctive isotopologue ﬁngerprint (Schneider et al.,

2017). The paired analysis of water vapour H

O mix-

ing ratio and water isotopologue δD depletion allows

to identify different processes in the atmosphere, for

example air mass mixing, precipitation and conden-

sation (Noone, 2012). Our goal is to develop data-

driven methods to identify and analyse such processes

for a better understanding of atmospheric water trans-

port and to evaluate the moisture pathways as simu-

lated by different state-of-the-art atmospheric models.

These methods and algorithms have to scale and cope

with the amount of data that is continuously produced

by the remote sensing instruments onboard the satel-

lites, where global measurements of our data for one

year aggregate to 20 Terabyte.

Our dataset in this example comprises 38,734

satellite observations with the geographical coordi-

nates longitude (lon) and latitude (lat) and the spec-

tral data processed for the water vapour H

O mix-

ing ratio and water isotopologue δD depletion. This

dataset corresponds to measurements for one global

morning overpass of both satellites in a region of in-

terest over the Atlantic and West Africa. The mea-

surements have been ﬁltered for cloud free conditions

and highest sensitivity; their summary statistics are

listed in Table 3. We applied different transformations

to align the value range of each variable according to

domain experts, given in Table 4. After scaling the

data we executed multiple runs with DBSCAN and

CoExDBSCAN and compared the clustering results.

Figure 2 illustrates a qualitative good example us-

ing DBSCAN, with parameters ε = 10 and minPts =

200. DBSCAN has identiﬁed three clusters that are

Table 4: Transformation of Features.

variable transformed variable

lon lon

trans f ormed

= lon ·cos(lat

180

)/0.5

lat lat

trans f ormed

= lat/0.5

O H

trans f ormed

= log(H

O)/0.2

δD δD

trans f ormed

= δD/20

dense in the full space of the scaled variables lat,

lon, H

O and δD. Moreover, two of the identiﬁed

clusters (green and orange color in the ﬁgure) in-

dicate some correlation in the {log(H

O),δD} and

O,H

O ·δD} value space, while at the same time

are located geographically close. Using the same pa-

rameters ε = 10 and minPts = 200 with CoExDB-

SCAN, we can additionally expresses a priori knowl-

edge of correlated structures in the dataset by provid-

ing user-deﬁned constraints to the expansion of clus-

ters, as well as deﬁne the spatial and constraint di-

mensions. In the same way as in the veriﬁcation tests

with the synthetic dataset, we used again a threshold

for the change of the mean squared error regression

loss as constraint, which has been set to 1 ·10

−5

this example and has been empirically determined.

As spatial dimensions we chose the geographical co-

ordinate variables longitude (lon) and latitude (lat),

and as constraint dimensions we chose the remain-

ing two variables H

O and δD. Figure 3 shows the

results using the CoExDBSCAN algorithm. Com-

pared to DBSCAN we have identiﬁed a signiﬁcant

higher number of clusters, while keeping the same ε

radius and minPts neighbourhood points. All clus-

ters are geographically close and correlated in either

the {log(H

O),δD} or {H

O,H

O ·δD} value space.

By incorporating a threshold for the change of the

mean squared error regression loss as constraint in the

cluster expansion step and explicitly deﬁning spatial

and constraint dimension, we can keep the DBSCAN

parameters that allow to explore the full dataset, but

still be able to explore more ﬁne-grained structures,

even with highly overlapping data points. This level

of granularity is very important to distinguish and

to identify the different processes in the atmosphere,

which emphasizes the value of the algorithm for data

analysis in this particular domain.

5 DISCUSSION

The main concept of our presented CoExDBSCAN

algorithm, to allow users to constrain the expansion of

clusters in speciﬁc subspaces of the data, can signiﬁ-

cantly improve the clustering results, as demonstrated

in sections 4.2 and 4.3 for synthetic and real-world

data. However, ﬁnding and expressing suitable con-

KDIR 2020 - 12th International Conference on Knowledge Discovery and Information Retrieval

112

−40 −20 0 20 40 60

lon [°]

−30

−20

−10

lat [°]

Geographic space {lon, lat}

7.0 7.5 8.0 8.5 9.0 9.5

log(H

0) [ppm]

−450

−400

−350

−300

−250

−200

−150

−100

δD

Value space {log(H

O), δD}

2000 4000 6000 8000 10000 12000

O [ppm]

−2.25

−2.00

−1.75

−1.50

−1.25

−1.00

−0.75

−0.50

−0.25

O · δ

×10

Value space {H

O, H

O ·δD}

Figure 2: {H

O,δD} cluster analysis with DBSCAN, ε = 10,minPts = 200. Only 1,000 randomly sampled points are shown

for better visibility.

−40 −20 0 20 40 60

lon [°]

−30

−20

−10

lat [°]

Geographic space {lon, lat}

7.0 7.5 8.0 8.5 9.0 9.5

log(H

0) [ppm]

−450

−400

−350

−300

−250

−200

−150

−100

δD

Value space {log(H

O), δD}

2000 4000 6000 8000 10000 12000

O [ppm]

−2.25

−2.00

−1.75

−1.50

−1.25

−1.00

−0.75

−0.50

−0.25

O · δ

×10

Value space {H

O, H

O ·δD}

Figure 3: {H

O,δD} cluster analysis with CoExDBSCAN, ε = 10,minPts = 200,δ = 1e

−5

. The geographic space (longitude

and latitude) has been used as spatial dimensions and the {log(H

O),δD} has been used as constraint dimensions. Only 1,000

randomly sampled points are shown for better visibility.

straints is a challenging task. In our presented veriﬁ-

cation and throughout multiple experimental runs, we

applied mainly generic constraints that allow the algo-

rithm to expand clusters for arbitrarily correlated data

points. With generic constraints we can avoid overﬁt-

ting the cluster algorithm, i.e. avoid to constrain the

cluster expansion to the generating process. Whereas

with specially tailored constraints, for example if we

express the information about the functions that gen-

erate the dependent y and z variables in our synthetic

data example as constraints, we can achieve a perfect

match to the true label of the dataset, but would loose

the generality of the algorithm. To simplify the pro-

cess of deﬁning constraints, methods from the ﬁeld of

active learning could be included into the data analy-

sis workﬂow (Zhu, 2005; Settles, 2009) that provide

appropriate constraints to the CoExDBSCAN algo-

rithm. Furthermore, a machine learning based selec-

tion of suitable constraints could additionally aid the

user in applying the algorithm to new datasets, which

is part of our ongoing research.

Beyond the presented low-dimensional veriﬁca-

tion and evaluation datasets, our algorithm remains

relevant even for high-dimensional data. This derives

from the fact that Schubert’s evaluation of the DB-

SCAN algorithm for high-dimensional data (Schubert

et al., 2017) shows that that DBSCAN continues to

be relevant even for high-dimensional data, although

the parameter ε of DBSCAN becomes hard to choose

in high-dimensional data due to the loss of contrast in

distances. CoExDBSCAN is based on DBSCAN and,

moreover, overcomes the issue of loss of contrast in

distances by utilizing a user-deﬁned subspace for the

distance measure.

However, besides the challenge of ﬁnding and ex-

pressing suitable constraints, ﬁnding the right param-

eters for the algorithm still remains another challenge,

especially for high-dimensional data. In addition to

the parameters of the DBSCAN algorithm, the dimen-

sions for the spatial- and constraint-subspace have to

be determined by the user. For the parameters of

the DBSCAN algorithm we usually rely on hyper-

parameter optimization techniques, for example grid

search, while varying the selected dimensions based

on domain knowledge and the expected outcome of

the analysis. We expect to provide a more general

guidance on the selection of parameters with future

cluster analysis ﬁndings based on CoExDBSCAN.

6 CONCLUSION

In this article we propose a new density-based clus-

tering algorithm with constrained cluster expansion,

CoExDBSCAN. The proposed algorithm uses DB-

SCAN to ﬁnd density-connected clusters in a deﬁned

CoExDBSCAN: Density-based Clustering with Constrained Expansion

113

subspace of features and restricts the expansion of

clusters to a priori constraints. Incorporating a pri-

ori knowledge into the clustering process can signif-

icantly improve the clustering results and can align

the outcome of the clustering process with the ob-

jective of the data analysis, as demonstrated in Sec-

tion 4. Our approach combines different techniques

form subspace, correlation and constrained cluster-

ing. Speciﬁcally, we introduce two user-deﬁned pa-

rameters to the original DBSCAN algorithm, one to

deﬁne the dimensions of the subspace to be used to

discover density-based clusters, and one to deﬁne the

dimensions of the subspace to be used to apply con-

straints to the cluster expansion. Further, we modify

the cluster expansion step in the original DBSCAN

algorithm to be restricted to these user-deﬁned con-

straints. Our validation of the algorithm on an ex-

perimental and real-world dataset demonstrates, that

our algorithm is especially suited for spatio-temporal

data, where one subspace of features deﬁnes the spa-

tial extent of the data and another correlations be-

tween features.

In the future, we plan to evaluate different con-

straints in terms of their feasibility and added over-

head compared to the improvement of the clustering

results, as well as propose a machine learning based

selection of suitable constraints, according to the in-

herent structure of the data. In addition, we plan

to work on an optimized implementation of the al-

gorithm that allows us to provide additional runtime

measurements and detailed comparison studies with

other algorithms in the ﬁeld of subspace, correlation

and constrained clustering.

REFERENCES

Achtert, E., B

ohm, C., David, J., Kr

oger, P., and Zimek,

A. (2008). Robust clustering in arbitrarily oriented

subspaces. In Proceedings of the 2008 SIAM Interna-

tional Conference on Data Mining, ICDM ’08, pages

763–774, Philadelphia, PA. Society for Industrial and

Applied Mathematics.

Aggarwal, C. C. and Yu, P. S. (2000). Finding general-

ized projected clusters in high dimensional spaces. In

Proceedings of the 2000 ACM SIGMOD International

Conference on Management of Data, SIGMOD ’00,

pages 70–81, New York, NY, USA. ACM.

Agrawal, R., Gehrke, J., Gunopulos, D., and Raghavan, P.

(1998). Automatic subspace clustering of high dimen-

sional data for data mining applications. In Proceed-

ings of the 1998 ACM SIGMOD International Confer-

ence on Management of Data, SIGMOD ’98, pages

94–105, New York, NY, USA. ACM.

Ankerst, M., Breunig, M. M., Kriegel, H.-P., and Sander, J.

(1999). Optics: Ordering points to identify the clus-

tering structure. SIGMOD Rec., 28(2):49–60.

Basu, S., Davidson, I., and Wagstaff, K. (2008). Con-

strained clustering: Advances in algorithms, theory,

and applications. CRC Press, Boca Raton, Florida.

Bentley, J. L. (1975). Multidimensional binary search

trees used for associative searching. Commun. ACM,

18(9):509–517.

Beyer, K., Goldstein, J., Ramakrishnan, R., and Shaft, U.

(1999). When is “nearest neighbor” meaningful? In

Beeri, C. and Buneman, P., editors, Database The-

ory — ICDT’99, pages 217–235, Berlin, Heidelberg.

Springer Berlin Heidelberg.

ohm, C., Kailing, K., Kr

oger, P., and Zimek, A. (2004a).

Computing clusters of correlation connected objects.

In Proceedings of the 2004 ACM SIGMOD Interna-

tional Conference on Management of Data, SIGMOD

’04, pages 455–466, New York, NY, USA. ACM.

ohm, C., Kailing, K., Kr

oger, P., and Zimek, A. (2004b).

Computing clusters of correlation connected objects.

In Proceedings of the 2004 ACM SIGMOD Interna-

tional Conference on Management of Data, SIGMOD

’04, pages 455–466, New York, NY, USA. Associa-

tion for Computing Machinery.

Campello, R. J. G. B., Moulavi, D., and Sander, J. (2013).

Density-based clustering based on hierarchical den-

sity estimates. In Pei, J., Tseng, V. S., Cao, L., Mo-

toda, H., and Xu, G., editors, Advances in Knowledge

Discovery and Data Mining, pages 160–172, Berlin,

Heidelberg. Springer Berlin Heidelberg.

Cormen, T. H., Leiserson, C. E., Rivest, R. L., and Stein,

C. (2009). Introduction to algorithms. MIT Press,

Cambridge, Massachusetts.

Dinler, D. and Tural, M. K. (2016). A Survey of Con-

strained Clustering, pages 207–235. Springer Inter-

national Publishing, Cham.

Duda, R. O. and Hart, P. E. (1972). Use of the hough trans-

formation to detect lines and curves in pictures. Com-

mun. ACM, 15(1):11–15.

Ert

oz, L., Steinbach, M., and Kumar, V. (2003). Finding

Clusters of Different Sizes, Shapes, and Densities in

Noisy, High Dimensional Data, pages 47–58. Society

for Industrial and Applied Mathematics, Philadelphia,

PA.

Ester, M., Kriegel, H.-P., Sander, J., and Xu, X. (1996).

A density-based algorithm for discovering clusters a

density-based algorithm for discovering clusters in

large spatial databases with noise. In Proceedings of

the Second International Conference on Knowledge

Discovery and Data Mining, KDD ’96, pages 226–

231, Palo Alto, California. AAAI Press.

Fisher, R. A. (1936). The use of multiple measurements in

taxonomic problems. Annals of Eugenics, 7(2):179–

188.

Friedman, J. H. (1994). An overview of predictive learning

and function approximation. In Cherkassky, V., Fried-

man, J. H., and Wechsler, H., editors, From Statistics

to Neural Networks, pages 1–61, Berlin, Heidelberg.

Springer Berlin Heidelberg.

KDIR 2020 - 12th International Conference on Knowledge Discovery and Information Retrieval

114

Guha, S., Rastogi, R., and Shim, K. (2001). Cure: an efﬁ-

cient clustering algorithm for large databases. Infor-

mation Systems, 26(1):35 – 58.

Hough, P. V. (1962). Method and means for recognizing

complex patterns. US Patent 3,069,654.

Houle, M. E., Kriegel, H.-P., Kr

oger, P., Schubert, E., and

Zimek, A. (2010). Can shared-neighbor distances

defeat the curse of dimensionality? In Gertz, M.

and Lud

ascher, B., editors, Scientiﬁc and Statistical

Database Management, pages 482–500, Berlin, Hei-

delberg. Springer Berlin Heidelberg.

Hubert, L. and Arabie, P. (1985). Comparing partitions.

Journal of Classiﬁcation, 2(1):193–218.

Jain, A. K. (2010). Data clustering: 50 years beyond k-

means. Pattern Recognition Letters, 31(8):651 – 666.

Award winning papers from the 19th International

Conference on Pattern Recognition (ICPR).

Mohri, M., Rostamizadeh, A., and Talwalkar, A. (2012).

Foundations of Machine Learning. MIT Press, Cam-

bridge, Massachusetts.

Noone, D. (2012). Pairing measurements of the water va-

por isotope ratio with humidity to deduce atmospheric

moistening and dehydration in the tropical midtropo-

sphere. Journal of Climate, 25(13):4476–4494.

Papadimitriou, C. H. and Steiglitz, K. (1998). Combinato-

rial optimization: algorithms and complexity. Courier

Corporation.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V.,

Thirion, B., Grisel, O., Blondel, M., Prettenhofer,

P., Weiss, R., Dubourg, V., Vanderplas, J., Passos,

A., Cournapeau, D., Brucher, M., Perrot, M., and

Duchesnay, E. (2011). Scikit-learn: Machine learning

in Python. Journal of Machine Learning Research,

12:2825–2830.

Pourrajabi, M., Moulavi, D., Campello, R. J. G. B., Zimek,

A., Sander, J., and Goebel, R. (2014). Model selection

for semi-supervised clustering. In Amer-Yahia, S.,

Christophides, V., Kementsietsidis, A., Garofalakis,

M. N., Idreos, S., and Leroy, V., editors, Proceedings

of the 17th International Conference on Extending

Database Technology, EDBT 2014, Athens, Greece,

March 24-28, 2014, pages 331–342, Konstanz. Open-

Proceedings.org.

Rand, W. M. (1971). Objective criteria for the evaluation of

clustering methods. Journal of the American Statisti-

cal Association, 66(336):846–850.

Role, F., Morbieu, S., and Nadif, M. (2019). Coclust: A

python package for co-clustering. Journal of Statisti-

cal Software, Articles, 88(7):1–29.

Ruiz, C., Spiliopoulou, M., and Menasalvas, E. (2007). C-

dbscan: Density-based clustering with constraints. In

An, A., Stefanowski, J., Ramanna, S., Butz, C. J.,

Pedrycz, W., and Wang, G., editors, Rough Sets, Fuzzy

Sets, Data Mining and Granular Computing, pages

216–223, Berlin, Heidelberg. Springer Berlin Heidel-

berg.

Schneider, M., Borger, C., Wiegele, A., Hase, F., Garc

ıa,

O. E., Sep

ulveda, E., and Werner, M. (2017). Mu-

sica metop/iasi {H

O,δD} pair retrieval simulations

for validating tropospheric moisture pathways in at-

mospheric models. Atmospheric Measurement Tech-

niques, 10(2):507–525.

Schubert, E., Sander, J., Ester, M., Kriegel, H. P., and Xu,

X. (2017). Dbscan revisited, revisited: Why and how

you should (still) use dbscan. ACM Trans. Database

Syst., 42(3).

Schubert, E. and Zimek, A. (2019). ELKI: A large open-

source library for data analysis - ELKI release 0.7.5

”heidelberg”. CoRR, abs/1902.03616:1–134.

Settles, B. (2009). Active learning literature survey. Techni-

cal report, University of Wisconsin-Madison Depart-

ment of Computer Sciences.

Wagstaff, K. and Cardie, C. (2000). Clustering with

instance-level constraints. In Proceedings of the Sev-

enteenth International Conference on Machine Learn-

ing, ICML ’00, pages 1103–1110, San Francisco, CA,

USA. Morgan Kaufmann Publishers Inc.

Zhu, X. J. (2005). Semi-supervised learning literature

survey. Technical report, University of Wisconsin-

Madison Department of Computer Sciences.

Zimek, A., Schubert, E., and Kriegel, H.-P. (2012). A

survey on unsupervised outlier detection in high-

dimensional numerical data. Statistical Analysis

and Data Mining: The ASA Data Science Journal,

5(5):363–387.

CoExDBSCAN: Density-based Clustering with Constrained Expansion

115