An Information Theory Subspace Analysis Approach with Application to

Anomaly Detection Ensembles

Marcelo Bacher, Irad Ben-Gal and Erez Shmueli

Department of Industrial Engineering, Tel-Aviv University, Tel-Aviv, Israel

Keywords:

Subspace Analysis, Rokhlin, Ensemble, Anomaly Detection.

Abstract:

Identifying anomalies in multi-dimensional datasets is an important task in many real-world applications. A

special case arises when anomalies are occluded in a small set of attributes (i.e., subspaces) of the data and

not necessarily over the entire data space. In this paper, we propose a new subspace analysis approach named

Agglomerative Attribute Grouping (AAG) that aims to address this challenge by searching for subspaces that

comprise highly correlative attributes. Such correlations among attributes represent a systematic interaction

among the attributes that can better reﬂect the behavior of normal observations and hence can be used to im-

prove the identiﬁcation of future abnormal data samples. AAG relies on a novel multi-attribute metric derived

from information theory measures of partitions to evaluate the ”information distance” between groups of data

attributes. The empirical evaluation demonstrates that AAG outperforms state-of-the-art subspace analysis

methods, when they are used in anomaly detection ensembles, both in cases where anomalies are occluded

in relatively small subsets of the available attributes and in cases where anomalies represent a new class (i.e.,

novelties). Finally, and in contrast to existing methods, AAG does not require any tuning of parameters.

1 INTRODUCTION

Anomaly detection refers to the problem of ﬁnd-

ing patterns in data that do not conform to an ex-

pected norm behavior. These non-conforming pat-

terns are often referred to as anomalies, outliers, or

unexpected observations, depending on the applica-

tion domains (Chandola et al., 2007). Algorithms for

detecting anomalies are extensively used in a wide va-

riety of application domains such as machinery mon-

itoring (Ge and Song, 2013), sensor networks, (Ba-

jovic et al., 2011), and intrusion detection in data net-

works (Jyothsna et al., 2011).

In a typical anomaly detection setting, only nor-

mal or expected observations are available, and con-

sequently, some assumptions regarding the distribu-

tion of anomalies must be made to discriminate nor-

mal from anomalous observations (Steinwart et al.,

2006). Traditional approaches for anomaly detec-

tion (see, e.g., (Aggarwal, 2015)) often assume that

anomalies occur sporadically and are well separated

from the normal data observations, or that anoma-

lies are uniformly distributed around the normal ob-

servations. However, in complex environments, such

assumptions may not hold. For instance, if during

the analysis of the multi-attribute data generated in

a complex system only a few sensors fail to function

normally, only some of the data attributes will be af-

fected. From a data analysis perspective, these mal-

functions can be seen as data samples corrupted by

noise over a subset of data attributes. Consequently,

anomalies in the generated data might only be no-

ticeable in some projections of the data into a lower-

dimensional space, called a subspace, and not in the

entire data space. Also, consider a case where anoma-

lies represent a new, previously unknown, class of

data observations, commonly called novelties (Chan-

dola et al., 2007). Similarly to the malfunctions ex-

ample above, deviations from the original data obser-

vations might only be evident along a subset of at-

tributes. Yet, these attributes will often be correlated

in some sense, and therefore cannot be treated as cor-

rupted by noise.

Based on this observation, ensembles were pro-

posed as a novel paradigm in the anomaly detec-

tion ﬁeld (Aggarwal and Yu, 2001). Ensembles for

anomaly detection typically follow three general steps

(Lazarevic and Kumar, 2005). First, a set of sub-

spaces is generated (e.g., by randomly selecting sub-

sets of attributes). This step is commonly referred to

as subspace analysis. Then, classical anomaly detec-

tion algorithms are applied on each subspace to com-

Bacher M., Ben-Gal I. and Shmueli E.

An Information Theory Subspace Analysis Approach with Application to Anomaly Detection Ensembles.

DOI: 10.5220/0006479000270039

In Proceedings of the 9th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (KDIR 2017), pages 27-39

ISBN: 978-989-758-271-4

pute local anomaly scores. Finally, these local scores

are aggregated to derive a global anomaly score (e.g.,

using voting). Here, we focus on the subspace anal-

ysis stage, which aims at to ﬁnding a representative

set of subspaces, among a very large number of pos-

sible subspaces, such that anomalies can be identiﬁed

effectively and efﬁciently.

Several methods for subspace analysis have been

proposed in the literature. These methods can be clas-

siﬁed into three broad approaches. The most basic

one is based on a random selection of attributes (see,

e.g., (Lazarevic and Kumar, 2005)). Other methods

search for subspaces by giving anomalous grades to

data samples, thus coupling the search for meaningful

subspaces with the anomaly detection algorithm (see,

e.g., (M

uller et al., 2010)). Recent methods search

for subspaces that comprising of highly correlative at-

tributes (e.g., (Nguyen et al., 2013a)).

However, all of the above methods suffer from

one or more of the following limitations: (i) Some

attributes might not be included in the generated set

of subspaces. This might impact the effectiveness

of the ensemble since anomalies might occur any-

where in the data space. (ii) The set of generated sub-

spaces might contain thousands of subspaces which

may make the training and operation phases of the

ensemble computationally prohibitive. (iii) They re-

quire, prior to their execution, the tuning of param-

eters such as the number of subspaces, the maxi-

mal size of each subspace, the number of clusters or

information-theory thresholds.

To address the challenges mentioned above, we

developed the Agglomerative Attribute Grouping

method (AAG) for subspace analysis. Motivated by

previous works, AAG searches for subspaces that

comprise of highly correlative attributes. As a gen-

eral measure for attribute association, AAG applies a

metric derived from concepts of information-theory

measures of partitions (see e.g. (Simovici, 2007)). In

particular, AAG makes use of the Rokhlin distance

(Rokhlin, 1967) to evaluate the smallest distance be-

tween subspaces in the case of two attributes, and a

multi-attribute distance, which is proposed here, for

cases where more than two attributes are involved.

Then, AAG applies a variation of the well-known ag-

glomerative clustering algorithm where subspaces are

greedily searched by minimizing the multi-attribute

distance. We also propose a pruning mechanism that

aims at improving the convergence time of the pro-

posed algorithm while limiting the size of the sub-

spaces.

Several important characteristics differentiate

AAG from existing state-of-the-art approaches. First,

due to the agglomerative approach in the subspace

search, none of the data attributes is are discarded,

and attributes are combined in an effective way man-

ner to generate the set of subspaces. Second, the set

of subspaces that AAG generates is relatively “com-

pact” in comparison to existing methods for two main

reasons: the use of the agglomerative approach re-

sults in a relatively small number of subspaces, and

the pruning mechanism results in a relatively small

number of attributes in each subspace. Finally, as a

result of combining the agglomerative approach with

the minimization of the suggested distance measure,

AAG does not require any tuning of parameters.

To evaluate the proposed AAG method, we con-

ducted extensive experiments on publicly available

datasets, while using seven different classical and

state-of-the-art subspace analysis methods as bench-

marks. The results of our evaluation shows that an

AAG-based ensemble for anomaly detection: (i) out-

performs classical and state-of-the-art subspace anal-

ysis methods when used in anomaly detection ensem-

bles, in cases where anomalies are occluded in small

subsets of the data attributes, as well as in cases where

anomalies represent a new class (i.e., novelties); and

(ii) generates fewer subspaces with a smaller (on aver-

age) number of attributes, in comparison to the bench-

mark approaches, thus, resulting in a faster training

time for the anomaly detection ensemble.

The rest of the paper is organized as follows: Sec-

tion 2 discusses the background and related work.

Section 3 provides a review of relevant theoretical el-

ements whereas in Section 4 we detail the proposed

approach. Section 5 describes the experimental eval-

uation of our technique as well as provides a discus-

sion on the obtained results. Finally, we conclude in

section 6 and suggest future research directions.

2 BACKGROUND

In Machine Learning applications, anomaly detection

methods aim to detect data observations that consid-

erably deviate from normal data observations (Aggar-

wal, 2015). Well-documented surveys on anomaly

detection techniques are (Markou and Singh, 2003;

Chandola et al., 2007; Maimon and Rockach, 2005;

Pimentel et al., 2014) and (Aggarwal, 2015). While

these techniques are widely used in real-world appli-

cations, they share a major limitation: the underly-

ing assumption that abnormal observations are spo-

radic and isolated with respect to normal data sam-

ples. That is, abnormal observations are usually seen

as the result of additive random noise in the full data

space (Chandola et al., 2007). Under this assumption,

anomalies can often be identiﬁed by building a sin-

gle model that describes normal data along all of its

dimensions.

However, in complex environments, such assump-

tions may not hold anymore. That is, in such cases,

abnormal data samples might be occluded in some

combination of attributes, and may only become ev-

ident in lower-dimensional projections, or subspaces.

One of the ﬁrst approaches that aimed to identify

such relevant subspaces was presented by Aggarwal

and Yu in (Aggarwal and Yu, 2001). Several works

followed (Aggarwal and Yu, 2001) by proposing en-

hanced methods where data observations were ranked

based on many possible subspace projections to de-

cide on their “anomalous grade” (see, e.g., (Kriegel

et al., 2009b; M

uller et al., 2010) and (M

uller et al.,

2011)). These approaches assumed that anomalous

observations are mixed together with normal data

samples, and therefore, the resulting set of subspaces

depends on the anomalous grade of each observation.

Focusing on the search for relevant subspaces sev-

eral inspiring mechanisms have been presented. For

example, Feature Bagging (FB) (Lazarevic and Ku-

mar, 2005), High Contrast Subspaces (HiCS) (Keller

et al., 2012), Cumulative Mutual Information (CMI)

(Nguyen et al., 2013b) and 4S (Nguyen et al., 2013a).

In all these approaches, the anomaly detection chal-

lenge was divided into three main stages: subspace

analysis, anomaly score computation and score ag-

gregation. Therefore, the task of ﬁnding “good” sub-

spaces can be isolated from the anomaly detection

algorithm as well as from the strategy for aggregat-

ing scores that are used at later stages. Although

these methods achieve relative good performing re-

sults, they are not extent of limitations. For exam-

ple, some attributes might not be present in the sub-

spaces due to random selection strategy (see, e.g.,

(Lazarevic and Kumar, 2005), (Keller et al., 2012)).

Furthermore, due to the applied search strategy in

(Keller et al., 2012) thousands of subspaces can po-

tentially be generated, which makes the training phase

of the ensemble for anomaly detection computation-

ally prohibitive. Finally, the methods in (Nguyen

et al., 2013b) and (Nguyen et al., 2013a) require the

previous set of clusters and maximal number of at-

tributes within a subspace, respectively. As no prior

information regarding anomalies usually exists, this

selection strategy might lead the ensemble to a perfor-

mance deterioration because attributes might be dis-

carded. Subspace analysis has been also addressed

as a clustering problem. In particular, subspace clus-

tering is an extension of traditional clustering that

seeks to ﬁnd clusters in different subspaces within

a dataset (see, e.g., (Parsons et al., 2004; Kriegel

et al., 2009a), (Deng et al., 2016)). The clusters are

usually described by a group of attributes that con-

tribute the most to the compactness of the data ob-

servations within the subspace. Aggarwal et al. pre-

sented CLIQUE in (Agrawal et al., 1998) as one of

the ﬁrst methods that aimed to ﬁnd the exact set of at-

tributes in each cluster. In (Cheng et al., 1999), based

on concepts from (Agrawal et al., 1998), ENCLUS

was presented as a method that searches for subspaces

with computed low Shannon entropy (see, e.g.,(Cover

and Thomas, 2006)). Other algorithms perform the

subspace clustering by assessing a weighting factor

to each attribute in proportion to the contribution to

the formation of a particular cluster (see, e.g., FSC

(Gan et al., 2006), EWKM (Jing et al., 2007), and

AFG-k-means (Gan and Ng, 2015)). One of the ma-

jor drawbacks of subspace clustering algorithms is the

challenge of tuning speciﬁc parameters for each algo-

rithm, as well as the identiﬁcation of the correct num-

ber of clusters that the clustering algorithm requires

prior to its execution.

3 INFORMATION THEORY

MEASURES OF PARTITIONS

In this section, we follow (Kagan and Ben-Gal, 2014)

and discuss how to use the informational measures

among partitions of a generic dataset to compute dis-

tances among set of attributes. To maintain a self-

contained text, we start this section by providing a

brief review of the partitioning concepts as well as

the used notation.

3.1 Preliminaries

Let D be a ﬁnite sample space composed of

N observations (rows) and p column attributes

,...,A

}, and let χ be a set of partitions

of the sample space D. For each partition α =

,...,a

} ∈ χ, K ≤ N, a

∩ a

0, j, m =

1,2,...,K and j 6= m. The partitions are deﬁned

by the attributes values as follows. For an at-

tribute A

∈ {A

,...,A

}, the elements a ∈ α

the corresponding partition α

= {a

,...,a

} ∈ χ

are the sets of indices of unique values of the at-

tribute A

. To deﬁne the informational distance be-

tween the partitions and so between the attributes, it

is necessary to specify a probability distribution in-

duced by the partition. For the ﬁnite sets, the em-

pirical probability distribution induced by the par-

tition α

∈ χ is deﬁned as (Simovici, 2007) p

(|a

|/|α

|,|a

|/|α

|,...,|a

|/|α

), where |·| represents

the cardinality of the set and

∑

j=1

(|a

|/|α|) = 1.

Then, the Shannon entropy of the induced random

variable by the partition α

is deﬁned as H(A

) =

−

∑

∈α

)log

)], where by usual con-

vention 0 log

0 = 0. Let α

and β

be two parti-

tions generated by the attributes A

and A

respec-

tively, where α

= {a

,...} and β

= {b

,...}.

Then, the conditional entropy of the partition α

with

respect to the partition β

is deﬁned as H(A

) =

H(A

) − H(A

). It follows, that the Rokhlin

(Rokhlin, 1967) distance is computed as,

) = H(A

) + H(A

) (1)

Notice that a direct implementation of the Rokhlin

metric between partitions appears in (Kagan and Ben-

Gal, 2013) for constructing search algorithm, and in

(Kagan and Ben-Gal, 2014) for creating the testing

trees. The Rokhlin distance is directly related to

the mutual information as d

) = H(A

) −

I(A

) (Kagan and Ben-Gal, 2014). Consequen-

tially, (1) can be interpreted as the grade of mutual

dependence between two attributes. A small distance

value means a small joint entropy and a high mutual

information values. That is, the attributes necessarily

have to be correlative (Cover and Thomas, 2006).

3.2 Multi-attribute Distance

The symmetric difference between the two partitions

and β

is deﬁned as follows: α

4β

= (α

\β

) ∪

(β

\α

), which considers the different elements of the

partitions α

and β

(see e.g. (Kuratowski, 1961)).

Correspondingly, the Hamming distance (see, e.g.,

(Simovici, 2007)) between partitions α

and β

is de-

ﬁned as the cardinality |α

4β

| of the set α

4β

. Fol-

lowing a similar analysis, the symmetric difference

of three partitions is presented in the Venn diagram

of Fig. 1, where in gray the resulting union of sets

without the successive intersections is shown. Fig. 1

also shows the information theoretical relationships

among the attributes A

, A

and, A

, induced by the

partitions α

, β

, and γ

, respectively. I(A

) is the

mutual information between attributes A

and A

, and

I(A

) is the conditional mutual information (see

e.g. (Cover and Thomas, 2006)).

We then deﬁne the multi-attribute distance, d

for three attributes as,

) = H(A

) + H(A

+ H(A

) + II(A

)

(2)

where II(·) is the multi-variate mutual information

that was introduced in the seminal work of McGrill

in (McGill, 1954) as a measure of the higher order

interaction among random variables. The ﬁrst three

terms on the right side of (2) compute the degree of

uncertainty among attributes whereas the last term

H A

( )

H A

( )

H A

( )

H A

| A

, A

( )

H A

| A

, A

( )

H A

| A

, A

( )

I A

; A

| A

( )

I A

; A

| A

( )

I A

; A

| A

( )

II A

; A

( )

Figure 1: Symmetric difference of three non-empty sets

shown in gray together with the information theoretical re-

lationships among the attributes A

, A

, and A

computes the shared information among them. The

extension of (2) for p attributes is derived from the

deﬁnition of the symmetric-difference for p sets (see,

e.g., (Kuratowski, 1961)) as,

(A) =

∑

i=1

H(A

|A\A

) + II(A) (3)

where A = {A

,...,A

} denotes the set of at-

tributes in D, A

∈ A, and II(A) is the multi-variate

mutual information (Jakulin, 2005). We further de-

ﬁne the general case when A contains p ≥ 2 attributes

by merging (3) and (1) into one generalized multi-

attribute distance d, namely,

d(A

) =











) if |A

| = |A

| = 1

) if |A

∪ A

| > 2

(4)

That is, d(A

) is computed using the Rokhlin dis-

tance deﬁned in 1 in the case that each subspace com-

prising one single attribute. Otherwise, Eq. (3) is

used. The deﬁnition of Eq. (4) allows us to claim:

Lemma 1. d(A

) is a metric

The proof is omitted in this paper due to space consid-

erations. Several beneﬁts in of using the proposed d

to analyze subspaces are worth to mentioning. First,

differently to from classical and state-of-the-art ap-

proaches such as 4S (Nguyen et al., 2013a), we pro-

pose to searching for subspaces that minimize the pro-

posed metric instead of maximizing information mea-

sures such as Total Correlation (Watanabe, 1960). On

one side, this is possible since the proposed d

holds

the metric axioms. On one side hand, this is possible

since the proposed d holds the metric axioms. On

the other side hand, minimizing a metric avoids the

setting of challenging and critical parameters, such

as information thresholds. Second, the minimization

of the proposed multi-attribute distance tends to del-

egate the combination of attributes with very low in-

formation content or, equivalently, large number of

symbols, to later stages where the results of their in-

ﬂuence results are less critical. Finally, it will later

be later empirically shown that the minimization of

d tends to generate, on average, a smaller set of sub-

spaces than other approaches, especially in the case

of datasets whose attributes have signiﬁcantly differ-

ent numbers of values. A direct consequence of this

characteristic is, on average, a reduced training time

of the ensemble for anomaly detection.

The time complexity of computing (3) is expo-

nential with respect to the number of attributes in

D, which makes it prohibitive for real-world scenar-

ios. Moreover, as p grows, the necessary probabil-

ity distributions become more high dimensional, and

hence the estimation of the multi-attribute distance

becomes less reliable. Therefore, in this subsection,

we propose simple rules to bound the suggested met-

ric. Given two sets of attributes A

and A

Lemma 2. A

⊆ A

⇒ d

) ≥ d

)

We omitted the corresponding proof due to space con-

siderations. Two immediate results follows: 1) If

⊆ A

then d

) ≤ d

)

and 2) d

) ≤ min

⊆A

)}.

The approximation of the multi-attribute distance

results in an increase of the distance value over the

reduced attribute set with respect to the full subspace.

Additionally, approximating d

has a direct conse-

quence, namely, that the multi-attribute distance pre-

serves the metric characteristics over the reduced set

of attributes, but it turns ends up to being loosely

with respect to the whole entire data space. That

is, d

becomes a pseudo-metric where the triangu-

lar inequality is not always guaranteed. Nevertheless,

the approximated d

can be applied to compute the

information distance within a subspace and conse-

quently be used in the search for highly correlative

subspaces, as shown in Section 5.2.

4 AGGLOMERATIVE

ATTRIBUTE GROUPING

The pseudo-code of the proposed AAG method is

shown in Algorithm 1. The algorithm receives as in-

put a dataset D composed of N observations and p at-

tributes. The algorithm returns as output a set of sub-

spaces with highly correlated attributes denoted by T .

The algorithm begins by initializing the result set

of subspaces T to the empty set (line 1). Then, in line

2, the algorithm creates a set of p subspaces, each of

which is composed of a single attribute. This set con-

stitutes the ﬁrst agglomeration level, and is denoted

by S

(1)

(lines 2-3). Then, the algorithm iteratively

generates the subspaces of agglomeration level t + 1,

denoted by S

(t+1)

, by combining subspaces from the

agglomeration level, S

(t)

(lines 4-27). Each such it-

eration begins with an updating of the result set T to

contain also the subspaces from the previous agglom-

eration level (line 5). Then, in line 6 we initialize

the set of subspaces of the next agglomeration level

to the empty set. Next, in line 7, we maintain a copy

of the current agglomeration level, denoted by S

(t)

The rationale behind this step will be explained later.

Notice that S

(t)

, S

(t)

, and S

(t+1)

, as well as T , con-

tain the indices of the data attributes in the subspaces,

whereas, A

denotes the projection of data samples.

The algorithm continues by searching for two subsets

in the current agglomeration level that have the lowest

multi-attribute distance (line 8), and adds the uniﬁed

set to the next agglomeration level instead of the two

individual subsets (lines 8-11).

In lines 13-25, the algorithm continues to com-

bine subspaces iteratively, until there are no more sub-

sets left in S

(t)

. However, now, the algorithm checks

whether it is better to unify a subset from S

(t)

and a

subset from S

(t+1)

, denoted by A

and A

, or two sub-

sets from S

(t)

, denoted by A

and A

. The motivation

behind this is to avoid merging single subspaces in

each agglomeration level and to allow the combina-

tion of multiple subspaces. In doing so, we avoid the

permanent selection of subspaces with a higher num-

ber of attributes to be combined. Once all subspaces

have been assigned at agglomeration level t, the al-

gorithm proceeds with subsequent levels of agglom-

eration (lines 4-27) until no subspace combination is

further required (line 4). The AAG algorithm ﬁnishes

by returning the set of subspaces T at line 28.

Note that in some cases (line 10 and line 18), we

choose not to add the uniﬁed set; we refer to this de-

cision as pruning, and describe it in details in the next

subsection.

The normalized multi-attribute distance, denoted

d(·), used in lines 8, 14, 15 and 16 is shown in (5).

d(A

) = d(A

)/H(A

∪ A

) (5)

The normalization factor, i.e. H(A

∪ A

), in (5) al-

lows us to compare subspaces with a different number

of attributes. In the general case, the distance compu-

tation is done based on Lemma 2 where we select a

ﬁxed-size subset of attributes (e.g., three), rather than

taking all attributes in the uniﬁed set.

To illustrate the operation of Algorithm 1, con-

sider the following example with a dataset D, com-

Table 1: A dataset D.

A1 A2 A3 A4 A5 A6 A7

0 R 1 F a 3 9

1 G 2 E a 3 9

0 R 3 F a 5 25

1 G 4 E a 5 25

0 R 5 F a 7 49

0 G 6 E b 8 64

1 R 7 F b 10 100

0 B 8 E b 10 100

0 B 9 E b 11 121

1 G 10 E a 11 121

prising of N = 10 records, and p = 7 attributes as

shown in Table 1.

AAG starts with initializing the set of

subspaces of level 1, denoted by S

(1)

, to

{{A

},{A

}} (line

2). Then, AAG searches for a pair of two subspaces

in S

(1)

that minimizes

d(·) (line 8). Going over all

21 possible pairs, and using the upper part of Eq.

(5), we ﬁnd that the pair {A

} and {A

} minimizes

this distance (note that A

= A

and therefore

d({A

},{A

}) ≈ 0). Then, the two subspaces {A

}

and {A

} are removed from S

(1)

(line 9) and the

uniﬁed subspace {A

} is added to the set of

subspaces of level 2, denoted by S

(2)

(line 11). At

this point S

(1)

= {{A

},{A

}} and

(2)

= {{A

}}.

Next, AAG keeps searching for meaningful sub-

spaces by trying to combine subspaces from S

(1)

with

subspaces from S

(2)

(line 14). Going over all ﬁve pos-

sible pairs, we ﬁnd that the pair {A

} and {A

}

minimizes the distance with

d({A

},{A

}) =

0.292. Next, the algorithm checks whether {A

} is

closer to other subspaces at S

(1)

than to {A

} (line

15). As a practical note, notice that all the distances

involving {A

} and other subspaces in S

(1)

have al-

ready been computed in the previous iteration when

} was chosen. Such computations can be

stored in a look-up table and reduce future compu-

tations considerably. We ﬁnd that the subspace in

(1)

that minimizes the distance is {A

}. However,

because

d({A

},{A

}) >

d({A

},{A

}), {A

} is

combined with {A

}, yielding the new subspace

} (line 16). Then, {A

} is removed from

(1)

, (line 22) and the new subspace {A

} re-

places the subspace {A

} in S

(2)

, leading to S

(1)

{{A

},{A

}} and S

(2)

= {{A

}}

Next, the Algorithm 1 proceeds to search for a

subspace in S

(1)

that if combined with {A

}

will keep it highly correlative. Because the

combined subspaces now contain four attributes,

when computing the distance, we apply Lemma

2 and select only three attributes. More specif-

ically, we compute

d({A

},{A

}), where A

∈

(1)

and ﬁnd that

d({A

},{A

}) results mini-

mal. The subspace {A

} is therefore re-

placed with {A

} and {A

} is removed

form S

(1)

(line 22). Because S

(1)

0 the al-

gorithm continues to search for combinations of

subspaces from S

(1)

and S

(2)

(lines 14-25). The

AAG method ﬁnds that combining {A

}

and {A

} yields the minimum distance with

d({A

},{A

}) ≈ 0.443. However, in line 15,

it ﬁnds that

d({A

},{A

}) <

d({A

},{A

}), and

therefore, it decides to unify the two subspaces

} and {A

}, add it to S

(2)

and remove the lat-

ter two single-attribute subspaces from S

(1)

. At this

point S

(2)

= {{A

},{A

}} and S

(1)

{{A

}}. Similarly, because

d({A

},{A

}) <

d({A

},{A

}) (lines 14-15), {A

} is uniﬁed with

}, replacing {A

} in S

(2)

It results that S

(2)

= {{A

},{A

}}

and S

(1)

0. The condition in line 13 becomes

false causing the loop to break. Then, the next

agglomeration level starts at line 4 with t = 2.

(2)

contains only two subspaces, which line 8 re-

turns in S

and S

and are afterwards uniﬁed into

(3)

. After removing the remaining subspaces from

(2)

(line 9), S

(2)

becomes empty, and the loop

at line 13 breaks. Finally, the algorithm termi-

nates and returns T = {{A

},{A

}}.

4.1 Pruning

The agglomerative approach used in the previous sec-

tion, has the inherent property that the number of at-

tributes in subspaces grows with the agglomeration

level. This property has two major limitations: (1)

it may have a great impact on the efﬁciency of the

anomaly detection ensemble (see section 5) and (2)

recall that (5) becomes less accurate when the num-

ber of attributes grows signiﬁcantly.

To overcome these limitations, we propose a sim-

ple rule to determine whether to proceed with unify-

ing two subspaces or not. This rule is embedded in

Algorithm 1 at lines 10 and 18. According to this

rule, two candidate subspaces are uniﬁed only if their

union increases the subspace’s quality with respect to

the two individual subspace candidates. More speciﬁ-

cally, we evaluate the Total Correlation (TC) (Watan-

abe, 1960) of the two individual subspaces A

and

and compare their sum to the TC of their union

∪ A

Algorithm 1: Agglomerative Attribute Grouping.

Input: A dataset D with N observations and p at-

tributes

Output: A set of subspaces T

1: T ←

2: S

(1)

← {{A

},{A

},...,{A

}}

3: t ← 1

4: while (S

(t)

0) do

5: T ← T ∪ S

(t)

6: S

(t+1)

←

7: S

(t)

← S

(t)

8: {A

} = argmin

∈S

(t )

d(A

)

9: S

(t)

← S

(t)

\ {A

}

10: if t ≤ 2 OR (Eq. (6) is FALSE) then

11: S

(t+1)

← S

(t+1)

∪ {A

∪ A

}

12: end if

13: while S

(t)

0 do

14: {A

} = argmin

∈S

(t )

∈S

(t +1)

d(A

)

15: S

= argmin

∈S

(t )

d(A

)

16: if (

d(A

) ≤

d(A

)) then

17: S

(t)

← S

(t)

\ {A

}

18: if t ≤ 2 OR (Eq. (6) is FALSE) then

19: S

(t+1)

← S

(t+1)

∪ {A

∪ A

}

20: end if

21: else

22: S

(t)

← S

(t)

\ A

23: S

← {A

∪ A

}

24: end if

25: end while

26: t ← t + 1

27: end while

28: return T

TC(A

∪ A

) < ν

TC(A

) + ν

TC(A

(6)

where J(·) is the well-known Jaccard Index and ν

J(A

∪ A

) and ν

= J(A

∪ A

) serve as soft

thresholds, allowing Algorithm 1 to combine sub-

spaces whose sum of individual TCs is marginally

higher than the TC of their union.

Note that the proposed rule does not require tuning

parameters. Moreover, its usage in Algorithm 1 does

not lead to discarded attributes attributes, since all at-

tributes are already combined in the previous level of

agglomeration. As we noted before, this is an impor-

tant property in anomaly detection applications where

all attributes are required.

4.2 Complexity Analysis

In this subsection we analyze the runtime complex-

ity of Algorithm 1. Note that since we focus on the

worst-case scenario, the pruning mechanism is ig-

nored.

In line 8, the Algorithm 1 searches for the two sub-

spaces with minimal distance

d(·), among all possible

pairs of single-attribute subspaces. Because we have

p attributes in total, the runtime complexity of line 8

is O(p

∆), where ∆ represents the runtime complex-

ity of

d(·). Although the algorithm searches only for

the pair of subspaces with minimal distance, the dis-

tances between all pairs are recorded in a matrix M.

Because

d(·) is symmetric only p(p − 1)/2 are actu-

ally stored. The algorithm makes use of the matrix M,

previously computed, and, taking into account the in-

herent nature of the agglomerative clustering embed-

ded in the proposed AAG, the the runtime complexity

of the entire algorithm is O(∆n

logn) (see, e.g., (Cor-

men, 2009)).

The computation of ∆ requires the estimation of

the conditional entropy among attributes as well as

the multi-variate mutual information in

d. We start

by ﬁrst analyzing the runtime complexity of the con-

ditional entropy between two attributes A

and A

In this case A

partitions the dataset D by identi-

fying its m

unique values. The run time to ﬁnd

unique m

elements in an array of size N, can be es-

timated by O(m

logN) (see, e.g., (Cormen, 2009)).

Following this, the unique m

elements of the sec-

ond attribute A

at each one of the m

partitions are

identiﬁed. This step requires again a run time of

O(m

logN). Thus, the run time complexity ∆ can

be estimated as O(m

log

N) for two attributes,

which can be further generalized as O(m

log

N),

where m = max{m

}. Following this analysis,

for three attributes in

d(·), ∆ can be estimated as

O(m

log

N). Applying the chain rule (see, e.g.,

(Cover and Thomas, 2006)) one can proof that the

multi-variate mutual information II(·) as well as the

normalization factors do not require new computa-

tions. Combining the analysis done for both the ag-

glomerative strategy of AAG and the computation of

d(·), the complexity of AAG can ﬁnally be estimated

as O(n

log

N log n).

5 EVALUATION

5.1 Experimental Setting

All of our experiments were conducted on 10 real-

world datasets (see Table 2), taken from the UCI

repository (Bache and Lichman, 2013). Although

these datasets are usually used in the context of clas-

siﬁcation tasks, previous studies (e.g., (Aggarwal and

Yu, 2001; Lazarevic and Kumar, 2005; Keller et al.,

2012; Nguyen et al., 2013b; Nguyen et al., 2013a))

have also used these datasets in the context of en-

sembles for anomaly detection. This was achieved

by ﬁrst identifying the majority class for each one of

the datasets and using the records associated with it

as normal observations. Then, abnormal observations

were generated in one of the following forms: (1) per-

turbing normal data samples with synthetic random

noise to generate anomalies or (2) using observations

from the remaining set of classes as novelties.

Table 2: Datasets’ characteristics.

Dataset Classes Instances Attributes

Features - Fourier 10 10000 74

Faults 7 1941 27

Satimage 7 6435 36

Arrhythmia 16 452 279

Pen Digits 10 10992 64

Features - Pixels 10 10000 240

Letters 26 20000 16

Waveform 3 5000 22

Sonar 2 208 60

Thyroid 5 7200 29

After identifying the majority class, the normal

observations associated with it were split into training

and test sets, where the training set had approximately

70% of the whole normal data. The training set

was used as input for the subspace analysis algorithm

and to train the anomaly detection algorithm in each

one of the subspaces. Because AAG assumes dis-

crete variables, we discretized the continuous-valued

attributes in the training set using the Equally Fre-

quency technique, following the recommendations in

(Garc

ıa et al., 2014).

We implemented the Minimum Volume Set ap-

proach (MV-Set) presented in (Park et al., 2010) as

the anomaly detection algorithm, deﬁning a probabil-

ity threshold for a speciﬁc false alarm rate (α). In

particular, the MV-Set method based on the Plug-In

estimator provides in the asymptotic sense the small-

est possible type-II error (false negative error) for any

given ﬁxed type-I error (false positive error).

After training the anomaly detection algorithm in

each subspace, a weighting factor was computed to

aggregate the ensemble elements at the test stage.

More speciﬁcally, we computed the accuracy of the

anomaly detection method in each subspace and used

these values as weighting factors to combine the en-

semble elements (Menahem et al., 2013). The accu-

racy was obtained from a validation set or from the

same training dataset.

We selected seven classical and state-of-the-art al-

gorithms representing a wide range of techniques to

benchmark the proposed AAG method. Speciﬁcally,

we selected FB (Lazarevic and Kumar, 2005) to repre-

sent the random selection of attributes. Representing

the A-Priori (Agrawal et al., 1994) based technique,

we selected HiCS (Keller et al., 2012). To represent

the clustering based techniques we selected ENCLUS

(Cheng et al., 1999), EWKM (Jing et al., 2007) and

AFG-k-means (Gan and Ng, 2015). Finally, to rep-

resent a category of algorithms that search for sub-

spaces based on information theoretical measures we

selected CMI (Nguyen et al., 2013b) and 4S (Nguyen

et al., 2013a). The implementation and parameter set-

ting of all benchmark approaches followed the cor-

responding description in the original publications.

With regard to AAG, we used three attributes in the

evaluation of (5), which seemed to be a good trade off

between high quality subspaces and reasonable run-

time.

As measures of performance we assessed the re-

ceiver operator characteristic (ROC) curve and esti-

mated the Area under the ROC Curve (AUC), as it

has often been used to quantify the quality of novelty

detection methods (see, e.g., (Goldstein and Uchida,

2016)). To obtain the ROC curve, we varied α in a

linear spaced grid of 100 values in the range [0, 1] and

then we computed the AUC. All experiments were

executed 20 times and the average reported, where

in each repetition the dataset was re-split randomly

into training and test sets. The experiments were con-

ducted on a standard MacBook-Pro running Mac OS

X Version 10.6.8, with a 2.53GHz Intel



Core 2 Duo

processor and 8GB of DRAM.

We evaluated the AAG method under two dif-

ferent settings. In the ﬁrst setting we simulated a

case where anomalies are created by adding zero-

mean Gaussian noise to normal observations, but only

along a subset of the attributes and not to the en-

tire data space. More speciﬁcally, after splitting the

normal dataset into training and test sets, we further

split the test set into two equally-sized datasets. One

of the newly split test set was kept as is, represent-

ing normal observations. For the other split, we ran-

domly selected K attributes from the entire data space

and added zero-mean Gaussian noise on the projected

subspace, representing anomalies. The Variance-

Covariance matrix of the Gaussian noise was set to

be diagonal with the variances of the K attributes in

the selected subspace as the diagonal elements. By

adding noise in this fashion, the correlation among

the K attributes is broken generating abnormal obser-

vations. We repeated this procedure for K from 1 to n,

where n is the total number of attributes in the dataset.

In the second setting, we simulated a case where

the abnormal observations represent a previously un-

seen class, i.e. novelties. To this end, we utilized the

complete test set (i.e., the 30% of the observation as-

sociated with the majority class) to represent normal

observations. Then, 10% of the observations associ-

ated with the remaining classes (i.e., not the majority

class) were added to the test set to represent novelties.

5.2 Results

Detecting Anomalies

Figure 2 shows the resulting AUC score values as a

function of the percentage of attributes synthetically

perturbed by additive zero-mean Gaussian noise on

one out of the 10 datasets from Table 2. The x-

axis indicates the percentage of perturbed attributes

w.r.t. the total number of attributes, and the y-axis

shows the averaged AUC values over the 20 repeti-

tions of the experiment. As seen in the ﬁgure, the

proposed AAG method signiﬁcantly outperforms the

other methods when the percentage of perturbed at-

tributes is lower than ≈ 40%. When the percentage

of perturbed attributes is higher than 40%, AAG per-

formance remains stable, and becomes comparable to

that of HiCS. Furthermore, it seems that AAG’s per-

formance is less affected by the percentage of per-

turbed attributes (i.e., lower variance in its AUC val-

ues), whereas the other methods are more affected.

Table 3 shows the averaged AUC values obtained

by the different subspace analysis methods, for all 10

datasets, when zero-mean Gaussian noise is added to

10% of the attributes. In each row of the table (i.e.

dataset), the AUC results obtained by the two best

subspace analysis methods are emphasized in Bold.

As seen from the Table, in all 10 datasets, AAG is

included in the list of two best performing subspace

analysis methods. In three cases, ENCLUS is in-

cluded with AAG in the two best performing meth-

ods, but in all of these thee cases, AAG outperforms

it. In four other cases, HiCS is included with AAG

in the list of two best performing methods, but only

in one of these cases it manages to outperform AAG.

CMI is also included two times with AAG in the list

of two best performing methods, but in all of these

seven cases, AAG outperforms it. All other methods

are left way behind.

0 0.2 0.4 0.6 0.8 1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Sensitivity resu lts for Features - Fourier AUC @ α ∈ [0, 1]

Factor of Number of Attributes

AUC

Figure 2: AUC as a function of the percentage of attributes

synthetically perturbed by additive zero-mean Gaussian

noise, for different subspace analysis methods. The sub-

space analysis methods are marked as follows: : AAG, C:

FB, B: HiCS, 5: ENCLUS, ◦: EWKM, ♦: AFG-k-means,

?: CMI, ∗: 4S.

Detecting Novelties

Table 4 shows the averaged AUC values obtained

under the novelty detection setting. Recall that the

reported values are the average over 20 repetitions.

Here as well, the two best results for each row (i.e.

dataset) are emphasized in Bold.

As seen from the table, in 8 out of the 10 datasets,

AAG is included in the list of two best performing

subspace analysis methods, and in 6 cases, it even

achieves the best performance. FB, seems to be the

second best method in the novelty detection setting,

reaching the list of the two best performing meth-

ods in 5 of the datasets, outperforming other state-

of-the-art subspace analysis methods such as HiCS

or 4S. HiCS and ENCLUS come next, both included

in the list of the two best performing methods two

times. 4S and CMI were found to be less effective

in detecting novelties, and were included in the list

of the two best performing methods twice and once

respectively. The soft subspace clustering (SSC) ap-

proaches EWKM and AFG-k-means were also found

to be less effective in detecting novelties. Interest-

ingly however, the SSC methods managed to achieve

relatively high AUC values in datasets with single-

type attributes such as Fourier, and Waveform. This is

most likely due to their use of the k-means algorithm.

In comparison to the previous experiment (i.e.,

Detecting Anomalies), the detection performance of

AAG is slightly less astonishing, and we attribute that

to the fact that in most of the analyzed datasets the

novelties seem to be “better” distributed along the en-

tire data space and henceforth, the performance of the

benchmark approaches became notably better.

To support our ﬁndings in Table 3 and Table 4,

Table 3: AUC performance results for ensemble of anomaly detection using MV-Set as local anomaly detection algorithm

(α ∈ [0.0,1.0]). Normal data samples were synthetically perturbed by additive zero-mean Gaussian noise in one subspace

comprising 10% of the total number of attributes i.e., p. The two best results are shown in Bold.

Dataset AAG FB HiCS ENCLUS EWKM AFG-k-means 4S CMI

Features-Fourier 0.613 0.256 0.370 0.294 0.147 0.202 0.238 0.244

Faults 0.759 0.484 0.424 0.564 0.448 0.550 0.594 0.601

Satimage 0.383 0.186 0.314 0.365 0.323 0.303 0.234 0.214

Arrhythmia 0.761 0.004 0.643 0.510 0.593 0.592 0.239 0.244

Pen Digits 0.725 0.402 0.293 0.627 0.543 0.524 0.241 0.301

Features-Pixels 0.693 0.497 0.381 0.452 0.474 0.365 0.504 0.533

Letters 0.664 0.289 0.564 0.640 0.425 0.337 0.416 0.419

Waveform 0.602 0.468 0.548 0.490 0.433 0.431 0.442 0.455

Sonar 0.546 0.246 0.499 0.373 0.232 0.299 0.391 0.405

Thyroid 0.843 0.252 0.750 0.289 0.236 0.591 0.632 0.470

we followed the statistical signiﬁcance tests proposed

in (Dem

sar, 2006). By applying the non-parametric

Friedman test in each table, we obtained the F-

statistics F = 35.025 and F = 55.945, respectively.

Based on the critical value of 3.245 at a signiﬁcance

level of 0.05, the null-hypothesis can be rejected for

both experiments. The obtained statistical values for

the post-hoc tests between AAG to each one of the

benchmarked approaches, showed a p-Value ¡ 0.05 for

all cases, concluding that AAG outperforms all of its

competitors in the selected cases.

Detailed Comparison

The rest of this subsection provides a more detailed

comparison of AAG to the other benchmark ap-

proaches.

With respect to the FB approach, the results ob-

tained as anomaly detection ensemble reveals a rel-

atively low performance whereas for novelty detec-

tion ensembles it achieved, on average, comparable

results. Novel classes, as oppose to random noise,

manifest certain correlation among different attributes

that FB manages to detect. However, in more com-

plex datasets where attributes are of mixed nature, and

the data dimensionality is relatively high in compari-

son to the number of samples, FB’s performance de-

grades. A possible reason for this behavior can be at-

tributed to the different and higher sizes of subspaces,

since more data samples are needed to avoid the curse

of dimensionality (Scott, 1992).

Unlike HiCS, AAG succeeds in ﬁnding a smaller

number of subspaces that can be directly applied. The

reason for this lies in the search strategy of HiCS

which is based on the A-Priori approach and then ran-

domly permuting attributes to reduce the algorithm

complexity. HiCS retrieves several hundreds of sub-

spaces that afterwards have to be ﬁltered in some fash-

ion. This can be observed from the obtained results in

the anomaly detection evaluation, where in average,

the HiCS method misses to ﬁnd moderate deviations

in the dataset.

With respect to ENCLUS, although it does not re-

quire to set the number of generated subspaces in ad-

vance, it does require three other parameters as input,

such that their tuning requires an extensive grid search

over the support of the parameters. Opposite of FB,

we see that it performs better in the case of anomaly

detection applications but its performance degrades

as subspace method for novelty detection ensembles.

Finally, we note that ENCLUS presented the highest

training run-time for one of the evaluated sets of pa-

rameters.

The subspace clustering approaches EWKM and

AFG-k-means obtained the worst performance values

in anomaly as well as in novelty detection ensem-

bles. The poorer performance with respect to other

approaches, is due to the fact that attributes are dis-

carded from the set of subspaces. Consequently, nei-

ther novel nor abnormal samples can be efﬁciently

identiﬁed. Additionally, we found that it was not triv-

ial to set the number of clusters, a critical parame-

ter for both approaches. In both subspace clustering

methods, the number of clusters has the major im-

pact in the subspace selected when optimizing the ex-

tended k-means cost objective.

For their part CMI and 4S resulted in lower perfor-

mance than the proposed AAG, both for novelty and

for anomaly detection. Only in one case did both ap-

proaches manage to outperform all other benchmark

approaches. Whereas the 4S method requires to set

the maximal number of attributes, in CMI the num-

ber of clusters in the k-means turns out to be critical

for ﬁnding highly qualitative subspaces. This is mani-

fested in the obtained results for both evaluations, i.e.

anomaly and novelty detection ensembles.

Table 4: AUC performance results for ensemble of novelty detection using MV-Set as local anomaly detection algorithm

(α ∈ [0.0,1.0]). The two best results are shown in Bold.

Dataset AAG FB HiCS ENCLUS EWKM AFG-k-means 4S CMI

Features-Fourier 0.948 0.896 0.877 0.764 0.736 0.804 0.851 0.855

Faults 0.695 0.299 0.575 0.568 0.349 0.392 0.528 0.511

Satimage 0.989 0.979 0.920 0.882 0.836 0.840 0.814 0.809

Arrhythmia 0.653 0.187 0.573 0.605 0.348 0.496 0.574 0.554

Pen Digits 0.908 0.990 0.874 0.895 0.772 0.658 0.842 0.833

Features-Pixels 0.991 0.996 0.839 0.534 0.956 0.919 0.768 0.755

Letters 0.374 0.307 0.485 0.486 0.572 0.539 0.486 0.445

Waveform 0.896 0.853 0.799 0.731 0.843 0.804 0.824 0.831

Sonar 0.634 0.427 0.588 0.508 0.531 0.503 0.492 0.501

Thyroid 0.589 0.449 0.490 0.299 0.207 0.194 0.638 0.643

5.3 Run-time Evaluation

We have also evaluated the time taken to train each of

the ensembles on the 10 datasets used in this paper.

Table 5 shows the runtime results for the four bench-

mark methods that obtained the best detection results

in the previous two subsections (i.e., FB, HiCS, and

ENCLUS). Best results are shown in Bold.

As seen in Table 5, in the majority of the 10 stud-

ied cases, the ensemble using AAG as the subspace

analysis approach, ﬁnished its execution faster than

that using ENCLUS or HiCS. One reason for this

superiority is that AAG ﬁnds, on average, a smaller

number of subspaces than the other two competitors.

This is most likely due to the A-priori method that the

other two methods employ to search for subspaces,

as opposed to the Agglomerative approach that AAG

employs.

Table 5: Runtime evaluation (in seconds) of the training

phase for the ensembles based on AAG, FB, ENCLUS and

HiCS on the 10 datasets. Best result shown in Bold.

Dataset AAG FB HiCS ENCLUS

Fourier 316.5 146.4 317.8 5817.5

Faults 77.5 52.9 387.7 2166.7

Satimage 201.8 135.7 1370.2 1944.8

Arrhythmia 331.2 1863.8 1370.4 1540.9

Pen Digits 34.5 101.5 871.6 404.1

Pixels 826.4 6144.7 323.2 7122.3

Letters 278.5 123.1 1869.5 1236.5

Waveform 55.2 78.8 360.9 2694.5

Sonar 243.3 175.7 218.8 48667.8

Thyroid 162.8 855.9 1655.2 1862.9

6 DISCUSSION AND FUTURE

WORK

In this paper, we presented the Agglomerative At-

tribute Grouping (AAG) subspace analysis algo-

rithm that aims to ﬁnd high-quality subspaces for

anomaly detection ensembles. Similar to other re-

cent state-of-the-art approaches for subspace analy-

sis, AAG searches for subspaces with highly corre-

lated attributes. To assess how correlative a sub-

set of attributes is, AAG uses a metric derived from

information-theory measures of partitions. Due to the

time complexity of the proposed metric with respect

to the number of attributes, we suggested a way to ap-

proximate the metric in cases where the number of at-

tributes is large. Equipped with the newly suggested

metric, AAG applies a variation of the well-known

agglomerative algorithm to search for highly corre-

lated subspaces. Our variation of the agglomerative

algorithm also applies a pruning rule that reduces re-

dundancy in the ﬁnal set of subspaces.

As a result of combining the agglomerative ap-

proach with the suggested metric, AAG avoids any

tuning of parameters. Moreover, as our extensive

evaluation shows, AAG manages to outperform other

classical and state-of-the-art subspace analysis algo-

rithms when used as part of an anomaly detection

ensemble, both in its better ability to distinguish be-

tween normal and abnormal observations, as well as

with its fastest training time. Finally, as demonstrated

in our evaluation, AAG also outperforms other sub-

space analysis techniques when used as part of nov-

elty detection ensembles.

The AAG algorithm presented in this paper ad-

dresses the case where no separation is made between

normal observations (i.e., there is only one normal

class). In future work we aim to extend AAG to be

applicable for datasets with multi-class normal obser-

vations. While the trivial way of doing so, is to ap-

ply AAG on each one of the normal classes separately

and unify the sets of subspaces, we would like to uti-

lize the information available in the different classes

to ﬁnd higher quality subspaces.

REFERENCES

Aggarwal, C. C. (2015). Outlier analysis. In Data Mining,

pages 237–263. Springer.

Aggarwal, C. C. and Yu, P. S. (2001). Outlier detection for

high dimensional data. pages 37–46.

Agrawal, R., Gehrke, J., Gunopulos, D., and Raghavan, P.

(1998). Automatic subspace clustering of high dimen-

sional data for data mining applications, volume 27.

ACM.

Agrawal, R., Srikant, R., et al. (1994). Fast algorithms for

mining association rules. 1215:487–499.

Bache, K. and Lichman, M. (2013). Uci machine learning

repository. http://archive.ics.uci.edu/ml.

Bajovic, D., Sinopoli, B., and Xavier, J. (2011). Sensor

selection for event detection in wireless sensor net-

works. IEEE Transactions on Signal Processing, 59.

Chandola, V., Banerjee, A., and Kumar, V. (2007).

Anomaly detection: A survey. Technical report, De-

partment of Computer Science and Engineering, Uni-

versity of Minnesota.

Cheng, C.-H., Fu, A. W., and Zhang, Y. (1999). Entropy-

based subspace clustering for mining numerical data.

In KDD, pages 84–93.

Cormen, T. H. (2009). Introduction to algorithms. MIT

press.

Cover, T. M. and Thomas, J. A. (2006). Elements of Infor-

mation Theory. John Wiley and Sons, Inc.

Dem

sar, J. (2006). Statistical comparisons of classiﬁers

over multiple data sets. Journal of Machine Learning

Research, 7:1–30.

Deng, Z., Choi, K.-S., Jiang, Y., Wang, J., and Wang, S.

(2016). A survey on soft subspace clustering. Infor-

mation Sciences, 348:84–106.

Gan, G. and Ng, M. K.-P. (2015). Subspace clustering

with automatic feature grouping. Pattern Recognition,

48(11):3703–3713.

Gan, G., Wu, J., and Yang, Z. (2006). A fuzzy subspace

algorithm for clustering high dimensional data. pages

271–278.

Garc

ıa, S., Luengo, J., S

aez, J. A., L

opez, V., and Herrera,

F. (2014). A survey of discretization techniques: Tax-

onomy and empirical analysis in supervised learning.

IEEE Transactions on Knowledge and Data Engineer-

ing, 25(4):734–750.

Ge, Z. Q. and Song, Z. H. (2013). Multivariate Statistical

Process Control: Process Monitoring Methods and

Applications. Springer London Dordrecht Heidelberg

New York.

Goldstein, M. and Uchida, S. (2016). A comparative eval-

uation of unsupervised anomaly detection algorithms

for multivariate data. PloS one, 11(4):e0152173.

Jakulin, A. (2005). Machine learning based on attribute

interactions. Univerza v Ljubljani.

Jing, L., Ng, M. K., , and Huang, J. Z. (2007). An entropy

weighting k-means algorithm for subspace clustering

of high-dimensional sparse data. IEEE Transactios on

Knowledge and Data Engineering, 18(8):1026–1041.

Jyothsna, V., Prasad, V. V. R., and Prasad, K. M. (2011). A

review of anomaly based intrusion detection systems.

International Journal of Computer Applications.

Kagan, E. and Ben-Gal, I. (2013). Probabilistic Search for

Tracking Targets: Theory and Modern Applications.

John Wiley and Sons, Inc.

Kagan, E. and Ben-Gal, I. (2014). A group testing algorithm

with online informational learning. IIE Transactions,

46(2):164–184.

Keller, F., M

uller, E., and B

ohm, K. (2012). Hics: High

contrast subspaces for density-based outlier ranking.

In Proceedings of the 2012 IEEE 28th International

Conference on Data Engineering.

Kriegel, H.-P., Kr

oger, P., and Zimek, A. (2009a). Clus-

tering high-dimensional data: A survey on subspace

clustering, pattern-based clustering, and correlation

clustering. ACM Transactions on Knowledge Discov-

ery from Data (TKDD), 3(1):1.

Kriegel, H.-P., Schubert, E., Zimek, A., and Kr

oger, P.

(2009b). Outlier detection in axis-parallel subspaces

of high dimensional data. pages 831–838.

Kuratowski, C. (1961). Introduction to set theory and topol-

ogy.

Lazarevic, A. and Kumar, V. (2005). Feature bagging for

outlier detection. In ACM, editor, KDD’05.

Maimon, O. and Rockach, L. (2005). Data Mining and

Knowledge Discovery Handbook: A Complete Guide

for Practitioners and Researchers, chapter Outlier de-

tection. Kluwer Academic Publishers.

Markou, M. and Singh, S. (2003). Novelty detection: A

review?part1: Statistical approaches. Signal Process-

ing, 83:2481–2497.

McGill, W. J. (1954). Multivariate information transmis-

sion. Psychometrika, 19(2):97–116.

Menahem, E., Rokach, L., and Elovici, Y. (2013). Combin-

ing one-class classiﬁers via meta learning.

uller, E., Schiffer, M., and Seidl, T. (2010). Adaptive out-

lierness for subspace outlier ranking. In CIKM, pages

1629–1632.

uller, E., Schiffer, M., and Seidl, T. (2011). Statistical

selection of relevant subspace projections for outlier

ranking. In 2011 IEEE 27th International Conference

on Data Engineering.

Nguyen, H. V., M

uller, E., and B

ohm, K. (2013a). 4s: Scal-

able subspace search scheme overcoming traditional

apriori processing. IEEE International Conference on

Big Data, pages 359–367.

Nguyen, H. V., M

uller, E., Vreeke, J. ans Keller, F., and

ohm, K. (2013b). Cmi: An information-theoretic

contrast measure for enhancing subspace cluster and

outlier detection. SIAM.

Park, C., Huang, J. Z., and Ding, Y. (2010). A computable

plug-in estimator of minimum volume sets for novelty

detection. Operation Research, Informs.

Parsons, L., Haque, E., and Liu, H. (2004). Subspace

clustering for high dimensional data: a review. Acm

Sigkdd Explorations Newsletter, 6(1):90–105.

Pimentel, M. A. F., Clifton, D. A., Clifton, L., and

Tarassenko, L. (2014). A review of novelty detection.

Signal Processing, 99:215–249.

Rokhlin, V. A. (1967). Lectures on the entropy theory of

measure-preserving transformations. A Series Of Ar-

ticles On Ergodic Theory.

Scott, D. W. (1992). Multivariate Density Estimation - The-

ory, Practice, and Visualization. John Wiley and Sons,

Inc.

Simovici, D. (2007). On generalized entropy and entropic

metrics. Journal of Multiple Valued Logic and Soft

Computing, 13(4/6):295.

Steinwart, I., Hush, D., and Scovel, C. (2006). A classiﬁ-

cation framework for anomaly detection. Journal of

Machine Learning Research, (6):211–232.

Watanabe, S. (1960). Information theoretical analysis of

multivariate correlation. IBM Journal of research and

development, 4(1):66–82.