Clustering and Density Estimation for Streaming Data

using Volume Prototypes

Maiko Sato, Mineichi Kudo and Jun Toyama

Division of Computer Science

Graduate School of Information Science and Technology, Hokkaido University

Kita-14, Nishi-9, Kita-ku, Sapporo 060-0814, Japan

Abstract. The authors have proposed volume prototypes as a compact expression

of a huge data or a data stream, along with a one-pass algorithm to ﬁnd them. A

reasonable number of volume prototypes can be used, instead of an enormous

number of data, for many applications including classiﬁcation, clustering and

density estimation. In this paper, two algorithms using volume prototypes, called

VKM and VEM, are introduced for clustering and density estimation. Compared

with the other algorithms for such a huge data, we showed that our algorithms

were advantageous in speed of processing, while keeping the same degree of per-

formance, and that both applications were available from the same set of volume

prototypes.

1 Introduction

In these years, we often deal with an enormous amount of data or a data stream in a large

variety of pattern recognition tasks. Such data require a huge amount of memory space

and computation time for processing. So it is preferable to ﬁnd a compact expression

of data that is acceptable in time and space without losing characteristics of original

datasets. For this goal, we have proposed volume prototypes [1, 2] as an extension of

conventional point prototypes. Each of volume prototypes is a geometric conﬁguration

that represents many data points inside.

Volume prototypes can be generated by a single-pass algorithm, and thus they are

signiﬁcantly effective for streaming data that can be accessed only once. As another ad-

vantage, volume prototypes typically converge to modes of the underlying distribution,

so that we do not need to determine the number of prototypes beforehand. It sufﬁces to

take a sufﬁciently large number of initial prototypes, unlike mixture models.

In our earlier studies [1, 2], we conducted some experiments and analyzed a typical

behavior of volume prototypes. In this paper, we investigate the applicability of volume

prototypes. Especially, we try to apply volume prototypes to clustering and construction

of mixture models.

2 Related Works

There have been many studies of clustering and density estimation for huge datasets or

data streams.

Sato M., Kudo M. and Toyama J. (2009).

Clustering and Density Estimation for Streaming Data using Volume Prototypes.

In Proceedings of the 9th International Workshop on Pattern Recognition in Information Systems, pages 39-48

DOI: 10.5220/0002173500390048

 SciTePress

Density estimation algorithms for a huge dataset are found in the literature [3–6].

Zhang et al. extended a kernel method to achieve a fast density estimation for very

large databases [3]. Arandjelovi´c and Cipolla realized a real-time density estimation by

incremental learning of GMM [4]. The incremental EM algorithm [5] attempts to reduce

the computational cost needed by EM algorithm. This is made by adopting partial E-

steps. It divides a whole dataset into some blocks and performs a partial E-step on

each block in a cyclic way. The lazy EM algorithm [6] has the same strategy with the

incremental EM algorithm. However, it performs a partial E-step only on a signiﬁcant

subset of data.

Clustering methods for a huge dataset are also seen in the literature [7–10]. Charikar

et al. proposed a one-pass clustering algorithm in a streaming model [7]. It produces

a constant factor approximation of the solution of the k-median problem in storage

space O(k poly log n) for size n data. Bradley et al. showed an efﬁcient clustering

method for large databases [8]. Their method is based on classifying regions into three

kinds of regions: the regions that must be maintained as they are, the regions that are

compressive, and the regions that can be discarded. This algorithm requires at most one

scan of the database. The algorithm BIRCH [9] by Zhang et al. achieves an efﬁcient

clustering for a huge dataset by constructing a CF-tree that is a hierarchical summary of

clusters. Data is scanned only once to construct a CF-tree. The algorithm FEKM [10]

by Goswami et al. produces almost the same clusters as the original k-means algorithm.

Their algorithm is basically one-pass, and obtain clusters close to the correct clusters of

the k-means with a few number of additional scans.

These approaches would be effective for a huge dataset in several situations. How-

ever, some of them require to keep a whole data and others require a constant number

of rescanning of data or its reduced subset. If we are allowed to access each data only

once, some of these methods are not available. One more important thing is that we do

clustering for one goal, e.g., data compression, do density estimation for another goal,

e.g., classiﬁcation, and sometimes we do both. In this case, it is desirable to be able

to carry out both efﬁciently. The goal of this paper is to show that our approach using

volume prototypes gives one of possible solutions.

3 Algorithm for Volume Prototypes

In this section, we show the concrete algorithm VP to generate volume prototypes. The

single-pass algorithm is shown in Fig. 1.

The outline is as follows. With multiple scans over the ﬁrst N samples in different

M orders, we obtain M seed prototypes (ME Step). Here N is relatively small, say

N = 500, and M is taken to be ten times or so the number of expected modes, say

M = 100. We bring up those seed prototypes by updating them using the samples

fell in their acceptance regions (C Step). For ﬁnal prototypes, we carry out prototype

selection using greedy set covering to have L ﬁnal prototypes.

3.1 Details

Dataset: We consider a data stream x

, x

, . . . , x

, x

N+1

, . . . of d-dimensional vec-

tors.

Inputs:(Unlimited samples) x

, x

, . . . , x

, . . .

N is the number of samples for ME Step.

M is the number of seed prototypes.

is the number of latest samples to be kept.

θ is the value to determine radius r of the acceptance region.

Procedure for obtaining volume prototypes

ME (mode estimation) step:

Repeat M times the following for the ﬁrst N samples.

1. Choose a random permutation σ to reorder the samples to x

, x

, . . . , x

2. Initialize a prototype by the way described in the detailed description.

3. Use x

, . . . , x

in order to update the seed prototype by the same way as 2.

(using only the samples in the acceptance region (1)).

C (Convergence) step:

For each sample x

(i = N + 1, N + 2, . . .) do the following.

1. Update the seed prototypes in which x

falls by (2)–(4).

2. Keep the latest N

samples.

Exploit time:

From M prototypes, select greedily L(≤ M) prototypes until no improvement is

found for covering N

samples. Count the number of samples falling in each pro-

totype in N

samples. Here a sample is divided by k when k prototypes share it.

Fig. 1. Volume Prototype Algorithm (VP).

Prototype: Let p = (µ, Σ, r, n) be a prototype. Here, µ is the prototype center, Σ is

the covariance matrix, r is the Mahalanobis radius and n is the number of samples

inside the prototype.

Included Sample Set: Let S

be the set of data points inside prototype p, which is

speciﬁed by S

= {x

|(x

− µ)

−1

− µ) ≤ r

}. Then n = |S

|. We

approximate n by the number of samples used for updating the prototype.

Acceptance Region: We specify the acceptance region A

of prototype p = (µ, Σ, r, n)

by the acceptance Mahalanobis radius R as

= {x|(x − µ)

−1

(x − µ) ≤ R

}. (1)

Here, R(≥ r) is determined from r and n by R = r +

(0.95)

Updating Procedure: When a sample x falls into the acceptance region A

(t)

of pro-

totype p

(t)

at (t + 1)th update, p

(t)

is updated to p

(t+1)

= n

(t)

+ 1, (2)

(t+1)



(t)

+ x



, (3)

(t+1)



(t)

+ xx

+ n

(t)

(t)0

− n

(t+1)

(t+1)0



.(4)

The radius R of the acceptance region is also updated.

Initialization of ME Step: We initialize the center µ

(0)

and the covariance matrix

(0)

by µ

(0)

= x

(the ﬁrst element after random reordering) and Σ

(0)

= λ

Here, λ

i=1

min

j6=i,j∈{1,2,...,N }

− x

(k · k

is the squared Eu-

clidean norm) and I is the unit matrix. In addition, we set the initial value of n

(0)

to n

(0)

= d + 1.

4 Applications of Volume Prototypes

Volume prototypes are an intermediate representation between data and clusters, while

they are more akin to data themselves. We can use them for costly applications instead

of a large amount of original samples. In this section, we show two applications for

clustering and density estimation.

Let us assume that we have a huge amount of data and they are represented by a

reasonable number of volume prototypes. In the following, we consider the correspon-

dence between a sample and a volume prototype. A big difference is that the latter has

many samples in it and has a volume speciﬁed by a covariance matrix and a Maha-

lanobis radius.

4.1 EM Algorithm with Volume Prototypes (VEM)

We consider ﬁrst modeling of a given density by a ﬁnite mixture of Gaussian com-

ponent densities. The number K of component densities is assumed to be given be-

forehand. Let the kth component be a Gaussian N(x; ν

, Σ

), where ν

is the mean,

is the covariance matrix. Then, the density is estimated by a mixture as f(x) =

k=1

N(x; ν

, Σ

) with prior probabilities π

. To estimate the parameter values

from given volume prototypes, we show a volume prototype version (VEM) of EM

algorithm. The concrete algorithm is shown in Fig. 2.

In the following, we use sufﬁx j for volume prototypes and k for component densi-

ties. Thus, jth volume prototype is denoted by p

= (µ

, Σ

, n

), omitting the radius

In the Maximization step (M step), we estimate the mean and covariance of kth

component density from the currently estimated membership value m

(p) of a volume

prototype p to kth component. Here,

k=1

(p) = 1. The parameter values are

estimated by the maximum likelihood estimators. The updating procedure is given as

, ν

{Σ

+ (µ

− ν

) (µ

− ν

)



In the Expectation step (E step), we re-estimate the membership values m

(p) from

the currently estimated K component densities. To do this, we use a kind of distance

measure between p and kth component. In the original EM algorithm, Mahalanobis

Inputs: L volume prototypes: p

, p

, . . . , p

= (µ

, Σ

, n

), l = 1, 2, . . . , L).

K is the number of components.

Outputs: f(x) =

k=1

(x; θ

) (f

is a Gaussian with θ

= (ν

, Σ

) ).

Procedure for obtaining a mixture model

Initialize step:

Set randomly the membership m

) for every kth component and lth prototype.

Repeat the following M step and E step in turn until convergence.

M step: For kth component (k = 1, 2, . . . , K), update the following:

{Σ

+ (µ

− ν

) (µ

− ν

)

} /

E step: For lth prototype (l = 1, 2, . . . , L), update the membership by

) = π

(2π)

−

|Σ

−

(

trΣ

−1

+(µ

−ν

)

−1

(µ

−ν

)

Fig. 2. Volume prototype EM algorithm (VEM).

Inputs: L volume prototypes: p

, p

, . . . , p

; K is the number of clusters to be found.

Outputs: K clusters speciﬁed by means ν

, ν

, . . . , ν

Procedure K-means with volume prototypes

t ← 0.

Choose the ﬁrst K prototypes and assign their centers to ν

(0)

, ν

(0)

, . . . , ν

(0)

Repeat the following until convergence.

Assign prototype p

(l = 1, 2, . . . , L) to the nearest cluster C

with ν

(t)

Here, the nearness between p

= (µ

, Σ

, n

) and ν

(t)

is measured by ||ν

(t)

− µ

||.

t ← t + 1.

Update ν

(t)

(k = 1, 2, . . . , K) by ν

(t)

∈C

, where N

∈C

Fig. 3. Volume prototype k-means algorithm (VKM).

distance between a sample x and mean ν

is used. Here, since p has a volume, we use

the expected Mahalanobis distance on prototype p:

X∈p

(X − ν

)

−1

(X − ν

)

= E

X∈p

(X − µ

+ µ

− ν

)

−1

(X − µ

+ µ

− ν

)

= trΣ

−1

+ (µ

− ν

)

−1

(µ

− ν

Then, with the prior probability π

, the membership value is updated as

) = π

(2π)

−

|Σ

−

· exp



−



trΣ

−1

+ (µ

− ν

)

−1

(µ

− ν

)





Here, we notice that the estimated Mahalanobis distance converges to the ordinal Ma-

halanobis distance (µ

− ν

)

−1

(µ

− ν

) of a single point µ

under the assumption

that Σ

= I and  → 0. Therefore, this way of assigning membership values is an

extension of the traditional E step.

4.2 k-means Algorithm with Volume Prototypes (VKM)

It is also easy to make the k-means algorithm applicable to volume prototypes (VKM).

Here, it is also assumed that the number K of clusters is determined beforehand.

When J volume prototypes {p

= (µ

, Σ

, n

)} (j = 1, 2, . . . , J) are combined

into one, we have a combined volume prototype p = (µ, Σ, n):

n =

j=1

, µ =

j=1

Σ =

j=1



+ (µ

− µ)(µ

− µ)



(5)

These hold even if {p

} are samples {x

} (easily veriﬁed with Σ

= 0, µ

= x

and

= 1).

For extending k-means using samples to the one using volume prototypes, all we

need is to deﬁne the distance between a volume prototype and a cluster center. Under

the assumption that a volume prototype (and all samples in it) belongs to one cluster

only, the squared distance minimization criterion Q for samples in the original k-means

can be used even for volume prototypes:

Q =

k=1

x∈C

||x − ν

k=1

x∈C

tr(x − ν

)(x − ν

)

is kth cluster)

k=1

∈C



trΣ

+ ||µ

− ν



(Eq. (5))

trΣ

k=1

∈C

||µ

− ν

The ﬁrst term is the irreducible error when volume prototypes are used. Therefore,

when we adopt the k-means algorithm to ﬁnd a sub-optimal solution of this optimiza-

tion problem, we may assign a volume prototype to the nearest cluster in the dis-

tance ||µ

− ν

||. By setting

∂Q

∂ν

= 0, we have the estimated mean for iteration:

∈C

, where N

∈C

. The concrete algorithm is shown

in Fig. 3.

In algorithm VKM, the cluster centers can be initialized by the ﬁrst K prototypes.

This strategy is effective because these K prototypes are already chosen in VP algo-

rithm so as to cover as many samples as possible in order.

5 Experiments

In the following experiments, we used three 2-dimensional artiﬁcial datasets:

1. Circle: Two clusters. One cluster is concentrated on the origin and it is surrounded

by the other cluster at a distance.

2. 4-Cross: Four Gaussians that cross at four corners.

3. 5-Gaussian: Five Gaussians. Some of them are close.

In each dataset, 100,000 samples were generated according to a speciﬁed distribution.

We chose M = 100 seed volume prototypes with radius r = χ

(0.9) in VP algorithm.

The ﬁrst N = 1000 samples were used for ME step and prototype selection.

The obtained volume prototypes are shown in Fig.4 (1st column). In Fig.4, only

selected volume prototypes are shown. Therefore, the number L is rather less than M =

100. We can see that 1) the set of volume prototypes represents well the distribution, 2)

they are located inside the distribution because of θ = 0.9, and 3) the number (25, 23

and 24 in order) is quite smaller than the number 10

of samples.

5.1 Experiment 1 : Clustering

We applied our VKM clustering method to the selected ﬁnal prototypes. We compared

it with a one-pass k-means algorithm, SKM (scalable k-means) [11]. SKM was applied

to all the data. We examined two different numbers of clusters. The results are shown

in Fig.4 (2nd and 3rd columns).

From Fig.4, we see:

1. When the speciﬁed number K of clusters is the same as that of the underlying

distribution, the cluster centers are comparable between VKM and SKM.

2. If K is larger than the correct number, the cluster centers found by VKM are located

inside the distribution compared with those by SKM.

5.2 Experiment 2 : Mixture Models

Then, we applied our VEM method to the same selected ﬁnal prototypes. We used

the results of VKM for initializing the membership values of a mixture model. We

compared it with the incremental EM algorithm [5]. In the incremental EM algorithm,

the number of blocks over which a partial E-step was carried was set to 100. Two

different numbers of components were examined. The results are also shown in Fig.4

(4th and 5th columns).

From Fig.4, we see the following:

1. The incremental EM and VEM are almost comparable when K is correct.

2. If K is larger, the components found by VEM are more safely secured than those of

the incremental EM. Note that the number of components is automatically reduced

to a near optimal number. This is because volume prototypes limit an excessive

production of small, in the number of included samples, components (see the results

of 4-Cross with K = 8 and 5-Gaussian with K = 10).

3. The components obtained by VEM are narrower than those of the incremental EM.

This is because the components in VEM are generated from volume prototypes that

give an inner approximation of the distribution.

In total, our VEM algorithm is more stable compared with the incremental EM

algorithm.

Fig. 4. Results of clustering and mixture models in three datasets (θ = 0.90). In clustering,

the cluster centers are shown as black dots. In mixture models, the covariance matrix of each

component is shown.

5.3 Computation Time

The calculation time is shown in Table 1. It is clear that our two algorithms of VKM and

VEM are faster than SKM and the incremental EM algorithm, respectively, as long as

we do not take into account the time consumed for obtaining volume prototypes. Even

if we add the time for obtaining volume prototypes, VEM is faster than the incremental

EM. The time advantage of VKM and VEM would be increased if we use a larger

Table 1. Time comparison in clustering and mixture models.

Dataset K Time (sec.)

VP VKM SKM VEM Incremental EM

Circle 5 5.908 0.001 4.204 0.012 10.136

(25 prototypes) 10 — 0.004 4.216 0.048 19.133

4-Cross 4 7.440 0.001 2.204 0.028 7.732

(23 prototypes) 8 — 0.001 2.532 0.184 15.332

5-Gaussian 5 7.928 0.001 2.544 0.012 9.808

(24 prototypes) 10 — 0.008 3.520 0.104 18.825

dataset, because VP is a completely one-pass algorithm. It should be noted that VKM

and VEM are separately applicable to the same set of volume prototypes. In addition,

we can try several values of K very efﬁciently with VKM and VEM for model selection.

6 Conclusions

In this paper, we have presented two algorithms for clustering and density estimation on

the basis of volume prototypes which can be used instead of a huge data and obtained

by a single-pass algorithm. The necessary number of volume prototypes is quite smaller

than the number of given samples, therefore, our algorithms work very efﬁciently for a

huge data or data streams.

One of proposed algorithms is a volume prototype version (VEM) of EM algorithm.

It is for density estimation. Since each prototype has a volume, we extended the original

algorithm so as to take into account the volume and the number of samples included.

Another algorithm is a k-means algorithm for volume prototypes (VKM). In this al-

gorithm, we developed a distance measure between a volume prototype and a cluster

center as a natural extension of its point version.

We conﬁrmed the efﬁciency of both algorithms in some experiments with 2 - di-

mensional artiﬁcial data. The main advantage of these algorithms is that we can carry

out both algorithms in low cost, once volume prototypes are given. We will further

investigate the applicability for high-dimensional real-world datasets.

References

1. Tabata, K., Kudo, M.: Information compression by volume prototypes. The IEICE Technical

Report, PRMU, 106 (2006) 25–30 (in Japanese)

2. Sato, M., Kudo, M., Toyama, J.: Behavior Analysis of Volume Prototypes in High Dimen-

sionality. In: Structural, Syntactic and Statistical Pattern Recognition, Lecture Notes in Com-

puter Science. Volume 5342., Springer (2008) 884–894

3. Zhang, T., Ramakrishnan, R., Livny, M.: Fast density estimation using CF-kernel for very

large databases. Proceedings of the ﬁfth ACM SIGKDD international conference on Knowl-

edge discovery and data mining (1999) 312–316

4. Arandjelovic

c, O., Cipolla, R.: Incremental learning of temporally-coherent Gaussian mix-

ture models. Proceedings of the IAPR British Machine Vision Conference (2005) 759–768

5. Neal, R.M., Hinton, G.E.: A view of the EM algorithm that justiﬁes incremental, sparse, and

other variants. Learning in Graphical Models (1999) 355–368

6. Thiesson, B., Meek, C., Heckerman, D.: Accelerating EM for Large Databases. Machine

Learning 45 (2001) 279–299

7. Charikar, M., O’Callaghan, L., Panigrahy, S.U.R.: Better Streaming Algorithms for Cluster-

ing Problems. Proceedings of the thirty-ﬁfth annual ACM symposium on Theory of comput-

ing (2003) 30–39

8. Bradley, P.S., Fayyad, U.M., Reina, C.A.: Scaling clustering algorithms to large databases.

Knowledge Discovery and Data Mining (1998) 9–15

9. Zhang, T., Ramakrishnan, R., Livny, M.: Birch: an efﬁcient data clustering method for very

large databases. SIGMOD Rec., 25 (1996) 103–114

10. Goswami, A., Jin, R., Agrawal, G.: Fast and exact out-of-core k-means clustering. IEEE

International Conference on Data Mining (2004) 83–90

11. Farnstrom, F., Lewis, J., Elkan, C.: Scalability for clustering algorithms revisited. SIGKDD

Explor. Newsl., 2 (2000) 51–57