
1.2. For n = 1, . . . , N − 1, [for j = n + 1, . . . , N , com-
pute α
n
= min(d(n, j)) and β
n
= arg min
j
α
n
].
2. Iteration: For k = N, N − 1, . . . , K, do
2.1. get cluster with minimum distance: for i =
1, . . . , k, search C
n
with min(α
i
);
2.2. merge clusters: form C
<n,β
n
>
by merging C
n
and
C
β
n
, and set C
n
← C
<n,β
n
>
;
2.3. update clusters preceding C
n
: for n
0
= 1, . . . , n−1,
[compute d(n
0
, n) and update α
n
0
and β
n
0
if neces-
sary; if β
n
0
= β
n
then recompute α
n
0
and β
n
0
];
2.4. update the new formed cluster C
n
: for n
0
= n +
1, . . . , k, compute d(n, n
0
) and update α
n
and β
n
;
2.5. update clusters following C
n
: for n
0
= n +
1, . . . , β
n
− 1, if β
n
0
= β
n
then recompute α
n
0
, β
n
0
;
2.6. erase cluster C
β
n
from the cluster list.
3. Finish: For k = 1, . . . , K, output θ
k
, π
k
and C
k
.
The log-likelihood distance depends only on the ob-
jects of the clusters being merged, and all the other
distances remain unchanged. However, the time
complexity of the algorithm is between O(N
2
) and
O(N
3
) (Meil
˘
a and Heckerman, 1998).
Another issue is the computation of the distance be-
tween two clusters that contain each one object. For a
nominal (unordered) attribute, the Hamming distance
is used to calculate the differences between two ob-
served values; for an ordinal attribute, the normalized
Manhattan distance is applied; for a frequency aggre-
gate attribute that takes a vector value, the normalized
Euclidean distance between two vector values is cal-
culated with a normalized constant
1
√
2
.
In practice, HAC based on classification model of-
ten gives good, but suboptimal partitions. The EM
algorithm can further refine and relocate partitions
when started sufficiently close to the optimal value.
The mixture clustering likelihood is used as the basis
for the EM algorithm, because it models a conditional
probability τ
nk
that an object x(n) belongs to a clus-
ter C
k
, in contrast, τ
nk
is assumed to be either 1 or
0 in the classification model. The EM algorithm is
a general approach for maximizing likelihood in the
presence of hidden variables and missing data (Fraley
and Raftery, 1998), i.e. the class label attribute, τ
nk
and π
k
.
1. E-step: for n = 1, . . . , N and k = 1, . . . , K, compute
the conditional expectation of τ
nk
by
ˆτ
nk
=
ˆπ
k
P
k
(x(n)|
ˆ
θ
k
)
P (x(n))
=
ˆπ
k
Q
M
m=1
Q
q
m
q=1
ˆ
θ
x
mq
(n)
kmq
P
K
k=1
ˆπ
k
Q
M
m=1
Q
q
m
q=1
ˆ
θ
x
mq
(n)
kmq
where x
mq
(n) stands for the value (1 or 0) of x(n).AA
m
in its q-th category.
2. M-step: for k = 1, . . . , K, estimate the expectation of
π
k
and θ
k
by
ˆπ
k
=
1
N
N
X
n=1
ˆτ
nk
,
ˆ
θ
kmq
=
P
N
n=1
ˆτ
nk
x
mq
(n)
P
N
n=1
ˆτ
nk
.
The iteration will converge to a local maximum of the
likelihood under mild conditions, although the con-
vergence rate may be slow in most cases.
The BIC provides a kind of score functions that not
only measures the goodness of fit of the model to the
data, but also penalizes the model complexity, e.g. the
total number of model parameters or the storage space
of model structure. We apply BIC to both the classi-
fication model (Equation (4)) and the mixture cluster-
ing model (Equation (5)). Accordingly, the smaller
the value of BIC, the stronger the model. BIC
C
,
in model-based HAC, is used to compute the upper
bound (stopping rule) of the EM; and BIC
M
, in the
EM algorithm, is applied to find the optimal number
of clusters. A decisive first local minimum indicates
strong evidence for a model with optimal parameters
and number of clusters (see Figure 4 for example).
5 EXPERIMENTAL RESULTS
We apply the approach to a real world relational
dataset, which contains about 10,000 records of the
survey information of various types of dwellings. As
mentioned in section 2, the data is modelled using
a relational aggregate schema, where Dwelling table
plays a role of base class, with Occupants table and
Rooms table being two part classes. We chose some
significant attributes from the three tables and dealt
with their domains of values so that all the attributes
are categorical. After aggregating the attributes of
Occupants table and House table, we got a set of
composite objects with aggregate attributes of (Oc-
cupantsNo, AdultsNo, PfaGender, PfaReligion, Total-
Income, RoomNo, BedroomNo, PfaDefect), in which
PfaGender, PfaReligion and PfaDefect are three par-
tial frequency aggregate attributes with vector values
(see an example in Figure 3).
Table 1: Experimental Result
number of objects 1,000 3,000 5,000 9,530
(Lower,Upper) Bound (2,7) (2,11) (3,13) (2,15)
number of clusters 6 9 11 14
HAC running time (sec.) 50 427 1,159 4,024
After clearing the objects that have missing data,
we got 9,530 aggregate objects left, from which
four groups are selected for clustering, 1,000 objects,
3,000 objects, 5,000 objects and the whole dataset.
The EM algorithm runs until either the difference be-
tween successive log-likelihood is less than 10
−5
or
100 iterations are reached. The results for the four
groups is listed in Table 1. Figure 4 show the two
plots of the mixture BIC scores and −2log-likelihood
values against the number of clusters for the last two
ICEIS 2004 - ARTIFICIAL INTELLIGENCE AND DECISION SUPPORT SYSTEMS
96