structures, the user will be unable to assimilate the
changes. In addition, it will lose confidence on the
learning algorithm. The incremental versions of the
SAHN algorithm do not provide a stable structure.
Since their goal is to obtain the same tree we would
obtain with the original method they must, in many
cases, rebuild the entire tree or an entire branch of it.
The purpose of our work is to present a stable in-
cremental version of SAHN algorithm. The proposed
algorithm adds a new case without varying the previ-
ously obtained structures.
To evaluate the proposed algorithm we answer two
main questions in this paper. Which is the quality of
the incremental dendrograms compared to those built
with the SAHN method? How does the arrival order
of the new cases affect the updating algorithm? We
experimented with 11 synthetic and 6 real datasets
and 3 algorithms based on the SAHN method. Re-
sults confirm that the incremental algorithm is able to
update dendrograms with no quality loss. Therefore,
we claim this algorithm is a good choice on real in-
cremental environments where stability and compre-
hensibility are important factors.
Next section is devoted to describe the SAHN
method and some previous incremental approaches.
The algorithm we propose is described in Section
3. In Section 4 we describe the experimentation de-
signed to measure the quality of the proposed algo-
rithm and results are described in Section 5. Finally,
Sections 6 and 7 are devoted to discuss the results and
draw conclusions.
2 HIERARCHICAL CLUSTERING
In this section we first describe the SAHN method: a
method to obtain cluster hierarchies. Then, we briefly
describe two previously published incremental ver-
sions.
The SAHN method is based on two main steps.
First, each individual point of the input dataset forms
a cluster on its own. That is, if n denotes the number
of data points in the dataset the method begins with
a partition of n singleton clusters. Second, the two
closest clusters are merged. This second step is re-
peated until all the points are merged in a single clus-
ter. This method automatically obtains a set of nested
partitions forming a hierarchy. The output of this pro-
cedure is a set of exactly n partitions, from the n sin-
gleton clusters to the all-in-one cluster partition.
To measure the proximity of two clusters we
must define a cluster proximity measure. The SAHN
method is often used with one of the following prox-
imity measures: single-linkage (nearest neighbour),
complete-linkage (farthest neighbour) and average-
linkage (group average). Single-linkage computes the
distance between two clusters as the distance between
the two nearest points in both clusters. Similarly,
complete-linkage computes the distance between the
two farthest points in both clusters. Finally, average-
linkage computes the average distance between all the
points from one cluster to the points in the other clus-
ter.
(Ribert et al., 1999) proposed an incremental vari-
ation of the SAHN method. The tree they obtain is ex-
actly the same obtained applying the SAHN method
to the union of the initial dataset and the new case. Al-
though an analysis of the computational cost of the in-
cremental method is not done, empirical results show
similar behaviour to SAHN’s. On the other hand, the
memory usage is considerably reduced.
(El-Sonbaty and Ismail, 1998) described another
incremental version of the single-linkage algorithm.
Its computational cost is O(n
2
), opposite to the O(n
3
)
cost of the SAHN method. They empirically showed
that the algorithm obtains the same tree as the single-
linkage algorithm for some values of an internal pa-
rameter of the algorithm. Nevertheless, the new
method cannot be applied to other proximity mea-
sures commonly used with the SAHN method.
3 SIHC ALGORITHM
The aim of the algorithm we present, called SIHC, is
to incrementally add new cases to dendrograms built
with the SAHN method. The novelty of this incre-
mental algorithm is that the updated tree keeps its
main structure. This way users will easily assimilate
the small variations produced by the learned cases.
Although the algorithm can be used on its own, we
designed it as an updating method for SAHN-based
dendrograms.
SIHC is a top-down process described in Algo-
rithm 1. It is a recursive procedure that begins at the
root node. This procedure computes the distance be-
tween the new case and the cluster represented by the
current node. This distance can be based on any prox-
imity measure, so the method adapts to any SAHN-
based algorithm. If the height of the node is less or
equal than the computed distance the recursive proce-
dure stops. A new node, whose children are the new
case and the current node, is created and it replaces
the current node in the dendrogram.
If the height of the current node is higher than the
distance from the new case to it, the recursive proce-
dure is repeated on the child nearest to the new case.
In this case the height of the traversed node should be
SIHC: A STABLE INCREMENTAL HIERARCHICAL CLUSTERING ALGORITHM
301