general, if any then no more than the upper two to
three levels of such structures are defined on
company level, for example organized by products,
clients or temporal aspects. All deeper structures are
created individually leading to the well-known
problems of incomprehensible folder structure and
hence long search times and the danger of missing of
relevant information. The SEEK!SDM system
allows for storing information resources on all
nodes.
These resources are folders, called ‘dossiers’
containing the actual information objects which
might be of various formats, e.g. text, image but also
personal or organisational data. The topics (nodes)
of the topic tree, the dossiers and the structure of the
dossiers are created manually guided by personal
opinions.
Hence, different versions of the same
information object but also the very same
information object might be stored in different
dossiers (and different nodes) increasing the
problem of finding all relevant information objects.
Searching for information is time consuming;
adding the risk of not finding relevant documents at
all or finding the relevant but not the latest document
provides the motivation for coming up with a
hierarchical structure of information which is (a)
independent from personal opinions and (b)
complete with respect to filing related information
objects (e.g. all versions, all formats of a document)
in the same node.
Rather than searching blindly in inexplicable
hierarchical structures, always uncertain if the right
information object has been found, search in an
objectively comprehensible structure may decrease
retrieval time and the risk of not finding everything.
With our approach of automatically clustering
information objects, we provide such an objectively
comprehensible structure.
3 RELATED WORK
3.1 Hierarchical Clustering of
Documents
Hierarchical agglomerative clustering (HAC) (Cios
et al. 1998) is a very well-known and popular
method for grouping data objects by similarity. HAC
is initialized by assigning each object to its own
cluster and then, in each iteration, merging the two
most similar clusters into a new cluster. This
procedure results in a so-called dendogram, a binary
tree of clusters where each branching reflects the
fact that two child nodes were merged to a parent
node in a given iteration of the algorithm.
When the data objects are documents, a
dendogram can be used as a means of navigation
within a document collection (see e.g. (Alfred et al.
2014)).
Alternative hierarchical clustering methods have
also been proposed for navigation, e.g. scatter/gather
(Cutting et al. 1993), where the user can influence
the clustering through interaction at run-time.
It has been recognized by many researchers that
binary trees are not an adequate representation of the
similarities and latent hierarchical relationships
between elements and clusters (Blundell et al. 2010).
Therefore, a number of approaches have been
proposed that cluster elements into multi-way trees.
Many of these approaches come from the area of
probabilistic latent semantic analysis, e.g. based on
Latent Dirichlet processes (Zavitsanos et al. 2011).
Other probabilistic approaches are based on greedy
algorithms, e.g. Bayesian Rose Trees (Blundell et al.
2010).
Another approach, similar to ours, uses a
partitioning of the dendogram resulting from HAC
to derive a non-binary tree (Chuang & Chien 2004).
In this approach, for a current (sub-)tree, an optimal
cut level for the corresponding dendogram is chosen
in a way that maximizes the coherence and
minimizes the overlap of the resulting clusters.
Then, this procedure is applied to the (binary) sub-
trees of the resulting clusters. The approach has been
shown to be effective, but it has a number of free
parameters that are hard to understand for end users.
It is a problem of all these approaches that data
elements are not allowed to reside within inner
nodes of the tree – something that users usually
expect and that will happen when hierarchies are
created manually.
3.2 Learning Topic Trees
Hierarchical structures for organizing document
collections only become useful when each node in
such a structure has a meaningful label – only then it
is possible for users to navigate and locate desired
content. We call a hierarchical organization of
documents (a tree) a topic tree if the nodes of the
tree have labels.
A number of researchers have explored the
challenge of labeling clusters in a flat (i.e. non-
hierarchical) clustering of textual documents
(Popescul & Ungar 2000), (Radev et al. 2004),
(Muller et al. 1999). These approaches are based on
term frequency statistics, selecting descriptors that
WhereDidI(T)PutIt?-AHolisticSolutiontotheAutomaticConstructionofTopicTreesforNavigation
195