
RDF with natural language to be more familiar for
users.
In this context, we study here how to automatically
structure collections by deriving concept hierarchies
from a document collection and how to automatically
generate from that a document hierarchy. The concept
hierarchy relies on the discovering of
“specialization/generalization” relations between the
concepts which appear in the documents of a corpus.
Concepts are themselves automatically identified from
the set of documents.
This method creates “specialization /
generalization” links between documents and
document parts. It can be considered as a technique for
the automatic creation of specific typed links between
information parts. Such typed links have been
advocated by different authors as a mean for
structuring and navigating collections. It also
associates to each document a set of keyword
representative of the main concepts in the document.
The proposed method is fully automatic and the
hierarchies are directly extracted from the corpus, and
could be used for any document collection. It could
also serve as a basis for a manual organization.
The paper is organized as follows. In section 2 we
introduce previous related work. In section 3, we
describe our algorithm for the automatic generation of
typed relations “specialization/generalization” between
concepts and documents and the corresponding
hierarchies. In section 4 we discuss how our algorithm
answers some questions of the Web Semantic research.
In section 5 we propose numerical criteria for
measuring the relevance of our method. Section 6,
describes experiments performed on small corpus
extracted from Looksmart and New Scientists
hierarchies.
2 PREVIOUS WORK
In this section we present related work on
automatically structuring document collection. We
discuss work on the generation of concept hierarchies
and on the discovering of typed links between
document parts. Since for identifying concepts, we
perform document segmentation into homogeneous
themes, we also briefly present this problematic and
describe the segmentation method we use. We also
give some pointers on work on natural language
annotations for the Semantic Web.
Many authors agree about the importance of typed
links in hypertext systems. Such links might prove
useful for providing a navigation context or for
improving research engines performances.
Some authors have developed links typologies.
[Randall Trigg, 1983] proposes a set of useful types for
scientific corpora, but many of the types can be
adapted to other corpora. [C. Cleary, R. Bareiss, 1996]
propose a set of types inspired by the conversational
theory. These links are usually manually created.
[J. Allan, 1996] proposes an automatic method for
inferring a few typed links (revision,
abstract/expansion links). His philosophy is close to
the one used in this paper, in that he chose to avoid
complex text analysis techniques. He deduces the type
of a link between two documents by analysing the
similarity graph of their subparts (paragraphs). We too
use similarity graphs (although of different nature) and
corpus statistics to infer a relation between concepts
and documents.
The generation of hierarchies is a classical problem
in information retrieval. In most cases the hierarchies
are manually built and only the classification of
documents into the hierarchy is automatic. Clustering
techniques have been used to create hierarchies
automatically like in the Scatter/Gather algorithm [D.
R. Cutting et al. 1992]. Using related ideas but by
using a probabilistic formalism, [A. Vinokourov, M.
Girolami, 2000], propose a model which allows to
infer a hierarchical structure for unsupervised
organization of documents collection. The techniques
of hierarchical clustering were largely used to organize
corpora and to help information retrieval. All these
methods cluster documents according to their
similarity. They cannot be used to produce topic
hierarchies or to infer generalization/specialization
relations.
Recently, it has been proposed to develop topic
hierarchies similar to those found in e.g. Yahoo. As in
Yahoo, each topic is identified by a single term. These
term hierarchies are built from
“specialization/generalization” relations between the
terms, automatically discovered from the corpus.
[Lawrie and Croft 2000, Sanderson and Croft 1999]
propose to build term hierarchies based on the notion
of subsumption between terms. Given a set of
documents, some terms will frequently occur among
the documents, while others will only occur in a few
documents. Some of the frequently occurring terms
provide a lot of information about topics within the
documents. There are some terms that broadly define
the topics, while others which co-occur with such a
general term explain aspects of a topic. Subsumption
attempts to harness the power of these words. A
subsumption hierarchy reflects the topics covered
within the documents, a parent term is more general
than its child. The key idea of Croft and co-workers
has been to use a very simple but efficient subsumption
measure. Term x subsumes term y if the following
relation holds :
P(x|y) > t and P(y|x)<P(x|y), where t is a preset
threshold. Using related ideas, [K. Krishna, R.
ICEIS 2004 - ARTIFICIAL INTELLIGENCE AND DECISION SUPPORT SYSTEMS
70