
 
RDF with natural language to be more familiar for 
users. 
In this context, we study here how to automatically 
structure collections by deriving concept hierarchies 
from a document collection and how to automatically 
generate from that a document hierarchy. The concept 
hierarchy relies on the discovering of 
“specialization/generalization” relations between the 
concepts which appear in the documents of a corpus. 
Concepts are themselves automatically identified from 
the set of documents. 
This method creates “specialization / 
generalization” links between documents and 
document parts. It can be considered as a technique for 
the automatic creation of specific typed links between 
information parts. Such typed links have been 
advocated by different authors as a mean for 
structuring and navigating collections. It also 
associates to each document a set of keyword 
representative of the main concepts in the document.  
The proposed method is fully automatic and the 
hierarchies are directly extracted from the corpus, and 
could be used for any document collection. It could 
also serve as a basis for a manual organization. 
The paper is organized as follows. In section 2 we 
introduce previous related work. In section 3, we 
describe our algorithm for the automatic generation of 
typed relations “specialization/generalization” between 
concepts and documents and the corresponding 
hierarchies. In section 4 we discuss how our algorithm 
answers some questions of the Web Semantic research. 
In section 5 we propose numerical criteria for 
measuring the relevance of our method. Section 6, 
describes experiments performed on small corpus 
extracted from Looksmart and New Scientists 
hierarchies.  
2 PREVIOUS WORK 
In this section we present related work on 
automatically structuring document collection. We 
discuss work on the generation of concept hierarchies 
and on the discovering of typed links between 
document parts. Since for identifying concepts, we 
perform document segmentation into homogeneous 
themes, we also briefly present this problematic and 
describe the segmentation method we use. We also 
give some pointers on work on natural language 
annotations for the Semantic Web. 
Many authors agree about the importance of typed 
links in hypertext systems. Such links might prove 
useful for providing a navigation context or for 
improving research engines performances. 
Some authors have developed links typologies. 
[Randall Trigg, 1983] proposes a set of useful types for 
scientific corpora, but many of the types can be 
adapted to other corpora. [C. Cleary, R. Bareiss, 1996] 
propose a set of types inspired by the conversational 
theory. These links are usually manually created.  
[J. Allan, 1996] proposes an automatic method for 
inferring a few typed links (revision, 
abstract/expansion links). His philosophy is close to 
the one used in this paper, in that he chose to avoid 
complex text analysis techniques. He deduces the type 
of a link between two documents by analysing the 
similarity graph of their subparts (paragraphs). We too 
use similarity graphs (although of different nature) and 
corpus statistics to infer a relation between concepts 
and documents. 
The generation of hierarchies is a classical problem 
in information retrieval. In most cases the hierarchies 
are manually built and only the classification of 
documents into the hierarchy is automatic. Clustering 
techniques have been used to create hierarchies 
automatically like in the Scatter/Gather algorithm [D. 
R. Cutting et al. 1992]. Using related ideas but by 
using a probabilistic formalism, [A. Vinokourov, M. 
Girolami, 2000], propose a model which allows to 
infer a hierarchical structure for unsupervised 
organization of documents collection.  The techniques 
of hierarchical clustering were largely used to organize 
corpora and to help information retrieval.  All these 
methods cluster documents according to their 
similarity. They cannot be used to produce topic 
hierarchies or to infer generalization/specialization 
relations. 
Recently, it has been proposed to develop topic 
hierarchies similar to those found in e.g. Yahoo. As in 
Yahoo, each topic is identified by a single term. These 
term hierarchies are built from 
“specialization/generalization” relations between the 
terms, automatically discovered from the corpus. 
[Lawrie and Croft 2000, Sanderson and Croft 1999] 
propose to build term hierarchies based on the notion 
of subsumption between terms. Given a set of 
documents, some terms will frequently occur among 
the documents, while others will only occur in a few 
documents. Some of the frequently occurring terms 
provide a lot of information about topics within the 
documents. There are some terms that broadly define 
the topics, while others which co-occur with such a 
general term explain aspects of a topic. Subsumption 
attempts to harness the power of these words. A 
subsumption hierarchy reflects the topics covered 
within the documents, a parent term is more general 
than its child. The key idea of Croft and co-workers 
has been to use a very simple but efficient subsumption 
measure. Term x subsumes term y if the following 
relation holds :  
P(x|y) > t and P(y|x)<P(x|y), where  t is a preset 
threshold. Using related ideas, [K. Krishna, R. 
ICEIS 2004 - ARTIFICIAL INTELLIGENCE AND DECISION SUPPORT SYSTEMS
70