Web, and bootstrapping);
• the Correspondence Analysis framework has been
employed for the computation of distributional
similarity between terms, which is then used as
a basis to extract different types of relationships.
In Section 2 we present the main approaches to the
problem of the extraction of concept hierarchies that
have been investigated in the context of our project.
In Section 3 we explain in detail the steps of our so-
lution for the extraction of concept hierarchies from
free text. In Section 4 we present a summary of the
results for the tests we performed on the different plu-
gins of our tool. In Section 5 we draw our conclusions
and summarize directions of our future work.
2 ONTOLOGIES FROM TEXT
The extraction of ontologies from text can be super-
vised by a human expert or use any sort of structured
data as an additional source. In the former case, we
are speaking of assisted or semi-automatic learning;
in the latter, we refer to oracle guided learning; if the
algorithm makes no use of structured sources or hu-
man help, it is considered an automatic learner. As an
orthogonal definition, when the objective of the algo-
rithm is the expansion of a pre-constructed ontology
we talk about bootstrapping instead of learning.
A large number of methods for the ontology learn-
ing from text are based on the same conceptual ap-
proach defined Distributional Similarity. It is based
on Harris’ Distributional Hypothesis (Harris, 1968):
“Words are similar to the extent that they
share similar context”.
The idea is to analyze the co-occurrence of words
in the same sentence, paragraph, document, or other
types of context to infer context similarity. The more
two words are similar in their distribution over the
contexts, the more they are expected to be semanti-
cally similar, and the more a potential directed rela-
tion is expected to stand between them.
In Distributional Similarity approaches, concepts
are extracted and organized using some representation
according to their distributional similarity. Then, dif-
ferent methods can be used to identify relationships
between neighbors. ASIUM (Faure D., 1998), for ex-
ample, is a software for the generation of concept hi-
erarchies that uses as context the verb of the sentence
where a concept appears and the syntactical function
(i.e., subject, object, or other complements) of the
concept itself. This tool presents semantically simi-
lar words to the user that can then suggest trough an
interface their hierarchical organization, thus leaving
relations discovery to the user.
Caraballo (Caraballo, 1999) presents an approach
to build a hierarchy of concepts extracted from a cor-
pus of articles from the Wall Street Journal, with the
parser described in (Caraballo and Charniak, 1998).
That model uses as context the paragraph in which
the terms appear, while for the generation of the hi-
erarchy it looks in the text for the so-called Hearst
Patterns (Hearst, 1992). An example for such a pat-
tern is “t
1
s, such as t
2
. . . ”, whose occurrence sug-
gests that term t
2
is a hyponym
1
of term t
1
. A further
different approach is the Learning by Googling one:
Hearst patterns can not only be found within docu-
ment corpora, but they can also be searched on the
Web. PANKOW (Cimiano P., 2004), for instance, is a
software that looks for these patterns on Google and
decides, according to the number of results returned
by the engine, whether a subsumption relation can be
confirmed or denied.
Another model is presented by Fionn Murtagh
in (Murtagh, 2005) and (Murtagh, 2007). This is a
Distributional Similarity approach that relies on Cor-
respondende Analysis (a multivariate statistical tech-
nique developed by J.-P. Benzcri in the ’60s (Benzcri,
1976)) to calculate the semantic similarity between
concepts. The generation of the hierarchy starts from
the assumption that terms appearing in more docu-
ments are more general than others, and the solution
of Murtagh places them in a higher position in the hi-
erarchy.
A strongly different approach from Distributional
Similarity is Formal Concept Analysis, as described
by Philipp Cimiano (Cimiano, 2006). It is based on
different assumptions and largely relies on Natural
Language Processing (NLP) algorithms. The idea in
FCA is to identify the actions that a concept can do
or undergo. Words are organized in groups according
to the actions they share; then groups are ordered in a
hierarchy always according to these actions (i.e. the
group of entities that can run and eat and group of
entities that can fly and eat are put together under the
more general group of entities that can eat). Finally
the user is asked to label every node of the formed
hierarchy (or other automatic methods can be used to
perform this operation), and the final hierarchy is ob-
tained. The paper also describes an algorithm which
generates hierarchies from text by using, as a sort
of prompter, a pre-constructed ontology. What the
model obtains is not an extension of the pre-existent
ontology (as for bootstrapping methods), but a new
1
When a is-a relationship in an ontology starts from
term t
1
and arrives at term t
2
, t
1
is defined as a hyponym
of t
2
, while t
2
is defined as a hypernym of t
1
.
ON THE USE OF CORRESPONDENCE ANALYSIS TO LEARN SEED ONTOLOGIES FROM TEXT
431