adjectival phrases and prepositional phrases.
In the
Romanian language, like in the other languages, a
noun phrase can be nested within another noun
phrase, with no depth limit. This nesting process is
represented in the grammar by recursive rules. Our
noun phrase chunker works well on the sublanguage
of meat processing and product descriptions. For
instance, consider the sentence: “Oferta de produse
cuprinde aproximativ 65 de sortimente, punctul forte
fiind reprezentat de specialitatile si produsele crud
uscate.” (The product offer includes about 65
assortments, the strong point being represented by
the specialties and the dry cruel products.) The
chunker identifies “Oferta de produse”,
”sortimente”, “punctul forte”, “specialitati”, and
”produse crud uscate” as noun phrases .
3.1.3 Morphological Analysis
Since the concepts of our taxonomy are designated
by noun phrases, we decided to do morphological
analysis only for nouns, adjectives and pronouns.
The morphological analysis is done in three steps. In
fact, it is not a proper morphological analysis, but
rather a lemmatizing process. For each token (word)
we extract its lemma – the base form of the word,
with no suffixes like definite articles or plural
endings. This would be a simple task in case of the
English language, since the plural ending is usually
“-s” (with some exceptions in case of irregular
nouns). However, the lemma extraction for the
Romanian language is quite complex. Unlike
English, the determined article is a suffix, so it must
be removed. What actually complicates things is the
fact that Romanian is the only neo-Latin language
that has preserved the three genders (masculine,
feminine and neuter). When considering removing
the plural endings, the problem lies in the fact that
neuter nouns have similar plural endings with
feminine nouns, while considering removing an
article for singular nouns the neuter nouns will have
similar suffixes with the
masculine nouns. Also, the
case of a noun having different suffixes in
nominative and accusative case than in genitive or
dative case should be considered.
A lemmatizing process would be much easier if
more information concerning the nouns would be
available, such as gender or case. The only
information currently available in the pre-processed
texts is the number: singular or plural. In order to
extract the lemma, we have written a lex lemmatizer
working in a three step approach which is
implemented as a set of regular patterns. The first
step of the lemmatizing process was to remove the
definite article from both singular nouns and plural
nouns. In the second step, the plural endings from
the plural nouns were removed. The third step looks
for adjectives and removes their plural endings and
then looks for the nouns determined by each
adjective trying to keep a gender and case agreement
between them.
We have used the words' lemmas and we have
enforced the preservation between adjective and
noun in order to avoid redundant information. The
redundant information is that when two flexional
forms (for example a plural form and a form with
definite article) of the same noun phrase are
considered as occurrences of two different tokens,
not as the same token. Moreover, because we use
WordNet lookup for the common hypernym of two
taxonomy siblings, it will search for the word's
lemma which is common for the two siblings.
WordNet uses a morphological component, in order
to remove the plural endings of the words searched,
but this works for English words. As we populate
the WordNet database with Romanian lemmas of the
words (nouns and nounphrases), it is obvious why
lemma extraction is needed.
3.2 Taxonomy Building and Prunning
For learning the domain ontology, we use the SOTA
algorithm, an unsupervised neural network with a
binary tree topology which is available as SOTArray
(Herrero, 2001). The SOTArray classifies the initial
data set only in the leaves of the binary tree that it
develops, the inner nodes being empty. Because of
this, we decided to label the inner nodes starting
from the leaves to the root (bottom-up), and to do
that we will search the WordNet database for the
most specific common hypernym of every two
sibling nodes. We consider an isA relationship
between a node and its parent. Since WordNet
contains only English words, we have modified the
WordNet database, by populating it with Romanian
nouns and noun phrases. A more detailed description
of the learning process is presented in the following
sections.
3.2.1 The Learning Process
The taxonomy learning is based on the Self-
Organizing Tree Algorithm (SOTA) (Herrero,
2001). A learned SOTA hierarchy is playing the role
of a learned taxonomy.
In our setting, the noun phrases identified in the
corpus are considered as terms, and these terms are
classified in a SOTA tree during the process of
taxonomy building. To make possible the SOTA
TAXONOMY LEARNING FOR THE ROMANIAN LANGUAGE USING SOTA AND WORDNET
171