TAXONOMY LEARNING FOR THE ROMANIAN LANGUAGE
USING SOTA AND WORDNET
Viorica R. Chifu, Ioan Salomie
Department of Computer Science, Technical University of Cluj-Napoca, Romania
Emil Şt. Chifu, Corina Grumazescu
Department of Computer Science, Technical University of Cluj-Napoca, Romania
Keywords: Ontologies, taxonomy learning, unsupervised neural network.
Abstract: Ontologies are widely used today in various domains such as information retrieval, semantic Web, NLP
tasks or for describing specific domains like certain branches of medicine. While there are many tools that
can be used for learning domain ontologies for English, when learning domain specific ontologies for
Romanian, we face a lack of available tools and resources. Moreover, due to the complexity of the
Romanian grammar, processing of Romanian text corpora is also difficult. This paper focuses on building a
domain specific ontology for the Romanian language using machine learning techniques. The taxonomy
learning process is based on an unsupervised neural network. The resulting modules are intended to be used
for semantic annotations of traceability services in meat industry.
1 INTRODUCTION
Today, digital technologies generate a huge quantity
of information. Most of this information can be
found in text format on the Internet as Web pages
and other resources. This large amount of
information would be useless if it could not be
searched, extracted and processed. The problem is
that usually the extracted information is not always
relevant for users’ purposes. In order to ensure an
information extraction that is useful for the user, a
search mechanism that provides maximum of
relevant information and minimum of irrelevant one
is necessary. In the traditional way of using popular
search engines such as Google, the information is
looked up by keywords specified by the user. Thus,
the retrieved documents are only those that contain
the specified keywords, although many documents
contain the desired semantic information where
synonyms are used instead of the searched
keywords. A solution to this problem would be the
embedding/association of semantic information
into/with the documents, the so called Semantic
Web.
This can be done by documents annotation using
a set of specific markup tags and using ontologies to
specify the meaning of the annotations. Such an
approach implies the creation of a domain specific
ontology. Ontology building is a time consuming
and complex task which requires a high degree of
human intervention. This is the reason why
nowadays there is a considerable research effort in
the domain of automatic ontology building.
This paper presents the automatic building of a
domain specific taxonomy out of textual
descriptions from Web sites (Maestro, 2007)
(CrisTim, 2007) of Romanian meat industry
companies. Our taxonomy learning tool is based on
SOTA (Self-Organizing Tree Algorithm) (Herrero,
2001) and WordNet. The learned taxonomy is part
of an ontology used for semantic annotation of Web
services. The concepts in the taxonomy are semantic
descriptions of the inputs and outputs of the
operations provided by a Web service.
The paper is organized as follows. Section 2
reviews several ontology learning frameworks,
while section 3 presents the taxonomy learning tool.
Section 4 gives a qualitative evaluation of the
experimental results. Conclusions and future
directions are presented in sections 5.
169
R. Chifu V., Salomie I., ¸St. Chifu E. and Grumazescu C. (2008).
TAXONOMY LEARNING FOR THE ROMANIAN LANGUAGE USING SOTA AND WORDNET.
In Proceedings of the Fourth International Conference on Web Information Systems and Technologies, pages 169-174
DOI: 10.5220/0001521601690174
Copyright
c
SciTePress
2 RELATED WORK
There is a multitude of reported ontology learning
frameworks investigated by survey works of
(Buitelaar, 2005) (Gómez-Pérez, 2003). We only
enumerate two such frameworks as being the most
related to ours.
In (Alfonseca, 2002), the terms are represented
with distributional (contextual) signatures, similar
with our vectors of occurrences in different
documents (contexts). The ontology learning is a
top-down process. The cited work uses decision tree
learning.
Witschel combines decision trees and text
mining techniques to extend a taxonomy (Witschel,
2005). The word similarities are calculated by
comparing words only on sentence-based co-
occurrences. A small sub-tree of the GermaNet
hierarchy (the German equivalent of WordNet) is
used as training data.
Another category of approaches is based on
lexico-syntactic patterns, known as Hearst patterns
(Hearst, 1992), which contain phrases suggesting
taxonomic relations: such as, (and | or) other,
including, especially, is a. In (Cimiano and Staab,
2005) (Cimiano, 2005) a combination of clustering
and Hearst patterns is used.
Our ontology learning is based on SOTA and
WordNet. In the case of SOTA, the concept
clustering is done in a top-down manner, the upper
levels being generated before the lower levels
(which are more detailed). The growth of the tree
can be stopped at any desired level of the hierarchy,
thus obtaining a more general or a more specific
ontology. We use SOTA and WORNET to learn a
domain specific ontology for Romanian language.
3 TAXONOMY BUILDING
Our domain taxonomy has been automatically built
from a domain text corpus consisting of html pages
with information about meat products. The pages
were colleted from Web sites of Romanian meat
industry companies (Maestro, 2007) (CrisTim,
2007). The taxonomy learning process has two
steps: term extraction, and taxonomy building and
pruning. In the term extraction step, the relevant
terms (words or phrases) for the taxonomy building
are extracted from the domain text corpus. These
extracted terms become the candidates for the
concept names in the final learnt taxonomy.
In the
taxonomy building and pruning step, the identified
terms become concepts, and taxonomic (isA)
relations are established between them, by actually
building a tree having the concepts in its nodes. The
pruning phase avoids the potentially uninteresting
concepts for the taxonomy. The term extraction
process and the taxonomy building and pruning are
presented in detail in the following two subsections.
3.1 Term Extraction
The candidates for concept names are identified in a
three phase text mining process over the domain
corpus. In the first phase a linguistic analysis is
performed on the corpus, in the second phase a set of
linguistic patterns are applied in order to identify
domain specific terms, while in the third phase a
morphological analysis is performed.
3.1.1 Linguistic Analysis
In the linguistic analysis phase, the domain text
corpus is first annotated with information about the
part of speech (POS) of every word with the help of
the Brill POS tagger (Brill, 1999). Brill tagger can
only be trained by a supervised learning process
starting from an already POS tagged corpus. In order
to train Brill tagger for Romanian, we used ROCO,
an annotated Romanian text corpus which consists
of articles from Romanian newspapers (a collection
of 40 million words) collected from the Web. The
ROCO corpus was tokenized and POS tagged with
the RACAI tools (Tufiş, 1999), having an annotation
accuracy of 98%.
Our original (untagged) corpus consists of 130
documents collected from Web sites of Romanian
meat industry companies (Maestro, 2007) (CrisTim,
2007). The evaluation of the trained tagger was
performed on our corpus and the accuracy,
calculated as the ratio of correct tags out of the total
number of tags, was 91%.
3.1.2 Identifying Domain Specific Terms
The phase of identifying domain specific terms is
based on recognizing linguistic patterns (noun
phrases) in the domain text corpus. To extract
domain specific terms from the corpus, we have
implemented a noun phrase (NP) chunker which
identifies noun phrases in the linguistically
annotated text corpus. Our NP chunker is written by
using lex and yacc.
The written yacc syntax rules of
the grammar essentially consist of a head noun
together with its pre/post-modifiers (attributes). The
pre-modifiers of a head noun can be indefinite
determiners and adjectives. The post-modifiers of
the head noun can be possessive pronouns,
WEBIST 2008 - International Conference on Web Information Systems and Technologies
170
adjectival phrases and prepositional phrases.
In the
Romanian language, like in the other languages, a
noun phrase can be nested within another noun
phrase, with no depth limit. This nesting process is
represented in the grammar by recursive rules. Our
noun phrase chunker works well on the sublanguage
of meat processing and product descriptions. For
instance, consider the sentence: “Oferta de produse
cuprinde aproximativ 65 de sortimente, punctul forte
fiind reprezentat de specialitatile si produsele crud
uscate.” (The product offer includes about 65
assortments, the strong point being represented by
the specialties and the dry cruel products.) The
chunker identifies “Oferta de produse”,
sortimente”, “punctul forte”, “specialitati”, and
produse crud uscate” as noun phrases .
3.1.3 Morphological Analysis
Since the concepts of our taxonomy are designated
by noun phrases, we decided to do morphological
analysis only for nouns, adjectives and pronouns.
The morphological analysis is done in three steps. In
fact, it is not a proper morphological analysis, but
rather a lemmatizing process. For each token (word)
we extract its lemma – the base form of the word,
with no suffixes like definite articles or plural
endings. This would be a simple task in case of the
English language, since the plural ending is usually
“-s” (with some exceptions in case of irregular
nouns). However, the lemma extraction for the
Romanian language is quite complex. Unlike
English, the determined article is a suffix, so it must
be removed. What actually complicates things is the
fact that Romanian is the only neo-Latin language
that has preserved the three genders (masculine,
feminine and neuter). When considering removing
the plural endings, the problem lies in the fact that
neuter nouns have similar plural endings with
feminine nouns, while considering removing an
article for singular nouns the neuter nouns will have
similar suffixes with the
masculine nouns. Also, the
case of a noun having different suffixes in
nominative and accusative case than in genitive or
dative case should be considered.
A lemmatizing process would be much easier if
more information concerning the nouns would be
available, such as gender or case. The only
information currently available in the pre-processed
texts is the number: singular or plural. In order to
extract the lemma, we have written a lex lemmatizer
working in a three step approach which is
implemented as a set of regular patterns. The first
step of the lemmatizing process was to remove the
definite article from both singular nouns and plural
nouns. In the second step, the plural endings from
the plural nouns were removed. The third step looks
for adjectives and removes their plural endings and
then looks for the nouns determined by each
adjective trying to keep a gender and case agreement
between them.
We have used the words' lemmas and we have
enforced the preservation between adjective and
noun in order to avoid redundant information. The
redundant information is that when two flexional
forms (for example a plural form and a form with
definite article) of the same noun phrase are
considered as occurrences of two different tokens,
not as the same token. Moreover, because we use
WordNet lookup for the common hypernym of two
taxonomy siblings, it will search for the word's
lemma which is common for the two siblings.
WordNet uses a morphological component, in order
to remove the plural endings of the words searched,
but this works for English words. As we populate
the WordNet database with Romanian lemmas of the
words (nouns and nounphrases), it is obvious why
lemma extraction is needed.
3.2 Taxonomy Building and Prunning
For learning the domain ontology, we use the SOTA
algorithm, an unsupervised neural network with a
binary tree topology which is available as SOTArray
(Herrero, 2001). The SOTArray classifies the initial
data set only in the leaves of the binary tree that it
develops, the inner nodes being empty. Because of
this, we decided to label the inner nodes starting
from the leaves to the root (bottom-up), and to do
that we will search the WordNet database for the
most specific common hypernym of every two
sibling nodes. We consider an isA relationship
between a node and its parent. Since WordNet
contains only English words, we have modified the
WordNet database, by populating it with Romanian
nouns and noun phrases. A more detailed description
of the learning process is presented in the following
sections.
3.2.1 The Learning Process
The taxonomy learning is based on the Self-
Organizing Tree Algorithm (SOTA) (Herrero,
2001). A learned SOTA hierarchy is playing the role
of a learned taxonomy.
In our setting, the noun phrases identified in the
corpus are considered as terms, and these terms are
classified in a SOTA tree during the process of
taxonomy building. To make possible the SOTA
TAXONOMY LEARNING FOR THE ROMANIAN LANGUAGE USING SOTA AND WORDNET
171
classification of the terms, a term document matrix
representation for each term has to be chosen. The
term document matrix contains a term vector on
each row. On each row, the first column entry is the
actual term (noun or noun phrase), then the next
entries represent the number of occurrences of the
word in each document. So the size of the
association dictionary will be n x m, where n is the
total number of terms taken into consideration
(nouns and noun phrases) and m is the total number
of documents + 1.
In order to obtain the term document matrix, a
processing of the corpus is needed. We have written
some C and yacc programs for this task. The term
document matrix is given as input to SOTArray and
the ontology is obtained.
The SOTA algorithm combines aspects of
hierarchical clustering with SOM (Dittenbach,
2002). The clustering algorithm in SOTA is a top-
down process: the tree grows starting from its root,
and then developing into more detailed
classifications in the lower hierarchical levels. This
growing stops when a predefined level of
classification detail is reached. The level of detail is
set according to the distribution of probability
obtained by randomization of the data set to be
classified. The SOTA algorithm can be applied to
any data set in which the data items can be encoded
as numerical vectors in a vector space. Moreover, a
measure of similarity between data items in the
vector space has to be defined. The output of SOTA
is a binary tree that follows the principles of the
growing cell structures algorithm (Fritzke, 1994). In
this algorithm, a binary tree is trained by adapting its
nodes to the characteristics latent in the input data
set. In both SOTA and the growing cell structures,
the tree-like output space can freely grow until
adapting as much as possible to the variability of the
input data space.
When classifying terms in a SOTA tree, the
development of new nodes can be stopped at a
taxonomic level corresponding to the desired level
of classification detail. Alternatively, new nodes can
grow until reaching a complete classification of the
terms extracted from the corpus, i.e. until having a
single term in every leaf of the tree.
Taxonomy pruning is achieved by avoiding
terms occurring in too few documents of the corpus,
specifically in less than 1-2% of the total number of
documents in the corpus. Such terms cannot be
considered as relevant to become concepts of the
domain.
3.2.2 Integrating WordNet with SOTArray
The ontology structure obtained with the SOTA
algorithm has the form of a binary tree of noun
phrases. In every leaf we have one noun phrase. This
means that SOTArray classifies the terms so that
only the leaves have terms from the corpus. This is
why we need to label all the interior nodes by
integrating WordNet into SOTArray. The labelling
process is achieved in a bottom-up recursive tree
traversal. The goal is to find the most specific
common WordNet hypernym for every pair of
sibling nodes, starting from the leaves and ascending
towards the root in the tree hierarchy. A list of all
the direct and indirect hypernyms for all the
meanings of the term is obtained for each sibling
node by querying WordNet. The two lists associated
to the two siblings are compared and intersected
such that all the common hypernyms are found and
only the most specific of them is kept. This most
specific hypernym, which is the lowest in the
WordNet taxonomy, will be associated to the parent
node of the two sibblings.
In order to be able to label the tree inner nodes
with most specific common hypernyms of the
descendants, such hypernyms must be present in the
WordNet database. As the WordNet database is for
English, we had to populate it with Romanian nouns
and noun phrases. This was done by using WNgrind.
The WNgrind tool generates WordNet database files
from WordNet source files, also known as
“lexicographer files”. The lexicographer files
contain synsets (sets of synonyms). The format of a
lexicographer file for Romanian language is
presented below.
{parizer_pentru_copil, parizer,@ }
{parizer_cu_ciuperci, parizer,@ }
{parizer_cu_ardei, parizer,@ }
..................
Figure 1: A part of a lexicographer file for the Romanian
language.
There is only one synset per line. Inside a synset, the
entities are separated by spaces or tabs. There are
also two types of supported pointers: lexical
pointers, which illustrate relations between word
forms, and semantic pointers, which illustrate
relations between word semantics. We are more
interested in semantic pointers (hypernym) in order
to focus on the isA relationships in a taxonomy.
Figure 3 presents the populated WordNet with
Romanian concepts.
WEBIST 2008 - International Conference on Web Information Systems and Technologies
172
Figure 2: The populated WordNet with Romanian
concepts.
3.2.3 OWL Translation
The obtained taxonomy is in the SOTA Newick
format and we decided to translate it into the OWL
standardized format (OWL, 2006). The reason for
our choice is that OWL ontologies are used in a
large variety of applications, since OWL is
recommended by W3C. We have written a Newick -
OWL translator in C++ to translate the taxonomy
into the OWL format.
Figure 3: A part of learned taxonomy.
4 EXPERIMENTAL RESULTS
In Figure 3 a part of the learned ontology view with
the Protégé (Noy, 2003) ontology editor is illustrated.
The English translations of some taxonomy concepts
are given in italics.
Table 1 shows the lexical precision and recall of
the learned taxonomy. Recall (1) is defined as the
ratio of (manually classified as) relevant terms that
are correctly extracted from the analyzed corpus
over all the terms extracted from the corpus, and
Precision (2) is the ratio of correctly extracted terms
over all the extracted terms.
Recall
=
corpus
extracted
all
correct
(1)
Precision
=
extracted
extracted
all
correct
(2)
In our method the precision is much better than the
recall. The modest value of the recall is due to the
low grammatical quality of the corpus (spelling and
punctuation mistakes, the lack of diacritical marks).
Table 1: Evaluation results for learned taxonomy.
Recall Precision Learned
Taxonomy
62% 87%
5 CONCLUSIONS
In this paper we have presented an unsupervised
taxonomy learning approach for the automatic
building of a domain specific taxonomy from textual
descriptions presented in Web sites of Romanian
meat industry companies. The taxonomy has been
built in the framework of the Maestro project
(Maestro, 2007).
The proposed learning method is based on the
SOTA algorithm and WordNet. WordNet is obtained
by populated WordNet with concepts for the
Romanian language. The advantages of SOTA are
its efficiency and the short convergence time.
Moreover, SOTA was used, since one of the
interests of this paper is the correlation between
terms which is fulfilled by clustering. This method is
rather general and can be applied to any set of data
that can be encoded as data item vectors and allows
the measurement of the similarity between data
items. We are especially interested in the correlation
between terms (the similarity of terms in a given
context) rather than in term synonymy. The obtained
ontology is a set of concepts and relations between
TAXONOMY LEARNING FOR THE ROMANIAN LANGUAGE USING SOTA AND WORDNET
173
them and it can offer an abstract view of the
application domain.
As future directions, we intend to process the
Romanian texts in order to exploit more complex
lexical information about the parts of speech, like
gender, case, and so on. This would be very helpful
in lemma extraction and other tasks concerning
automated text analysis. Another development
possibility would be a more complex and complete
WordNet database for Romanian. This would help to
illustrate more complex relations between concepts
(like part-of) and will lead to a more complex
ontology. Also, we plan to experiment in the future
with other corpora from different domains.
ACKNOWLEDGEMENTS
This work was supported by the PI-1020 (333/2007)
project within the framework of the “Research of
Excellence” program initiated by the Romanian
Ministry of Education and Research.
REFERENCES
Alfonseca E., and Manandhar, S., 2002. Extending a
lexical ontology by a combination of distributional
semantics signatures. In A. Gómez-Pérez, V.R.
Benjamins, eds., 13th International Conference on
Knowledge Engineering and Knowledge Management,
LNAI, Springer, pp. 1-7.
Brill, E.,1999. A simple rule-based part-of-speech tagger.
ANLP’92, 3rd Conference on Applied Natural
Language Processing, pp. 152-155.
Buitelaar, P., Cimiano, P., Grobelnik, M., and Sintek, M.,
2005. Ontology learning from text. Tutorial at the
ECML/PKDD workshop on Knowledge Discovery and
Ontologies.
Cimiano, P., Staab, S.,2005. Learning concept hierarchies
from text with a guided hierarchical clustering
algorithm. ICML workshop on Learning and
Extending Lexical Ontologies with Machine Learning
Methods.
Cimiano, P., Pivk, A., Schmidt-Thieme, L., and Staab, S.,
2005. Learning taxonomic relations from
heterogeneous sources of evidence. In P. Buitelaar, P.
Cimiano, B. Magnini, eds., Ontology Learning from
Text: Methods Applications and Evaluation, IOS
Press, pp. 59-73.
M. Dittenbach, D. Merkl, and A. Rauber, Organizing and
exploring high-dimensional data with the Growing
Hierarchical Self-Organizing Map”, In L. Wang, et al.,
eds., 1st International Conference on Fuzzy Systems
and Knowledge Discovery, vol. 2, 2002, pp. 626-630.
Gómez-Pérez, A., and Manzano-Macho, D., 2003. A
survey of ontology learning methods and techniques.
OntoWeb Deliverable 1.5.
Hearst, M. A.,1992. Automatic acquisition of hyponyms
from large text corpora. 14th International Conference
on Computational Linguistics.
Herrero, J., Valencia, A., and Dopazo, J., 2001. A
hierarchical unsupervised growing neural network for
clustering gene expression patterns. Bioinformatics 17,
pp.:126–136.
Khan, L., and Luo, F., 2002. Ontology construction for
information selection. 14th IEEE International
Conference on Tools with Artificial Intelligence, pp.
122-127.
Miller, G., A., Beckwith, R., Fellbaum, C., Gross, D., and
Miller, K., 1993. Introduction to WordNet: An on-line
lexical database. Technical report, Princeton. CSL
Report 43, revised March.
Noy, N.F., Crubézy, M.,et al., 2003. Protégé-2000: An
Open-Source Ontology-Development and Knowledge-
Acquisition Environment. AMIA Annual Symposium
Proceedings.
Studer, R., Benjamins, V., and Fensel, D., 1998.
Knowledge Engineering: Principles and Methods,
Data and Knowledge Engineering, 25(1-2), pp.:161 –
197.
Tufiş, D., 1999. Tiered Tagging and Combined Classifiers,
In F. Jelinek and E. Nöth, eds., Text, Speech and
Dialogue, Lecture Notes in Artificial Intelligence
1692, Springer.
Witschel, H. F.,2005. Using Decision Trees and Text
Mining Techniques for Extending Taxonomies. In
Proceedings of Learning and Extending Lexical
Ontologies by Using Machine Learning Methods
,
Workshop at ICML-05.
Maestro: http://www.maestro.ro/, 2007
CrisTim: http://www.cirstim.ro/, 2007.
OWL: http://www.w3.org/TR/owl-guide/, 2006
WEBIST 2008 - International Conference on Web Information Systems and Technologies
174