2 RELATED PRELIMINARY
WORK
Although there have been some researches exploring
on ontology, most of them focused on using specific
ontology to assist their work, rather than on building
ontology. On the other hand, other researches (Trent,
2002, Rowena, 2005, Dave 2001, Sin-Jae, 2001,
Yan-Hwang, 2005, Alexander, 2000, Riichiro, 2003,
Thanh Tho, 2006, Prieto-Diaz, 2003, Yuri A., 2003
and Ju-in Youn, 2004) addressed building ontology.
They could be classified into two categories in
building ontology (strictly speaking, some of them
are just to propose a schema of object entities). The
first one is to classify documents into their domain
based on key terms which are organized by several
words in documents (Florian, 2002, Dave, 2001,
Weipeng, 2001, Yin-Fu, 2007, Thanh Tho, 2006 and
Ju-in, 2004). The other one is to classify keywords
to construct a taxonomy structure based on
belonging documents, thesauri, or pre-built ontology
(Trent, 2002, Rowena, 2005, Sin-Jae, 2001, Yan-
Hwang, 2005, Alexander, 2000, Prieto-Diaz, 2003,
Vaclav, 2005 and Yuri A., 2003).
Youn et al. (Ju-in Youn, 2004) first constructed
the ontology by fuzzy function and relations, and
then classifies documents based on this ontology. In
fact, the ontology constructed here is just a word
relation tree similar to that proposed (Yin-Fu Huang,
2007). Besides, two papers (Florian, 2002 and Yin-
Fu, 2007) also provide schemas of documents, and
the classification on documents has the same
characteristics, since each cluster of documents (or
each tree node in word relation tree) implies the
same term feature. However, their methodologies
are different where one is how to select term features
to do clustering, and another is how to stretch the
current level to the next one.
Since building ontology is so tremendous, it
should be maintained incrementally, rather than
building from scratch. Some learning techniques to
refine the built ontology were proposed (P. Buitelaar,
2005, Asunción, 2003 and Alexander, 2001), and
even general relationship learning (not focusing on
Is-A or Parts-of relationships) has been discussed
(M. Kavalec, 2004, David, 2006 and A. Schutz,
2005). In our framework, new incremental
documents could be imported periodically, and then
the learning process uses them to refine word
relationships in the same way.
2.1 Key Terms for Generating
Ontology
Term-Document-Matrix (TDM) records the
frequency that each key term appears in documents,
and it is also called weighted word histogram
(Weipeng, 2001). Key terms and documents are two
dimensions in TDM. If we take the dimension of
documents as our classified target, key terms can be
viewed as feature (Florian, 2002, Dave, 2001,
Weipeng, 2001 and Teuvo, 2000), and vice versa.
Usually, it is necessary to build ontology to present
the overall context structure on web pages. Tijerino
et al. developed an information-gathering engine,
TANGO, to exploit tables and filled-in forms to
generate domain-specific ontology (Yuri A., 2003).
In our framework, TDM is treated as the implicit
feature to evaluate word correlations.
FOLDOC (http://foldoc.org/) is an online
computing dictionary, in which each keyword and
its relatives are tagged to show their relationships.
Apted and Kay followed its original relationships
between words, and transferred the whole keywords
in the dictionary into a clear relation graph of
keywords (Trent Apted, 2002). Although it has
stored about 14,000 computing terms till now, many
computing terminologies are not yet stored inside.
2.2 Features of Key Terms
Besides the documents as the input source,
additional dictionaries are required to build ontology
(Sin-Jae, 2001 and Alexander, 2000). The features
of key terms retrieved from documents and
dictionaries help to build ontology, which could be
generalized as three kinds; i.e., document vectors,
sememes, and the meaning coming from
dictionaries.
Sememes are defined as the smallest basic
semantic unit in HowNet (K. W. Gan, 2002). Some
papers (Yi, 2002 and Yan-Hwang, 2005) took
sememes as feature roles to do further processing.
However, many computing terms are special
terminologies, the meanings of which could be
different from their original words. Thus, viewing
sememes in computing terms as features could not
be feasible here. Finally, since FOLDOC does not
have enough computing terms for our work, the
instruction inside it is somewhat inadequate to
provide further features. Therefore, we choose The
Free Dictionary instead as the explicit feature
provider.
A FRAMEWORK AUTOMATING DOMAIN ONTOLOGY CONSTRUCTION
17