STRUCTURING TAXONOMIES BY USING LINGUISTIC PATTERNS

AND WORDNET ON WEB SEARCH

Ana B. Rios-Alvarado, Ivan Lopez-Arevalo and Victor Sosa-Sosa

Information Technology Laboratory, CINVESTAV, Cd. Victoria, Tamaulipas, Mexico

Keywords:

Text mining, Knowledge representation.

Abstract:

Finding an appropriate structure for representing the information contained in texts is not a trivial task. On-

tologies provide a structural organizational knowledge to support the exchange and sharing of information.

A crucial element within an ontology is the taxonomy. For building a taxonomy, the identiﬁcation of hyper-

nymy/hyponymy relations between terms is essential. Previous work have used speciﬁc lexical patterns or

they have focused on identifying new patterns. Recently, the use of the Web as source of collective knowledge

seems a good option for ﬁnding appropriate hypernyms. This paper introduces an approach to ﬁnd hypernymy

relations between terms belonging to a speciﬁc knowledge domain. This approach combines WordNet synsets

and context information for building an extended query set. This query set is sent to a web search engine in

order to retrieve the most representative hypernym for a term.

1 INTRODUCTION

At the beginning of the 21st century the easy way to

access to digital information resources has motivated

an exponential growth in the available unstructured

information. This growth is not only present on web

resources, but it also can be seen inside organizations,

institutions, and companies. In an organization, for

example, documents represent a signiﬁcant source of

collective expertise (know how). In order to store, re-

trieve, or infer knowledge from this information, it

is necessary represent it using a conceptual structure.

This can be achieved by means of taxonomies or on-

tologies.

An ontology can be build in a manual manner

through the knowledge engineers and domain experts,

resulting on long and tedious development stages,

which can result in a knowledge acquisition bot-

tleneck (Maedche and Staab, 2001). As a conse-

quence, nowadays an important research area is on-

tology learning. Ontology learning is deﬁned as a

set of methods used for building from scratch, en-

riching or adapting an existing ontology in a semi-

automatic fashion using heterogeneous information

resources (S

anchez, 2009). The ontology learning

deals with entities discovery and how such entities

can be grouped, related, and subdivided according to

their similarities and differences. In ontology learn-

ing, an unsupervised manner to build conceptual stru-

ctures is to use text (terms) clustering techniques.

Syntactic patterns or grammatical classes could, for

example, be used to provide candidates for term de-

tection. However, these approaches do not consider

that words are ambiguous and sharing a semantic con-

text. In this sense, Pantel and Lin (Pantel and Lin,

2002) provide a soft clustering algorithm called Clus-

tering by Committee (CBC) which can assign words

to different clusters using sets of representative ele-

ments (called committees) that try to discover unam-

biguous centroids for describing the members of a

possible class. This method only creates clusters of

terms, but it does not create a hierarchical structure.

Cicurel et al. (Cicurel et al., 2007) evaluated CBC

concluding that is a good technique to identify senses

of words. Its disadvantage is that it requires adjust

some parameters, for example the threshold between

the centroid and any element for grouping. However,

the use of an unsupervised learning techniques makes

possible to calculate these parameters.

According to Gruber (Gruber, 1993), “ontolo-

gies are often equated with taxonomic hierarchies of

classes”; thus, it can be said that the key component

in the ontology is the taxonomy. Such taxonomies, as

the main component for an ontology provide an orga-

nizational model for a domain (domain ontology), or

a model suitable for speciﬁc tasks or problem solving

methods (ontologies of tasks and methods) (Burgun

and Bodenreider, 2001). Nevertheless, constructing

273

B. Rios-Alvarado A., Lopez-Arevalo I. and Sosa-Sosa V..

STRUCTURING TAXONOMIES BY USING LINGUISTIC PATTERNS AND WORDNET ON WEB SEARCH.

DOI: 10.5220/0003665902730278

In Proceedings of the International Conference on Knowledge Engineering and Ontology Development (KEOD-2011), pages 273-278

ISBN: 978-989-8425-80-5

 2011 SCITEPRESS (Science and Technology Publications, Lda.)

taxonomy is a very hard task.

The identiﬁcation of hypernymy/hyponymy rela-

tions between terms (in this work only nouns are con-

sidered as terms) is mandatory for building a tax-

onomy. A hyponym can be deﬁned as: a word of

more speciﬁc meaning than a general or superordi-

nate term applicable to it. By contrast, a hypernym is

a word with a broad meaning constituting a category

under which more speciﬁc words fall. For example,

Mercury, Jupiter, and Mars are hyponyms of Planet

whereas Planet is a hypernym of Mercury, Jupiter,

and Mars. Other names for the hyponym relation-

ship are is-a, parent-child, or broader-narrower rela-

tionships (Cederberg and Widdows, 2003). Caraballo

(Caraballo, 1999) claimed that according to WordNet,

“a word A is said to be a hypernym of a word B if na-

tive speakers of English accept the sentence B is a

(kind of) A”.

In recent years, the Web has become a source of

collective knowledge, reason why it seems a good op-

tion for ﬁnding suitable hypernyms. In addition to

using Web and lexical patterns, some works (Snow

et al., 2005), (Ortega-Mendoza et al., 2007) identify

new lexical patterns that make possible to obtain more

speciﬁc hyponyms; but it is necessary rely on the

known hyponymy relationships for training a classi-

ﬁer, which is not always possible. In this paper, an

approach to ﬁnd hypernym relations between terms

from text belonging to domain knowledge is pre-

sented. Particularly, this approach combines WordNet

synsets and contextual information for building an ex-

tended query set. With this query set, a web search is

executed in order to retrieve the most representative

hypernym for a term.

The rest of this document is structured as follows.

In Section 2, a brief description of the related work

about automatic discover of hypernyms is given. In

Section 3 the approach and the method to ﬁnd hyper-

nyms are described. Later, in the Section 4, the ex-

periments and preliminary results are presented. Fi-

nally, Section 5 gives some conclusions and the fur-

ther work.

2 RELATED WORK

One of the ﬁrst ideas in automatic discovering hy-

pernyms from text was proposed by Hearst (Hearst,

1992). She proposed a method to identify a set of

lexico-syntactic patterns occurring frequently in the

text. Caraballo (Caraballo, 1999) proposed to auto-

matically build a noun hierarchy from text using data

on conjunctions and appositives appearing in the Wall

Street Journal corpus. Both methods are limited by

the number of patterns used. Pantel et al. (Pantel

et al., 2004) showed how to learn syntactic patterns

for identifying hypernym relations and binding them

with clusters that were built from co-occurrence infor-

mation. Blohm and Cimiano (Blohm and Cimiano,

2007) proposed a procedure to ﬁnd lexico-syntactic

patterns indicating hypernym relations from the Web.

From this work, Ortega-Mendoza et al. (Ortega-

Mendoza et al., 2007) and Sang (Sang, 2007) de-

veloped a method to extract hyponyms and hyper-

nyms using lexical patterns respectively. Snow et

al. (Snow et al., 2005) generated hypernym patterns

and combined them with noun clusters to generate

high-precision suggestions for unknown noun inser-

tion into WordNet. Ritter et al. (Ritter et al., 2009)

presented a method based on lexical patterns that ﬁnd

hypernyms on arbitrary noun phrases. They used a

Support Vector Machine classiﬁer to ﬁnd the correct

hypernyms from matches to the Hearst patterns. Most

of these studies are limited due to the hand selection

of pairs of terms that a hypernym relationship has,

which represents the initial seed for discovering new

patterns. In this sense, the automatic acquisition of

terms is essential. The Schutz and Buitelaar approach

(Schutz and Buitelaar, 2005) uses linguistic analysis

and a predeﬁned ontology for relation extraction with

the purpose of extending domain ontology. Cimiano

and Staab (Cimiano and Staab, 2004) showed that

a potencial way to avoid the knowledge acquisition

bottleneck is acquiring collective knowledge from the

Web using a search engine. This idea was used by

anchez (S

anchez, 2009), using the Web for acquir-

ing taxonomic and non-taxonomic relationships.

3 THE METHOD

According to the ontology learning, two of the main

components in an ontology are concepts and relation-

ships. These elements should be relevant in the do-

main of the input corpus. This section introduces a

method for extracting relevant hypernyms from the

information given by speciﬁc corpus that is also com-

plement with knowledge retrieved from the Web.

3.1 The Representation Model

Typically text is represented using the bag of words

model. This model assumes that the order of words

has no signiﬁcance. However, current applications

consider that a semantic representation focused on

Natural Language Processing (NLP) has a major po-

tential for new developments. Thus, word-context

matrices and pair pattern matrices are most suitable

KEOD 2011 - International Conference on Knowledge Engineering and Ontology Development

274

for measure the semantic similarity of word pairs and

patterns (Turney and Pantel, 2010). In the approach

presented in this paper, the proposal is to use a syn-

tactic parser to extract the grammatical context where

each word occurs. It is of special interest the fo-

cus on dependency relationships <subject, verb> and

<verb, object>. With these relationships, represen-

tative pairs of words in a context (topic) are identi-

ﬁed. The verbs are considered because they specify

the interaction between two participants in an action

and express their relationship (Schutz and Buitelaar,

2005). A pair-term matrix is used as representation

model (see Figure 1):

Figure 1: Example of pair verb-noun matrix.

Figure 2: Example of values of pair verb-noun matrix.

By means of mutual information is possible to ﬁnd

two related terms. The Pointwise Mutual Information

(PMI) is the measure used for the association strength

between two words (w

, w

). By using the Equation

1, the values of mutual information was calculated.

PMI(w

, w

) = log

p(w

ANDw

)

p(w

) ∗ p(w

)

(1)

For each verb-noun pair (Figure 1) their PMI is cal-

culated, thus, the pair verb-noun is mapped to numer-

ical values as the Figure 2 shows. The representation

model is obtained on the overall corpus.

3.2 Querying the Web

For obtaining close results to the domain of the input

corpus, it is proposed the construction of an extended

query set that considers the more representative terms

in the input corpus and in the WordNet synsets. The

obtained results (pages) are processed to get relevant

hypernyms. In general, discovering hypernyms con-

sists of the following phases (see Figure 3).

• Pre-processing: It is performed to identify depen-

dencies between nouns sharing a verb in the same

context. These dependencies are obtained using

the Minipar

parser. A pair-pattern matrix is used

http://webdocs.cs.ualberta.ca/$sim$lindek/minipar.htm

Figure 3: Method for discovering hypernyms.

as representation model. In the pair pattern ma-

trix, the pairs correspond to the terms appearing in

a triple term structure <subject> verb <object>.

A noun can be a subject or an object within a sen-

tence. The representative nouns are obtained by

pairs like <subject-verb> and <verb-object>.

• Topic extraction: The topics from the corpus are

inferred using an adaptation of the CBC algorithm

proposed by Pantel (Pantel, 2003).

• Discovering hypernyms: For each topic, a taxon-

omy is constructed. For each noun in the topic,

a set of queries is generated. It it considered the

following:

1. The Hearst’s patterns have shown good evi-

dence identifying that entity A (noun) is a hy-

ponym of B. However, Snow et al. (Snow

et al., 2005) also identiﬁed other possible pat-

terns as result of their method for discovering

STRUCTURING TAXONOMIES BY USING LINGUISTIC PATTERNS AND WORDNET ON WEB SEARCH

275

hypernyms (see Table 1). Both set of patterns

are considered in this work.

Table 1: Lexical patterns.

Hearst’s patterns Other patterns

A, and other B B, called A

A, or other B B, particularly A

A is a B B, for example A

B, such as A B, among which A

B, including A

B, especially A

2. A general query on the Web like such as

<hyponym> is not enough to obtain interesting

and precise information. In order to get useful

information, the query needs to be more spe-

ciﬁc (Sang, 2007). This is the reason why re-

lated information is added to the query: 1) con-

textual information and 2) supervised informa-

tion. The contextual information is given to the

terms with the higher frequencies in the corpus

(without stopwords and after a lemmatization

process). The supervised information is given

to the more representative terms in the Word-

Net synset corresponding with the term. For

extracting terms from WordNet, the gloss of the

term is tagged; the words (three words) labeled

as noun are considered as supervised informa-

tion. If a term has more than one synset, the

ﬁrst synset is taken.

3. Query sets are constructed using the lexical pat-

terns and the related information. Each query is

sent to a web search engine for using the Web

as a source of knowledge.

4. For each query in the hypernym query set,

the n ﬁrst pages are retrieved. The text for

each n page is cleaned and parsed avoiding

non-essential information (eliminating images,

videos, banners, etc.). Each sentence is POS-

tagged using the Stanford tagger

, thus the lex-

ical pattern of the query and their candidate hy-

pernym are identiﬁed. A term is selected as hy-

pernym if it is a noun but it is not a stopword.

5. The list of candidate hypernyms is evaluated

using a new query set, where each possible hy-

pernym will be replaced in the lexical pattern.

Using its query set and the number of hits ob-

tained in the web search each candidate hyper-

nym (CH) is evaluated by means of the fol-

lowing measure to score candidate hypernym

(SCH) (Cimiano and Staab, 2004) (Equation

2):

SCH =

hits(LexicalPattern(term,CH))

hits(CH)

(2)

http://nlp.stanford.edu/software/tagger.shtml

where the LexicalPattern(term,CH) repre-

sents a query like: <term>, + and + other

+ <CandidateHypernym>; and other corre-

sponds to some lexical pattern. The total score

for a CH is given by the sum of scores obtained

for each lexical pattern. Thus, the hypernym

with the highest total score in the result for the

query will be the hypernym associated to the

term.

4 EXPERIMENTS AND RESULTS

A sample of the Lonely Planet

corpus was used in the

experiments. To illustrate the experiment, the term

museum was considered. The terms with the higher

frequencies in the sample corpus were: cash, travel,

and product. The extracted words from the WordNet

synset for museum were: collection, object, and dis-

play; their lexical pattern query set is shown in Ta-

ble 2 and Table 3. Using the query set with only

lexical patterns, the list of candidate hypernyms was:

<site, place, attraction, department of history>. Us-

ing a query with added information, the new candi-

date hypernyms were: <depository, institution>. A

new lexical pattern query set was created using each

one. Then, using the number of obtained hits in the

web search, the corresponding score was computed

for each candidate hypernym. For example, for the

term attraction, the obtained hits are shown in Table

Table 2: Example of a web query set for term museum using

the higher frequency terms in the Lonely Planet Corpus.

museum,+and+other+cash+travel+product

museum,+or+other+cash+travel+product

museum+is+a+cash+travel+product

such+as+museum+cash+travel+product

including+museum+cash+travel+product

especially+museum+cash+travel+product

called+museum+cash+travel+product

particularly+museum+cash+travel+product

for+example+museum+cash+travel+product

among+which+museum+cash+travel+product

In Table 5 can be seen that the best hypernym to

museum is attraction and into the tourist context could

be a good option, but it is important to note that the

second best candidate is institution. According to dif-

ferent authors, the deﬁnitions of museum are:

...a museum is a building or institution which houses

and cares for a collection of artifacts and other objects of

scientiﬁc, artistic, or historical importance and makes them

http://olc.ijs.si/lpReadme.html

KEOD 2011 - International Conference on Knowledge Engineering and Ontology Development

276

Table 3: Example of a web query set for the term museum

using WordNet synsets.

museum,+and+other+collection+object+display

museum,+or+other+collection+object+display

museum+is+a+collection+object+display

such+as+museum+collection+object+display

including+museum+collection+object+display

especially+museum+collection+object+display

called+museum+collection+object+display

particularly+museum+collection+object+display

for+example+museum+collection+object+display

among+which+museum+collection+object+display

Table 4: Example of the web query set for evaluating the

term attraction.

Query Hits

museum,+and+other+attraction 12300000

museum,+or+other+attraction 12300000

museum+is+a+attraction 26900000

attraction+such+as+museum 26900000

attraction+including+museum 26900000

attraction+especially+museum 26900000

attraction+called+museum 11600000

attraction+particularly+museum 26800000

attraction+for+example+museum 3780000

attraction+among+which+museum 12500000

Table 5: Total score of candidate hypernyms for term mu-

seum.

Candidate hypernym Total score

attraction 3.74220

institution 3.65833

depository 1.50125

department of history 0.82055

place 0.21463

site 0.09794

available for public viewing through exhibits that may be

permanent or temporary...

Museums enable people to explore collections for inspi-

ration, learning and enjoyment. They are institutions that

collect, safeguard and make accessible artefacts and speci-

mens, which they hold in trust for society...

The museum is an empowering institution, mean to in-

corporate all who would become part of our shared cultural

experience...

According to the added information to queries,

Edward Porter Alexander, Mary Alexander. Museums

in motion: an introduction to the history and functions of

museums. Rowman & Littleﬁeld, 2008 ISBN 0-7591-0509-

http://www.museumsassociation.org/about/frequently-

asked-questions

Mark Lilla. The Great Museum Muddle. New Repub-

lic, April 8, 1985. pp.25-29

the term institution is a good candidate hypernym to

museum. The Figure 4 shows the created taxonomy

for the group of terms <art, culture, library, science,

book, travel> related to museum. The taxonomy is

constructed following next steps: pairs of terms are

used for building a extended query set, thus it is sent

to the web search engine. The method ﬁnds that one

of two terms into the pairs is hypernym of the oth-

ers. The method is repeated without the hypermym

found previously. The group of terms is the result of

the CBC clustering algorithm.

Figure 4: Taxonomy created for the group of terms related

with museum.

Figure 5: Taxonomy created for the group of terms related

with plant.

Following the experiments, a query set was con-

structed for the term plant and their group of related

terms using the WordNet synsets terms: ﬂora, botany,

and organism. In Table 6 can be seen the hypernyms

obtained for each term and their appropriate Word-

Net hypernym for the group of terms <plant, veg-

etation, park, garden, region, safari, environment>.

The found hierarchical structure is shown in Figure

5. Note that these taxonomies (Figure 4 and 5) cor-

responds only for the information extracted from the

input corpus (Lonely Planet), they are not from the

general domain, such taxonomies can be enhanced by

using an additional corpus.

STRUCTURING TAXONOMIES BY USING LINGUISTIC PATTERNS AND WORDNET ON WEB SEARCH

277

Table 6: Hypernyms obtained and WordNet hypernym for

the group of terms related with term plant.

Term Hypernym WordNet

obtained hypernym

plant organism organism, being

park plant tract, piece of land

garden park vegetation

region park location

safari garden expedition, travel

environment garden geographical area

vegetation plant collection,aggregation

5 CONCLUSIONS

This paper describes an approach to discover hyper-

nyms. The use of the related information in web

queries seems a good approximation for narrowing

the search results. This kind of queries is the most

concrete and indicates that 1) there is a relation be-

tween terms and 2) the terms and their hypernym are

in the same context. The method can be applied to

any domain knowledge. WordNet seems to be lim-

ited because it does not nouns with more than one

term and it only includes some proper nouns. The

obtained results can be improved resolving ambigu-

ous terms. Adding new lexical patterns to queries and

extending the search to Frequently Questions Blogs

and Wikipedia are good options to explore. The cre-

ated taxonomies are consistent with the input corpus.

This makes possible that taxonomies can be used on

applications where the structure of corpus content is

crucial. Finally, in the futher work will be consid-

ered additional experimentation and comparison with

other state of art approaches.

REFERENCES

Blohm, S. and Cimiano, P. (2007). Learning Patterns

from the Web-Evaluating the Evaluation Functions-

Extended Abstract. OTT’06, 1:101.

Burgun, A. and Bodenreider, O. (2001). Aspects of the tax-

onomic relation in the biomedical domain. In Pro-

ceedings of the international conference on Formal

Ontology in Information Systems-Volume 2001, page

233. ACM.

Caraballo, S. (1999). Automatic construction of a

hypernym-labeled noun hierarchy from text. In Pro-

ceedings of the 37th annual meeting of the Association

for Computational Linguistics, pages 120–126. Asso-

ciation for Computational Linguistics.

Cederberg, S. and Widdows, D. (2003). Using LSA and

noun coordination information to improve the preci-

sion and recall of automatic hyponymy extraction. In

Proceedings of the seventh conference on Natural lan-

guage learning at HLT-NAACL 2003-Volume 4, page

118. Association for Computational Linguistics.

Cicurel, L., Bloehdorn, S., and Cimiano, P. (2007). Cluster-

ing of polysemic words. Advances in Data Analysis,

pages 595–602.

Cimiano, P. and Staab, S. (2004). Learning by googling.

ACM SIGKDD explorations newsletter, 6(2):24–33.

Gruber, T. (1993). A translation approach to portable ontol-

ogy speciﬁcations. Knowledge acquisition, 5(2):199–

220.

Hearst, M. (1992). Automatic acquisition of hyponyms

from large text corpora. In Proceedings of the 14th

conference on Computational linguistics-Volume 2,

pages 539–545. Association for Computational Lin-

guistics.

Maedche, A. and Staab, S. (2001). Ontology learning

for the semantic web. Intelligent Systems, IEEE,

16(2):72–79.

Ortega-Mendoza, R., Villase

nor-Pineda, L., and y G

omez,

M. M. (2007). Using lexical patterns for extracting

hyponyms from the web. MICAI 2007: Advances in

Artiﬁcial Intelligence, pages 904–911.

Pantel, P. (2003). Clustering by committee. PhD thesis,

University of Alberta.

Pantel, P. and Lin, D. (2002). Discovering word senses from

text. In Proceedings of the eighth ACM SIGKDD in-

ternational conference on Knowledge discovery and

data mining, pages 613–619. ACM.

Pantel, P., Ravichandran, D., and Hovy, E. (2004). Towards

terascale knowledge acquisition. In Proceedings of

the 20th international conference on Computational

Linguistics, page 771. Association for Computational

Linguistics.

Ritter, A., Soderland, S., and Etzioni, O. (2009). What is

this, anyway: Automatic hypernym discovery. In Pro-

ceedings of AAAI-09 Spring Symposium on Learning

by Reading and Learning to Read, pages 88–93.

anchez, D. (2009). Domain ontology learning from

the web. The Knowledge Engineering Review,

24(04):413–413.

Sang, E. (2007). Extracting hypernym pairs from the web.

In Proceedings of the 45th Annual Meeting of the

ACL on Interactive Poster and Demonstration Ses-

sions, pages 165–168. Association for Computational

Linguistics.

Schutz, A. and Buitelaar, P. (2005). Relext: A tool for re-

lation extraction from text in ontology extension. In

Gil, Y., Motta, E., Benjamins, V., and Musen, M., ed-

itors, The Semantic Web ISWC 2005, volume 3729 of

Lecture Notes in Computer Science, pages 593–606.

Springer Berlin / Heidelberg.

Snow, R., Jurafsky, D., and Ng, A. (2005). Learning

syntactic patterns for automatic hypernym discovery.

Advances in Neural Information Processing Systems,

17:1297–1304.

Turney, P. D. and Pantel, P. (2010). From frequency to

meaning: vector space models of semantics. J. Artif.

Int. Res., 37:141–188.

KEOD 2011 - International Conference on Knowledge Engineering and Ontology Development

278