MULTI-LABELED PATENT DOCUMENT CLASSIFICATION

USING TECHNICAL TERM THESAURUS

Yoshimi Suzuki and Fumiyo Fukumoto

Interdisciplinary Graduate School of Medicine and Engineering, University of Yamanashi, 4-3-11 Takeda, Kofu, Japan

Keywords:

Thesaurus, Patent, Document classiﬁcation.

Abstract:

This paper presents a method for patent document classiﬁcation by using an expanded technical term thesaurus.

For classifying structural documents such as patent documents, structural information is very useful. However,

if we use documents divided into several applicant tags, the number of words are limited. For example, ‘Title

of invention’ tag is very important for patent document classiﬁcation. However, the number of words in the

tag is very few. Therefore, in order to deal with this problem, we employ two methods. One is to classify

applicant tags into semantic tags, the other is word expansion using an expanded technical term thesaurus.

For thesaurus expansion, our system integrates technical terms into a thesaurus using patent documents. The

classiﬁcation results showed the method using the expanded thesaurus was better than that without thesaurus.

Although our method is very simple, it is comparable to other methods. These results suggest that thesaurus

and our method to expand thesaurus can be useful for patent document classiﬁcation.

1 INTRODUCTION

Patent documentclassiﬁcation is an important issue of

NLP. Some workshops for patent documents classiﬁ-

cation have been held, and many researchers proposed

various methods. However, there are few methods us-

ing thesaurus. Because technical term thesaurus are

required for patent document classiﬁcation.

Currently there are a lot of machine readable the-

sauri, e.g. WordNet(Fellbaum, 1998), BunruiGoi-

Hyo (National Language Research Institute, 1964)

and EDR concept dictionary. However, they are the-

sauri of common words and they have few technical

terms or their hierarchical semantic features are not

for technical terms. JST Thesaurus (Japan Science

and Technology Agency, 1999) is a technical term

thesaurus. It consists of 43,314 index words, while

many technical terms are not listed in it. Therefore, it

is necessary to construct thesaurus of technical terms.

There are a lot of studies for thesaurus construc-

tion and thesaurus expansion. Tokunaga, et al. (Toku-

naga, 1997) and Uramoto (Uramoto, 1996) proposed

methods for extending an existing thesaurus by classi-

fying new words in terms of that thesaurus. However,

their studies are for words commonly used and not for

technical terms.

For thesaurus construction or thesaurus expan-

sion, we have to extract similar word pairs. In order

to extract similar word pairs, some methods ((Hindle,

1990), (Lin, 1998) and (Hagiwara et al., 2006)) based

on dependency relationships are proposed. However,

their methods are for commonly-used words and they

did not mentioned whether their methods were effec-

tive for technical terms.

For extracting similar words of a technical term,

we have to deal with the following two difﬁculties.

• Some technical terms do not appear frequently.

• Some technical terms are used in the same con-

texts.

Therefore, it is difﬁcult to extract various dependency

relationships of technical terms.

In this paper, we use an expanded thesaurus for

text categorization. We propose a method to integrate

new technical terms into a core thesaurus. We also

propose a method to use the thesaurus for patent doc-

ument classiﬁcation. We perform some experiments

using the thesaurus in order to conﬁrm the expanded

thesaurus is effective for multi-labeled patent docu-

ment classiﬁcation. We compare the results of our

system with the results using other methods. Our

method is very simple, while our system is compet-

itive to other systems.

425

Suzuki Y. and Fukumoto F..

MULTI-LABELED PATENT DOCUMENT CLASSIFICATION USING TECHNICAL TERM THESAURUS.

DOI: 10.5220/0003658504250428

In Proceedings of the International Conference on Knowledge Engineering and Ontology Development (KEOD-2011), pages 425-428

ISBN: 978-989-8425-80-5

 2011 SCITEPRESS (Science and Technology Publications, Lda.)

Figure 1: System overview.

2 SYSTEM DESIGN

Our system consists of two phases: “Thesaurus ex-

pansion” and “Document classiﬁcation”. Figure 1

shows our system overview.

2.1 Thesaurus Expansion

Some technical terms do not frequentlyappear in even

if large corpora, then we have to use hierarchical

semantic features for smoothing technique. Firstly,

we extract similar word pairs using dependency re-

lationships. Dependency relationship between two

words is used for extracting semantic similar word

pairs. For example, Lin proposed “dependency triple”

(Lin, 1998). A dependency triple consists of two

words: w, w

′

and the grammatical relationship be-

tween them:r in the input sentence. ||w, r, w

′

|| denotes

the frequency count of the dependencytriple (w, r, w

′

||w, r, ∗|| denotes the total occurrences of (w, r) rela-

tionships in the corpus, where “∗” indicates wild card.

We used 10 kinds of Japanese case particles as r.

Table 1 illustrates the case particles which we used as

In order to extract the corresponding semantic fea-

ture of the new word, we extract dependency triples

of the new word and the extracted words. Using some

extracted words, many types of dependencytriples are

extracted. For extracting the similar words from the

core thesaurus, ﬁrst, I(w, r, w

′

) is calculated using For-

mula (1).

Table 1: 10 case particles.

case particle (r) description

ga nominative case

no genitive case

wo accusative case

ni dative case

he goal

to comitative case

kara elative case

yori from, at

de inessive case

ya coordination

I(w, r, w

′

)

= −log(P

MLE

(B)P

MLE

(A|B)P

MLE

(C|B))

−(−logP

MLE

(A, B, C))

= log

||w, r, w

′

|| × ||∗, r, ∗||

||w, r, ∗|| × ||∗, r, w

′

(1)

where P

MLE

is the maximum likelihood estimation of

a probability distribution.

Let T(w) be the set of pairs (r, w

′

) such

that log

||w,r,w

′

||×||∗,r,∗||

||w,r,∗||×||∗,r,w

′

is positive. The similarity

Sim(w

, w

) between two words: w

and w

is deﬁned

by Formula (2).

Sim(w

, w

)

∑

(r,w)∈T(w

)∩T(w

)

(I(w

, r, w) + I(w

, r, w))

∑

(r,w)∈T(w

)

I(w

, r, w) +

∑

(r,w)∈T(w

)

I(w

, r, w)

(2)

Finally, candidates of corresponding semantic fea-

tures of the new word are detected using the hierarchi-

cal semantic features of the core thesaurus.

2.2 Document Classiﬁcation

In document classiﬁcation phase, we classify each

document into some relevant themes.

In patent documents, there are many applicant

tags: “Title of invention”, “Abstract”, “Purpose”,

“Claims”, and so on. Each document has about 56 ap-

plicant tags. Most of applicant tags are used in most

of documents. However, there are some notational

variant in applicant tags. We classify these applicant

tags into 6 semantic tags (Kim et al., 2005). Each la-

bel of semantic tag and classiﬁed applicant tags are

shown in Table 2. In Table 2, “# of nouns” means

average number of nouns in each documents.

Table 3 illustrates some examples of themes.

Many of themes correlate with “Purpose” of the

semantic tags. Therefore, we decided “Purpose” is

KEOD 2011 - International Conference on Knowledge Engineering and Ontology Development

426

Table 2: Examples of classiﬁed applicant tags into semantic

tags.

Semantic tag Examples of # of

Applicant tag nouns

Technological industrial application 80.5

ﬁeld ﬁeld

Purpose title of the invention, 134.1

purpose of the invention

Method means of solving 71.2

the problem

Claim claim 151.2

Explanation composition 166.4

Example embodiment example 72.5

Table 3: Examples of themes.

Theme code Description

2B011 Mushroom Cultivation

3C036 Drilling and boring

4D057 Centrifugal separators

5E022 Connector installation

5K068 Stereo-broadcasting methods

the most important semantic tags, and we used word

expansion using expanded thesaurus for documents of

“Purpose” tag, in document classiﬁcation.

For document classiﬁcation, we used “Bag Of

Words” and distribution of words. Although we use

many training data, many words in test data do not

appear in training data. Therefore, we have to use

word expansion using the expanded thesaurus. Al-

though word expansion is useful, if we expand all

words using thesaurus, the results must be worse by

noise. Therefore, we use word expansion using the

expanded thesaurus for documents of “Purpose” tag.

We used Naive Bayes classiﬁer for document classi-

ﬁcation. The themes

theme which are selected as the

relevant themes using the following equation.

theme = arg max

themes

P(theme)

∏

P(w

|theme) (3)

where w

and w

′

mean word

and a related word of

word

. If w

is a word in the expanded thesaurus, the

words which are next neighbours in the thesaurus also

used as w

′

3 EXPERIMENTS

We have an experiment to evaluate the effectiveness

of the expanded thesaurus for document classiﬁca-

tion. We used the thesaurus for patent document

classiﬁcation. In the experiments, we decided cor-

responding themes for each patent documents out of

Table 4: The number of related words.

related words # of words

NT(narrower terms) 102,645

BT(brother terms) 122,606

RT(related terms) 26,958

Figure 2: A part of JST Thesaurus (Fourier transform).

about 2,900 themes.

3.1 Experimental Setup

For the experiments we used Japanese patent docu-

ments and technical term thesaurus which was ex-

panded by our method.

We used patent publication bulletins written in

Japanese (1993-1999) which were provided by patent

retrieval task of NTCIR Workshop 5 (Iwayama et al.,

2005). The training data we used is the documents

from 1993 to 1997. For test data we used the patent

documents from 1998 to 1999. The number of train-

ing data was 1,707,194 documents. The number of

test data was 2,008 documents. Each training data and

test data had multiple themes. The number of themes

was 2,903. Average number of themes of each docu-

ment was about 2.26.

Firstly, we obtained similar word pairs and inte-

grated them into a core thesaurus. We used JST The-

saurus (Japan Science and Technology Agency, 1999)

which had 43,314 index words. Each index word

had about 6 related words on average. The related

words were classiﬁed into 3 categories: NT (narrower

terms), BT (brother term) and RT (related terms). Ta-

ble 4 shows the number of related words.

Figure 2 illustrates an index word (Fourier trans-

form) and its related words of JST Thesaurus. The in-

dex words appeared about 2 million times in the 2,008

patent documents (1998-1999).

3.2 Document Classiﬁcation Results

We retrieved relevant documents from Japanese

patent documents using thesaurus information, We

compared our results with the results in NTCIR5.

MULTI-LABELED PATENT DOCUMENT CLASSIFICATION USING TECHNICAL TERM THESAURUS

427

Table 5: Classiﬁcation results (MAP).

Method MAP

cosine 0.45

cosine + JST Thesaurus 0.46

cosine + expanded thesaurus 0.46

Naive Bayes 0.63

Naive Bayes + JST Thesaurus 0.64

Naive Bayes + expanded thesaurus 0.64

k-NN (BOLA1 (Kim et al., 2005)) 0.69

Naive Bayes (JSPAT2) 0.66

k-NN (WGLAB9) 0.62

VSM (FXDM3) 0.49

We used Mean Average Precision (MAP) to compare

these results. MAP is deﬁned by the following equa-

tion

MAP(Q) =

|Q|

∑

j=1

∑

k=1

Precision(R

) (4)

where Q is set of test documents, m

is the number of

relevant documents of document

, and R

means kth

ranked retrieval results of document

Table 5 shows results of document classiﬁcation.

In Table 5, BOLA1, JSPAT2, WGLAB9 and FXDM3

are RunID of NTCIR5 Patent Retrieval Task. BOLA1

used k-NN and structure of patent documents. JS-

PAT2 used Naive Bayes. WGLAB9 used k-NN,

where retrieval model is BM11 or the vector space

model. FXDM3 used vector space model.

4 DISCUSSION

We expanded a technical term thesaurus using

Japanese patent documents. To conﬁrm our thesaurus

is useful for text classiﬁcation, we compared the re-

sults using our thesaurus with results without the the-

saurus. As the results, we found that our thesaurus

is effective for document classiﬁcation. We also com-

pared our method with the methods in NTCIR5 patent

classiﬁcation task. Although our method is very sim-

ple, we found our system is competitive to other sys-

tems. We classiﬁed 6 semantic tags in the experi-

ments, and applied word expansion in “purpose”.

Future work includes (i) applying the method to

other data for quantitativeevaluation, and (ii) compar-

ing the method with other classiﬁcation techniques to

evaluate the effectiveness of the method.

ACKNOWLEDGEMENTS

The authors would like to thank the referees for their

comments on the earlier version of this paper. This

work was partially supported by The Telecommuni-

cations Advancement Fundation.

REFERENCES

Fellbaum, C. (1998). WordNet: An Electronic Lexical

Database. Bradford Books.

Hagiwara, M., Ogawa, Y., and Toyama, K. (2006). Selec-

tion of effective contextual information for automatic

synonym acquisition. In In Proc. of the 21st Interna-

tional Conference on Computational Linguistics and

44th Annual Meeting of the ACL, pages 353–360.

Hindle, D. (1990). Noun classiﬁcation from predicate-

argument structures. In Proceedings of 28th Annual

Meeting of the Association for Computational Lin-

guistics, pages 268–275.

Iwayama, M., Fujii, A., and Kando, N. (2005). Overview

of classiﬁcation subtask at ntcir-5 patent retrieval task.

In Proceedings of NTCIR-5 Workshop Meeting.

Japan Science and Technology Agency

(1999). JST (JICST) Thesaurus 1999.

http://jois.jst.go.jp/JOIS/html/thesaurus index.htm.

Kim, J.-H., Huang, J.-X., Jung, H.-Y., and Choi, K.-S.

(2005). Patent document retrieval and classiﬁcation at

kaist. In Proceedings of NTCIR-5 Workshop Meeting.

Lin, D. (1998). Automatic retrieval and clustering of similar

words. In Proceedings of 36th Annual Meeting of the

Association for Computational Linguistics and 17th

International Conference on Computational Linguis-

tics Proceedings of the Conference, pages 768–774.

National Language Research Institute (1964). Bunruigoi-

hyo. Shuei publisher (In Japanese).

Tokunaga, T. (1997). Extending a thesaurus by classifying

words. In In Proceedings of the ACL-EACL Workshop

on Automatic Information Extraction and Building of

Lexical Semantic Resources, pages 16–21.

Uramoto, N. (1996). Positioning unknown words in a the-

saurus by using information extracted from a corpus.

In In proceedings of COLING’96, pages 956–961.

KEOD 2011 - International Conference on Knowledge Engineering and Ontology Development

428