TEXT CATEGORIZATION USING EARTH MOVER’S DISTANCE AS

SIMILARITY MEASURE

Hidekazu Yanagimoto and Sigeru Omatu

Osaka Prefecture University, 1-1, Sakai, Osaka, Japan

Keywords:

Text Categorization, Earth Mover’s Distance.

Abstract:

We propose a text categorization system using Earth Mover’s Distance (EMD) as similarity measure between

documents. Many text categorization systems adopt the Vector Space Model and use cosine similarity as

similarity measure between documents. There is an assumption that each of words included in documents

is uncorrelated because of an orthogonal vector space. However, the assumption is not desirable when a

document includes a lot of synonyms and polysemic words. The EMD does not demand the assumption

because it is computed as a solution of a transportation problem. To compute the EMD in consideration of

dependency among words, we deﬁne the distance between words, which needs to compute the EMD, using

a co-occurrence frequency between the words. We evaluate the proposing method with ModApte split of

Reuters-21578 text categorization test collection and conﬁrm that the proposing method improves a precision

rate for text categorization.

1 INTRODUCTION

Text categorization systems automatically categorize

informations using human-labeled documents. The

system uses the similarity measure between an un-

labeled document and the labeled documents. The

documents are represented as vectors with the Vec-

tor Space Model (VSM) and the cosine similarity is

used in text categorization. When the cosine similar-

ity is computed, we assume that the vector space is am

orthogonal vector space. However, this assumption

is not always fulﬁlled because a document includes

a lot of synonyms and polysemic words. Hence, we

need to propose new similarity measure without the

assumption that each of words is uncorrelated.

We propose a text categorization system using the

Earth Mover’s Distance (EMD) as a similarity mea-

sure between documents. The EMD needs no as-

sumption that each of words is uncorrelated. Alter-

natively, it demands a distance between the words.

We deﬁne the distance according to relationship be-

tween the words. To capture the relationship between

the words, we use the co-occurrence frequency of the

words. The co-occurrence is represented as the co-

occurrence probability and the distance is computed

according to the probability. To evaluate the propos-

ing method, we carry out an experiment that ModApte

split of Reuters-21578 text categorization test collec-

tion is categorized with the proposing method and a

conventional method. We conﬁrm that the proposing

method is superior to a conventional method in the

view of a precision rate.

2 PREVIOUS WORK

Text categorization is the activity of labeling natural

language texts with predeﬁned categories. To realize

the text categorization, various machine learning al-

gorithms are applied to it(Sebastiani, 2002). Expert

Network(Yang and Chute, 1994), which is one of the

text categorization systems using k-nearest neighbor-

hood(Mitchell, 1997), achieves high recall and pre-

cision. The Expert Network uses a cosine measure

between an input document and a labeled document.

Text categorization systemsusually deal with texts

represented as vectors by the Vector Space Model

(VSM)(Salton et al., 1975). In the VSM, documents

632

Yanagimoto H. and Omatu S. (2007).

TEXT CATEGORIZATION USING EARTH MOVER’S DISTANCE AS SIMILARITY MEASURE.

In Proceedings of the Ninth International Conference on Enterprise Information Systems - ISAS, pages 632-635

DOI: 10.5220/0002406606320635

 SciTePress

are represented as real-value vectors whose elements

are weights for indexing words. When a retrieval sys-

tem retrieves documents which relate to a query, a

similarity measure between a document and the query

is calculated by a cosine measure. In this case we

assume that the vector space consists of orthogonal

basis vectors. However, this assumption does not al-

ways fulﬁll because of synonyms, polysemic words,

co-occurrence of words. Dependency of words dis-

torts the similarity measure between a document.

Wan et al.(Wan and Peng, 2005) propose a simi-

larity measure regarding dependency between words.

They use the Earth Mover’s Distance (EMD)(Rubner

et al., 2000) to calculate the similarity measure be-

tween documents. They use the electronic lexical

database - WordNet(Miller et al., 1990) and deﬁne

the distance depending on semantical vicinity. How-

ever, it is difﬁcult to digitize the semantical vicin-

ity since the vicinity is deﬁned in linguistics. A

well-known problem with thesaurus-based method is

that general-purpose thesauri do not have sufﬁcient

vocabulary coverage crossing different applications.

The thesaurus-based method cannot cover neologisms

since almost all thesauruses are often maintained with

man power. Hence, we tackle these problems by im-

provement of similarity measure.

3 TEXT CATEGORIZATION

USING EMD

3.1 Vector Space Model

In the VSM, all documents written with natural lan-

guages is represented as vectors to deal with them on

computers. An element of the document vector de-

notes a weight of an indexing word included in the

document. The weight for the indexing words is cal-

culated by tf*idf.

= tf

log

(1)

where w

is a weight of a term T

in a document Doc

is a term frequency of T

in Doc

, df

is a document

frequency of T

and N is the total number of the doc-

uments. A document Doc

is represented as the doc-

ument vector d

= (w

, w

, ··· , w

) where V denotes

the number of vocabulary for the corpus.

The cosine measure is often used as a similarity

measure between documents. The cosine measure

sim

cos

, d

) between between documents; Doc

and

Doc

, is calculated below.

sim

cos

, d

) =

k d

(2)

where T denotes transposition of a matrix and k d

denotes a quadratic norm of d

. Since we make the

norms of all documents be equal to 1 in an experi-

ment, Equation (2) is transformed into below.

sim

cos

, d

) = d

(3)

3.2 Earth Mover’s Distance

The EMD is a method to evaluate similarity be-

tween two multi-dimensional distributions in a fea-

ture space where a distance between features can

be deﬁned. The distributions are represented as

signatures which have feature vectors and weights

for the features. A multi-dimensional distribution

P which is represented as the signature is P =

{(p

, w

), (p

, w

), ··· , (p

, w

)} where p

is a

feature vector and w

is a weight for the feature vec-

tor. Now, let Q = {(q

, w

), (q

, w

), ··· , (q

, w

)}

be the second distribution.

A distance between the features is deﬁned in the

feature space and is called a ground distance. Let

D = [d

] be a ground distance matrix where d

is the

ground distance between the features; p

and q

. Let

F = [ f

] be a ﬂow matrix where f

is the ﬂow be-

tween p

and q

. Here, we want to ﬁnd an optimal

ﬂow F

∗

where makes a following cost function be a

minimum.

WORK(P, Q, F) =

∑

i=1

∑

j=1

(4)

The cost function is minimized under the following

constraints:

≥ 0 1 ≤ i ≤ m, 1 ≤ j ≤ n (5)

∑

j=1

≤ w

1 ≤ i ≤ m (6)

∑

i=1

≤ w

1 ≤ j ≤ n (7)

∑

i=1

∑

j=1

= min(

∑

i=1

∑

j=1

) (8)

Using the optimal ﬂow F

∗

, the earth mover’s dis-

tance EMD(P, Q) is deﬁned.

EMD(P, Q) =

∑

i=1

∑

j=1

∗

∑

i=1

∑

j=1

∗

(9)

Since the resulting work WORK(P, Q, F

∗

) depends on

the size of a signature, we need to normalize the re-

sulting work. If the resulting work is not normalized,

the smaller signature is favorable and this similarity

measure is not desirable.

TEXT CATEGORIZATION USING EARTH MOVER'S DISTANCE AS SIMILARITY MEASURE

633

3.3 Earth Mover’s Distance for Texts

Documents need to be represented as the signatures to

compute similarity between the documents using the

EMD. Let a feature vector p

be an indexing word and

let a weight for the feature be a weight for the index-

ing word. A document Doc

is represented as a signa-

ture D

= {(T

, w

), (T

, w

), ··· , (T

, w

)} which in-

cludes all indexing words in only the document Doc

We need to deﬁne the ground distance between

indexing words to compute the EMD. We deﬁne the

ground distance between the indexing words depend-

ing on relationship between the indexing words. The

relationship between the indexing words is deﬁned

based on the co-occurrence of the indexing words.

Now let co-occurrence frequency of T

and T

occur(T

, T

). A conditional probability P(T

)

which shows a probability that T

occurs in a sentence

including T

is deﬁned below.

P(T

) =

occur(T

, T

)

∑

occur(T

, T

))

(10)

The ground distance d

from T

to T

is deﬁned below

using the conditional probability P(T

= 1− P(T

) (11)

Since the EMD is not similarity measure, the sim-

ilarity measure between the documents is computed

using the EMD. When the sum of the weights for the

indexing words in a signature is 1, the next formula

consists.

EMD(D

, D

) ≤ 1 (12)

∑

= 1 and

∑

= 1

We deﬁne the similarity measure sim

EMD

, D

) be-

tween Doc

and Doc

sim

EMD

, D

) = 1− EMD(D

, D

) (13)

3.4 Text Categorization

The proposing system uses architecture of the Expert

Network (ExpNet)(Yang and Chute, 1994). The Exp-

Net consists of two steps: a similarity computing step

and a category rank step. In the similarity computing

step, the ExpNet computes cosine measure between

an unlabeled document and a labeled document. It

selects top K cosine measure labeled documents and

uses them to decide a category for the unlabeled doc-

ument. Our system uses the EMD to select K labeled

documents.

In the category rank step, the ExpNet uses the K

labeled documents to decide a category for the unla-

beled document. In the ExpNetP(c

) is deﬁned as

Table 1: Content of Reuters-21578 using experiments.

documents unique words

Training 7,733 17,488

Test 3,008 10,731

a conditional probability of a category c

related to a

document Doc

judged by human previously. Given

the labeled documents, the conditional probability is

estimated as

P(c

) =

frequency of category c

for d

frequency of d

in labeled documents

(14)

Relevance measure of each category rel(c

) is de-

ﬁned as a weighted sum of the cosine similarity.

rel

cos

|x) =

∑

j=1

P(c

)sim

cos

(x, d

) (15)

The x denotes a document vector of the unlabeled

document. Since our method uses the EMD, rel(c

|X)

is deﬁned as

rel

EMD

|X) =

∑

j=1

P(c

)sim

EMD

(X, D

) (16)

where X denotes a signature for the unlabeled docu-

ment.

4 EXPRIMENTS

We carried out text categorization for Reuters-21578

text categorization test collection. We used ModAPte

split of the Reuters-21578. The ModApte split con-

sists of 9,603 training documents and 3,299 test docu-

ments. In the experiment we used 7,733 training doc-

uments and 3,008 test documents including more than

one indexing word after we removed SMART stop-

list and applied Porter algorithm(Porter, 1980). On

average a training document and a test document are

labeled with 1.2 categories. 115 different categories

exist in the training documents and 93 different cate-

gories exist in the test documents. And 3 categories

in the test documents do not exit in training docu-

ments. Table 1 shows the content of the training doc-

uments and the test documents. Since 3,407 indexing

words out of 10,731 unique words in the test docu-

ments was not included in the training documents, we

used only 7,324 words to compute the similarity be-

tween a training document and a test document.

The experiment was that a test documents was

labeled one category ˆc which has the maximum

rel

cos

|x) or the maximum rel

EMD

|X) although

ICEIS 2007 - International Conference on Enterprise Information Systems

634

0 5 10 15 20 25 30 35 40 45 50

Precision

"EMD"

"VSM"

Figure 1: Precision rate on K=1, 10, 20, 30, 40, 50.

Table 2: The number of correct documents and error docu-

ments for VSM and EMD.

VSM

correct error

EMD correct 2,359 217

error 94 338

test documents were labeled with several categories.

Hence, when an estimated category ˆc is included in

categories labeled in a test document as correct cat-

egories, we consider that that the text categorization

algorithm can label the document correctly.

To evaluate each method based on previous idea

we used precision rate for the test documents.

precision=

the number of correct labeled documents

the number of all test documents

(17)

The precision rate depends on a value of K in

Equation (15) and Equation (16). Figure 1 shows

precision rates for cosine similarity (VSM) and our

proposing method (EMD) on K = 1, 10, 20, 30, 40, 50.

Our proposing method, EMD was superior to a con-

ventional method, VSM on every K-values. The pre-

cision rate on K = 10 was the maximum value for

VSM and EMD. The precision rate of VSM is 81.6%

and the one of EMD is 85.6%. Hence, the difference

of the precision rates is about 4.0%.

To discuss text categorization ability Table 2

shows the number of error documents and correct

documents for VSM and EMD. The error documents

for EMD was smaller than the one for VSM because

of improvement of the precision rate.

In ﬁgure 1 we conﬁrmed that it could improve

the precision rate to regard the dependency of in-

dexing words. In table 2 we think that we can im-

prove our method. Our method makes a word related

to too many words or contextually unrelated words.

Hence, our method could not label the documents

which VSM could label correctly. To make a word

related to appropriate words increases the similarity

between documents which the VSM can not evaluate.

Hence, the number of error documents in VSM de-

creases. On the other hand, to make a word related to

too many words boosts the similarity between docu-

ments which are unrelated. This cause the number of

error documents, which can not exist in the VSM, to

increase. We need to discuss how to deﬁne the dis-

tance between the indexing words beside the condi-

tional probability P(T

5 CONCLUSION

We proposed a text categorization method using Earth

Mover’s Distance as a similarity measure. We re-

alized to compute similarity between documents re-

garding the dependency of words using the Earth

Mover’s Distance. The distance between the words

is deﬁned with the conditional probability that one

word occur with the other word in the same sentence.

We conﬁrm that the proposing method is superior to

a conventional method using cosine similarity with

Reuters-21578 text categorization test collection.

We will discuss how to deﬁne the distance be-

tween words beside the conditional probability and

improve our proposing method.

REFERENCES

Miller, G. A., Beckwith, R., Fellbaum, C., Gross, D., and

Miller, K. J. (1990). Introduction to wordnet: An on-

line lexical database. International Journal of Lexi-

cography, 3(4):235–312.

Mitchell, T. M. (1997). Machine Learning. McGraw Hill,

New York, US.

Porter, M. (1980). An algorithm for sufﬁx stripping. Pro-

gram, 14(3):130–137.

Rubner, Y., Tomasi, C., and Guibas, L. (2000). The earth

mover’s distance as a metric for image retrieval. Inter-

national Journal of Computer Vision, 40(2):99–121.

Salton, G., Wong, A., and Yang, C. S. (1975). A vector

space model for automatic indexing. Communications

of the ACM, 18(11):613–620.

Sebastiani, F. (2002). Machine learning in automated text

categorization. ACM Computing Surveys, 34(1):1–47.

Wan, X. and Peng, Y. (2005). The earth mover’s distance as

a semantic measure for document similarity. In the

14th ACM International Conference on Information

and Knowledge Management, pages 301–302. ACM

Press.

Yang, Y. and Chute, C. G. (1994). An example-

based mapping method for text categorization and re-

trieval. ACM Transactions on Information Systems,

12(3):252–277.

TEXT CATEGORIZATION USING EARTH MOVER'S DISTANCE AS SIMILARITY MEASURE

635