HANDLING THE IMPACT OF LOW FREQUENCY EVENTS ON

CO-OCCURRENCE BASED MEASURES OF WORD SIMILARITY

A Case Study of Pointwise Mutual Information

Franc¸ois Role and Mohamed Nadif

LIPADE, Universit

e Paris Descartes, 45, rue des Saints P

eres, 75006 Paris, France

Keywords:

Text mining, Word similarity measures, Pointwise mutual information.

Abstract:

Statistical measures of word similarity are widely used in many areas of information retrieval and text mining.

Among popular word co-occurrence based measures is Pointwise Mutual Information (PMI). Altough widely

used, PMI has a well-known tendency to give excessive scores of relatedness to word pairs that involve low-

frequency words. Many variants of it have therefore been proposed, which correct this bias empirically. In

contrast to this empirical approach, we propose formulae and indicators that describe the behavior of these

variants in a precise way so that researchers and practitioners can make a more informed decision as to which

measure to use in different scenarios.

1 INTRODUCTION

The ability to automatically discover semantically as-

sociated words is at the heart of many information

retrieval and extraction applications. For example,

when querying a search engine, many terms can be

used to express the same or very similar thing (e.g.

”city” and ”town”). As a consequence, some relevant

documents may not be retrieved just because they do

not contain exactly the same words as those in the

query text. A common technique for overcoming this

lack of ﬂexibility is to expand an initial query using

synonyms and related words. Being able to identify

semantically similar words can also be useful in clas-

siﬁcation applications to reduce feature dimension-

ality by grouping similar terms into a small number

of clusters. Many measures of word semantic simi-

larity rely on word co-occurrence statistics computed

from a large corpus of text: the assumption is that

words that frequently appear together in text are con-

ceptually related (Manning and Schutze, 1999; Lee,

1999; Thanopoulos et al., 2002; Evert, 2004; Pecina

and Schlesinger, 2006; Bullinaria and Levy, 2007;

Hoang et al., 2009). Among the most popular word

co-occurrence based measures is Pointwise Mutual

Information (PMI). For two words (or terms) a and

b , PMI is deﬁned as:

log

p(a, b)

p(a)p(b)

p(a) (resp. p(b)) is the probability that word a

(resp. b) occurs in a text window of a given size

while p(a, b) denotes the probability that a and b co-

occur in the same window. PMI is thus the log of

the ratio of the observed co-occurrence frequency to

the frequency expected under independence. It mea-

sures the extent to which the words occur more than

by chance or are independent. The assumption is

that if two words co-occur more than expected un-

der independence there must be some kind of seman-

tic relationship between them. Initially used in lexi-

cography (Church and Hanks, 1990), PMI has since

found many applications in various areas of informa-

tion retrieval and text mining where there is a need

for measuring word semantic relatedness. In (Vech-

tomova and Robertson, 2000), the authors show how

to use PMI to combine a corpus information on word

collocations with the probabilistic model of informa-

tion retrieval. In the context of passage retrieval PMI

is used to expand the queries using lexical afﬁnities

computed using statistics generated from a terabyte

corpus (Terra and Clarke, 2005). Other examples in-

clude the use of PMI in projects devoted to extracting

entities from search query logs, discovering indexing

phrases, predicting user click behavior to simulate hu-

man visual search behavior, etc. Last but not least, au-

tomatic synonym dicovery is another area where PMI

has been extensively used (Turney, 2001; Terra and

Clarke, 2003).

However, a well-known problem with PMI is its

226

Role F. and Nadif M..

HANDLING THE IMPACT OF LOW FREQUENCY EVENTS ON CO-OCCURRENCE BASED MEASURES OF WORD SIMILARITY - A Case Study of

Pointwise Mutual Information.

DOI: 10.5220/0003655102180223

In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval (KDIR-2011), pages 218-223

ISBN: 978-989-8425-79-9

 2011 SCITEPRESS (Science and Technology Publications, Lda.)

tendency to give very high association scores to pairs

involving low-frequency words, as the denominator

is small in such cases, while one generally prefers a

higher score for pairs of words whose relatedness is

supported by more evidence.

The impact of the bias is apparent in tasks directly

related to query transformation. For example, in

(Croft et al., 2010), while talking about query expan-

sion, the authors take a collection of TREC news as

example, and show that, according to PMI, the most

strongly associated words for ”tropical” in this cor-

pus are: ”trmm”, ”itto”, ”ortuno”, ”kuroshio”, ”bio-

function”, etc. though the collection contains words

such as ”forest”, ”tree”, ”rain”, ”island” etc. They

then conclude that these low-frequency words ”are

unlikely to be much use for many queries”.

Nonetheless, the basic straightforwardness of PMI

over other approaches is still appealing, and several

empirical variants have therefore been proposed to

overcome this limitation. Since the product of two

marginal probabilities in the denominator favors pairs

with low-frequency words, a common feature of these

variants is to assign more weight to the joint probabil-

ity p(a, b) either by raising it to some power k in the

denominator (log

p(a,b)

p(a)p(b)

) or by using it to globally

weight PMI as in the case of the so-called ”Expected

Mutual Information” (p(a, b)log

p(a,b)

p(a)p(b)

). However,

as pointed out by (Croft et al., 2010), the correction

introduced may result in too general words being top-

ranked.

Whether it is preferable to discover specialized

or general related terms depends on the context. In

any case, the point is that failing to precisely quantify

the impact of the bias and its possible corrections in-

evitably leads to empirical results that are very depen-

dent on the data. The aim of this paper is therefore to

propose precise indicators of sensitivity to frequency.

The plan of the paper is as follows. In section 2,

we review PMI and some common variants in order

to give insight into how each measure works, factors

inﬂuencing them, and their differences. We also pro-

pose formulae for assessing the impact of the correc-

tions brought by several widely used variants. Section

3 provides experimental validation of these formulae

and investigates how to give some simple visual hints

at the differences in behaviour to be expected when

migrating from one measure to the other. We con-

clude by summarizing our contribution and indicating

directions for future research.

2 A FORMAL STUDY OF SOME

IMPORTANTS VARIANTS OF

PMI

Although widely used, PMI

has two main limitations

: ﬁrst, it may take positive or negative values and lacks

ﬁxed bounds, which complicates interpretation. Sec-

ondly, it has a well-known tendency to give higher

scores to low-frequency events. While this may be

seen as beneﬁcial in some situations, one generally

prefers a higher score for pairs of words whose relat-

edness is supported by more evidence.

In order to overcome these limitations, several

variants of PMI have therefore been proposed over

the years. In contrast to more general relatedness

measures for which numerous comparative studies are

available (Pecina and Schlesinger, 2006; Hoang et al.,

2009; Thanopoulos et al., 2002; Lee, 1999; Petro-

vic et al., 2010; Evert, 2004), no systematic and for-

mal comparison speciﬁcally addressing these variants

seems to have been conducted so far.

Among the most widely used variants are those

of the so-called PMI

family (Daille, 1994). These

variants consist in introducing one or more factors of

p(a, b) inside the logarithm to empirically correct the

bias of PMI towards low frequency events. The PMI

and PMI

measures commonly employed are deﬁned

as follows:

PMI

(a, b) = log

p(a, b)

p(a)p(b)

and

PMI

(a, b) = log

p(a, b)

p(a)p(b)

Note that from the expression of PMI

(a, b), a sim-

ple derivation shows that it is in fact equal to

2log p(a, b) − (log p(a) +log p(b)), and thus to

PMI(a, b) + log p(a, b).

That is to say the correction is obtained by

adding some value increasing with p(a, b), namely

log p(a, b) to PMI, which, in fact, will boost the scores

of frequent pairs. However, for comparison purposes

it may be more convenient to express PMI

(a, b) as:

PMI

(a, b) = PMI(a, b) − (−log p(a, b)) (1)

PMI is not to be confused with the Mutual Information

between two discrete random variables X and Y , denoted

I(X ;Y ), which is the expected value of PMI.

I(X ;Y ) =

∑

a,b

p(a, b)log

p(a, b)

p(x)p(y)

∑

a,b

p(a, b)PMI(a, b).

HANDLING THE IMPACT OF LOW FREQUENCY EVENTS ON CO-OCCURRENCE BASED MEASURES OF

WORD SIMILARITY - A Case Study of Pointwise Mutual Information

227

The transformation may seem purely cosmetic

but it will allow us to better analyze the PMI vari-

ants in information-theoretic terms. Indeed, frequent

word pairs convey less information (in information-

theoretic sense) than infrequent ones. Therefore,

when writing

PMI(a, b) − (−log p(a, b)),

the reasoning is that the more frequent a pair, the less

information we substract from the PMI. Note that, a

generalization of (1) is easily obtained by writing:

PMI

(a, b) = PMI(a, b) − (−(k − 1) log p(a, b)).

This expression of PMI

measures allows us to

compare them in a precise way to the NPMI (Normal-

ized PMI) recently proposed in (Gerlof, 2006). The

main motivation for this new variant is to give PMI

a ﬁxed upper bound of 1 in the case of perfect de-

pendence, that is to say when two words only occur

together. In this case

PMI(a, b) = −log p(a) = − log p(b) = − log p(a, b).

One option to normalize PMI is then to divide it by

−log p(a, b), which results in the following deﬁni-

tion:

NPMI(a, b) =

PMI(a, b)

−log p(a, b)

(2)

Another common variant consists of removing the

log that appears in the deﬁnition of PMI. We then ob-

tain a new positive variant of PMI noted PPMI and

deﬁned as:

PPMI(a, b) = 2

PMI(a,b)−(−log p(a,b))

(3)

We eventually end up with expressions of the vari-

ants which exhibit what they have in common.

We conclude this review by comparing PMI

PPMI and NPMI, based on how their sensitivity to

low frequencies is affected by the correction factors

we have just described. In table 1, we ﬁrst recap the

values taken by the variants in three cases: complete

dependence (both words only occur together), inde-

pendence, and when p(x, y) = 0.

Table 1: Variation intervals of PMI, NPMI, PMI

and

PPMI.

Variants Comp. Ind. null

PMI −log p(a, b) 0 −∞

NPMI 1 0 -1

PMI

0 log p(a, b) −∞

PPMI 1 p(a, b) 0

Let us now assess to what extent the correction

factors used by the three different variants inﬂuence

the scores assigned to the pairs. In order to quantify

this inﬂuence of frequencies, it is easy to prove the

following proposition concerning PMI

Proposition 1.

Let k ≥ 0, if PMI(c, d) = PMI(a, b) + k, then











PMI

(c, d) = PMI

(a, b) if log

p(a,b)

p(c,d)

= k

PMI

(c, d) > PMI

(a, b) if log

p(a,b)

p(c,d)

< k

PMI

(c, d) < PMI

(a, b) if log

p(a,b)

p(c,d)

> k

Using the second implication, we can also show that

PMI

increases with p(a, b). Indeed, in the case when

k = 0, the implication says that if

log

p(a, b)

p(c, d)

< 0

then

PMI

(c, d) ≥ PMI

(a, b).

In addition, since PPMI(a, b) = 2

PMI

the same im-

plications can be used for PPMI.

In the same manner, in the following proposition,

we can prove and therefore quantify the inﬂuence of

the frequencies for the NPMI measure.

Proposition 2.

Let PMI(z,t) = kPMI(x, y) we have:

* NPMI(z,t) = NPMI(x, y)

- if log p(z, t) = k log p(x, y).

* NPMI(z,t) > NPMI(x, y)

- if (k > 0, PMI(z, t) < 0 and

log p(z,t)

log p(x,y)

< k),

- or (k > 0, PMI(z, t) > 0 and

log p(z,t)

log p(x,y)

> k),

- or (k < 0 and PMI(z, t) > 0).

* NPMI(z,t) < NPMI(x, y)

- if (k > 0, PMI(z,t) > 0 and

log p(z,t)

log p(x,y)

< k),

- or (k > 0, PMI(z, t) < 0 and

log p(z,t)

log p(x,y)

> k),

- or (k < 0 and PMI(z, t) < 0).

3 NUMERICAL EXPERIMENTS

The k thresholds derived in the previous section allow

us to predict the consequences of switching from one

measure to another. Take for example the following

pairs (ba=basidiomycota, champ=champignon) and

KDIR 2011 - International Conference on Knowledge Discovery and Information Retrieval

228

(cha=chanson, si=single)

extracted from the French

Wikipedia corpus described and used later.

The marginal and joint probabilities, and the com-

puted PMI and PMI

measures of each pair are re-

ported in table 2. As we have

PMI(ba, champ) − PMI(cha, si) = 3.05 = k

and

log

p(cha, si)

p(ba, champ)

= 3.16 > 3.05,

we observe that

PMI

(cha, si) > PMI

(ba, champ),

despite

PMI(cha, si) < PMI(ba, champ).

Table 2: Marginal, joint probabilities, PMI and PMI

for

two pairs.

(a, b) basidiomycota chanson

champignon single

p(a, b) 3.70 × 10

−5

3.31 × 10

−4

p(a) 3.95 × 10

−5

5.68 × 10

−4

p(b) 7.21 × 10

−5

3.72 × 10

−4

PMI(a, b) 13.66 10.61

PMI

(a, b) −1.05 −0.94

Given that (chanson,single) is a much more

frequent pair than (basidiomycota, champignon)

PMI

was able to correct the bias of PMI so as

to assign a higher score to the more frequent pair.

Continuing with the previous example, this time

using PMI and NPMI, one can show that in this case,

k = 1.28 and log p(basidiomycota, champigon) ≥

1.28 × log p(single, chanson). This leads to

NPMI(basidiomycota, champigon) = 0.92 ≥

NPMI(single, chanson) = 0.91. The more frequent

pair is ranked after the less frequent one. In contrast

to PMI

, NPMI was not able to correct the bias of

PMI.

To stress how dramatic an impact this divergence

may have on the ranking of related terms, table 3

shows the most strongly associated words (obtained

using a large set of Wikipedia titles and categories we

used as a corpus) for ’football’ according to PMI and

PMI

, which is a commonly used variant of PMI. As

can be seen, we end up with two very different lists.

PMI tends to rank domain-speciﬁc words at the top

of the list whereas the top-ranked terms for PMI

are

much more general.

Translations: basidiomycota=basidiomycota;

champignon=mushroom; chanson=song; single=single.

Table 3: Most strongly associated words for ’football’

ranked according to the scores assigned by PMI (left) and

PMI

(right). PMI considers the pair ”football”, ”mid-

ﬁelder” to be the most strongly connected pair whereas

PMI

ranks the pair ”football”, ”league” ﬁrst.

midﬁelder 5.581 league -19.840

midﬁelders 5.575 clubs -20.667

cornerbacks 5.545 england -20.915

goalkeepers 5.543 players -21.326

safeties 5.530 season -21.677

goalkeeper 5.529 team -21.922

linebackers 5.475 college -22.043

striker 5.4750 club -22.244

defenders 5.408 national -22.891

defender 5.270 managers -23.142

quarterbacks 5.262 cup -23.498

This suggests a more global way of looking at

the levels of correction brought about by the differ-

ent variants for a particular corpus. First select a

large corpus of word pairs. For each association mea-

sure, compute the score for every pair and rank the

pairs by their scores. Then plot the rankings against

each other so as to see whether the measures agree on

which pairs of words contain the most strongly associ-

ated words. In order to have a statistically signiﬁcant

set of word pairs, we parsed the XML dump of the

French Wikipedia dated September 14, 2009 (5 Gb of

text). After ﬁltering, we kept about one million arti-

cles that had at least a Wikipedia category associated

with them

. We then parsed the articles, and for each

word appearing in a category name, we recorded its

co-occurrences with all the other words also appear-

ing in category name. After removing French gram-

matical stop words as well as words occurring in less

than 100 documents, we eventually ended up with a

set of 70 922 word pairs. We then ranked these pairs

according to the different measures and then plotted

the so obtained ranks against each other.

The scatter plot in ﬁgure 1 shows that, compared

to PMI, NPMI only slightly improves the bias towards

low frequency pairs. The graph follows a rough diag-

onal. In contrast, ﬁgure 2 shows that moving from

NPMI to PMI

has a dramatic effect as far as the as-

signed ranks are concerned. By looking at this kind of

graph one can get a visual insight on the global effects

of switching from one variant to another.

To get a more precise view, it is useful to focus

on the most frequent pairs. The plots in ﬁgure 3 and

ﬁgure 4 compare the ranks assigned by NPMI, PMI

and PMI

to the 500 most frequent pairs in the corpus.

Categories are short phrases that represent the main

topics of an article. Usually several categories are assigned

to a given article.

HANDLING THE IMPACT OF LOW FREQUENCY EVENTS ON CO-OCCURRENCE BASED MEASURES OF

WORD SIMILARITY - A Case Study of Pointwise Mutual Information

229

From ﬁgure 3, it is clear that these pairs are ranked

among the ﬁrst by PMI

while the ranks assigned by

NPMI are more variable. This is a conﬁrmation of ﬁg-

ure 2 and shows that using PMI

to ﬁnd words related

to another word would certainly result in quite more

general terms than if using NPMI.

Figure 1: Scatter plot of the ranks assigned by PMI (vertical

axis) and NPMI (horizontal axis). The plot reveals a near-

linear relationship. The correction brought about by NPMI

is not very signiﬁcant for this corpus.

Figure 2: Scatter plot of the ranks assigned by NPMI (ver-

tical axis) and PMI

(horizontal axis).

4 CONCLUSIONS

We have proposed formulae that help better under-

stand and predict the behaviour of some important

word assocation measures based on PMI, especially

Figure 3: The 500 most frequent pairs as ranked by NPMI

and PMI

Figure 4: The 500 most frequent pairs as ranked by PMI

and PMI

. The most frequent pairs are clearly ranked ﬁrst

by both measures.

concerning the way these measures behaves w.r.t fre-

quency. More speciﬁcally, to the best of our knowl-

edge, it is the ﬁrst time that the notorious bias of PMI

towards low frequency and the ways to correct it have

been examined in detail. We also feel that this study

will help ﬁll a gap in the literature by providing a

comprehensive formalization and comparison of sev-

eral importants variants of PMI. In future, we plan

to use rank correlation methods such as Spearman’s

Rho and Kendall’s Tau to get another indicators of

differences in sensitivity to frequency. Additional vi-

sualization methods will also be investigated in order

to determine how to best help researchers and corpus

practitioners in choosing the right PMI based associ-

ation measure given the corpus and the task at hand.

KDIR 2011 - International Conference on Knowledge Discovery and Information Retrieval

230

REFERENCES

Bullinaria, J. and Levy, J. (2007). Extracting semantic

representations from word co-occurrence statistics: A

computational study. Behavior Research Methods,

page 510.

Church, K. W. and Hanks, P. (1990). Word association

norms, mutual information, and lexicography. Com-

putational Linguistics, 16(1):22–29.

Croft, W. B., Metzler, D., and Strohman, T. (2010). Search

engines: information retrieval in practice. Pearson -

Addison Wesley.

Daille, B. (1994). Approche mixte pour l’extraction au-

tomatique de terminologie : statistiques lexicales et

ﬁltres linguistiques. PhD thesis, Universit Paris 7

(1994).

Evert, S. (2004). The statistics of word cooccurrences: word

pairs and collocations. Unpublished doctoral disser-

tation, Institut f

ur maschinelle Sprachverarbeitung,

Universit

at Stuttgart.

Gerlof, B. (2006). Normalized (pointwise) mutual informa-

tion in collocation extraction. In Proceedings of the

Biennial GSCL Conference, pages 31–40, Gunter Narr

Verlag. Chiarcos, Eckart de Castilho & Stede (eds).

Hoang, H. H., Kim, S. N., and Kan, M.-Y. (2009). A re-

examination of lexical association measures. In MWE

’09: Proceedings of the Workshop on Multiword Ex-

pressions, pages 31–39, Morristown, NJ, USA. Asso-

ciation for Computational Linguistics.

Lee, L. (1999). Measures of distributional similarity. In

37th Annual Meeting of the Association for Computa-

tional Linguistics, pages 25–32.

Manning, C. D. and Schutze, H. (1999). Foundations of

Statistical Natural Language Processing. The MIT

Press.

Pecina, P. and Schlesinger, P. (2006). Combining associa-

tion measures for collocation extraction. In Proceed-

ings of the COLING/ACL on Main conference poster

sessions, pages 651–658, Morristown, NJ, USA. As-

sociation for Computational Linguistics.

Petrovic, S., Snajder, J., and Basic, B. D. (2010). Extend-

ing lexical association measures for collocation ex-

traction. Computer Speech & Language, 24(2):383–

394.

Terra, E. and Clarke, C. L. A. (2003). Frequency estimates

for statistical word similarity measures. In NAACL

’03: Proceedings of the 2003 Conference of the North

American Chapter of the Association for Computa-

tional Linguistics on Human Language Technology,

pages 165–172, Morristown, NJ, USA. Association

for Computational Linguistics.

Terra, E. and Clarke, C. L. A. (2005). Comparing query

formulation and lexical afﬁnity replacements in pas-

sage retrieval. In SIGIR Workshop on Methodolo-

gies and Evaluation of Lexical Cohesion Techniques

in Real-world Applications. Salvador, Brazil, August

2005, pages 11–18. ACM Press.

Thanopoulos, A., Fakotakis, N., and Kokkinakis, G. (2002).

Comparative evaluation of collocation extraction met-

rics. In Proceedings of the LREC 2002 Conference,

pages 609–613.

Turney, P. D. (2001). Mining the Web for synonyms: PMI-

IR versus LSA on TOEFL. In ECML-01, pages 491–

502, Freiburg, Germany.

Vechtomova, O. and Robertson, S. (2000). Integration

of collocation statistics into the probabilistic retrieval

model. In Proceedings of the 22nd British Computer

Society - Information Retrieval Specialist Group Con-

ference 2000, pages 165–177.

HANDLING THE IMPACT OF LOW FREQUENCY EVENTS ON CO-OCCURRENCE BASED MEASURES OF

WORD SIMILARITY - A Case Study of Pointwise Mutual Information

231