tendency to give very high association scores to pairs
involving low-frequency words, as the denominator
is small in such cases, while one generally prefers a
higher score for pairs of words whose relatedness is
supported by more evidence.
The impact of the bias is apparent in tasks directly
related to query transformation. For example, in
(Croft et al., 2010), while talking about query expan-
sion, the authors take a collection of TREC news as
example, and show that, according to PMI, the most
strongly associated words for ”tropical” in this cor-
pus are: ”trmm”, ”itto”, ”ortuno”, ”kuroshio”, ”bio-
function”, etc. though the collection contains words
such as ”forest”, ”tree”, ”rain”, ”island” etc. They
then conclude that these low-frequency words ”are
unlikely to be much use for many queries”.
Nonetheless, the basic straightforwardness of PMI
over other approaches is still appealing, and several
empirical variants have therefore been proposed to
overcome this limitation. Since the product of two
marginal probabilities in the denominator favors pairs
with low-frequency words, a common feature of these
variants is to assign more weight to the joint probabil-
ity p(a, b) either by raising it to some power k in the
denominator (log
p(a,b)
k
p(a)p(b)
) or by using it to globally
weight PMI as in the case of the so-called ”Expected
Mutual Information” (p(a, b)log
p(a,b)
p(a)p(b)
). However,
as pointed out by (Croft et al., 2010), the correction
introduced may result in too general words being top-
ranked.
Whether it is preferable to discover specialized
or general related terms depends on the context. In
any case, the point is that failing to precisely quantify
the impact of the bias and its possible corrections in-
evitably leads to empirical results that are very depen-
dent on the data. The aim of this paper is therefore to
propose precise indicators of sensitivity to frequency.
The plan of the paper is as follows. In section 2,
we review PMI and some common variants in order
to give insight into how each measure works, factors
influencing them, and their differences. We also pro-
pose formulae for assessing the impact of the correc-
tions brought by several widely used variants. Section
3 provides experimental validation of these formulae
and investigates how to give some simple visual hints
at the differences in behaviour to be expected when
migrating from one measure to the other. We con-
clude by summarizing our contribution and indicating
directions for future research.
2 A FORMAL STUDY OF SOME
IMPORTANTS VARIANTS OF
PMI
Although widely used, PMI
1
has two main limitations
: first, it may take positive or negative values and lacks
fixed bounds, which complicates interpretation. Sec-
ondly, it has a well-known tendency to give higher
scores to low-frequency events. While this may be
seen as beneficial in some situations, one generally
prefers a higher score for pairs of words whose relat-
edness is supported by more evidence.
In order to overcome these limitations, several
variants of PMI have therefore been proposed over
the years. In contrast to more general relatedness
measures for which numerous comparative studies are
available (Pecina and Schlesinger, 2006; Hoang et al.,
2009; Thanopoulos et al., 2002; Lee, 1999; Petro-
vic et al., 2010; Evert, 2004), no systematic and for-
mal comparison specifically addressing these variants
seems to have been conducted so far.
Among the most widely used variants are those
of the so-called PMI
k
family (Daille, 1994). These
variants consist in introducing one or more factors of
p(a, b) inside the logarithm to empirically correct the
bias of PMI towards low frequency events. The PMI
2
and PMI
3
measures commonly employed are defined
as follows:
PMI
2
(a, b) = log
p(a, b)
2
p(a)p(b)
and
PMI
3
(a, b) = log
p(a, b)
3
p(a)p(b)
.
Note that from the expression of PMI
2
(a, b), a sim-
ple derivation shows that it is in fact equal to
2log p(a, b) − (log p(a) +log p(b)), and thus to
PMI(a, b) + log p(a, b).
That is to say the correction is obtained by
adding some value increasing with p(a, b), namely
log p(a, b) to PMI, which, in fact, will boost the scores
of frequent pairs. However, for comparison purposes
it may be more convenient to express PMI
2
(a, b) as:
PMI
2
(a, b) = PMI(a, b) − (−log p(a, b)) (1)
1
PMI is not to be confused with the Mutual Information
between two discrete random variables X and Y , denoted
I(X ;Y ), which is the expected value of PMI.
I(X ;Y ) =
∑
a,b
p(a, b)log
p(a, b)
p(x)p(y)
=
∑
a,b
p(a, b)PMI(a, b).
HANDLING THE IMPACT OF LOW FREQUENCY EVENTS ON CO-OCCURRENCE BASED MEASURES OF
WORD SIMILARITY - A Case Study of Pointwise Mutual Information
227