would result in thousands of phrases representing the
same knowledge as the one maximal sequence.
4 RELATED WORK ON
COLLOCATION ACQUISITION
The initial work on collocation extraction is that
of (Choueka et al., 1983). Their definition of a col-
location was “a sequence of adjacent words that fre-
quently appear together”. The sequences were the-
oretically of any length, but were limited to size 6
in practice, due to repeated frequency counting. It
was experimented on an 11 million words corpus
from the New York Times archiveand found thousands
of common expressions such as “home run”, “fried
chicken”, “Magic Johnson”, etc. After pointing the
limited size of the sequences, one can also regret the
impossibility to extract any discontinuous sequence
such as “knock .. .door”, due to the adjacency princi-
ple of the definition. Finally, the selection/rejection is
simply based on a frequency threshold, which makes
the result depend on the size of the corpus.
(Church and Hanks, 1990) described a colloca-
tion as a pair of correlated words. That is, as a pair
of words that occur together more often than chance.
The technique is based on the notion of mutual infor-
mation, as defined in Information Theory (Shannon,
1948; Fano, 1961). This new set of techniques per-
mits to retrieveinterruptedsequences of words as well
as continuous ones. Unfortunately, the set of the can-
didate sequences is now restricted to pairs of words.
In other words, we can only acquire collocations of
size 2, where Choueka’s technique was up to 6.
Smadja proposed a more advanced technique,
built on Choueka’s. It resulted in Xtract (Smadja,
1993), a tool combining a frequency-basedmetric and
several filters based on linguistic properties. The met-
ric used by Smadja was the z-score. The z-score of a
pair is calculated by computing the average-frequency
of the words occurring within a 5-words radius of a
given word (either forward or backward), and then
determining the number of standard deviations above
the average frequency for each word pair. Pairs with
a z-score under a certain threshold were pruned away.
Then, linguistic filters were applied to get rid of those
pairs, which are not true lexical collocates. For exam-
ple, for a same pair “noun-verb”, the technique differ-
entiates the case were the noun is the subject or the
object of the verb. Semantically related pairs (such as
doctors-hospitals) were also removed. After the iden-
tification of these word pairs, the collocation set was
recursively extended to longer phrases, by searching
for the words that co-occurred significantly together
with an already identified collocation. A lexicogra-
pher was asked to estimate Xtract’s result. After the
full processing, including the statistical stages and lin-
guistic filtering, 80% of the phrases were evaluated as
good collocations. The score was only 40% before the
syntactic filtering, illustrating the primary importance
of combining both linguistic and syntactic informa-
tion, in order to find accurate lexical collocates.
Of course, our technique is not as strict as
Smadja’s, regarding the definition of a collocation,
and most of its linguistic filtering can be regarded
as unnecessary for our purpose. Indeed, we are not
fundamentally aiming at the discovery of colloca-
tions from a document collection, but considering
collocation-based techniques to estimate the value of
document descriptors. As a matter of fact, and as
mentioned earlier, we will rather stick to Benson’s
definition of a collocation, that is probably the most
appropriate to statistical techniques: an arbitrary and
recurrent word combination. Based on this approach,
we will now compare maximal frequent sequences to
other types of descriptors.
4.1 Specificities of MFSs as Collocations
Among the most satisfactory aspects of MFS ex-
traction is the possibility to discover phrases of any
size. From this point of view, it adds up from both
Choueka and Church. Another clear strength, op-
posed to Choueka et al., is the ability to compose
phrases from non-adjacent words. This is due to two
reasons. First, the use of a gap, the maximal number
of words allowed between two other words, so as to
consider them as a pair. Second, the use of a list of
stop words, which prunes away most of the less infor-
mative words. The negative aspect of this stop word
filtering is that most os the collection following the
verb+adverb (e.g., “take ...off”, “turn .. .on”) pat-
tern will be missed. A solution would be parts-of-
speech based preprocessing, so as to make sure we
keep the adverbs corresponding to these possibly rel-
evant phrases (and only those). Our technique has
also the advantage over that of Smadja, that it does
not require the computationally heavy combination of
frequency and distance. Indeed, using windows of ra-
dius 5 implies, for each word of the corpus, to form
10 pairs of words and to calculate their frequency.
Another difference with Smadja’s Xtract, is that
our technique does not unite a pair and its inverted
form, as a one and same phrase. We consider se-
quences rather than phrases. For example, noun-verb
and verb-noun are different in our view, whereas in
Smadja, they are first gathered together and then even-
tually pruned by the z-score threshold. Given a pair,
KDIR 2010 - International Conference on Knowledge Discovery and Information Retrieval
144