Measuring term dependency through phrases or
n-grams includes dependency only between adjacent
terms. However, genuine term dependencies do not
exist only between adjacent words. They may also
occur between more distant words such as between
“powerful” and “computers” in powerful multipro-
cessor computers. This work is targeted in the di-
rection of capturing term dependencies between ad-
jacent as well as distant terms. Proximity could be
viewed as indirect measure of dependence between
terms. (Beeferman et al., 1997) shows that term de-
pendencies between terms are strongly influenced by
proximity between them. The intuition is that if two
words have some proximity between each other in
one document and similar proximity in the other doc-
ument, then a combined feature of these two words
when added to the original document vector should
contribute to similarity between these two documents.
We also suggest a feature generation process to limit
the number of pairs of words to be considered for in-
clusion as a feature. Cosine similarity is then used
to measure similarity between the two document vec-
tors and finally Group Hierarchical Agglomerative
Clustering (GHAC) algorithm is used to cluster docu-
ments.
To the best of our knowledge, no work has been
done so far to utilize term proximity between distant
terms for improved clustering of text documents. The
contribution of this paper is two folds. Firstly we in-
troduce a new kind of feature called Term-Pair fea-
ture. A Term-Pair feature consists of a pair of terms
which might be adjacent to each other as well as dis-
tant and is weighted on the basis of a term proximity
measure between the two terms. With the help of dif-
ferent weights, we show how clustering is improved
in a simple yet effective manner. Secondly, we also
discuss how from the large number of such possible
features, only the most important ones are selected
and remaining ones are discarded.
The rest of the paper is organized as follows. Sec-
tion 2 briefly describes the related work. Section 3
explains the notion of term proximity in a text docu-
ment. Section 4 describes our approach to the calcu-
lation of similarity between two documents. Section
5 and 6 describe experimental results and the conclu-
sion respectively.
2 RELATED WORK
Many Vector Space Document based clustering mod-
els make use of single term analysis only. To fur-
ther improve clustering of documents and rather than
treating a document as Bag Of Words, including
term dependency while calculating document similar-
ity has gained attention. Most of the work dealing
with term dependency or proximity in text document
clustering techniques includes phrases (Hammouda
and Kamel, 2004), (Chim and Deng, 2007), (Zamir
and Etzioni, 1999) or n-gram models. (Hammouda
and Kamel, 2004) does so by introducing a new doc-
ument representation model called the Document In-
dex Graph while (Chim and Deng, 2007), (Zamir and
Etzioni, 1999) do so with the use of Suffix Tree. In
(Bekkerman and Allan, 2003), the authors talk about
the usage of bi-grams to improve text classification.
(Ahlgren and Colliander, 2009), (Andrews and Fox,
2007) analyze the existing approaches for calculating
inter document similarity. In all of the above men-
tioned clustering techniques, semantic association be-
tween distant terms has been ignored or is limited to
words which are adjacent or a sequence of adjacent
words.
Most of the existing information retrieval mod-
els are primarily based on various term statistics. In
traditional models - from classic probabilistic models
(Croft and Harper, 1997), (Fuhr, 1992) through vector
space models (Salton et al., 1975) to statistical lan-
guage models (Lafferty and Zhai, 2001), (Ponte and
Croft, 1998) - these term statistics have been captured
directly in the ranking formula.The idea of including
term dependencies between distant words (distance
between term occurrences) in measurement of doc-
ument relevance has been explored in some of the
works by incorporating these dependency measures
in various models like in vector space models (Fagan,
1987) as well as probabilistic models (Song et al.,
2008). In literature efforts have been made to extend
the state-of-the-art probabilistic model BM25 to in-
clude term proximity in calculation of relevance of
document to a query (Song et al., 2008), (Rasolofo
and Savoy, 2003). (Hawking et al., 1996) makes use
of distance based relevance formulas to improvequal-
ity of retrieval. (Zhao and Yun, 2009) proposes a new
proximity based language model and studies the in-
tegration of term proximity information into the un-
igram language modeling. We aim to make use of
term proximity between distant words, in calculation
of similarity between text documents represented us-
ing vector space model.
3 BASIC IDEA
The basis of the work presented in this paper is mea-
sure of proximity among words which are common in
two documents. This in turn conveys that two doc-
uments will be considered similar if they have many
KDIR 2011 - International Conference on Knowledge Discovery and Information Retrieval
538