
3.4 HAL-based System (System II)
Considering the limits of the bigram-based approach,
with reference to the small size of a term’s context,
exclusively associated with the adjacent terms, we
decided to expand this context to a window of N
terms, using the Hyperspace Analogue to Language
approach,(Bruza and Song, 2002). The co-occurrence
matrix is generated as follows: once a term is given,
its co-occurrence is calculated with the N terms to
its right (or to its left). In particular, given a term
t and considered the window of N terms to its right
(or left) f
t
= {w
1
, . . . , w
n
}, we get co− oc(t, w
i
) =
w
i
i
,
i = 1, . . . , N. During the testing phase, N was given
a value of 10. As in the bigram-based approach, pair
(a, b) is equal to pair (b, a): hence even in this case
the co-occurrence matrix is symmetrical. For each
one of the training documents a co-occurrence matrix
is generated, whose lines are then normalized. The
matrices of the single documents are then summed
up, generating one single co-occurrence matrix rep-
resenting the entire training corpus. The text is bro-
ken down into nominal expressions, as before, but in-
stead of gathering all the terms in one single docu-
ment, the breakdown is maintained intact, in nominal
expressions. This is when the HAL algorithm is im-
plemented separately on the single nominal expres-
sions of the document. We want to ascertain if and
how much the addition of semantic information, such
as the breakdown into nominal expressions, can help
enhance performance, still implemented the weight-
ing system of co-occurrences based on the joint use
of IDF and c− index.
3.5 System based on Co-occurrence at
Page Level (System III)
The systems implemented so far base the construc-
tion of the co-occurrence matrix on the proximity of
words: in the case of bigrams, co-occurrence is lim-
ited to two adjacent words, while co-occurrence in the
HAL-based approach is extended to a window of N
terms. Both methods take advantage of the concept
of word proximity: the more the two words are closer
in the text, the higher the probability they will be se-
mantically linked. In the approach we are about to
describe, we have decided to pursue a totally different
method in building the co-occurrence matrix, which
allows to overcome the limit of considering two co-
occurring terms only if they are close to each other
in the text. Indeed, we tried to implement a system
which exploits co-occurrence at a page level, namely
trying to track down the pairs of words that usually
co-occur within the same training document, regard-
less of the distance between them; each term in a doc-
ument is considered co-occurring with all the other
terms in that very document. The number of times the
term appears in the document is counted, and the vec-
tor
−→
o
v
=h(t
1
, t f
t
1
), . . . , (t
n
, t f
t
n
)i is generated, where
N stands for the number of different stemmed terms
within the training document under discussion. Such
vector consists of pairs (t
1
, t f
t
i
), i = 1, . . . , N, where t
i
stands for a term present in the document, and t f
t
i
the
number of times it appears in the document. The ben-
efits of this weighting mechanism is evident when the
QE is done. As seen before, for each query term, the
co-occurrence vector is calculated. These vectors are
then summed up. The weighting mechanism makes
the contribution of the co-occurrence of the hardly
relevant terms of the query in the document corpus
less important compared to the co-occurrence of rele-
vant terms. Hence, for each training document, a co-
occurrence matrix is generated. These matrices are
then summed up so as to form one single matrix of co-
occurrences, which is used for the QE. Once the tex-
tual information is obtained from the training links,
the POS tagger extracts the nouns, proper nouns and
adjectives. Not all these terms are selected, only the
first k are used, following an order based on t f × id f.
Co-occurrences at a page level are then calculated ex-
clusively using these first k keywords where k is a
fixed parameter of the system, which is the same for
any page to be analyzed.
3.6 System based on Co-occurrence at
Page Level and Term Proximity
(System IV)
System II is exclusively based on the concept of co-
occurrence at page level: it attempts to track down the
terms that are usually present simultaneously in the
same pages, without even considering the distance be-
tween the words within the text. Ignoring term prox-
imity within the same document can lead to a consid-
erable loss of information, since two words that are
close to each other are more likely to be correlated,
from a semantic viewpoint too. That’s why we de-
cided to use a hybrid approach, that doesn’t use page-
level co-occurrence only, but that also considers term
proximity, as the bigram-based and HAL-based sys-
tems do. Following this idea we implemented and
tested a hybrid approach, starting from the extraction
of nominal expressions and exclusively considering
nouns, proper nouns and adjectives. Moreover, the
weighting mechanism based on IDF and c− index is
used. In order to carry out the QE, the two vectors of
co-occurrence with the query terms are obtained sepa-
rately, following the HAL-based approach and the ap-
QUERY EXPANSION WITH MATRIX CORRELATION TECHNIQUES - A Systematic Approach
37