tion retrieval performance, mainly, by detecting rela-
tionships between terms in a corpus of documents.
We focus on the application of the least square
method on a corpus of textual data in order to achieve
expressive semantic relationships between terms.
In order to check the validity and the performance
of our method, an experimental procedure was set up.
The evaluation is then based on a comparison of
the list of documents retrieved by a developed infor-
mation retrieval system and the documents deemed
relevant.
To evaluate within a framework of real exam-
ples, we have resorted to a textual database, of
1400 documents, called Cranfield collection (Ahram,
2008)(Sanderson, 2010). This collection of tests in-
cludes a set of documents, a set of queries and the list
of relevant documents in the collection for each query.
For each document of the collection, we proceed
a handling and an analysis in order to lead it to lem-
mas which will be the index terms. Once the doc-
uments are each presented with a bag of words, we
have reached by a set of 4300 terms in the whole col-
lection. Hence, matrix X is sized 1400 ∗ 4300. After
that, we applied on it the least square method for each
term in order to determine the vector A for each one.
The obtained values A
i
indicate the relationship be-
tween term
i
and the remaining terms in the corpus.
We obtain another square matrix T with 4300 lines
expressing the semantic relationships between terms
as follows:
∀i ∈ 1, 2, .. . ,4300,∀ j ∈ 1, 2, .. . ,4300
term
i
=
∑
(T[i, j].term
j
) (9)
Example of obtained semantic relationships:
Term airborn = 0.279083 action + 0.222742 air-
forc + 0.221645 alon + 0.259213 analogu + 0.278371
assum + 0.275861 attempt + 0.210211 behaviour +
0.317462 cantilev + 0.215479 carrier + 0.277437
centr + 0.216453 chapman + 0.22567 character +
0.23094 conecylind + 0.347057 connect + 0.239277
contact + 0.225988 contrari + 0.217225 depth +
0.283544 drawn + 0.204302 eighth + 0.26399 ellip-
soid + 0.312026 fact + 0.252312 ferri + 0.211903
glauert + 0.230067 grasshof + 0.223152 histori +
0.28336 hovercraft + 0.380206 inch + 0.238555 in-
elast + 0.205513 intermedi + 0.275635 interpret +
0.235573 interv + 0.216454 ioniz + 0.319457 meksyn
+ 0.200089 motion + 0.223062movement + 0.233753
multicellular + 0.376881 multipli + 0.436183 nautic +
0.219787 orific + 0.414204 probabl + 0.214005 pro-
pos + 0.305503 question+ 0.204316 read + 0.222911
reciproc + 0.256728 reson + 0.237344 review +
0.202781 spanwis + 0.351152 telemet + 0.226465ter-
min + 0.212812 toroid+ 0.339988 tunnel + 0.25228
uniform + 0.233854 upper + 0.20262 vapor.
We notice that obtained relationships between
terms are meaningful. Indeed, related terms in a re-
lation talk about the same context, for example the
relationship between the lemma airbon and the other
lemmas (airborn, airforc, conecylind, action, tunnel
. .. ) talks about the airborne aircraft carrier subject.
To test these relationships, we calculate for each one
the error rate (Err):
Err(term
i
) =
∑
1400
j=1
(X[ j, i] − (
∑
i−1
k=1
(X[ j, k] × T[i, k])
+
∑
4300
q=i+1
(X[ j, q] × T[i, q])))
2
(10)
The obtained values are all closed to zero, for exam-
ple the error rate of the relationship between term (ac-
count) and the remaining of terms is 1.5∗10
−7
and for
the term (capillari) is 5, 23∗ 10
−11
.
To check if obtained relations improve informa-
tion retrieval results, we have implemented a vector
space information retrieval system which test queries
proposed by the Cranfield Collection.
The aim of this kind of system is to retrieve docu-
ments that are relevant to the user queries. To achieve
this aim, the system attributes a value to each candi-
date document; then, it rank documents in the reverse
order of this value. This value is called the Retrieval
Status Value (RSV) (Imafouo and Tannier, 2005) and
calculated with four measures (cosines, dice, jaccard
and overlap).
Our system presents two kinds of evaluation;
firstly, it calculates the similarity (RSV) of a docu-
ment vector to a query vector. Then, it calculates the
similarity of a document vector to an expanded query
vector. The expansion is based on the relevant docu-
ments retrieved by the first model (Wasilewski, 2011)
and the relationships obtained by least square method.
Indeed, if a term of a collection is very related
with a term of query (α >= 0.5) and appears in a the
relevant returned documents, we add it to a query.
Mean Average Precision (MAP) is used to calculate
precision of each evalution. Table1 shows the ob-
tained results.
We notice from this evaluation, that relationships
obtained by least square method are meaningful and
can provide improvements in the information retrieval
process. Indeed, the MAP values are increasing when
these relations are used in information retrieval sys-
tem. For example, our method improves information
retrieval results using cosinus measure when α > 0.6
with MAP = 0.21826 compared to the basic VSM
model (MAP = 0.20858).
Compare our results with other works, we note
WEBIST2014-InternationalConferenceonWebInformationSystemsandTechnologies
342