![](bg3.png)
XML document and an XML query. The main mo-
tivation of its application is that it allows to perform
structure comparison, which is necessary in XML re-
trieval and second, it provides a ranked list of docu-
ment trees (or document sub-trees), where the score
is the inverse of the total cumulated cost of the doc-
ument tree to the query tree correction (or the query
tree to the document tree correction). However, due to
some practice reasons, this cannot be applied to XML
retrieval. The tree to tree correction algorithms are not
scalable: they cannot be applied on XML retrieval on
very large corpuses.
The structural retrieval process should look for the
deepest and widest sub-tree shared by both represen-
tations (Selkow, 1977). Instead of applying a tree to
tree flexible correction algorithm in the original trees,
we use a dual process: an exact match algorithm in
flexible representations of each tree. A flexible repre-
sentation of an XML tree explores all hidden relations
that could exist between nodes. We call such repre-
sentation the extended tree. This operation is done
since indexing the XML corpus.
To do so, we add to each path A −→ B a weight re-
flecting the importance of the relation between nodes
A and B. According to the parent-child relation, this
weight is equal to 1. The more A is distant from B
in the original path, the less its weight is. The weight
depends on the distance between two nodes that occur
in a path. We use the weighting function f defined by
f(A −→ B) = exp(1 − d(A,B)) where d(A,B) is the
distance that separates node A from node B. We de-
note this path by A
w
−→ B where w = exp(1− d(A,B))
is the weight of the path A −→ B.
We apply an exact match algorithm on the ex-
tended trees (Ben Aouicha et al., 2006) (Ben Aouicha
et al., 2009). As the complexity of the exact match al-
gorithms is less than the approximative matching al-
gorithm one, we use this kind of algorithms to com-
pare between two flexible representations. All the
complexity is then moved at the representation phase,
which corresponds to the indexing process. This
phase is only performed once, and the interrogation
process is in contrast performed by the application
of an exact match algorithm on the extended trees
representations which are stored in the index. Start-
ing from tree T and tree T
′
this algorithm looks for
the deepest and widest sub-tree shared by T and T
′
and computes the similarity depending on the weights
of paths appearing in each one. Relevant fragments
are those having similar structure than the user query.
This can be done by looking in the extended docu-
ment for fragments having exactly the same structure
as the extended query or a part of it. This allows flex-
ible XML retrieval (Bordogna and Pasi, 2000). The
search strategy we adopt is iterative. We start from
the extended document and query trees. We build a
returned fragment incrementally, starting from poten-
tially common roots, we build for each one its com-
mon child nodes, and then for each child node, we
build its common child nodes and so on until leaf
nodes. When building relevant candidate fragments,
we compute the structure-based score by cumulating
the product of the paths weights in the document and
their matched ones in the query tree.
The final score is a linear combination of the
structure-based score r
s
and the content-based score
r
c
:
r( f
q
, f
d
) = α × r
c
+ (1 − α) × r
s
(3)
where α is a parameter that emphasizes the re-
trieval process on the structure constraints or the key-
words. For our experiments, we use α = 0.6.
4 AUTOMATIC QUERY
REFORMULATION
In our approach we focus on the structure of the orig-
inal query and that of document fragments deemed
relevant to the user structure hints. Indeed, this study
allows us to reinforce the importance of these struc-
tures in the reformulated query to better identify the
most relevant fragments to the user’s needs. The anal-
ysis of structures allows us to identify the most rele-
vant nodes and the involved relationships.
4.1 Line of Descent Matrix
According to most approaches to automatic query re-
formulation, the query construction is done by build-
ing a representative structure of relevant objects and
another structure for irrelevant ones, and then build a
representation close to the first and farther from the
second.
For example, the Rocchio’s method (Rocchio,
1971) considers a representative structure of a doc-
ument set by their centroid. A linear combination of
the original query and the centroids of the relevant
documents and irrelevant ones can be assumed as a
potentially suitable user need.
We propose to traduce the documents and the
query in a matrix format instead of a wighted term
vector like in Rocchio’s method which is more suit-
able for flat document. We build for each document
a matrix called line of descent matrix (LDM), which
must show all existing ties of kindship between differ-
ent nodes. This representation should also reflect the
STRUCTURE-BASED INTERROGATION AND AUTOMATIC QUERY REFORMULATION
125