EXTENDING AN XML MEDIATOR WITH TEXT QUERY

Clément Jamard, Georges Gardarin

Laboratoire PRiSM, Université de Versailles, 45 avenue des Etats-Unis, 78000 VERSAILLES

Keywords: XML, mediation, indexing technique.

Abstract: Supporting full-text query in an XML mediator is a difficult problem. This is because most data-sources do

not provide keyword search and ranking. In this paper, we report on the integration of the main

functionalities of the emerging XQuery Text standard in XLive, a full XML/XQuery mediator. Our

approach is to index on keywords virtual documents in views. Selected virtual documents are on demand

mapped to data source objects. Thus, the mediator selection operator is efficiently extended to support full-

text search on views. Keyword search and result ranking are integrated. We rank results using a relevance

formula adapted to XPath, based on number of keywords in elements and distance from the searched nodes.

1 INTRODUCTION

As XQuery becomes the standard for querying

XML, new needs appear to perform full-text search

in XML. A task force, Buxton and Rys (2003), is

currently specifying new full-text search predicates

and functions to be included in XQuery, so as to

express searching on multiple keywords, ranking

results on relevance, searching on suffix or prefix of

terms, etc. TexQuery, Amer-Yahia (2004), can be

seen as a precursor of the future language.

Some text search functionalities are very

common and present in most DBMSs, such as single

keyword search. Data from distributed system has to

be recomposed before applying text search;

important functionalities often required by

applications are not possible with distributed

systems. These concern ranking query results,

multiple conjunctive keywords searches, and

searches dealing with stemming, prefix or suffix on

terms. An increasing number of XQuery-based

information integration platforms are available like

BEA (2004), IBM DB2 (2004), Papakonstantinou

(2003) or XQuare (2005). They are mostly based on

a global as view architecture and support a

significant subset of XQuery. At the best of our

knowledge, none of them support fully XQuery

Text. However, many data integration applications

are full-text oriented and requires full support.

The goal of a mediator is to federate sources

around an integrated architecture fulfilling the lacks

of some sources. Most data sources support single

word search, some multi-keyword search, but most

mediated systems have different capabilities for

searching full text. For example, the XLive mediator

can currently query Google as a (large) virtual XML

collection through a Web service wrapper. This

search engine is very powerful in multi-keyword

search and in ranking results, compared to common

relational databases. We also federate Xyleme,

Abiteboul (2002), an XML native database system

that supports efficiently multi-keyword searches. All

these systems have some capabilities, but none

propose the full set of XQuery text functionalities.

Thus, there is a strong need to integrate uniform full-

text search on all sources. Moreover the integration

of the ranking systems is very difficult, as all

integrated system have their own ranking scheme.

In this paper, we address the problem of

extending an XML mediator for querying text-

oriented sources using XQuery Text. We base our

implementation on XLive, Dang-Ngoc and Gardarin

(2003). The XLive system integrates and query

relational or XML sources in XQuery. A large

subset of XQuery is supported including FLWR

expressions and nested queries. Sources are wrapped

in a subset of XQuery. The query runtime is

dataflow-oriented and built around an extended

relational algebra for XML, known as the XAlgebra.

The basic idea of this algebra is to model XML

documents as tuples of paths referencing virtual

DOM trees, called XTuples. The mediator evaluates

query plans of XAlgebra operators on collections of

XTuples and constructs XQuery results.

Jamard C. and Gardarin G. (2006).

EXTENDING AN XML MEDIATOR WITH TEXT QUERY.

In Proceedings of WEBIST 2006 - Second International Conference on Web Information Systems and Technologies - Internet Technology / Web

Interface and Applications, pages 38-45

DOI: 10.5220/0001246000380045

 SciTePress

Data retrieved through XLive are distributed on

multiples sources. An important issue in integrating

full-text search in XLive is the management of

sources capabilities. We propose to unify these

capabilities through views. The mediator defines

views of distributed XML data and provides XQuery

Text support through these views. The mediator

does not materialize the views to avoid replicating

sources data; but, it indexes their contents and

structures. Several indexing schemes have been

proposed in centralized systems for fast retrieval of

elements on keywords. The interested reader can

find a survey in Gardarin and Yeh (2003). We

propose an efficient distributed indexing scheme that

relies on a viewguide, an invariant abstract DTD-

like summary derived from the query defining the

view. This index scheme is particularly adapted for

text search over distributed data.

XQuery/IR, Bremer and Gertz (2002), is an

efficient integration of information retrieval

techniques within XQuery. It uses an indexing

scheme adapted to XML tree structures allowing

solving tree pattern queries. Such a system does not

provide a solution for our mediation context as data

structures are centralized and homogeneous.

Another approach to support XQuery Text query is

to define function operators. TexQuery, Amer-Yahia

(2004), uses Boolean operators on XML data flows

to determine the presence of keywords in elements

and distance between keywords. Scores functions

are also defined as operators to rank results. Such

operators are not easy to adapt to mediation; the

mediator has to manipulate a huge amount of data

through complex operations. Both systems do not

provide solutions to reconcile data coming from

different sources before applying text search

functions. Numerous works focus on reducing index

size in centralized systems (Chen and al. 2003,

Chung and al. 2002, Cooper and al. 2001, Milo 1999

or Kaushik and al. 2002). In summary, although

functional and efficient solutions have been studied

for supporting XQuery Text, they are not easily

applicable to mediation systems.

Managing view requires integration of data

available in different schemas. Relationship

(mappings) between schemas must be specified, to

determine correspondence between elements in

source schemas and elements in target schemas. A

lot of work has been done on unifying source

schemas under a target schema. A survey can be

found in Rahm and Bernstein (2001). Defining rules

mapping paths from one schema to paths of another

is a simple but effective approach. Our system

provides for this kind of mapping techniques to

create integrated views.

This paper is organized as follows. Section 2

presents the integration of indexed views to support

full-text search in XLive. Section 3 develops the

query processing algorithm for querying views and

ranking results. Section 4 gives some experimental

results of our system. Section 5 summarizes the

contributions and introduces future work.

2 INDEXING VIRTUAL VIEWS

The key question in a mediation context is how to

integrate a keyword path indexing technique within

the distributed architecture. In mediation systems,

views are often used to focus the search on relevant

data source parts. To combine the power of views

with keyword search, a key decision of our design is

to index virtual views of sources by keywords.

We choose to index the view content. Through

the index, the mediator knows the locations of terms,

which helps in answering text queries efficiently.

Indexing important terms avoids replicating entire

sources in the mediator. It avoids huge data transfer

between sources and the mediator. The index

determines relevant results, which avoids complex

full text search operation on data in the mediator. A

compact and fast index is the focus of our approach

in order to avoid managing large data sets in the

mediator.

Identifiers used in index entries are mapped to

objects in sources through additional structures

maintained at the mediator and source layers. We

use these additional structures that determine where

data composing a document in the view are located.

It helps in recomposing the document efficiently.

2.1 Index Overview

We choose to index view content at creation time

and to maintain index when view sources are

updated. Term positions in the view are memorized

in the index at mediator level. The index determines

relevant elements addresses; it avoids huge data

transfer between source and mediator; only relevant

data are transferred. Thus, the mediator does not

manipulate the whole data, through complex

information retrieval operations.

Identifiers used in our index reference objects on

sources using intermediate structures managed by

wrappers. These structures allow view data

localization, extraction, and reconstruction from

EXTENDING AN XML MEDIATOR WITH TEXT QUERY

sources. When an update occurs on a source, the

source reports to the mediator in order to update

index identifiers. Triggers or periodic polling is used

to detect updates on sources. The reporting

functionality depends on the source wrapper.

2.2 Location of Words in Views

To index content of the view, the position of a word

has to be identified precisely. A word is located by a

path and a document instance of the view. We

propose our own numbering scheme to encode this

position. We now detail the index structure used for

determining and locating the relevant view instances

2.2.1 Numbering Scheme

We first introduce the numbering scheme

implemented for identifying virtual nodes in a view.

Any element in a view instance is addressed by a

global document identifier (GDID) plus a node

identifier (NID) determining the path reaching the

element.

Global document identifier (GDID): Unique

integer allocated by the mediator identifying a

virtual document instance of a view.

Node identifier (NID): Unique identifier of a

node element in a document determined by the view

definition.

To encode a NID, we make use of a Viewguide

summarizing the structure of a view.

Viewguide: Tree giving the common structures

of all documents in a view, whose nodes correspond

to elements or attributes and edges to simple or

multi-valued (marked with*) imbrications of

elements.

Attributes are treated as elements with a name

prefixed by @. All children of a node have different

names as duplicates are removed. In addition, edges

are marked by the maximum cardinality of the

element (1 by default, and * if multiple). Thus, each

distinct path of the view is represented once and

only once in the viewguide.

The viewguide is somehow similar to a

DataGuide (Widom and al), but: (i) It is a pure

structural summary. (ii) It is derived from a view

definition (i.e., the query defining the view) and not

from instances. (iii) It is annotated with cardinalities

of elements. It is used to assign a compact and stable

unique node identifier (NID) to each element of an

instance of a view. Viewguide nodes are numbered

by means of a preorder traversal (see figure 1). We

select this structure as it is easy to derive from a

query with a fully specified return clause. View

definitions are restricted to fully specified return

clauses, as detailed in the sequel.

III

V VI

VII

VIII IX

critic

book

review

author

genres

isbn title

rating

author

for $b in collection(“catalog”)/book

return

<book>

{ for $a in $b/author

return <author> { $a/text () } </author>

}

<genres> { $b/genres/text() } </genres>

<title> { $b/title/text () } </title>

</book>

{ for $rev in collection(“review”)/review

where $b/@isbn = $rev/book/@isbn

return

{ for $p in $rev/book/p

return <p> { $p/text () } </p>

}

<rating> { $rev/book/rating/text () } </rating>

<author> { $rev/book/author/text () } </author>

</review>

}

</critic>

Figure 1: View definition XQuery and ViewGuide.

To facilitate logical operation and XPath

encoding, we implement a structure for NIDs. A

NID is composed of a prefix and an optional suffix:

- The prefix is the node number assigned to the

node in preorder traversal of the viewguide.

- The suffix corresponds to the cardinality of the

traversed multi-valuated elements from the root

to the node.

The document identifier determines a document

instance in the view while the node identifier

encodes the path to reach the node from the

document root. Then a <GDID-NID> pair identifies

a unique element in the view.

For example, an element identified by the XPath

critic/review/p is assigned the path I/VII/VIII. Only

the leave number is kept as node identifier, i.e., VIII.

Nodes with edges marked with multiple occurrences

are additionally identified by a suffix added to the

identifier. Therefore the path critic/review[1]/p[2],

which corresponds to the numbering I/VII[1]/VIII[2]

WEBIST 2006 - INTERNET TECHNOLOGY

is encoded as VIII[1,2]. We keep suffixes only for

multi-valuated elements; a mono-valuated element

has no suffix, for example critic/book/title is

encoded as VI. Such identifiers are compact and do

not change while the view definition does not

change. The <GDID-NID> pair identifying the

position of the author of the second review in the

second document of the view is <2-X[2]>.

To translate XPath expressions selecting several

nodes in path identifiers, we introduce the concept of

identifier pattern (NID pattern called NIP). This

structure is used further for query processing.

Node identifier pattern (NIP): Profile of node

identifier with * in place of indices, meaning that

any indice is valid.

A node identifier pattern is simply a node

identifier in which stars replace one or more indice

of the suffix. A star in a suffix means that any

number is valid. For example, the path

critic/review/p[1], selecting the first p element in

any review of a critic will be encoded VIII[*,1].

2.2.2 Word Index

The mediator stores words positions in the view in

the Word Index.

Word Index: B-tree structure giving for each

keyword the virtual addresses of the nodes

containing these keywords.

The Word Index is a classical inverted list

addressing element locations in virtual documents.

An address is a <GDID-NID> pair determined by

our numbering scheme. Keywords are determined by

a thesaurus giving important words to be indexed,

which can be used in queries. It is populated with

location of all words at view creation time.

In a more detailed way, entries of the word index are

pairs (term, position record). The position record is a

table with column GDID, NID prefix, sorted list of

NID suffix. Each tuple corresponds to an element

containing the term with possibly multiple instances

if the element is multi-valued. Table 1 illustrates two

position records. This structure has been selected for

fast evaluation of intersection and union operations

detailed further in query processing section.

Table 1: Two position records, rec1 and rec2.

GDID Prefix Suffix list GDID Prefix Suffix list

120 VIII (1,4) (3,5) 120 IV -

120 IX (2) 120 VI -

121 VI - 120 VIII (1,2) (2,3) (2,5)

121 VIII (3,4) (4,1) 120 IX (1)

2.3 Location on Data Sources

The Source Map maintains the mapping between a

global document in the view and local documents in

the sources used to compose the view instance. More

precisely, we refer local documents through local

document identifiers. This local identifier is

associated to an extraction data operation.

Source Map: Mapping structure on the mediator

mapping a GDID to a set of LDID composing the

document.

Local document identifier (LDID): Number

allocated by a wrapper allowing retrieving a part of a

document in the source.

At view creation time, the view definition query

is decomposed into atomic queries (queries referring

to a single collection of XML documents). Each

concerned wrapper rewrites the atomic query(ies)

according to its local schema(s). Mapping between

global schema (defined by atomic queries definition)

and local schema can be given by a human or can be

determined semi-automatically by schema matching

algorithms. Mapping techniques used are not

detailed here for lack of place.

Plan Generator

Wrapper

- XQuery Mapping translation

- LDID creation

Atomic XQueries

Source

Local queries

Wrapper

- XQuery Mapping translation

- LDID creation

-XML data

-LDID

View

Documents

View Indexer

ViewGuide

XQuery View

Definition

Execution Plan

Figure 2: Framework for creating an Indexed View.

Each local source wrapper extracts and provides

data to the mediator respecting the target view

schema (viewguide). The view creation framework

is detailed in figure 2. The Plan Generator defines

an Execution Plan, which constructs view

documents from data retrieved by wrappers. Data is

then indexed by the View Indexer, which populates

the Word Index and Source Map.

For each document on a local source, a LDID is

created. An LDID maintains a reference to data on

the local source and a reference to the mapping used

EXTENDING AN XML MEDIATOR WITH TEXT QUERY

to extract data. The LDID mapping depends on the

wrapper. For a file wrapper, the LDID can be simply

the file URI. For an XML database, it is generally a

document identifier, for example a URL in Xyleme.

For relational databases, it can be a reference to an

SQL/XML or XQuery query allowing mapping table

rows to XML. The mapping associated to each

LDID determines the way to query and recompose

data on local sources.

In the view example, two atomic queries

corresponding to book and review are generated

from the view definition. Wrappers corresponding to

these collections of entities are retrieved, queries are

rewritten according to local mapping, and data are

extracted.

From any LDID identifier, wrappers are able to

query the source to retrieve the local part of

document participating in the view. Finally, the

mediator uses the view definition to recompose the

whole document.

3 TEXT QUERY PROCESSING

The query processing algorithm first retrieves the

index entries corresponding to a textual search (e.g.,

search on keyword list with ranking of results).

Then, it uses the retrieved node identifiers to extract

from the sources the relevant elements from the

view. We detail how to search the index and

recompose results after querying the source. We also

propose an efficient way to rank results by relevance

in the context of distributed heterogeneous sources.

A multi-keyword search over semi-structured

data relies on two parameters: the structural search

space and a keyword list (k1, k2… kn). The

structural search space defines the elements to look

into. Typically, it is expressed as an Xpath. We first

concentrate on conjunctive queries in which the

relevant elements shall contain all keywords.

Predicates in queries are of the form A1/A2…/Am[.

Ftcontains k1 && k2 &&… kn], where Ai are labels

(or attribute names prefixed by @) and ki are

keywords. Regular expressions are allowed, i.e., Ai

can be * or empty (wildcards). A typical query is

“find all books in the critics view having a review

dealing with XML and databases”.

Steps performed by a search are:

1. Determine the search space.

2. Compute the set of entries containing each

term of the keyword list.

3. Extract documents from sources and

compose results.

We define the search space using node identifier

pattern(s) as define above. NIPs are derived from the

Xpath expression by a viewguide traversal from the

root. Notice that an Xpath query searches for all

elements when the position is not specified for a

repetitive element.

On the example, we rewrite the predicate Xpath

in pattern critic/review[*]. Then, we encode the

Xpath, which results in VII,(*). Thus, we obtain the

root(s) of the subtree(s) interesting for the query.

However, we need to access the full contents of

these subtrees, to look for other keywords inside

(keywords may be contained in any element among

comments, comment, p, author or rating. To retrieve

efficiently the ancestors or descendants of a node,

we maintain the ancestor-descendant matrix (ad-

matrix): it is a compact Boolean matrix in which

element C

= 1 if element i is an ancestor of element

j in the viewguide. This matrix provides an

immediate method to retrieve the relationship

between two identifiers in the viewguide.

The search space for our example query is an

interval of nodes identified between VI(*) and X(*).

We query the word index to compute the set of

entries containing each term of the keyword list. The

strategy consists in accessing the B-tree entries

(position record) for each keyword k

to k

. We then

intersect these lists of identifiers for conjunctive

queries; the element searched must contain at least

one or more occurrences of each keyword. The

intersect operation determines the first common

ancestor between two or more nodes. Each valid

intersection (i.e., remaining in the search space) will

be kept as a result.

To reduce the number of comparisons, the

intersection algorithm processes intersection on

entries respecting these three conditions:

1. GDID equality, i.e., keywords are in the

same document. It is a trivial condition

avoiding intersecting position contained in

different documents.

2. NID prefixes are descendant of the NIP

search space, i.e., keyword positions out of

the search space are not considered.

3. NIDs suffixes are not already computed.

Figure 3 sketches the intersection algorithm. It

computes every valid intersection between list of

identifiers in a given search space. Condition 1 is

checked on lines 8 and 12-13 to select identifiers of

same documents. Condition 2 is applied when

asking for next element with the search space on

lines 1, 5, 13, 27. Each next element is chosen only

among lines of position record corresponding to the

search space. Last condition is applied on line 21-

WEBIST 2006 - INTERNET TECHNOLOGY

22, by checking if the current intersection remains in

the last result (descendant of that result).

Entries are ordered by NID suffix in position

records, which avoids skipping intersection when

scanning each list. Therefore condition on line 30

always selects the minimum NID of list to intersect.

10.

11.

12.

13.

14.

15.

16.

ALGORITHM INTERSECT

INPUT : List(n)

n list of keyword position

VARIABLES : intersect, current_id, test_id, tmp_res

type:identifier<GDID-NID>

search_space

type:NIP filter

result

type:list of result

current_id = List(0).nextElement(search_space)

WHILE( List(0).hasNextElement(search_space) ) DO

intersect = NULL;

FOR EACH LIST DO

test_id = list().nextElement(search_space)

/* No element intersecting in same document

Repeat process on the next element */

WHILE ( test_id.GDID > current_id.GDID )

current_id = test_id;

FOR EACH LIST DO

/* Find next element with same GDID */

WHILE (test_id.GDID < current_id.GDID) DO

test_id = list().nextElement(search_space);

/* Intersection in a given search space */

tmp_res =

ancestor(test_id,current_id,search_space);

/* Valid intersection ->

search in the next list */

/* Invalid intersection ->

same process on next element */

IF (tmp_res != NULL &&

!tmp_res.descendantOf(result.last))

intersect = tmp_res;

ELSE

intersect = NULL;

IF (current_id.suffix > test_id.suffix)

current_id =

list().nextElement(search_space);

break;

/* Keep the resulting intersection */

if(intersect != NULL)

result.add(intersect);

last_res = intersect;

current

id = list().nextElement(search

space);

17.

18.

19.

20.

21.

22.

23.

24.

25.

26.

27.

28.

29.

30.

31.

32.

33.

34.

35.

Figure 3: Intersection algorithm.

The algorithm can be applied to position record

list of table 1 with the search space of our query as a

NIP: VII(*). For GDID 120 only lines VIII and IX

are selected. We obtain valid intersection for IX(1)

records1 and VIII(1,4) records2, VIII(2,3) records1

and IX(2) records2. These are VII(1) and VII(2).

Other intersections for GDID 120 are not leading to

a valid intersection (in the search space, or an

intersection already found). Entries of other GDID

are processed in the same way.

4 XQUERY TEXT CAPABILITIES

4.1 Full Text Search

A full-text search may use the position of keywords

inside the document. This position can be expressed

in two metrics. The position as element means that

only the path from the root of the document

determines the criteria. The position inside an

element means the path and the position among the

other words in the element determine the criteria.

The index presented before allows answering any

full-text search dealing with position expressed as

element. Other functionalities like term distance,

window, order or result ranking require additional

information, the word offset inside an element. The

offset is added to position records as needed by these

operations. Full functionalities are available in

Buxton and Rys (2003).

4.2 Ranking Results

A ranking method associates a relevance weight to

each result. In a mediator, we have to rank results

coming from different sources and to merge results

for delivery to the user in correctly ranked order.

Our architecture provides a way to pre-rank results,

i.e., virtual results are ranked before source

extraction; we compute the relevance score of each

result when querying the word index. The weighting

formula has to be accurate but simple enough to be

computable with information contained in the word

index.

We determine the weight of a result by adding

the weights of each node containing directly one or

more keywords. Our ranking approach is based on

the specificity of each result. The ranking method

gives more influence to element nodes close to the

root of the search space. Thus, words close to the

root weight more than words deeply hidden in the

result tree. Such an approach is a bit simplistic, as

ranking weights are attributed independently of

keyword position in relation with each other.

We also give more influence to element nodes

containing several keywords of the search. The

percentage of keywords in a node is used as a

polynomial factor to adjust the weight. Finally, the

following formula computes the weight of a node:

()

∑

⎟

⎠

⎞

⎜

⎝

⎛

Wi is the weight of the keyword (tf.idf), N is the

total number of keywords in the query, Ni is the total

number of keywords in the node, and d is the

distance to the root of the sub-graph (number of

edges). The constant α is designed to give more

influence to the distance from the root to the word

position (in edges). β is a polynomial factor used to

increase the weight of an element containing several

EXTENDING AN XML MEDIATOR WITH TEXT QUERY

keywords. The total weight of a result is the sum of

each node weight containing one or more keywords.

An advantage of the implemented ranking

system is the formula modularity. It may be

extended or replaced. The mediator may integrate

any formula relying on information contained in the

word index (tf.idf, distance). The ranking formula is

application adjustable.

Some other systems propose concrete solutions

to rank results of a keyword–based query. XRANK,

Lin and al. (2003), proposes a ranking algorithm

relying on the elemRank of an XML element. This

rank is computed using the number of outgoing and

incoming edges (inter and intra documents). It

contrasts with our approach, which focuses first on

keywords distribution and second on links, by

applying a tf.idf based formula on element contents.

Moreover, the proximity factor proposed in XRANK

is not fully adapted to XML tree structures; XRANK

uses the minimum containing window (containing

all keywords) as proximity metric. It is a global

proximity; our system rather uses a proximity factor

computed at the element granularity, i.e., not

globally on the full result sub-graph. Thus, our

approach is more precise on content. XRANK also

uses a decreasing factor for less specific results (far

from the sub-graph result root).

The XXL system, Theobald and Weikum (2002),

uses relevance ranking focused on a vagueness

operator, which computes a similarity score for

every result. It compares the structure of the result

with the structure of the query, by applying

ontological rules. This kind of ranking is not easily

adaptable to XQuery Text in the context of

mediation, as it is not based on a specific search

space. However, it could be interesting to consider

ontology-based similarity for integrating

heterogeneous sources.

XIRQL, Norbert and Kai (2001), extends IR

functionalities to XML, like relevance-oriented

search, looking for XML objects satisfying a content

search. Weighting formulas are applied to objects

(i.e., sub-graphs) defined in the schema at the type

level and in the index at the instance level.

Composed objects are weighted with the sum of the

composing objects. Content queries are processed by

combining relevance of objects according to the

logical search conditions. Only the relevant objects

are returned ranked by weights. A tf.idf computation

and the specificity of the position of keywords are

used to adjust the weight of the objects. This

approach is difficult to apply in a mediation

architecture where objects are not defined for each

source.

5 EXPERIMENTS

We experiment index search on three different data

sets stored in an XML repository. The data sets are

presented in table 2. The size is the total size of the

view after creation. Each data set is structured as the

critic view definition given above. Collections are

stored on different sources. We measure search time

through the index for three queries:

(q01) critic/review [. ftcontains “k1” && … “kn”]

(q02) critic [. ftcontains “k1” && … “kn” without

content .//title]

(q03) critic [. ftcontains “k1” && … “kn”]

The queries are searching documents containing

a conjunctive set of keywords. The search space

includes different elements of the view. q01 searches

in the review element, q02 in the review element

without title, and q03 in the full document. Queries

are executed on the same Pentium 4 with 512K

memory configuration. The numbers of keywords in

queries vary from 2 to 26. The measures presented

here are an average of ten executions.

Table 2 presents the execution time of q01 searching

for 5 keywords. For each data set, we measure the

execution time with an indexed view, and without

index (the mediator handles the search operation).

The time for the view includes the index search time

(both Word Index and Source Map), the query plan

execution in the mediator, and the result

construction. The time without indexed view

includes the plan execution and the result

construction times. As planned, index speeds up the

execution time as only relevant results are requested

from the sources and complex content search

operations are avoided on huge text data at the

mediator layer. For each data set, the execution

through the indexed view brings out a significant

ratio averaging 3, for a low selectivity of queries

(from 60 % to 68 % of the documents are selected).

Table 2: Data sets.

Docs Byte Size

ds1 100 607 855

ds2 250 1 556 052

ds3 500 3 029 990

Table 3: q01 execution time and index search time.

Execution Time Intersection

View Mediator 5 words 15 words 25 words

1042.1 2917.6 1.6 3.1 4.6

1976.5 5969 6.3 15.6 21.9

5949 18501 17.2 65.6 86

Index search times for query Q01 are presented

in table 3. The index search does not increase

WEBIST 2006 - INTERNET TECHNOLOGY

significantly the execution time; it represents less

than 1% of the overall execution time. These

preliminary results validate our approach.

1234 6789

DS3 Q02

DS3 Q03

DS2 Q02

DS2 Q03

1234 6789

DS3 Q02

DS3 Q03

DS2 Q02

DS2 Q03

words

Time (ms)

Figure 4: q02 and q03 evaluation for DS2 and DS3.

Figure 4 illustrates the index search time

difference between q03 and q02. Due to the

identifiers ordering scheme, q02 always executes

faster than q03 as fewer elements are considered in

the search space.

6 CONCLUSION

In this paper, we have reported on the integration of

XQuery Text in an XML mediator. The main

difficulty is to integrate sources with little

capabilities in full-text search. We propose to use

indexed virtual views to support such sources. The

views are indexed inside the mediator using a sort of

structural dataguide derived from the view

definition, called a viewguide. Nodes identifiers and

path expressions are encoded through the viewguide,

which yields to algorithms to process efficiently the

mediator basic selection operator involving XPaths

and keywords. A parameterized ranking formula

taking into account relevance and deepness of

elements is proposed to integrate result relevance.

Further work remains to be done. Notably, a

better support of source capabilities would be

desirable. When a source can support a subset of

XQuery, we should be able to build limited views at

the wrapper to integrate it in distributed query

processing. Thus, functionalities should be divided

in multiple stages, e.g., concrete local views

combined with global virtual views. Also, local

ranking of results from a view or a capable source

(e.g., Google) seems easy, but global ranking with

pertinent formulas remains to be experienced in

details on real applications. The propagation of

updates must also be studied. Indexing structures

should be automatically updated when inserting and

deleting objects in data sources. A basic approach

could be detecting updates at wrapper level and

propagate them at the different index structures.

REFERENCES

Abiteboul S., S. Cluet, G. Ferran et M.C. Rousset: "The

Xyleme project", Computer Networks 39(3): 225-238

(2002)

Amer-Yahia S., C. Botev, J. Shanmugasundaram :

"TeXQuery: A Full-Text Search Extension to

XQuery", WWW'04

BEA: "Liquid data for WebLogic 1.1, 2004, http://e-

docs.bea.com/liquiddata/docs11/

Bremer J. M., M. Gertz : "XQuery/IR: Integrating XML

Document and Data Retrieval", WebDB 2002.

Buxton S., Rys M. Editors, "XQuery and XPath Full-Text

Requirements", W3C Working Draft 02 May 2003,

http://www.w3.org/TR/xquery-full-text-requirements/

Chen Q., A. Lim and K.W. Ong : D(k)-index: An adaptive

structural summary for graph-structured data. In Proc.

of SIGMOD, 2003.

Chung Chin-Wan, J. Min and K. Shim: "APEX: an

adaptive path index for XML data", SIGMOD

Conference 2002: 121-132

Cooper B., N. Sample, M.J. Franklin, G.R. Hjaltason and

M. Shadmon :" A Fast Index for Semistructured

Data.", VLDB 2001: 341-350

Dang-Ngoc T.-T., G. Gardarin : "Federating

heterogeneous data sources with XML", In Proc. of

IASTED IKS Conference, pages 193-198, Scottsdale,

USA, Nov. 2003.

Fuhr N., K. Großjohann: "XIRQL: A Query Language for

Information Retrieval in XML Documents". SIGIR

2001: 172-180

Gardarin G., L. Yeh: "Treeguide Index: Enabling Efficient

XML Query Processing", Bases de Données Avancées,

Montpellier, Octobre 2005

IBM: "DB2 Information Integrator for Content", 2004,

http://www-306.ibm.com/software/data/eip/

Kaushik R., P. Shenoy, P. Bohannon and E. Gudes :

Exploiting local similarity for indexing paths in graph-

structured data. In Proc. of ICDE, 2002.

Lin G., F. Shao, C. Botev, J. Shanmugasundaram :

XRANK: Ranked Keyword Search over XML

Documents. SIGMOD Conference 2003: 16-27

Milo T., D. Suciu: "Index Structures for Path

Expressions", ICDT 1999: 277-295

Papakonstantinou Y., V. Borkar, M. Orgiyan, K.

Stathatos, L. Suta, V. Vassalos, P. Velikhov : "XML

queries and algebra in the Enosys integration

platform", Data Knowl. Eng. 44(3): 299-322 (2003)

Rahm E., P.A. Bernstein. 2001. A survey of approaches to

automatic schema matching. VLDB journal:334-350.

Theobald A., G. Weikum : "The Index-Based XXL Search

Engine for Querying XML Data with Relevance

Ranking". EDBT 2002: 477-495

Widom J. et. al.: "Lore, a DBMS for XML", http://www-

db.stanford.edu/lore/

XQuare: "The XQuare project: open source information

integration components based on XML and XQuery",

2004, http://xquare.objectweb.org/

EXTENDING AN XML MEDIATOR WITH TEXT QUERY