ITERATIVE XML SEARCH BASED ON DATA

AND ASSOCIATED SEMANTICS

Alda Lopes Gançarski

Institut TELECOM, TELECOM et Management SudParis, CNRS SAMOVAR, 9 rue Charles Fourier, 91011 Évry, France

Pedro Rangel Henriques, Flávio Xavier Ferreira

Department of Informatics, University of Minho, Campus de Gualtar, 4710-057 Braga, Portugal

Keywords: XQuery, iterative XML search, metadata, ontologies, RDF, SPARQL.

Abstract: In a previous work in the context of information retrieval, XQuery was extended with an iterative paradigm.

This extension helps the user getting the desired results from queries. In a related work, XQuery was also

extended to allow the inclusion of SPARQL queries; this is useful when XML documents are associated

with semantic RDF descriptions. However integrating SPARQL in XQuery queries makes the construction

of queries more complex (although more powerful). To leverage this integration, we propose to apply the

iterative paradigm to the ‘SPARQL extension to XQuery’. In the paper this proposal is introduced and

justified and a case study is presented.

1 INTRODUCTION

XML information access is done using structured

query languages such as XPath (Berglund et al.,

2007) and XQuery (Boag et al., 2006), the standard

proposed by the W3C. To help the user get the

desired information from his queries, (Gançarski et

al 2006) proposed an iterative search over XML

documents using an extension to XPath. The

iterative paradigm was, then, included in XQuery

(Gançarski and Henriques, 2006).

To improve data processing, the document

collections and Web resources are associated with

semantic descriptions, i.e. metadata. In order to be

able to exchange the semantics of information, one

first needs to agree on how to explicitly model it.

This is usually done using a sophisticated

description in the form of an ontology. An ontology

is a formal explicit specification of a shared

conceptualization. Using an ontology, any kind of

description can be made about a resource.

Ontologies can be used to annotate data with labels

indicating their meaning, thereby making their

semantics explicit and machine-accessible. W3C

This research is done in the context of the RESPIRE

project financed by the French ANR-ARA program.

has created the Resource Description Framework

(RDF) (Manola and Miller, 2004), a language for

representing information about resources in the

World Wide Web. RDF Schema (RDFS) is an RDF

extension, which provides the basic elements for

ontologies descriptions. To find information in RDF

descriptions, the SPARQL query language was

defined by the W3C (Prud’hommeaux and Seaborn,

2007).

In (Gançarski and Henriques 2007), they exploit

the use of XML documents together with the

respective semantics to access information, arguing

that both may be interesting to the user and can help

him to find the desired information. For that, they

integrate SPARQL queries into XQuery ones.

In this paper, we propose to apply the iterative

paradigm to the SPARQL extension made to

XQuery. In fact, searching by data and metadata is

more sophisticated than simple data search. Thus,

the user may take advantage from the iterative

paradigm when building his queries, not only in the

XQuery component, but also in the SPARQL one.

Next section introduces the XQuery iterative

model. Section 3 introduces the search based on data

and associated semantics. Then, Section 4 integrates

the iterative paradigm with the semantic search. The

formal definition of the proposed XQuery extension

is made in Section 5. A case study is described in

479

Lopes Gançarski A., Rangel Henriques P. and Xavier Ferreira F. (2008).

ITERATIVE XML SEARCH BASED ON DATA AND ASSOCIATED SEMANTICS.

In Proceedings of the Tenth International Conference on Enterprise Information Systems - DISI, pages 479-484

DOI: 10.5220/0001706504790484

 SciTePress

Section 6. The article finishes with a brief

conclusion, indicating some future work. For the

sake of simplicity, we will ignore IRIref definitions

in our examples.

2 ITERATIVE SEARCH

The iterative paradigm of query construction is

based on selection operations that consist in

restricting intermediate results to the subset of

elements that satisfies the user. For that, the

mf:select function may be used in location path

expressions (mf from my function). The mf:select

function selects the subset of interesting elements

based on some criteria. While in a filter the set of

elements is selected by intention, with mf:select it is

by extension, i.e. explicitly referring to each

element. This can be interesting when the

specification of the criteria is too complicated (the

user may even not know how to do it) or when it is

more efficient/rapid to directly refer the desired

elements.

Suppose each node is identified by a unique

identifier and consider it as a string of characters.

The input to mf:select is a node and a list of node

identifiers (denoted by “(...)”). The output is the

input node if it is selected (i.e., if its identifier

belongs to the list of identifiers), or an empty

sequence of nodes. For example, suppose the user

wants references made inside interesting articles of

author Kevin. The user can, then, make the

following query:

for $a in /articles/article[author = ``Kevin'']

[mf:select(., ("a4", "a8"))]

return $a//references

In this query, function mf:select selects articles

identified by "a4" and "a8". Symbol “.” refers to

each context node, i.e., each resulting node of the

precedent operation. Thus, mf:select takes each

article being a context node and returns it if it

corresponds to some of the selected items.

3 DATA AND SEMANTICS

When an XML document is associated to semantic

descriptions expressed in RDF, SPARQL queries

may be integrated in XQuery queries. This can be

done by adding a new clause metadata to the for

clause of XQuery. Let us call the extended XQuery

with the metadata clause XQuery+SPARQL. As an

example, consider the following ontology that

includes as concepts, among other things, elements

of an XML document.

Book.xml about “Beings”.

Book.xml#Chapter1 about “Fishes”.

Book.xml#Section11 about “Ocean Fishes”.

Book.xml#Section12 about “River Fishes”.

Book.xml#Chapter2 about “Birds”.

Book.xml#Chapter3 about “Vegetables”.

Book.xml#Chapter4 about “Fruits”.

Men eat “Birds”.

Men eat “Fishes”.

Men eat “Vegetables”.

Men eat “Fruits”.

If the user wants chapters about what men eat, he

can specify the following XQuery+SPARQL query:

1 for $c in

2 doc(“http://.../Book.xml”) /book//chapter

3 metadata $c in

4 SELECT ?c

5 WHERE { Men eat ?o.

6 ?c about ?o. }

7 return $c

In this query, the for clause associates to variable

$c the set of chapters of the document (lines 1 and

2). The metadata clause (line 3) includes a SPARQL

query (line 4 to 6) which selects all the book parts

(stored in variable ?c) that are about beings eaten by

men (stored in variable ?o). This yields the set

{Book.xml#Chapter1, Book.xml#Chapter2,

Book.xml#Chapter3, Book.xml#Chapter4}, stored in

variable ?c. This result is intersected with the set of

elements of the XQuery external query, stored in

variable $c, to get the desired parts of the book.

4 ITERATIVE DATA AND

SEMANTICS SEARCH

The formulation of XQuery+SPARQL queries is

more complex than simple XQuery queries. To

simplify this task to the user, we propose to extend

the iterative paradigm to the SPARQL component of

XQuery+SPARQL. For that, a selection function is

included in SPARQL, associated to the FILTER

clause. Let us see an example query using a filter

and, then, incorporate the selection function on it.

ICEIS 2008 - International Conference on Enterprise Information Systems

480

4.1 FILTER Clause

SPARQL filters restrict solutions to those for which

the filter expression evaluates to true. Filters use

functions to define conditions. Those functions may

come from XPath, XQuery or may be SPARQL

specific. For example, consider the same ontology as

before in Section 3. Suppose the user wants sections

about fishes. He may, then, specify the following

query:

1 for $s in

2 doc(“http://.../Book.xml”) /book//section

3 metadata $s in

4 SELECT ?s

5 WHERE { ?s about ?o.

6 FILTER regex(?o, “Fishes”) }

7 return $c

The regex function (line 6) is SPARQL specific and

allows matching a string with a pattern. In the

example query, the string is stored in variable ?o and

the pattern is the simple string “Fishes”. The result

of the Where clause is the set { Book.xml#Chapter1,

Book.xml#Section11, Book.xml#Section12}. When

making the intersection with the result of the

external XQuery query (which gives sections from

Book.xml), the final result becomes

{Book.xml#Section11, Book.xml#Section12}.

4.2 Select Function

With the proposed iterative paradigm, when a

variable is computed, the user may see its content

and select the subset of interesting values. For that,

we define the msf:select function (msf from my

SPARQL function). Suppose again the ontology of

the previous example in Section 3. If now the user

wants animals eaten by men, he may specify a query

in the following steps:

Step 1.

for $s in

doc(“http://.../Book.xml”) /book//chapter

metadata $s in

SELECT ?s

WHERE { Men eat ?o.

At this point, with the iterative paradigm, the user

may access the intermediate result stored in variable

?o. This result is the set {“Birds”, “Fishes”,

“Vegetables”, “Fruits”}. The user may, then, select

the animals, as specified in the next step.

Step 2.

for $s in

doc(“http://.../Book.xml”) /book//chapter

metadata $s in

SELECT ?s

WHERE

{ Men eat ?o.

FILTER msf:select(?o, (“Birds”, “Fishes”))

?s about ?o.

}

return $s

With the msf:select operation, the content of variable

?o became {“Birds”, “Fishes”}. Thus, the final

result of the query is {Book.xml#Chapter1,

Book.xml#Chapter2}.

5 FORMAL DEFINITION OF

SELECT FUNCTION

The FILTER clause may be associated to user

defined functions, as specified in the following

productions:

[26] Filter ::= 'FILTER' Constraint

[27] Constraint ::= FunctionCall

The msf:select function is defined using the

SPARQL grammar productions corresponding to

user defined functions:

[28] FunctionCall ::= IRIref ArgList

[29] ArgList ::= ( NIL | '(' Expression ( ','

Expression )* ')' )

Here, the name of the function is derived by IRIref.

This symbol allows for complete IRI references or

prefixed names. In our case, we use the prefix msf

and the name select.

The arguments of user defined functions are

represented by the symbol Expression. This is a

general symbol representing from simple literal

strings to complex Boolean expressions. In the

msf:select function, the first occurrence of

Expression derives an RDF term, corresponding to

the content of some variable, and the second one, a

bracketed sequence of RDF terms. This sequence

corresponds to the sequence of selected terms from

an intermediate result. What follows defines the

msf:select function:

xsd:boolean msf:select(RDF term t,

(RDF term)* tSeq)

ITERATIVE XML SEARCH BASED ON DATA AND ASSOCIATED SEMANTICS

481

{

for (i=0; i<length[tSeq]; i++) {

if (sameTerm(t, tSeq[i]))

{ return TRUE; exit; }

}

Return FALSE;

}

The sameTerm SPARQL pre-defined function

returns true if both arguments are equal.

For each term bound to a variable passed as the

first argument of msf:select, this function returns

true if the term exists in the sequence of terms

passed as the second argument; otherwise, it returns

false. Considering the example query of Section 4.2,

variable ?o is bound to the set of terms {“Birds”,

“Fishes”, “Vegetables”, “Fruits”}. This variable is

the first argument of the msf:select function. The

second argument is the sequence (“Fishes”,

“Birds”) choose by the user. So, verifying if each

term stored in variable ?o occurs in the sequence,

the content of ?o becomes {“Birds”, “Fishes”}.

6 CASE STUDY

We intend to experiment our approach with the

resources of the Portuguese Emigration Museum

(Museu da Emigração e das Comunidades - MEC).

MEC is a web-museum that wants to make easily

accessible to the general public the rich cultural

heritage characterizing the Portuguese emigration

phenomenon, and the impress left by the Portuguese

people around the world.

MEC assets (resources) are vast and multifaceted

because the emigration documents and objects come

from the most diversified sources, ranging from

official government records to old newspapers and

photo albums, with the type of documents also

heterogeneous (from official travel reports to local

stories). So it becomes necessary to organize and

categorize all this resources. As such, the

information sources and the resources where

classified using an ontology, presented partially in

Figure 1 (at the end of the paper). Furthermore,

using XML Schema (Fallside and Walmsley, 2004),

we defined, an XML format for each type of

document (the ellipses in Figure 1) - documents can

be seen as the resources being described by the

ontology.

The ontology gives a full categorization of the MEC

resources universe; it shows the relations between

the location of the resources (ex.: district archives),

the information sources (ex.: passport's processes,

almanacs) and the documents themselves (ex.: birth

certificate, passport petition, event record).

Next subsections show different application

scenarios using XQuery+SPARQL over the MEC

resources.

6.1 Using XQuery+SPARQL

A visitor, wishing to explore MEC’s resources, may

be interested in searching for travelling details

registered in Fafe related to the emigrant Antonio

Serra; the details considered below are, the date of

departure, and the destination (target country) .

Travelling information is obtained accessing

passport records; an excerpt of a record of this kind

is as shown in the following simplified:

<passportRecord ...> ...

<name>Antonio Serra</name> ...

<place of="destination">Brasil</place>

<place of="source">Portugal</place>...

</passportRecord>

The XQuery+SPARQL query which yields to

the desired information is:

for $d in doc(“passRec.xml”)/passportRecord

where $d/name = “Antonio Serra”

metadata $d in

select ?d

where {

?d rdf:type Passport_Record .

?d contained_in ?p .

?p rdf:type Municipal_passport_record .

?p acquired_in Fafe .

}

return

<pr> {

$d/passportRecord(place[@of=”destination”]|date)

} </pr>

In the metadata clause of this query, the SPARQL

query searches for documents contained in

municipal passport records (stored in variable ?p)

which were acquired in Fafe.

6.2 Using Iterative XQuery+SPARQL

Suppose, now, the visitor is interested in searching

for images and photos related to some families he

knows. Let us assume a user from the city of Braga.

Family photos can be found in the respective family

album. Instances of family albums are represented

ICEIS 2008 - International Conference on Enterprise Information Systems

482

by identifiers such as “Freitas_Fam_Album” for the

“Freitas” family.

Image and photo documents have a structure/content

similar to the following excerpt:

<image>

<name>Family address of José Freitas</name>...

<city>Braga</city>

<img>http://.../jose_freitas.jpg</img> ...

</image>

Then, the query satisfying the current example is:

for $i in doc(“image1.xml”)/image

where $i/city =”Braga”

metadata $i in

select ?i

where {

?i rdf:type Images_and_photos .

?i contained_in ?f .

?f rdf:type Family_album .

FILTER msf:select(?f, (Freitas_Fam_Album,

Silva_Fam_Album)) .

}

return $i/img

In the metada clause, the list of identifiers of the

family albums found is displayed, as the answer

computed in the third triple pattern of the SPARQL

query (stored in variable ?f). The visitor may, then,

immediately identify those belonging to families he

knows. He can, then, select those albums using the

msf:select function associated to the FILTER clause.

In this query, the user selected albums from

“Freitas” and “Silva” families.

7 CONCLUSIONS

In this paper, we propose to integrate the iterative

paradigm for query construction into the

XQuery+SPARQL semantic querying language. We

believe this can help users to get the desired

information.

We intend to create a prototype processing

environment for the XQuery+SPARQL. We can use

existing XQuery and SPARQL query processors

integrating them with a special editor and result

visualizer. We will, then, test this prototype and

verify the usefulness of this approach using MEC

assets, as described in Section 6.

REFERENCES

Berglund, A., Boag, S., Chamberlin, D., Fernandez, M.,

Kay, M., Robie, J., Siméon, J., 2007. XML Path

Language (XPath) 2.0 W3C Recommendation 23

January 2007, URL: http://www.w3.org/TR/xpath20/.

Boag, S., Chamberlin, D., Fernandez, M., Florescu, D.,

Robie, J., Siméon, J., 2007. XQuery 1.0: An XML

Query Language W3C Recommendation 23 January

2007, http://www.w3.org/TR/xquery/.

Fallside, D. and Walmsley, P., 2004. XML Schema Part

0: Primer Second Edition, W3C Recommendation 28

October 2004. URL: http://www.w3.org/TR/

xmlschema-0/

Gançarski, A., Doucet, A., Henriques, P., 2006. AG-based

interactive system to retrieve information from XML

documents, IEE Proceedings Software Journal,

Volume 153, Issue 2, p. 51-60, April 2006.

Gançarski, A., Henriques, P., 2006. A Formal Definition

of Selection Operations that Extend XQuery with

Interactive Query Construction. International

Conference in Web Information Systems and

Technologies 2006 (Webist06), Setubal, Portugal,

INSTICC Press.

Gançarski, A., Henriques, P., 2007. Using data together

with metadata to improve XML information access.

International Conference in Web Information Systems

and Technologies 2007 (Webist07), Barcelone, Spain,

INSTICC Press.

Manola, F. and Miller, E., 2004. RDF Primer W3C

Recommendation 10 February 2004. URL:

http://www.w3.org/TR/rdf-primer/.

Prud’hommeaux, E. and Seaborn, A., 2007. SPARQL

Query Language for RDF W3C Proposed

Recommendation 12 November 2007. URL:

http://www.w3.org/TR/rdf-sparql-query/.

ITERATIVE XML SEARCH BASED ON DATA AND ASSOCIATED SEMANTICS

483

Figure 1: MEC Ontology.

ICEIS 2008 - International Conference on Enterprise Information Systems

484