PROGRAMMING THE KDD PROCESS USING XQUERY

Andrea Romei and Franco Turini

Computer Science Department, University of Pisa, Largo B. Pontecorvo 3, 56127 Pisa (PI), Italy

Keywords:

Data mining, Query language, Inductive databases, KDD process, Mining constraints, XML, XQuery.

Abstract:

XQuake is a language and system for programming data mining processes over native XML databases in the

spirit of inductive databases. It extends XQuery to support KDD tasks. This paper focuses on the features

required in the deﬁnition of the steps of the mining process. The main objective is to show the expressiveness

of the language in handling mining operations as an extension of basic XQuery expressions. To this purpose,

the paper offers an extended application in the ﬁeld of analyzing web logs.

1 INTRODUCTION

Since the introduction of XML as a standard for rep-

resenting semistructured data, the amount of infor-

mation coded according to such standard is steadily

growing. Systems for retrieving information out of

such collections of XML data have been developed,

up to the point that a number of implementations for

handling native XML databases has been proposed

XQuery is probably the most widely accepted lan-

guage in this area (W3C, 2010). Many authors main-

tain that the process of data mining can be seen as a

sophisticated way of querying the database, and, as

a consequence, it is a good idea to extend query lan-

guages with features supporting data mining.

According to this point of view the XQuake sys-

tem has been developed as an extension of XQuery

designed to support data mining tasks (Romei and

Turini, 2010). Besides being designed for mining na-

tive XML databases, XQuake takes advantage of the

XML philosophy also for representing the results of

the mining process, according to the PMML standard

(The Data Mining Group, 2011). The uniformity of

the representation of all the levels of information al-

lows the full compliance with the closure principle of

inductive databases.

XQuake provides a good basis for mining XML

data, but it still offers opportunities for extensions.

Two of them are presented in this paper:

• speciﬁcation of constraints on the mining process;

See http://www.w3.org/XML/Query/ for an exhaustive

list of XML XQuery implementations.

• the possibility of programming different data min-

ing processes in an expressive way.

The ﬁrst issue is addressed by specifying constructs

for binding data to mining models and for knowledge

ﬁltering. The second issue is addressed by extending

the language with mining functions, that exploit typ-

ical functional language constructs, including recur-

sion. The paper aims at highlighting the capability of

the language of specifying mining tasks in an elegant

and expressive way. The basic ideas for the system

architecture are coherent with the design of XQuake

and can be found in (Romei and Turini, 2010).

Section 2 contains a presentation of XQuake and

the proposed extensions by providing its syntax and

its semantics, and by exemplifying its use for coding

typical mining subtasks. Section 3 offers the descrip-

tion of a concrete application by discussing the im-

plementation of a mining process in detail. The last

two sections deal with related work, future work, and

some ﬁnal consideration.

1.1 The

xmark

Database

Through the paper, we adopt an easily accessible

source of XML documents, namely

xmark

(Schmidt

et al., 2002). It models an Internet auction site,

deﬁning entities such as

people

open auctions

closed auctions

items

and

categories

. We re-

port below a brief description (and the XML fragment

in ﬁg. 1) for the ﬁrst three entities.

Speciﬁcally, the

tag is made up of a se-

quence of

elements encoding proﬁling in-

formation and the history of the visited auctions. The

former has a (eventually empty) list of

131

Romei A. and Turini F..

PROGRAMMING THE KDD PROCESS USING XQUERY.

DOI: 10.5220/0003626501230131

In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval (KDIR-2011), pages 123-131

ISBN: 978-989-8425-79-9

 2011 SCITEPRESS (Science and Technology Publications, Lda.)

elements indicating the item categories interesting for

the user. It also includes other personal information.

The

element contains auctions in

progress. Their properties are the initial price, the

bid history along with references to the bidders, a ref-

erence to the item being sold and a reference to the

seller. Each closed auction (tag

)

contains the reference to the seller, buyer and item,

the price and date of the closed transaction and the

type of transaction.

From now on, all the examples will refer the

xmark

data source, that is stored in BaseX (Holupirek

et al., 2009), i.e. the nativeXML database of XQuake.

Some highlights on XQuery can be found in (Romei

and Turini, 2011b).

<interest category="category11"/>...

<education>High School</education>

</profile>

<watch open_auction="open_auction29"/>...

</watches>

</person>...

</people>

...

<open_auctions>

<open_auction id="open_auction29">

</bidder>...

</open_auction>...

</open_auctions>

...

<closed_auctions>

<closed_auction id="closed_auction9">

<type>Regular</type>...

</closed_auction>...

Figure 1: Three XML fragments of

xmark

2 XQuake EXTENDED

The section is organized as follows. We ﬁrst present

the syntax and the meaning of six clauses, that are

used as a basis to construct mining operators. Then,

such operators are introduced through simple exam-

ples. Finally, we show how to specify special mining

functions.

2.1 Mining Constructs

Each mining operator is made up of a combination of

base constructs. As shown in Figure 2, six operators

have been considered as guidelines for the design of

XQuake. Speciﬁcally, they serve to locate XML data

and PMML models, to bind new data to an extracted

set of patterns and to specify mining constraints or the

format of the output result. After presenting a simple

running example, we describe each construct in turn.

Figure 2: Syntax of the six basic clauses. The

for data

clause (a). The

let

clause (b). The

for PMML

clause (c).

The

using

clause (d). The

having

clause (e). The

return

clause (f).

2.1.1 Running Example

To take conﬁdence with the language philosophy, we

introduce a “classical” example taken from the induc-

tive database theory. Speciﬁcally, we aim at “mining

association rules from a dataset; on such a result, we

ﬁnd all the given instances that satisfy

the rules; ﬁ-

nally, we induce a classiﬁcation tree from those in-

stances”. XQuake offers three operators to solve this

task.

Below, a set of association rules is extracted to

ﬁnd frequent correlations among the bidders in all

By deﬁnition, a transaction satisﬁes an association rule

. . . I

→ I

n+1

. . . I

if every item I

for i ∈ [1, m] occurs in

the transaction.

KDIR 2011 - International Conference on Knowledge Discovery and Information Retrieval

132

the open auctions. The output model contains pat-

terns like {

Mark

} → {

John

}, which states that when

Mark

appeared as bidder, also

John

was a bidder with

a certain support and conﬁdence. A condition re-

quires that the size of the extracted rules is equal to

2 (i.e. exactly one item in the body and one item in

the head of the rule).

for data $auc in doc("xmark")//open_auction

let group $pers := $auc/bidder//@person

having xquake:rule-size() = 2

return default

In the above query, the set of involved transactions

(i.e. the

elements) is speciﬁed

through the

for data

clause. Items of each trans-

action (i.e. the person identiﬁers for each

element) are deﬁned in the

let group

clause. Both

the

having

and

return

clauses operate on the out-

put result. The former is evaluated for each rule. It

uses the

rule-size()

built-in function, deﬁned in

the reserved

xquake

namespace, to get the size of that

rule and to implement the constraint. Notice that the

parameter of

rule-size()

is implicitly an associa-

tion rule. As soon as constraints are evaluated, the

return

clause is evaluated once to return a PMML

document. Assume that the output rules are stored in

my-rules.xml

In the next XQuake fragment, we ﬁlter out a set

of instances (i.e. the sequence of

elements) that do not satisfy at least ten association

rules in

my-rules.xml

. To test whether an item (i.e.

a person identiﬁer) occurs in a transaction, we check

whether that person has been either buyer or seller

in the closed auction. Now, the

return

statement is

evaluated for each input data. The result, stored in

my-tuples.xml

, is a sequence of

<inst>

tags, each

encoding a

if it satisﬁes the con-

straint.

for data $d in doc("xmark")//closed_auction

using model $r in doc("my-rules")/PMML

bind $pers := $d/[seller|buyer]//@person

return <inst>{if (xquake:rule-satisfy($r) > 9)

then $d else () }</inst>

Finally, a dummy PMML classiﬁcation tree is in-

duced from

my-tuples.xml

. It is built on the price

and quantity properties of each closed auction to pre-

dict the type of the auction (i.e. regular or featured).

for data $auc in doc("my-tuples")

let active $price := $auc//price

let active $qty := $auc//quantity

let predicted $type := $auc//type

return default

As a general comment, we have supposed in this

simple example to store the result of a step to be used

as input in the next step. Moreover, we have not yet

speciﬁed neither the kind of knowledge to extract, nor

the mining algorithm to use in the query fragments. In

sect. 2.3 we cover these aspects, and a more elegant

way to combine mining results is presented.

2.1.2 Constructs for Locating XML Data

The ﬁrst step in specifying a data mining task is the

selection of the relevant data as input of the analysis.

Relevant XML elements and attributes are selected by

means of the clauses depicted in ﬁg. 2 (a),(b).

The syntax of the

for data

expression (ﬁg. 2 (a))

is similar to the

for

clause of XQuery. It sets-up an

iteration over the sequence returned by the expression

after the

keyword. Each item of the sequence is

bound to a variable that can be used in the rest of the

expression. The optional

clause allows for a po-

sitional variable, which is bound to an integer repre-

senting the iteration number.

The

let

clause (ﬁg. 2 (b)) is used to bind a

variable to a mining ﬁeld. The keyword after the

let

refers to the role of such an attribute in the

mining activity of interest. More speciﬁcally, the

active

keyword speciﬁes that the ﬁeld is used as

input to the mining task:

predicted

speciﬁes that

it is a predicted attribute (e.g. in a classiﬁcation

task),

supplementary

states that it holds additional

descriptive information, and ﬁnally,

group

groups

atomic values(e.g. in an association or sequence anal-

ysis). Mining ﬁelds in input to the mining tasks are

required to be atomic (e.g. string, numeric or date),

except for a supplementary ﬁeld that, in principle, can

assume any complexXML type. It can be used to hold

background knowledge information useful to evalu-

ate constraints. Active ﬁelds also admit a special (and

optional) syntax to express an atomic sequence of an

explicit size in a

let active

speciﬁcation. This fa-

cility is particularly useful when a large number of

XML ﬁelds are used in the analysis.

2.1.3 Constructs for Locating PMML Models

and Binding New Data

A similar syntax may be used to locate (parts of) a

(new or extracted) pattern, represented via PMML. As

shown in ﬁg. 2 (c), a variable is bound to each item

of the sequence resulting from the evaluation of the

expression that follows the

clause. Unlike a

for

data

clause, each item of the sequence is now a single

mining model or a set of homogeneous patterns (i.e.

either a set of classiﬁcation tree or frequent itemsets

orassociationrules)sharingthesameminingschema

The mining schema lists the ﬁelds used by the model

specifying their usage type, outlier treatment, missing val-

ues replacement policy and so on.

PROGRAMMING THE KDD PROCESS USING XQUERY

133

The kind of knowledge is speciﬁed by means of a

special keyword following the

for pmml

expression.

Importantly, since the structure of a PMML model

is ﬁxed, the user has to specify only the root of the

model(s) (i.e. the

<PMML>

element(s)).

Often, new data has to be used in a model context.

Consider, for example, the case in which a confusion

matrix is constructed from a predictive model in clas-

sifying a test set, or, vice-versa, association rules are

used to determine which instances violate them. The

using

clause of ﬁg. 2 (d) accomplishes both tasks. A

keyword after the

using

distinguishes between “eval-

uating a model over a dataset” (we say in this case

that the data is bound to the model) and “evaluating a

dataset over a model” (i.e. the model is bound to the

data). The

using data

and

using model

clauses

are used in the ﬁrst and second case, respectively. In

the former, the idea is to set-up an iteration over a

sequence to bind each item to a variable. Such a vari-

able can be used in the following

bind

expression.

Here, each (non supplementary) ﬁeld belonging to the

mining schema of given mining models is bound to

new data, by evaluating the expressions after the as-

signment symbol. Such binding is by name and type,

i.e. each variable of the

bind

clause must coincide, in

name and type, with a ﬁeld of the mining schema. The

using model

clause is similar, but it speciﬁes mining

models after the

keyword and it binds such models

to new given data in the

bind

statement.

2.1.4 Constructs for Constraints and Output

Speciﬁcation

We offer a simple and elegant way to express con-

straints useful to ﬁlter an inferred mining model. As

shown in ﬁg. 2 (e), a simple XQuery predicate fol-

lowing the keyword

having

is used. We deﬁne a li-

brary of built-in functions to refer (the main parts of)

such output inside the XQuery predicate. This solu-

tion has two main advantages. First, it avoids spe-

cialized constructs and constraints are expressed more

declaratively. Second, the user has to know only the

signature and meaning of the external built-in func-

tions to apply constraints. As an example, to ﬁlter out

uninteresting itemsets, a built-in library offers special

functions to get their size, support and other interest-

ing measures, the complete list of the items belonging

to the itemset as well as the background knowledge

related to these items.

A similar strategy is used to offer to the user the

capability of deﬁning its own output, both for data and

mining models. The basic idea is to use built-in func-

tions inside an XQuery expression (ﬁg. 2 (f)), that en-

capsulate the main parts of the result. However, since

the output may have a very complex structure (e.g. in

the case of mining models), a default output can be

speciﬁed by means of the

return default

clause,

which is a PMML document for mining models.

2.2 Mining Operators

In this section we integrate the running example of

sect. 2.1.1 with additional examples of the mining

operators, according to the aformentioned speciﬁca-

tion. Preprocessing, model extraction, ﬁltering and

deploying tasks are shown.

2.2.1 Preprocessing

Several preprocessing and data preparation tasks for

sorting, selecting and ﬁltering XML data can be di-

rectly obtained throughout the use of XQuery con-

structs. However, since the data preprocessing is

a time consuming phase, ad-hoc constructs have to

be designed for cleaning, discretization, aggregation,

sampling and many others.

The syntax of a preprocessing operator admits a

for data

clause followed by a combination of

let

clauses (whose number and order depend on the kind

of task), and by a

return

clause. In the following

example, the value of the

<price>

element in each

is discretized. The result is en-

coded in a sequence of

<p>

XML tags.

for data $auc in doc("xmark")//closed_auction

let predicted $price := $auc/price

return <p>{xquake:discr($price)}</p>

Notice the usage of the built-in function

discr(.)

the

return

clause, that returns the discretized value

of its numeric argument.

2.2.2 Model Extraction

Mining models are directly inferred from XML data.

The speciﬁcation of a model extraction operator in-

cludes a

for data

statement, to specify input XML

nodes, followed by a combination of

let

clauses, to

specify the ﬁeld (active, predicted or group) as in-

put to the algorithm or the background information.

The latter can be used, for example, to specify, by

means of an optional

having

clause, domain-based

constraints on the output model. A

return

statement

closes the statement.

As an example, we can extend the ﬁrst query of

the running example by introducing a more complex

constraint to reduce the number of generated rules.

Below, the query also speciﬁes that, in each rule, ev-

ery person in the antecedent bought at least two items

in the closed auction history.

KDIR 2011 - International Conference on Knowledge Discovery and Information Retrieval

134

for data $auc in doc("xmark")//open_auction

let group $pers := $auc/bidder//@person

let supplementary $count-buy := count(

for $i in doc("xmark")//closed_auction

where $i/buyer/@person eq $xquake:item

return $i)

having every $j in xquake:body-context()

satisfies $j > 1

return default

Here, after selecting the transactions (i.e. the auc-

tions) and the items (i.e. the person’s identiﬁers),

the

count-buy

variable holds, for each distinct per-

son, the number of items bought by that person.

To this purpose, a join of the person identiﬁer (re-

ferred by the special variable

$xquake:item

) and the

set of

elements has been used.

For each mined association rule, the built-in func-

tion

body-context()

in the

having

clause returns

the context information (i.e. a sequence of int values)

related to the body items of that rule.

2.2.3 Model Filtering, Application and

Evaluation

The extracted knowledge can be ﬁltered according to

a condition, that, in principle, can be applied to every

model. The general syntax begins with the clause

for

pmml

, in which one has to specify the kind of model,

followed by a

having

and

return

clause. Similar

operators are used to apply an extracted model on new

data, to predict features, to select data accordingly to

the knowledge stored in the model, or to evaluate the

model itself. In these cases, a

using

clause is used to

bind data to the knowledge.

As an example, consider the third query of the run-

ning example and let suppose that we have induced a

set of trees, stored in

my-trees.xml

. Below, their

mining schema is shown:

</MiningSchema>...

Here, the PMML

lists the ﬁelds

(i.e. name and usage) which a user has to provide

in order to apply the model. The ﬁrst query below

ﬁlters out those trees having a training conﬁdence

lower than 50% for

type = "regular"

in the root

node, where the path expression in the

let

clause re-

turns the PMML

element of

the root node (see (The Data Mining Group, 2011)).

for pmml tree $t in doc("my-trees")/PMML

having let $d := $t//Node/ScoreDistribution

[@value eq "regular"]

return $d/@confidence > 0.5

return $t

Given new XML data compliant with the mining

schema above, the next two queries return the set of

PMML confusion matrixes (one for each tree) con-

structed on such data and the predicted values of the

target ﬁeld, respectively.

for pmml tree $t in doc("my-trees")/PMML

using data $d in doc("xmark")//closed_auction

bind $price := $d/price,

$qty := $d/quantity

$type := $d/type

return <tree>{xquake:conf-matrix()}</tree>

for data $d in doc("xmark")//closed_auction

using model $t in doc("my-trees")/PMML

bind $price := $d/price,

$qty := $d/quantity

$type := $d/type

return <classes>{xquake:class($t)}</classes>

Observe that the two queries above have a similar syn-

tax, but different semantics. The ﬁrst one evaluates

the expression in the

return

clause for each input

tree. At each iteration, it sets-up a cycle over the

data sequence to construct the confusion matrix and

to compute the evaluation metrics. The second one

returns a

element for each item of the in-

put data sequence. Each

tag encapsulates

the predicted values, so that its size coincides with the

number of input trees. Given a set of association pat-

terns (resp. rules), similar operators can be used to get

the contingency tables of each itemset (resp. rule), or

to predict the instances that violate/satisfy those item-

sets (resp. rules).

2.3 Putting it all Togheter

At this point, one should note that in the simple

queries above, we haven’t yet deﬁned neither the kind

of knowledge mined, nor the mining algorithm used,

nor, and more importantly, how to deploy a min-

ing operator inside a KDD process. From this latter

perspective, two important aspects have to be mod-

elled: iteration and interaction. The KDD is an in-

teractive, iterative and multi-step process in the sense

that, at any stage, the user should have the possibil-

ity to choose different algorithms/parameters, to eval-

uate a condition that selects a “then” branch or an

“else” branch, or to iteratively repeat some step to

achieve better results. Also, a language supporting a

KDD process should include constructs encouraging

the reuse of (parts of) the process previously deﬁned

to easily integrate this sub-query (i.e., sub-process)

inside a more complex one.

To make the KDD process modular and reusable,

XQuake adds to XQuery the capability of deﬁning

special mining user-deﬁned functions whose body is

PROGRAMMING THE KDD PROCESS USING XQUERY

135

made up of a mining operator. Below, an example of

mining function declaration is shown:

declare mining function

my-nmspace:my-fun($my-param as xs:int) {

< mining operator >

};

As for standard functions and variables, user-

deﬁned mining functions can be called either from

almost any place in a query or in an external min-

ing module. For example, they can be invoked in-

side a

FLWOR

, conditional, switch or quantiﬁed ex-

pression, as well as in mining functions themselves.

The syntax of a mining function call is the same of

any other function, except for the ﬁrst argument that

is an algorithm speciﬁcation with relative parameters.

For example, to call the function above by using the

apriori

algorithm with a minimum support and con-

ﬁdence of 10%, one mights use:

my-nmspace:my-fun(rules:apriori(0.10,0.10), 1)

The

rules

namespace indicates the kind of knowl-

edge to be mined, in this case association rules. Cur-

rently, XQuake supports discretization,

discr

, sam-

pling,

sampl

, the generation of frequent itemsets,

itemsets

, rules

rules

, classiﬁcation trees

trees

and their ﬁltering, evaluation and usage.

3 APPLICATION SCENARIO

This section reports a concrete usages of XQuake.

The goal is to present a simple (but also taken from

our real-experience in data mining) KDD process to

showhow XQuake is particularly suitable for support-

ing an inductive database framework. An additional

example of KDD process based on frequent itemsets

can be found in (Romei and Turini, 2011b).

xmark

, about 50% of the users provide a proﬁle

with personal information (see the

tag in

the ﬁrst XML fragment of ﬁg. 1). Also, they spec-

ify their categories of interest (e.g. music or sport

auctions) during the registration process. Among the

registered users, only a subset provides personal in-

formation on the age (about 45%). The idea is to use

the other personal information to predict that missing

information. To this purpose, we aim at extracting a

classiﬁcation tree able to discriminate age based both

on the other personal information and on the speciﬁed

interests. The knowledge of the missing information

of the registered user will allow to offer, at time of

accessing, personalized banners, promotions or news.

The overall process is schematized in ﬁg. 3.

[Discretization]. In order to use a classiﬁcation al-

gorithm, the

age

information is discretized into three

Figure 3: A sample KDD process based on classiﬁcation.

distinct intervals,

young

mid

and

old

. The overall

data is then partitioned into two samples for training

and testing. At this stage, we do not use a sampling al-

gorithm, but rather the users are selected among those

having provided a phone number (about 50%) and

the others. The use of the phone information offers

a quite randomized partition.

[Bagging Classiﬁcation]. Accuracy can be increased

via a bagging classiﬁcation. More speciﬁcally, a clas-

siﬁer is trained on a sample of instances taken with a

replacement strategy from the training set. This task

is repeated k times and, at each iteration, the sam-

ple size is equal to the size of the original training

set. The output is a set of k inducers: T = {t

, . . . , t

Then, T is ﬁltered according to a condition. Speciﬁ-

cally, for each t

, i ∈ [1, k], these conditions must hold:

(i) the overall number of nodes of t

is below a cer-

tain threshold, α, and (ii) the accuracy of each leaf

that classiﬁes as

young

in t

is greater than a parame-

ter, β. The result of this phase is a new set of induc-

ers, T

′

= {t

, . . . , t

}, with h ≤ k. Notice that the ﬁrst

condition above tries to reduce the complexity of the

trees avoiding those subject to overﬁtting. The second

one permits to consider only those trees that are more

precise in predicting young users. The survived clas-

siﬁers are composed to generate a bagged classiﬁer,

′

, that returns the class that has been predicted most

often by means of a voting method among t

∈ T

′

[Evaluation]. Once the composed tree, t

′

, has been

constructed, it can be applied to a test set to evaluate

its performance in terms of an accuracy error. If the

resulting accuracy is greater than a given threshold,

γ, then t

′

is returned. Otherwise, the ﬁltering task

is repeated on T

′

by using a more stringent value of

the α parameter, to the aim of ﬁltering out additional

trees with an high number of nodes. The survived

KDIR 2011 - International Conference on Knowledge Discovery and Information Retrieval

136

trees T

′′

= {t

, . . . , t

}, with j ≤ h are composed and

the procedure is repeated until the condition on the

accuracy is fulﬁlled or T

′′

The KDD process just described can be imple-

mented in XQuake as reported in ﬁg. 4. In the

registered-users

module a set of mining and

XQuery functions is deﬁned.

The

discretizer()

function discretizes the

<age>

XML element for each person having speciﬁed

a value for the age. In the result, we append, to each

element, a

<age-discr>

tag containing

that discretized value. The

sampler(.)

function gets

as input a sequence of

elements and it uses

the built-in function

num-times-in-sample($i)

get the number of times the current item (i.e. person)

of the sequence belongs to the sample of index

(we

recall that we have a single sample with index 1 and

a replacement strategy is used). Below, a fragment of

the output is shown:

<interest category="category11"/>...

...

</profile>

<age-discr>young</age-discr>

</person>...

The

inducer(.)

function extracts a classiﬁcation

tree given a sequence of

elements (i.e. the

training set). Active ﬁelds of the task are the sub-

element

as well as the

top-ﬁve interests speciﬁed by each user. For the sake

of brevity, we suppose a user-deﬁned XQuery func-

tion

select-interests($p, $n)

(not shown in ﬁg.

4) is deﬁned. Given an XML

element,

and the number of required categories,

, that func-

tion returns a sequence of boolean values of size

Each boolean value indicates whether the person

has an interest on the i

category, with i ∈ [1, n].

The ﬁltering module is implemented in the

filter(.)

function. It yields a sequence of PMML

trees and the α and β parameters. It also uses a built-in

function, namely

xquake:tree-leaves($t)

, in the

having

clause, to get the list of leaves as PMML ele-

ments.

Finally, the composition, classiﬁcation and eval-

uation are performed by means of the function

bagging(.)

. It takes a set of PMML trees and a

test set as a sequence of (discretized)

el-

ements. Then, it sets-up an iteration in which, for

each

, the predicted values of the

<age>

el-

ement are collected for each classiﬁcation tree. This

is achieved via the

classifier(.)

mining function

that returns a sequence of predicted classes (contain-

ing one value for each tree) for each item of the input

sequence. Such single predictions are used to predict

the target attribute, according to a majority strategy

(XQuery function

majority-class(.)

not shown in

ﬁg. 4). A sequence of misclassiﬁed values is returned,

as shown in the following XML fragment:

<mis>mid young</mis>

<mis>old young</mis>

<mis>old mid</mis>...

The overall process is assessed throughout an

XQuery program. It yields as parameters the number

of iterations, k, and the α, β and γ thresholds. Also, it

uses the recursive XQuery function,

tester(.)

, to

ﬁlter, compose and evaluate the induced trees until

the condition on the accuracy is respected or no more

trees survive to the ﬁlter. This is an elegant way to

simulate iterations depending on a condition.

4 RELATED WORK

The explotation of XML as a ﬂexible and extensible

instrument for IDBs has been studied in (Euler et al.,

2006; Romei et al., 2006; Meo and Psaila, 2006).

RapidMiner (Euler et al., 2006) is an environment

for KDD and machine learning in which experiments

are described via XML documents. While the graph-

ical user interface supports interactive design, the un-

derlying XML representation enables automated ap-

plications after the prototyping phase.

KDDML (Romei et al., 2006) and XDM (Meo and

Psaila, 2006) are the most related works. In both,

the KDD process is modeled as an XML document

and the description of an operator application is en-

coded by means of an XML element. They integrate

XQuery expressions into the mining process. For in-

stance, XDM encodes XPath expressions into XML

attributes to select sources for the mining, whilst KD-

DML uses an XQuery expression to evaluate a con-

dition. XQuake does not use XML for the process

representation, but rather it directly extends XQuery

to achieve a better expressiveness in representing the

KDD process.

Mining XML data are used in an instrumental

way in (Baralis et al., 2007), to construct summarized

representations of XML data. The authors propose

to extract association rules from XML databases as

the basis for a pattern based representation of XML

datasets. The idea is to use the patterns to answer

queries on the datasets.

Finally, we mention (Blockeel et al., 2008) as

far as the deﬁnition of a relational-based inductive

database.

PROGRAMMING THE KDD PROCESS USING XQUERY

137

mining module namespace reg = "registered-users";

declare mining function reg:discretizer() {

for data $pers in doc("xmark")/site/people/person[not(empty(profile/age))]

let predicted $age := $pers/profile/age

return <pers> {

($pers, <age-discr>{xquake:discr($age)}</age-discr>)

} </pers>

};

declare mining function reg:sampler($dataset as node()*) {

for data $person in $dataset

return (for $i in (1 to xquake:num-times-in-sample(1)) return $person)

};

declare mining function reg:inducer($training-set as node()*) {

for data $person in $training-set

let active $education := $person//profile/education

let active $is-business := $person//profile/business

let active<5> $interests := reg:select-interests($person, 5)

let supplementary $age = $person//age-discr

return default

};

declare mining function reg:filter($trees as node()*, $alpha, $beta) {

for pmml tree $t in $trees

having (count($t//Node) <= $alpha) and

(every $i in xquake:tree-leaves()

satisfies $i/@value eq "young" and $i/@confidence > $beta)

return $t

};

declare function reg:bagging($trees, $test-set) {

for $pers in $test-set

let $pred := reg:majority-class(reg:classifier(trees:apply(), $trees, $person))

return if ($pred != $pers/age-discr)

then <mis>{($pers/age-discr,$pred)}</mis> else ()

};

declare mining function reg:classifier($trees, $test-set) {

for data $person in $test-set using model $t in $trees/PMML

bind $education := $person//profile/education,

$is-business := $person//profile/business,

$interests := reg:select-interests($person, 5),

$age = $person//age-discr

return default

};

declare function reg:tester($trees, $test-set, $alpha, $beta, $gamma) {

let $trees := reg:filter(trees:filter(), $trees, $alpha, $beta)

return if ((count(reg:bagging($trees, $test-set)) <= $gamma) or (empty($trees)))

then $trees else reg:tester($trees, $test-set, $alpha - 5, $beta, $gamma)

};

Figure 4: The

registered-user

mining module implementing the KDD process of ﬁg. 3.

For a recent and complete review on inductive

databases see (Romei and Turini, 2011a).

5 CONCLUSIONS

XQuake is a new implementation of an inductive

database system over XML data. The scenario pre-

KDIR 2011 - International Conference on Knowledge Discovery and Information Retrieval

138

Table 1: Summarization of the XQuake language.

Inductive Database requirement XQuake perspective

Data and model storage Native XML Database (models represented via PMML)

KDD process representation XQuery program + special mining functions

KDD process parametrization Parametrization of XQuery functions

Closure principle Achieved by means of the XQuery closure

Constraints & interesting measures XQuery expression + built-in function library

Output speciﬁcation XQuery expression (optional) + built-in function library

Data binding Based on the PMML mining schema

sented in this paper offers an idea of its potentialities

and advantages. First, XML data is mined where it

is, in a native XML database. Second, great attention

has been paid to the closure principle: the examples

highlight the ability of combining the results of the

knowledge extraction in order to evaluate certain in-

dicators, to compose preprocessing, data mining and

post-processing, and to use background knowledge to

ﬁlter models. Finally, the KDD process has now an

integrated view and it can be easily made modular and

parametric. Tab. 1 summarizes the main features of

XQuake, according to the inductive database princi-

ples.

Since our project aims at a completely general

solution for XML data mining, there are further ex-

tensions that need an in-depth investigation. An on

going work is the integration of both further knowl-

edge (speciﬁcally, sequential patterns) and a rich li-

brary of mining algorithms. Also, we are working

on providing the formal semantics of XQuake. Fu-

ture work can go in two (often orthogonal) directions:

(i) the exploitation of ontologies to represent meta-

data (on the expressiveness side), and (ii) the study of

query rewriting techniques for optimization purposes

(on the architectural side). The study of more sophis-

ticated high-level guis for the design of the queries is

another aspect to be considered in the future.

REFERENCES

Baralis, E., Garza, P., Quintarelli, E., and Tanca, L. (2007).

Answering XML queries by means of data summaries.

ACM Trans Info Syst, 25(3):1–10.

Blockeel, H., Calders, T., Fromont, E., Goethals, B., Prado,

A., and Robardet, C. (2008). An inductive database

prototype based on virtual mining views. In KDD,

pages 1061–1064, New York, NY, USA. ACM.

Euler, T., Klinkenberg, R., Mierswa, I., Scholz, M., and

Wurst, M. (2006). YALE: rapid prototyping for com-

plex data mining tasks. In KDD ’06, pages 935–940,

Philadelphia, PA, USA.

Holupirek, A., Gr¨un, C., and Scholl, M. (2009). BaseX and

DeepFS - Joint Storage for Filesystem and Database.

In EDBT, pages 1108–1111, Saint Petersburg, Russia.

ACM.

Meo, R. and Psaila, G. (2006). An XML-based database for

knowledge discovery. In EDBT ’06, pages 814–828,

Munich, Germany.

Romei, A., Ruggieri, S., and Turini, F. (2006). KDDML: a

middleware language and system for knowledge dis-

covery in databases. Data Knowl. Eng., 57(2):179–

220.

Romei, A. and Turini, F. (2010). XML data mining. Softw.,

Pract. Exper., 40(2):101–130.

Romei, A. and Turini, F. (2011a). Inductive database lan-

guages: requirements and examples. Knowl. Inf. Syst.,

26(3):351–384.

Romei, A. and Turini, F. (2011b). Programming the KDD

process using XQuery. Technical Report (extended

version) TR-11-10, University of Pisa, Department of

Computer Science.

Schmidt, A., Waas, F., Kersten, M., Carey, M. J.,

Manolescu, I., and Busse, R. (2002). XMark: a bench-

mark for XML data management. In VLDB, pages

974–985.

The Data Mining Group (2011). The Predictive

Model Markup Language (PMML). Version 4.0.1.

www.dmg.org/pmml-v4-0-1.html.

W3C (2010). XQuery 3.0: An XML Query Lan-

guage. W3C Working Draft 14 December 2010.

www.w3.org/TR/xquery-30/.

PROGRAMMING THE KDD PROCESS USING XQUERY

139