A TRANSACTIONAL APPROACH TO ASSOCIATIVE XML

CLASSIFICATION BY CONTENT AND STRUCTURE

Gianni Costa, Riccardo Ortale and Ettore Ritacco

ICAR-CNR, Via P. Bucci 41C, 87036 Rende (CS), Italy

Keywords:

XML mining, XML transactional modeling, Associative classiﬁcation.

Abstract:

We propose XCCS, which is short for XML Classiﬁcation by Content and Structure, a new approach for the

induction of intelligible classiﬁcation models for XML data, that are a valuable support for more effective

and efﬁcient XML search, retrieval and ﬁltering. The idea behind XCCS is to represent each XML document

as a transaction in a space of boolean features, that are informative of its content and structure. Suitable

algorithms are developed to learn associative classiﬁers from the transactional representation of the XML

data. XCCS induces very compact classiﬁers with outperforming effectiveness compared to several established

competitors.

1 INTRODUCTION

XML is a popular model for data representation, that

allows to organize textual content into (possibly irreg-

ular) logical structures.

The supervised classiﬁcation of XML data into

predeﬁned classes consists in learning a model of the

structural and content regularities (observed across a

set of pre-classiﬁed XML documents), that discrim-

inate each individual class. The resulting classiﬁer

can, hence, predict the class of a previously unseen

XML document from the same applicative domain,

by looking at its structure and content.

A wide variety of approaches to XML classiﬁ-

cation have been proposed in the literature, includ-

ing (Theobald et al., 2003; Yi and Sundaresan, 2000;

Garboni et al., 2006; Knijf, 2007; Zaki and Aggar-

wal, 2006). These efforts can be divided into two ma-

jor categories. One family of approaches uses only

the structural information of XML data in classiﬁer

induction and class prediction, such as in (Garboni

et al., 2006; Knijf, 2007; Zaki and Aggarwal, 2006).

An inherent limitation of such approaches is the in-

ability at discriminating the classes, when all of the

available XML documents share an undifferentiated

structure. Another family of approaches, such as

the ones in (Theobald et al., 2003; Yi and Sundare-

san, 2000), performs a more sophisticated separation

of the classes, by considering both the content and

structural information of XML data. Unfortunately,

despite their effectiveness, these approaches do not

provide explicative classiﬁcation models, i.e., concise

and human-intelligible summarizations of the content

and structural regularities that discriminate the indi-

vidual classes. Such classiﬁcation models have the

potential to offer an in-depth and actionable under-

standing of the relevant properties of very large cor-

pora of XML data and, hence, are of great practical

interest in all those settings in which XML classiﬁca-

tion is preliminarily performed to enable more effec-

tive and efﬁcient XML search, retrieval and ﬁltering.

In this paper, we propose XCCS, a new approach

to XML Classiﬁcation by Content and Structure,

that relies on solid and well-understood foundations.

XCCS performs associative classiﬁcation on the avail-

able XML data to induce an easily interpretable and

highly expressive predictive model. The latter is a

compact set of rules, which discriminate the generic

class from the others by means of content and struc-

tural regularities, that frequently occur in the class and

are positively correlated with the class itself.

We identify suitable features of the XML docu-

ments, that are informative of their content and struc-

ture, and represent each XML document as a trans-

action in the resulting feature space. Additionally,

we design algorithms to perform associative classiﬁ-

cation on the transactional representation of the XML

data. The devised algorithms handle skewed class dis-

tributions, that are often encountered in the XML do-

main. To the best of our knowledge, XCCS is the ﬁrst

approach that borrows the advantages of associative

classiﬁcation, i.e., a high degree of both interpretabil-

104

Costa G., Ortale R. and Ritacco E..

A TRANSACTIONAL APPROACH TO ASSOCIATIVE XML CLASSIFICATION BY CONTENT AND STRUCTURE.

DOI: 10.5220/0003662401040113

In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval (KDIR-2011), pages 104-113

ISBN: 978-989-8425-79-9

 2011 SCITEPRESS (Science and Technology Publications, Lda.)

ity and expressiveness (that are well known and stud-

ied in the literature) coupled with a robust effective-

ness. As a further contribution, the latter is investi-

gated by comparatively evaluating XCCS over several

XML corpora. Empirical evidence shows that XCCS

scales to induce very compact classiﬁers with outper-

forming effectiveness from very large XML corpora.

The paper proceeds as follows. Section 2 intro-

duces notation and preliminaries. Section 3 discusses

the XCCS framework. Section 4 presents the results

of a comparative evaluation of several classiﬁers in-

duced within XCCS. Finally, section 5 concludes and

highlights future research.

2 PRELIMINARIES

We introduce the notation used throughout the pa-

per as well as some basic concepts. The structure

of XML documents without references can be mod-

eled in terms of rooted, labeled trees, that represent

the hierarchical relationships among the document el-

ements (i.e., nodes).

Deﬁnition 2.1. XML Tree. An XML tree is a rooted,

labeled tree, represented as a tuple t = (r

, V

, E

, λ

whose individual components have the following

meaning. V

⊆ N is a set of nodes and r

∈ V

the root node of t, i.e. the only node with no enter-

ing edges. E

⊆ V

× V

is a set of edges, catching

the parent-child relationships between nodes of t. Fi-

nally, λ

: V

7→ Σ is a node labeling function and Σ is

an alphabet of node tags (i.e., labels).

In the above deﬁnition, the elements of XML doc-

uments and their attributes are not distinguished: both

are mapped to nodes in the XML tree representation.

Hereafter, the notions of XML document and XML

tree are used interchangeably.

Let t be a generic XML tree. Nodes in V

divide

into two disjoint subsets: the set L

of leaves and the

set V

− L

of inner nodes. An inner node has at least

one child and contains no textual information. A leaf

is instead a node with no children, that can contain

only textual information.

A root-to-leaf path p

in t is a sequence of nodes

encountered in t along the path from the root r

a leaf node l in L

, i.e., p

=< r

, . . . , l >. Nota-

tion λ

) denotes the sequence of labels that are

associated in the XML tree t to the nodes of path p

i.e., λ

) =< λ

), . . . , λ

(l) >. The set paths(t) =

|l ∈ L

} groups all root-to-leaf paths in t.

Let l be a leaf in L

. The set terms(l) =

{λ

).w

, . . . , λ

).w

, λ

).ε} is a model of

the information provided by l. Elements λ

).w

(with i = 1. . . h) are as many as the distinct term stems

in the context of l and seamlessly couple content in-

formation with its structural context. Therein, w

some term stem (obtained in the ﬁrst step of the pre-

processing in subsection 4.2) and λ

) acts as a pre-

ﬁx specifying the location of l within the XML tree t,

that allows to distinguish the occurrences of w

in the

context of distinct leaves. The unique element of the

type λ

).ε is instead informative only of the loca-

tion of l within t: ε indicates the null string. λ

).ε

still provides some (purely-structural) information on

l when the leaf does not contain textual information.

Leaf terms and their preﬁxes are chosen as infor-

mative features of the XML data, with which to sep-

arate the classes of XML trees. Henceforth, for read-

ability sake, we will write p instead of λ(p) to mean

the preﬁx of a term stem w.

Deﬁnition 2.2. XML Feature. Let t be an XML tree.

A preﬁxed term (stem) p.w is said to be a feature of t

(or, equivalently, p.w occurs in t), denoted as p.w  t,

if the following two conditions hold. First, there exists

a leaf l ∈ L

and, hence, a path p

∈ paths(t) such

that λ

) = p. Second, p.w ∈ terms(l).

Assume that S = {p.w|∃t ∈ D such that p.w  t}

is a suitable selection of features from the XML trees

in D . S identiﬁes an expressive feature space, in

which to perform the induction of models for effec-

tive XML classiﬁcation.

Let D = {t

, . . . , t

} be a training database (or,

equivalently, a forest) of N XML trees, each of

which is associated with one label from the set L =

, . . . , c

}. Our approach to XML classiﬁcation

can be formalized as learning some suitable model

C : 2

7→ L of the associations between the occur-

rence of the chosen features in the XML trees of D

and the class labels of the same XML trees. The re-

sulting classiﬁer C is useful to predict the unknown

class of a previously unseen XML tree t

′

, on the basis

of the features occurring in t

′

Ideally, the efﬁciency and scalability of XML

classiﬁcation should suffer neither from the dimen-

sionality of S (i.e., the number |S | of features), nor

from the costs of the operations for the manipulation

of tree-like structures. To meet such requirements, the

dimensionality of the feature space corresponding to

S is sensibly reduced in subsection 4.2. Additionally,

XML data is represented in a convenient transactional

form, that avoids the manipulation of XML trees.

The idea behind the transactional representation

is that by looking at the elements in S as binary fea-

tures, the available XML data can be projected into

a feature space, wherein the occurrence of the indi-

vidual features within each XML tree is explicitly

represented. More precisely, if S denotes the se-

A TRANSACTIONAL APPROACH TO ASSOCIATIVE XML CLASSIFICATION BY CONTENT AND STRUCTURE

105

lected collection of features, the XML trees from D

can be modeled as transactions over a feature space

F , {F

p.w

|p.w ∈ S }. Here, the generic feature F

p.w

is a boolean attribute, whose value indicates the pres-

ence/absence of the corresponding feature p.w of S

within the individual XML trees.

Let x

(t)

be the transactional representation of an

XML tree t. The value of each attribute F

p.w

within

(t)

is true if p.w is a feature of t, otherwise it is false.

Hence, x

(t)

can be modeled as a proper subset of F ,

namely x

(t)

, {F

p.w

∈ F |p.w  t}, with the meaning

that the features explicitly present in x

(t)

take value

true, whereas the others assume value false. The orig-

inal database D can hence be represented in transac-

tional form as D

(F)

= {x

)

, . . . , x

)

}, whereas the

class associated with the generic transaction is de-

noted as class(x

(t)

). Hereafter, to keep notation un-

cluttered, the transactional database and the generic

transaction will be denoted, respectively, as D and x.

In this paper, XML classiﬁcation is approached

through associative classiﬁcation (Liu et al., 1998), a

powerful enhancement of conventional rule learning,

that results from the integration of two fundamental

tasks in data mining, namely, association rule min-

ing and classiﬁcation. Associative classiﬁers retain

the advantages of traditional rule-based classiﬁcation

models (i.e., interpretability, expressiveness and high

effectiveness) and, also, tend to achieve a better pre-

dictive performance (Xin and Han, 2003).

The necessary concepts concerning associative

classiﬁcation in the domain of the transactional rep-

resentation of the XML trees are formalized next.

The notion of class association rule is the starting

point.

Deﬁnition 2.3. Class Association Rule. Let F be a

feature space, deriving from the selection of certain

features of the XML data. Also, assume that D (the

so-called training data) is a database of XML trees

represented as transactions over F and that L is a

set of class labels. A class association rule (or, equiv-

alently, a CAR) r : I → c is a pattern that catches the

association (i.e. the co-occurrence)in D of some sub-

set I of F with a class label c belonging to L . I and

c are said to be, respectively, the antecedent and con-

sequent of the CAR.

Essentially, a CAR relates the occurrence of a cer-

tain combination of features in a transaction corre-

sponding to an XML tree with one particular class.

A rule r : I → c is said to cover a (labeled or un-

labeled) transaction x ∈ D (and, dually, x is said to

trigger or ﬁre r) if the condition I ⊆ x holds. The

set D

of transactions covered by r is deﬁned as

= {x ∈ D |I ⊆ x}.

The notions of support and conﬁdence are em-

ployed to deﬁne the interestingness of a rule r.

Deﬁnition 2.4. CAR Support and Conﬁdence. A

transaction x ∈ D supports a CAR r : I → c if it holds

that I ⊆ x and c = class(x). The support of r, denoted

as supp(r), is the fraction of transactions in D that

support r. The conﬁdence or predictive strength of r,

denoted by conf(r), is deﬁned as conf(r) =

supp(r)

supp(I )

where supp(I ) is the fraction of transactions in D in-

cluding the subset I .

Hereafter, a CAR r is actually interesting if it

meets certain minimum requirements on its support

and conﬁdence and if its antecedent and consequent

are positively correlated (Arunasalam and Chawla,

2006). This avoids misleading CARs (i.e., CARs with

negative correlation despite a high conﬁdence) with

skewed classes, which are often encountered in the

XML domain.

Deﬁnition 2.5. Associative Classiﬁer. An associa-

tive classiﬁer C is a disjunction C = {r

∨ . . . ∨ r

}

of interesting CARs learnt from a database D of

labeled transactions (representing XML trees with

known class labels).

An associative classiﬁer is a set of CARs that as-

sign an unlabeled XML tree (in transactional form) to

a class if certain features occur in the tree. An ap-

proach to induce associative classiﬁers for effective

XML classiﬁcation is proposed next.

3 THE XCCS APPROACH

XCCS is a general framework for the associative clas-

siﬁcation of XML data, that relies on CARs to model

the associations between subsets of co-occurring fea-

tures and the discriminated classes.

XCCS exploits a selection of features of the avail-

able XML data for the discrimination of the individ-

ual classes. XML classiﬁcation in XCCS divides into

model learning and prediction. The former learns

an associative classiﬁer C from a database of labeled

XML trees in transactional form. The latter exploits

C to predict the class of unlabeled XML trees.

3.1 Model Learning

The model learning process in XCCS, sketched in

ﬁg. 1, receives four input parameters: a database D

of XML trees, a set S of discriminatory features, a

set L of class labels in D and one global threshold τ,

from which the minimum support thresholds for the

individual classes in L are derived.

KDIR 2011 - International Conference on Knowledge Discovery and Information Retrieval

106

MODEL-LEARNING(D ,L ,S ,τ)

Input: a training dataset D ;

a set S of substructures of the XML trees in D ;

a set L of class labels in D ;

and a support threshold τ;

Output: An associative classiﬁer C = {r

∨ ... ∨ r

};

1: let F ← {F

|s ∈ S } be the feature space;

2: R ←

3: D

′

←

4: for each t ∈ D do

5: x ← {F

∈ F ,s  t};

6: D

′

← D

′

∪ {x};

7: end for

8: R ← MINECARS (F , D

′

, τ);

9: C ← PRUNE(R);

10: RETURN C

Figure 1: Model learning in XCCS.

Model learning preliminarily involves the deﬁni-

tion (at line 1) of the space F of features related to

the elements of S as well as the mapping of the in-

dividual XML trees in D to as many corresponding

transactions over F (lines 4-7). The MINECARS pro-

cedure is then used to discover a potentially large set

R of CARs targeting the classes in L . The rule set R

is eventually distilled into a compact associative clas-

siﬁer C through the pruning method PRUNE.

A detailed treatment of the MINECARS and

PRUNE steps is provided, respectively, in subsec-

tions 3.1.1 and 3.1.2.

3.1.1 Mining the Class Association Rules

MINECARS is an Apriori-based procedure, that

searches for meaningful CARs in the training data

D . MINECARS enhances the basic Apriori algo-

rithm (Agrawal and Srikant, 1994) by incorporating

two effective mechanisms, i.e., multiple minimum

class support (Liu et al., 2000) and complement class

support (Arunasalam and Chawla, 2006), with which

to distill, within each class in D , an appropriate num-

ber of CARs with a positive correlation between their

antecedents and consequents. The exploitation of

these mechanism is particularly useful when the dis-

tribution of classes in D is skewed. In the absence

of suitable expedients, class imbalance typically lim-

its the extraction of a suitable number of CARs from

the less frequently occurring classes and negatively

acts on the correlation of CAR antecedents and con-

sequents, up to the point of identifying misleading

CARs (i.e. with a negative correlation).

Figure 2 shows the scheme of MINECARS, which

divides into frequent itemset discovery (lines P1-

P18) and CAR generation (lines P19- P26). In the

ongoing discussion, an itemset is a subset of struc-

tural features from the space F .

Frequent itemset discovery starts (at line P3) with

, a set of candidate 1-itemsets, consisting of one

structural feature from F and a class label from L .

At the generic iteration, MINECARS builds a set L

of frequent k-itemsets from L

k−1

. Two steps are per-

formed for this purpose. The join step (at line P14)

involves joining L

k−1

with itself to yield C

′

, a col-

lection of candidate k-itemsets. Notice that this re-

quires joining pairs of frequent k − 1-itemsets with

identical class labels. The well-known Apriori prop-

erty, according to which an unfrequent itemset cannot

have frequent supersets, is then used (at line P15) to

drop from C

′

those k-itemsets with at least one k− 1-

subset that is not in L

k−1

. The support counting step

(lines P5- P12) involves counting the occurrences of

the surveyed candidate itemsets in C

by scanning the

training data D . Those candidates whose support ex-

ceeds a class-speciﬁc threshold are considered to be

frequent and retained within L

MINECARS halts when no more frequent item-

sets can be discovered.

Multiple minimum class support (Liu et al., 2000)

is employed at line P13 to automatically adjust the

global minimum support threshold τ, supplied by the

user, to the minimum support threshold speciﬁc for

each class. Essentially, the generic candidate itemset

c is frequent if its support is over τ · supp(class(c)),

the adjusted minimum support threshold for class(c).

If class distribution is skewed, multiple minimum

class support implements a ﬁrst stage of focused prun-

ing, that dynamically assigns a higher minimum sup-

port threshold to more frequent classes (which pre-

vents from yielding several overﬁtting itemsets) and

a lower minimum support threshold to less frequent

classes (which enforces the generation of an appro-

priate number of itemsets).

Instead, complement class support (Arunasalam

and Chawla, 2006) is used in the CAR generation

stage, to avoid the speciﬁcation of a global minimum

conﬁdence threshold. In particular, a speciﬁc property

of complement class support (shown in (Arunasalam

and Chawla, 2006)) is exploited at line P22 to au-

tomatically identify a class-speciﬁc minimum conﬁ-

dence threshold. According to such a property, a rule

r : I → c is such that I and c are positively corre-

lated if and only if conf(I → c) >

σ(c)

|D |

, where σ(c)

is the overall number of occurrences of class c in D .

Therefore, the CARs whose conﬁdence exceeds (at

line P22) the minimum threshold corresponding to

their targeted class are guaranteed to be positively cor-

related. Thus, both conﬁdence and positive correla-

tion between rule components can be veriﬁed without

additional parameters or further correlation analysis.

When classes are skewed, the dynamic setting of a

A TRANSACTIONAL APPROACH TO ASSOCIATIVE XML CLASSIFICATION BY CONTENT AND STRUCTURE

107

MINECARS(F ,D ,τ)

Input: a ﬁnite set of boolean attributes F ;

a training set D of transactions representing XML trees;

and a support threshold τ;

Output: a set R of class association rules;

/* Frequent itemset discovery */

P1: I ←

0, k ← 1;

P2: Let L be the set of class labels in D ;

P3: Let C

← {c|∃F

∈ F , l ∈ L such that c = {F

}, class(c) = l};

P4: while C

0 do

P5: for each candidate candidate itemset c ∈ C

P6: supp(c) ← 0;

P7: end for

P8: for x ∈ D do

P9: for c ∈ C

such that class(c) = class(x) and c ⊆ x do

P10: supp(c) ← supp(c) +

|D |

;

P11: end for

P12: end for

P13: L

← {c ∈ C

|supp(c) > τ · supp(class(c))};

P14: C

′

k+1

← {c

∪ c

, c

∈ L

∧ class(c

) = class(c

) ∧ |c

∪

| = k + 1};

P15: C

k+1

← {c ∈ C

′

k+1

|∀c

′

⊂ c with |c

′

| = k, it holds that c

′

∈

};

P16: k ← k+1;

P17: end while

P18: I ← ∪

;

/* CAR generation */

P19: R ←

P20: for each frequent itemset I ∈ I do

P21: create rule r : I → class(I );

P22: if conf(r) >

σ(class(I ))

|D |

then

P23: R ← R∪ {r};

P24: end if

P25: end for

P26: RETURN R;

Figure 2: The MINECARS procedure.

class-speciﬁc minimum conﬁdence threshold acts as

a second stage of focused pruning, that ensures the

discovery of discriminative rules targeting the unfre-

quent classes and still avoids an overwhelming num-

ber of rules from the predominant classes.

3.1.2 Learning an Associative-classiﬁcation

Model

Due to the inherently combinatorial nature of the

associative patterns, MINECARS may yield a large

number of CARs, which are likely to overﬁt the train-

ing data and provide contrasting predictions. To avoid

such issues, a compact and accurate classiﬁer is dis-

tilled from the rule set R through the covering method

PRUNE, illustrated in ﬁg. 3.

PRUNE initially orders (at line M1) the available

CARs according to the total order ≪, which is in-

spired to the one introduced in (Liu et al., 1998). Pre-

cisely, given any two CARs r

, r

∈ R, r

precedes r

which is denoted by r

≪ r

, if (i) conf(r

) is greater

than conf(r

), or (ii) conf(r

) is the same as conf(r

but supp(r

) is greater than supp(r

), or (iii) conf(r

)

is the same as conf(r

) and supp(r

) is identical to

supp(r

), but length(r

) is less than length(r

). The

length of a CAR r : I → c is the number of features in

the antecedent of r, i.e., length(r) = |I |.

If two CARs r

, r

have equal conﬁdence, support

and length, then r

≪ r

if r

was generated before r

A covering process (lines M4- M19) then seeks a

compact classiﬁer C , consisting of a minimal number

of CARs from R, that attain a high predictive accu-

racy over unlabeled transactions (representing unclas-

siﬁed XML trees).

The covering process attempts the maximization

of the effectiveness F(C ) of the resulting classiﬁer C

across all classes. F(C ) is evaluated in terms of the

macro-averaged F-measure (Manning et al., 2008) of

C , which is deﬁned as follows

F(C ) =

|L |

∑

c∈L

(c)

(C )

where F

(c)

(C ) is the effectiveness (or, also, the

predictive performance) of C over the generic class

c, described below. F(C ) assigns a same relevance to

the effectiveness of C over the different classes, re-

gardless of the occurrence frequency of the individual

classes in the training data. This is especially useful in

the presence of class imbalance, since F

(c)

(C ) is not

dominated by the predictive performances of C over

the most frequent classes across the transactions.

The covering process increases F(C ) by sepa-

rately acting on each F

(c)

(C ), through the selection

(at line M6) of CARs from R that, when appended to

C (at line M10), improve the predictive performance

of the resulting classiﬁer over c without a signiﬁcant

loss in compactness.

For each class c, covering scans (according the ≪

order) the different CARs of R that target c. A CAR

r : I → c is appended to C if F

(c)

(C ∪ {r}) is greater

than F

(c)

(C ) (at line M7). In this case, r is removed

from R (at line M9) and all transactions covered by r

are dropped from D (at line M12).

Notice that F

(c)

(C ) is 0 while C does not include

any CARs targeting c.

The total order established over R (at line M1)

plays a key role while covering the generic class c:

it ensures that the CARs predicting c with highest im-

plicative strength are appended to C since the early

stages of the covering process, which is beneﬁcial for

(c)

. Moreover, as covering of class c proceeds, the

consequent removal of transactions from D operates

as a pruning method, which increasingly tends to pre-

vent from retaining in C those CARs targeting c, that

have not yet been considered. This positively acts on

KDIR 2011 - International Conference on Knowledge Discovery and Information Retrieval

108

both the effectivenessand the compactness of C since,

according to the ≪ order, such CARs predict c with

either a lower implicative strength or a higher speci-

ﬁcity.

In particular, pruning speciﬁc CARs is useful to

ﬁlter redundancy (Li et al., 2001) away from C .

Therein, let r : I → c and r

′

: I

′

→ c be two redun-

dant CARs of R, such that I ⊂ I

′

(i.e. r

′

is more

speciﬁc than r) and the implicative strength of r

′

lower than that of r. According to the order of the

CARs, it holds that r ≪ r

′

(i.e., r precedes r

′

in R).

Therefore, if r is added to C , the consequent removal

from D of D

(i.e. the set of transactions covered by

r) also involves the elimination of D

′

, since D

′

is ac-

tually a subset of D

. As a consequence, when cover-

ing subsequently considers r

′

, the latter will be unable

to improve F

(c)

and, thus, will be discarded. Redun-

dancy avoidance strongly contributes to the compact-

ness of C (especially when the size of R is large), in

cooperation with the information-theoretic scheme at

the end of this section. In the envisaged cooperation,

pruning speciﬁc CARs avoids redundancy, whereas

the latter scheme is used to evaluate whether the gain

in predictive performance, due to the addition of non-

redundant CARs to C , is worth the consequent loss in

compactness.

Notice that the different classes are separately

covered (at line M4) in increasing order of their oc-

currence frequency, to avoid that transactions belong-

ing to less frequent classes are removed from D while

covering other classes with higher occurrence fre-

quency. This would have the undesirable effect of

avoiding an appropriate evaluation of the gain in ef-

fectiveness due to the addition to C of CARs targeting

the foresaid less frequent classes.

Covering halts when either there are no more

CARs to consider (which is caught, for each class c,

at line M5), or all training transactions have been cov-

ered (which is caught at line M15), or the predictive

performance of C cannot be further increased (which

is caught, for each class c, at line M5).

The generic F

(c)

summarizes two further measures

of class-speciﬁc effectiveness, i.e., the degree of pre-

cision P

(c)

and recall R

(c)

of classiﬁer C in class c:

(c)

(C ) = 2



(c)

(C )

(c)

(C )



−1

An increase in F

(c)

, due to the addition of a CAR

r to the current classiﬁer C , means that r produces

an acceptable improvement of the predictive perfor-

mance of C , ascribable to a sensible gain in at least

one between P

(c)

and R

(c)

Precision P

(c)

(C ) is the exactness of C within

class c, i.e., the proportion of transactions that are

actually of class c among the ones assigned by C to

class c. Recall R

(c)

(C ) is instead the completeness of

C within c, i.e., the fraction of transactions of class c

that are correctly predicted by C . Formally, let D

(c)

{x ∈ D |∃r ∈ C , r : I → c, I ⊆ x} be the set of transac-

tions covered by the CARs of C predicting class c and

(c)

= {x ∈ D |∃r ∈ C , r : I → c, I ⊆ x, class(x) = c}

be the set of transactions correctly assigned by C to

class c. Also, assume that σ(c) is the overall number

of transactions of class c. Precision P

(c)

(C ) and recall

(c)

(C ) are deﬁned as reported below

(c)

(C ) =

(c)

(C ) =

(c)

σ(c)

Precision P

(c)

(C ) and recall R

(c)

(C ) provide com-

plementary information on the effectiveness of C

over c. Indeed, an improvement in precision alone,

achievedby appending r to C , would not say anything

on the corresponding variation in the recall of C ∪{r}.

Dually, an improvement in recall alone would not say

anything on the corresponding variation in the preci-

sion of C ∪ {r}.

The simultaneous increase of both precision and

recall is a challenging issue in the design of algo-

rithms for learning classiﬁcation models, since it of-

ten happens that a gain in recall corresponds to a loss

in precision and vice-versa. F

(c)

(C ) is the harmonic

mean of P

(c)

(C ) and R

(c)

(C ) and, hence, it is always

closer to the smaller between the two. Therefore, an

improvement of F

(c)

(C ∪ {r}) with respect to F

(c)

(C )

ensures that an increase in recall due to the addition

of r to C is not vanished by a serious loss in precision.

Let D

be set of transactions left in D , that are

covered by r : I → c (at line M11). Also, assume

that p

(c)

is the subset of those transactions in D

cor-

rectly classiﬁed by r into class c, i.e., p

(c)

= {x ∈

|class(x) = c}. The updated values of precision

(c)

(C ∪ {r}) and recall R

(c)

(C ∪ {r}), resulting from

the addition of r to C , are incrementally computed

from P

(c)

(C ) and R

(c)

(C ) as follows

(c)

(C ∪ {r}) =

(c)

| + |p

(c)

| + |D

(c)

(C ∪ {r}) = R

(c)

(C ) +

(c)

σ(c)

When covering ends (line M19), the resulting

classiﬁer C is a list of predictive CARs grouped by

the targeted class. The individual groups of CARs

appear in C in increasing order of the occurrence fre-

quency of the targeted class. Moreover, within each

A TRANSACTIONAL APPROACH TO ASSOCIATIVE XML CLASSIFICATION BY CONTENT AND STRUCTURE

109

PRUNE(R,D ,L )

Input: a set R of CARs;

a set D of transactions;

a set L of class labels in D ;

Output: a classiﬁer C ;

/* Rule ordering according to the devised total order ≪ */

M1: R ← ORDER(R);

M2: C ←

M3: T ← D ;

M4: for each c ∈ L in increasing order of occurrence frequency do

M5: while there are still CARs in R that target c do

M6: choose the next rule r : I → c from R;

M7: if F

(c)

(C ∪ {r}) > F

(c)

(C ) then

M8: cur length ← length(C ∪ {r});

M9: R ← R− {r};

M10: C ← C ∪ {r};

M11: D

← {x ∈ D |I ⊆ x};

M12: D ← D − D

;

M13: min length ← cur length;

M14: end if

M15: if |D | = 0 then

M16: continue at line M20;

M17: end if

M18: end while

M19: end for

M20: if |D | > 0 then

M21: c

∗

← argmax

c∈L

supp(c, D );

M22: else

M23: c

∗

← argmax

c∈L

supp(c);

M24: end if

M25: C ← C ∪ {

0 → c

∗

};

M26: RETURN C ;

Figure 3: The PRUNE procedure.

group, the CARs reﬂect the total order ≪ established

over R.

As it is said in (Ning et al., 2006), the class-based

ordering of the CARs in C confers to the classiﬁer

a high interpretability, that would not be obtained if

the same CARs were sorted in C according to the ≪

relationship. Indeed, in this latter case, the meaning

of each CAR r in C would involve the negation of

any other CAR r

′

in C such that r

′

≪ r. This would

clearly reduce the comprehension of the CARs sited

at the bottom of C , especially if the size of C is large.

Class-based ordering has been largely adopted in the

design of seminal rule-induction algorithms.

To conclude the discussion on PRUNE, the mu-

tual exclusiveness and the exhaustive coverage of the

CARs of any resulting classiﬁer C must be touched.

A rule-based classiﬁer is mutually exclusive if

each input triggers no more than one rule. Generally,

such a property does not hold for C . Indeed, it is ac-

tually possible that multiple CARs are triggered by a

same transaction. This is clearly undesirable because

(some of) the triggered CARs may provide contrast-

ing predictions. This problem is overcome in sec. 3.2.

Instead, the addition to C (at line M25) of a default

rule

0 → c

∗

ensures exhaustive coverage, i.e., that

every transaction is covered by at least one CAR of

C ∪ {

0 → c

∗

}. In particular, the default rule covers all

those transactions uncovered by the CARs of C and

assigns them to a suitable class c

∗

. This guarantees a

maximum recall with very poor precision in class c

∗

To attenuate the loss in precision, c

∗

can be reason-

ably chosen (at line M23) to be the class with high-

est occurrence frequency in the training data, which

ensures the highest precision for the default rule. De-

pending on the overall coverage of C , there are two

alternative possibilities for the choice of the default

class c

∗

. If there are still uncovered transactions after

the termination of the covering process (at line M20),

∗

is selected (at line M21) as the class with maxi-

mum occurrence frequency supp(c, D ) in the residual

training data D . Otherwise, if all transactions have

been covered, c

∗

is chosen (at line M23) to be the

class with highest occurrence frequency in the whole

training data. In both cases, ties can be arbitrarily bro-

ken.

3.2 Prediction

Let C be an associative classiﬁer induced by XCCS at

the end of model-learning phase of ﬁg. 1. Also, as-

sume that t is an unlabeled (i.e. an unclassiﬁed) XML

tree, whose transactional representation over the un-

derlying feature space F is x. The class predicted by

C for t is C (x). To avoid conﬂicting predictions from

multiple triggered CARs, C (x) is provided by the ﬁrst

CAR of C that covers x.

4 EVALUATION

The empirical behavior of XCCS is studied in order

to comparatively evaluate the effectiveness of XCCS

across different domains;

All tests are performed on a Linux machine, with

an Intel Core 2 Duo cpu, 4Gb of memory and 2Ghz

of clock speed.

4.1 Data Sets

The behaviorof XCCS is tested over several real XML

data sets. Synthetic XML corpora are not considered

for experimental purposes, since these are generally

unlikely to provide coherent textual information in

natural language.

Macro-averaged effectiveness results are obtained

by performing a stratiﬁed 10-fold cross validation on

the transactional representation of each data set.

We choose four real-world XML data sets, that

KDIR 2011 - International Conference on Knowledge Discovery and Information Retrieval

110

include textual information and are characterized

by skewed distributions of the classes of XML

documents.

Wikipedia is an XML corpus proposed in the INEX

contest 2007 (Denoyer and Gallinari, 2008) as a

major benchmark for XML classiﬁcation and clus-

tering. The corpus groups 96, 000 XML documents

representing very long articles from the digital ency-

clopedia. The XML documents are organized into 21

classes (or thematic categories), each corresponding

to a distinct Wikipedia Portal. A challenging aspect

of the corpus is the ambiguity of certain pairs of

classes such as, e.g.,

Portal:Pornography

and

Portal:Sexuality

Portal:Chistianity

and

Portal:Spirituality

(Denoyer and Gallinari,

2008).

IEEE is a reference text-rich corpus, presented

in (Denoyer and Gallinari, 2007), that includes

12, 107 XML documents representing full articles.

These are organized into 18 classes corresponding

to as many IEEE Computer Society publications:

Transactions

and 12 other journals. A same

thematic can be treated into two distinct journals.

DBLP is a bibliographic archive of sci-

entiﬁc publications on computer science

(

http://dblp.unitrier.de/xml/

). The archive

is available as one very large XML ﬁle with a diver-

siﬁed structure. The whole ﬁle is decomposed into

479, 426 XML documents corresponding to as many

scientiﬁc publications. These individually belong

to one of 8 classes:

article

(173, 630 documents),

proceedings

(4, 764 documents),

mastersThesis

(5 documents),

incollection

(1, 379 documents),

inproceedings

(298, 413 documents),

book

(1, 125

documents),

www

(38 documents),

phdthesis

(72

documents). The individual classes exhibit differen-

tiated structures, despite some overlap among certain

document tags (such as title, author, year and pages),

that occur in (nearly) all of the XML documents.

The Sigmod collection groups 988 XML documents

(i.e., articles from Sigmod Record) complying

to three different class DTDs:

IndexTermsPage

OrdinaryIssue

and

Proceedings

. These classes

contain, respectively, 920, 51 and 17 XML doc-

uments. Such classes have diversiﬁed structures,

despite the occurrence of some overlapping tags,

such as volume, number, authors, title and year.

4.2 Preprocessing

The high dimensionality (i.e. cardinality) of the fea-

ture space S may be a concern for the time efﬁciency

and the scalability of model induction. In particu-

lar, when the classes of XML trees cannot be dis-

criminated through the structural information alone

and, hence, the content information must necessarily

be taken into account, the number of XML features

likely becomes very large if the XML documents con-

tain huge amounts of textual data.

To reduce the dimensionality of the feature space

S , the available XML data is preprocessed into two

steps.

The ﬁrst step addresses the textual information

of the XML data and sensibly reduces the overall

number of distinct terms in the leaves of the XML

trees through token extraction, stop-word removaland

stemming.

Dimensionality reduction is performed at the sec-

ond step both to reduce overﬁtting and to ensure a

satisfactory behavior of XCCS in terms of efﬁciency,

scalability and compactness of the induced classiﬁers.

The idea is partitioning S into groups of XML fea-

tures, that discriminate the individual classes in a sim-

ilar fashion. For this purpose, we explicitly repre-

sent the discriminatory behavior of the features in S

and then group the actually discriminatory features

through distributional clustering (Baker and McCal-

lum, 1998). In particular, the discriminatory behav-

ior of each feature p.w in S is represented as an

array v

p.w

with as many entries as the number of

classes in L . The generic entry of v

p.w

is the prob-

ability P(c|p.w) of class c given the XML feature

p.w. Clearly, P(c|p.w) =

P(p.w|c)P(c)

P(p.w)

by Bayes rule,

where probabilities P(p.w|c), P(c) and P(p.w) are

estimated from the data. Before clustering features,

noisy (i.e. non discriminatory) features are removed

from S . Speciﬁcally, a feature p.w is noisy if there is

no class c in L , that is positively correlated with p.w

according to chi-square testing at a signiﬁcance level

0.05. The arrays relative to the remaining features of

S are then grouped through distributional clustering

into a desired number of feature clusters with similar

discriminatory behavior.

Eventually, an aggressive compression of the orig-

inal feature space S is achieved by replacing each

feature p.w of S with a respective synthetic feature,

i.e., the label of the cluster to which v

p.w

belongs.

Distributional clustering efﬁciently compresses the

feature space by various orders of magnitude, while

still enabling a signiﬁcantly better classiﬁcation per-

formance than several other established techniques

for dimensionality reduction (Baker and McCallum,

1998).

4.3 Classiﬁcation Effectiveness

We compare XCCS against several other established

competitors in terms of classiﬁcation effectiveness.

A TRANSACTIONAL APPROACH TO ASSOCIATIVE XML CLASSIFICATION BY CONTENT AND STRUCTURE

111

Both direct and indirect comparisons are performed.

The direct comparisons involve three compet-

ing classiﬁcation approaches, that produce rule-based

classiﬁers, i.e., XRULES (Zaki and Aggarwal, 2006),

CBA (Liu et al., 1998) as well as CPAR (Xin and

Han, 2003). These competitors are publicly available

and, thus, can be compared against XCCS on each

XML data set.

XRULES is a state-of-the-art competitor, that ad-

mits multiple cost-models to evaluate classiﬁcation

effectiveness. For each data set, we repeatedly

train XRULES to suitably tune the minimum support

threshold for the frequent subtrees to ﬁnd in the var-

ious classes and, then, report the results of the cost

model allowing the best classiﬁcation performance.

CBA and CPAR are two seminal techniques for

learning associative classiﬁers. CBA and CPAR are

included among the competitors of XCCS to compare

the effectiveness of the three distinct approaches to

associative classiﬁcation at discriminating classes in

(high-dimensional) transactional domains.

To evaluate CBA and CPAR, we use the imple-

mentations from (Coenen, 2004). In all tests, both

CBA and CPAR are trained on the transactional rep-

resentations of the XML data sets used to feed XCCS.

Again, CBA and CPAR are repeatedly trained on the

transactional representation of each XML data set, in

order to suitably tune their input parameters. For ev-

ery data set, we report the results of the most effective

classiﬁers produced by CBA and CPAR.

Through preliminary tests we noticed that, in all

tests, a satisfactory behavior of XCCS can be ob-

tained by ﬁxing the support threshold τ of ﬁg. 1 to

0.1. This is essentially due to the adoption of the min-

imum class support (Liu et al., 2000) (discussed in

section 3.1.1) in the MINECARS procedure of ﬁg. 2.

Fig. 4 summarizes the effectiveness of the chosen

competitors across the selected data sets.

Columns Size and #C indicate, respectively, the

number of XML documents and classes for each cor-

pus. Column Model identiﬁes the competitors. Rules

is the rounding off of the average number of rules of

a classiﬁer in the stratiﬁed 10-fold cross validation.

The effectiveness of each classiﬁer is measured in

terms of average precision (P), average recall (R), av-

erage F-measure (F). More precisely, the values of

P, R and F are averages of precision, recall and F-

measure over the folds of the stratiﬁed 10-fold cross

validation of classiﬁers on the individual data sets.

The maximum values of P, R and F on each data

set are highlighted in bold.

Notice that we tabulate only the (best) results

achieved by the approaches (de Campos et al., 2008;

Murugeshan et al., 2008; Yang and Zhang, 2008;

Yong et al., 2007; Xing et al., 2007) in the respective

papers.

Some results were not originally measured and,

hence, are reported as N.A. (short for not available).

Rules has no sense for (de Campos et al., 2008; Mu-

rugeshan et al., 2008; Yang and Zhang, 2008; Yong

et al., 2007; Xing et al., 2007) and, thus, its entry

in the corresponding rows is left blank. The sym-

bol − that appears in three rows of ﬁg. 4 reveals

that XRULES did not successfully complete the tests

over Wikipedia, IEEE and DBLP. The enumeration

of the frequent embedded subtrees within each class

and the consequent generation of predictive struc-

tural rules (satisfying the speciﬁed level of minimum

class-speciﬁc support) are very time-expensive steps

of XRULES, especially when the underlying num-

ber of XML documents is (very) large. In all com-

pleted tests, XRULES is less effective than XCCS.

In addition, the huge number of rules produced by

XRULES makes the resulting classiﬁcation models

difﬁcult to understand (and, hence, hardly actionable)

in practice. . The classiﬁcation performance of CBA

is inferior than that of XCCS on the selected data

sets. Moreover, as discussed in sec. 3.1.2, interpret-

ing CBA classiﬁers may be cumbersome, since their

rules are not ordered by the targeted class. CPAR

delivers a satisfactory classiﬁcation performance on

the chosen XML corpora. Nonetheless, CPAR is

still less effective and compact than XCCS. The ap-

proaches (de Campos et al., 2008; Murugeshan et al.,

2008; Yang and Zhang, 2008) and (Yong et al., 2007;

Xing et al., 2007) exhibit generally inferior classiﬁ-

cation performances than XCCS on the Wikipedia and

IEEE corpora, respectively.

To conclude, XCCS consistently induces the most

effective classiﬁers on the chosen corpora. As far as

compactness is concerned, such classiﬁers are gen-

erally comparable to the ones induced by CBA and

signiﬁcantly more compact than the ones induced by

XRULES and CPAR. The effectiveness of XCCS con-

ﬁrms its general capability at handling XML data with

skewed class distributions.

5 CONCLUSIONS AND FURTHER

RESEARCH

XCCS is a new approach to XML classiﬁcation that

induces clearly interpretable predictivemodels, which

are of great practical interest for more effective and

efﬁcient XML search, retrieval and ﬁltering. XCCS

induces very compact classiﬁers with outperforming

effectiveness from very large corpora of XML data.

Ongoing research aims to increase the discrimina-

KDIR 2011 - International Conference on Knowledge Discovery and Information Retrieval

112

Data Size #C Model Rules P R F

Wikipedia 96,000 21

XCCS 87 0.77 0.78 0.78

XRULES - - - -

CBA 90 0.60 0.61 0.61

CPAR 156 0.73 0.72 0.73

(de Campos et al., 2008) N.A. N.A. 0.75

(Murugeshan et al., 2008) N.A. 0.76 N.A.

(Yang and Zhang, 2008) N.A. 0.84 N.A.

IEEE 12,107 18

XCCS 49 0.73 0.75 0.74

XRULES - - - -

CBA 68 0.53 0.55 0.54

CPAR 133 0.68 0.68 0.68

(Yong et al., 2007) N.A. N.A. 0.72

(Xing et al., 2007) N.A. N.A. 0.60

DBLP 479426 8

XCCS 10 1 1 1

XRULES - - - -

CBA 8 0.95 0.95 0.95

CPAR 10 0.96 0.96 0.96

Sigmod 998 3

XCCS 6 1 1 1

XRULES > 50000 0.95 0.94 0.94

CBA 3 0.91 0.93 0.92

CPAR 32 0.93 0.94 0.93

Figure 4: Results of the empirical evaluation.

tory power of XML features by incorporating the tex-

tual context of words in the leaves of the XML trees.

Also, we are studying enhancements of model learn-

ing in XCCS, with which to induce classiﬁcation rules

that also consider the absence of XML features.

REFERENCES

Agrawal, R. and Srikant, R. (1994). Fast algorithms for

mining association rules. In Proc. of Int. Conf. on Very

Large Data Bases, pages 487 – 499.

Arunasalam, B. and Chawla, S. (2006). CCCS: A top-

down association classiﬁer for imbalanced class distri-

bution. In Proc. of Int. Conf. on Knowledge Discovery

and Data Mining, pages 517–522.

Baker, L. and McCallum, A. (1998). Distributional cluster-

ing of words for text classiﬁcation. In Proc. of ACM

Int. Conf. on Research and Development in Informa-

tion Retrieval, pages 96 – 103.

Coenen, F. (2004). LUCS KDD implementations of CBA

and CMAR. Dpt of Computer Science, University of

Liverpool - www.csc.liv.ac.uk/ frans/KDD/Software/.

de Campos, L., Fern´andez-Luna, J., Huete, J., and Romero,

A. (2008). Probabilistic methods for structured doc-

ument classiﬁcation at inex’07. In Proc. of INitiative

for the Evaluation of XML Retrieval, pages 195 – 206.

Denoyer, L. and Gallinari, P. (2007). Report on the xml

mining track at inex 2005 and inex 2006. ACM SIGIR

Forum, 41(1):79 – 90.

Denoyer, L. and Gallinari, P. (2008). Report on the xml

mining track at inex 2007. ACM SIGIR Forum,

42(1):22 – 28.

Garboni, C., Masseglia, F., and Trousse, B. (2006). Sequen-

tial pattern mining for structure-based xml document

classiﬁcation. In Proc. of the INitiative for the Evalu-

ation of XML Retrieval, pages 458 – 468.

Knijf, J. D. (2007). Fat-cat: Frequent attributes tree based

classiﬁcation. In Proc. of the INitiative for the Evalu-

ation of XML Retrieval, pages 485 – 496.

Li, W., Han, J., and Pei, J. (2001). CMAR: Accurate

and efﬁcient classiﬁcation based on multiple class-

association rules. In Proc. of Int. Conf. on Data Min-

ing, pages 369 – 376.

Liu, B., Hsu, W., and Ma, Y. (1998). Integrating classiﬁca-

tion and association rule mining. In Proc. of Conf. on

Kwnoledge Discovery and Data Mining, pages 80–86.

Liu, B., Ma, Y., and Wong, C. (2000). Improving an asso-

ciation rule based classiﬁer. In Proc. of Int. Conf. on

Principles of Data Mining and Knowledge Discovery,

pages 504 – 509.

Manning, C., Raghavan, P., and Sch¨utze., H. (2008). Intro-

duction to Information Retrieval. Cambridge Univer-

sity Press.

Murugeshan, M., Lakshmi, K., and Mukherjee, S. (2008).

A categorization approach for wikipedia collection

based on negative category information and initial de-

scriptions. In Proc. of the INitiative for the Evaluation

of XML Retrieval.

Ning, P., Steinbach, M., and Kumar, V. (2006). Introduction

to Data Mining. Addison Wesley.

Theobald, M., Schenkel, R., and Weikum, G. (2003). Ex-

ploiting structure, annotation, and ontological knowl-

edge for automatic classiﬁcation of xml data. In Proc.

of WebDB Workshop, pages 1 – 6.

Xin, X. and Han, J. (2003). CPAR: Classiﬁcation based

on predictive association rules. In Proc. of SIAM Int.

Conf. on Data Mining, pages 331–335.

Xing, G., Guo, J., and Xia, Z. (2007). Classifying xml doc-

uments based on structure/content similarity. In Proc.

of the INitiative for the Evaluation of XML Retrieval,

pages 444 – 457.

Yang, J. and Zhang, F. (2008). Xml document classiﬁcation

using extended vsm. In Proc. of the INitiative for the

Evaluation of XML Retrieval, pages 234 – 244.

Yi, J. and Sundaresan, N. (2000). A classiﬁer for semi-

structured documents. In Proc. of Int. Conf. on Knowl-

edge Discovey and Data Mining, pages 340 – 344.

Yong, S., Hagenbuchner, M., Tsoi, A., Scarselli, F., and

Gori, M. (2007). Xml document mining using graph

neural network. In Proc. of the INitiative for the Eval-

uation of XML Retrieval, pages 458 – 472.

Zaki, M. and Aggarwal, C. (2006). Xrules: An effective al-

gorithm for structural classiﬁcation of xml data. Ma-

chine Learning, 62(1-2):137–170.

A TRANSACTIONAL APPROACH TO ASSOCIATIVE XML CLASSIFICATION BY CONTENT AND STRUCTURE

113