A TRANSACTIONAL APPROACH TO ASSOCIATIVE XML
CLASSIFICATION BY CONTENT AND STRUCTURE
Gianni Costa, Riccardo Ortale and Ettore Ritacco
ICAR-CNR, Via P. Bucci 41C, 87036 Rende (CS), Italy
Keywords:
XML mining, XML transactional modeling, Associative classification.
Abstract:
We propose XCCS, which is short for XML Classification by Content and Structure, a new approach for the
induction of intelligible classification models for XML data, that are a valuable support for more effective
and efficient XML search, retrieval and filtering. The idea behind XCCS is to represent each XML document
as a transaction in a space of boolean features, that are informative of its content and structure. Suitable
algorithms are developed to learn associative classifiers from the transactional representation of the XML
data. XCCS induces very compact classifiers with outperforming effectiveness compared to several established
competitors.
1 INTRODUCTION
XML is a popular model for data representation, that
allows to organize textual content into (possibly irreg-
ular) logical structures.
The supervised classification of XML data into
predefined classes consists in learning a model of the
structural and content regularities (observed across a
set of pre-classified XML documents), that discrim-
inate each individual class. The resulting classifier
can, hence, predict the class of a previously unseen
XML document from the same applicative domain,
by looking at its structure and content.
A wide variety of approaches to XML classifi-
cation have been proposed in the literature, includ-
ing (Theobald et al., 2003; Yi and Sundaresan, 2000;
Garboni et al., 2006; Knijf, 2007; Zaki and Aggar-
wal, 2006). These efforts can be divided into two ma-
jor categories. One family of approaches uses only
the structural information of XML data in classifier
induction and class prediction, such as in (Garboni
et al., 2006; Knijf, 2007; Zaki and Aggarwal, 2006).
An inherent limitation of such approaches is the in-
ability at discriminating the classes, when all of the
available XML documents share an undifferentiated
structure. Another family of approaches, such as
the ones in (Theobald et al., 2003; Yi and Sundare-
san, 2000), performs a more sophisticated separation
of the classes, by considering both the content and
structural information of XML data. Unfortunately,
despite their effectiveness, these approaches do not
provide explicative classification models, i.e., concise
and human-intelligible summarizations of the content
and structural regularities that discriminate the indi-
vidual classes. Such classification models have the
potential to offer an in-depth and actionable under-
standing of the relevant properties of very large cor-
pora of XML data and, hence, are of great practical
interest in all those settings in which XML classifica-
tion is preliminarily performed to enable more effec-
tive and efficient XML search, retrieval and filtering.
In this paper, we propose XCCS, a new approach
to XML Classification by Content and Structure,
that relies on solid and well-understood foundations.
XCCS performs associative classification on the avail-
able XML data to induce an easily interpretable and
highly expressive predictive model. The latter is a
compact set of rules, which discriminate the generic
class from the others by means of content and struc-
tural regularities, that frequently occur in the class and
are positively correlated with the class itself.
We identify suitable features of the XML docu-
ments, that are informative of their content and struc-
ture, and represent each XML document as a trans-
action in the resulting feature space. Additionally,
we design algorithms to perform associative classifi-
cation on the transactional representation of the XML
data. The devised algorithms handle skewed class dis-
tributions, that are often encountered in the XML do-
main. To the best of our knowledge, XCCS is the first
approach that borrows the advantages of associative
classification, i.e., a high degree of both interpretabil-
104
Costa G., Ortale R. and Ritacco E..
A TRANSACTIONAL APPROACH TO ASSOCIATIVE XML CLASSIFICATION BY CONTENT AND STRUCTURE.
DOI: 10.5220/0003662401040113
In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval (KDIR-2011), pages 104-113
ISBN: 978-989-8425-79-9
Copyright
c
2011 SCITEPRESS (Science and Technology Publications, Lda.)
ity and expressiveness (that are well known and stud-
ied in the literature) coupled with a robust effective-
ness. As a further contribution, the latter is investi-
gated by comparatively evaluating XCCS over several
XML corpora. Empirical evidence shows that XCCS
scales to induce very compact classifiers with outper-
forming effectiveness from very large XML corpora.
The paper proceeds as follows. Section 2 intro-
duces notation and preliminaries. Section 3 discusses
the XCCS framework. Section 4 presents the results
of a comparative evaluation of several classifiers in-
duced within XCCS. Finally, section 5 concludes and
highlights future research.
2 PRELIMINARIES
We introduce the notation used throughout the pa-
per as well as some basic concepts. The structure
of XML documents without references can be mod-
eled in terms of rooted, labeled trees, that represent
the hierarchical relationships among the document el-
ements (i.e., nodes).
Definition 2.1. XML Tree. An XML tree is a rooted,
labeled tree, represented as a tuple t = (r
t
, V
t
, E
t
, λ
t
),
whose individual components have the following
meaning. V
t
N is a set of nodes and r
t
V
t
is
the root node of t, i.e. the only node with no enter-
ing edges. E
t
V
t
× V
t
is a set of edges, catching
the parent-child relationships between nodes of t. Fi-
nally, λ
t
: V
t
7→ Σ is a node labeling function and Σ is
an alphabet of node tags (i.e., labels).
In the above definition, the elements of XML doc-
uments and their attributes are not distinguished: both
are mapped to nodes in the XML tree representation.
Hereafter, the notions of XML document and XML
tree are used interchangeably.
Let t be a generic XML tree. Nodes in V
t
divide
into two disjoint subsets: the set L
t
of leaves and the
set V
t
L
t
of inner nodes. An inner node has at least
one child and contains no textual information. A leaf
is instead a node with no children, that can contain
only textual information.
A root-to-leaf path p
r
t
l
in t is a sequence of nodes
encountered in t along the path from the root r
t
to
a leaf node l in L
t
, i.e., p
r
t
l
=< r
t
, . . . , l >. Nota-
tion λ
t
(p
r
t
l
) denotes the sequence of labels that are
associated in the XML tree t to the nodes of path p
r
t
l
,
i.e., λ
t
(p
r
t
l
) =< λ
t
(r
t
), . . . , λ
t
(l) >. The set paths(t) =
{p
r
t
l
|l L
t
} groups all root-to-leaf paths in t.
Let l be a leaf in L
t
. The set terms(l) =
{λ
t
(p
r
t
l
).w
1
, . . . , λ
t
(p
r
t
l
).w
h
, λ
t
(p
r
t
l
).ε} is a model of
the information provided by l. Elements λ
t
(p
r
t
l
).w
i
(with i = 1. . . h) are as many as the distinct term stems
in the context of l and seamlessly couple content in-
formation with its structural context. Therein, w
i
is
some term stem (obtained in the first step of the pre-
processing in subsection 4.2) and λ
t
(p
r
t
l
) acts as a pre-
fix specifying the location of l within the XML tree t,
that allows to distinguish the occurrences of w
i
in the
context of distinct leaves. The unique element of the
type λ
t
(p
r
t
l
).ε is instead informative only of the loca-
tion of l within t: ε indicates the null string. λ
t
(p
r
t
l
).ε
still provides some (purely-structural) information on
l when the leaf does not contain textual information.
Leaf terms and their prefixes are chosen as infor-
mative features of the XML data, with which to sep-
arate the classes of XML trees. Henceforth, for read-
ability sake, we will write p instead of λ(p) to mean
the prefix of a term stem w.
Definition 2.2. XML Feature. Let t be an XML tree.
A prefixed term (stem) p.w is said to be a feature of t
(or, equivalently, p.w occurs in t), denoted as p.w t,
if the following two conditions hold. First, there exists
a leaf l L
t
and, hence, a path p
r
t
l
paths(t) such
that λ
t
(p
r
t
l
) = p. Second, p.w terms(l).
Assume that S = {p.w|∃t D such that p.w t}
is a suitable selection of features from the XML trees
in D . S identifies an expressive feature space, in
which to perform the induction of models for effec-
tive XML classification.
Let D = {t
1
, . . . , t
N
} be a training database (or,
equivalently, a forest) of N XML trees, each of
which is associated with one label from the set L =
{c
1
, . . . , c
k
}. Our approach to XML classification
can be formalized as learning some suitable model
C : 2
S
7→ L of the associations between the occur-
rence of the chosen features in the XML trees of D
and the class labels of the same XML trees. The re-
sulting classifier C is useful to predict the unknown
class of a previously unseen XML tree t
, on the basis
of the features occurring in t
.
Ideally, the efficiency and scalability of XML
classification should suffer neither from the dimen-
sionality of S (i.e., the number |S | of features), nor
from the costs of the operations for the manipulation
of tree-like structures. To meet such requirements, the
dimensionality of the feature space corresponding to
S is sensibly reduced in subsection 4.2. Additionally,
XML data is represented in a convenient transactional
form, that avoids the manipulation of XML trees.
The idea behind the transactional representation
is that by looking at the elements in S as binary fea-
tures, the available XML data can be projected into
a feature space, wherein the occurrence of the indi-
vidual features within each XML tree is explicitly
represented. More precisely, if S denotes the se-
A TRANSACTIONAL APPROACH TO ASSOCIATIVE XML CLASSIFICATION BY CONTENT AND STRUCTURE
105
lected collection of features, the XML trees from D
can be modeled as transactions over a feature space
F , {F
p.w
|p.w S }. Here, the generic feature F
p.w
is a boolean attribute, whose value indicates the pres-
ence/absence of the corresponding feature p.w of S
within the individual XML trees.
Let x
(t)
be the transactional representation of an
XML tree t. The value of each attribute F
p.w
within
x
(t)
is true if p.w is a feature of t, otherwise it is false.
Hence, x
(t)
can be modeled as a proper subset of F ,
namely x
(t)
, {F
p.w
F |p.w t}, with the meaning
that the features explicitly present in x
(t)
take value
true, whereas the others assume value false. The orig-
inal database D can hence be represented in transac-
tional form as D
(F)
= {x
(t
1
)
, . . . , x
(t
N
)
}, whereas the
class associated with the generic transaction is de-
noted as class(x
(t)
). Hereafter, to keep notation un-
cluttered, the transactional database and the generic
transaction will be denoted, respectively, as D and x.
In this paper, XML classification is approached
through associative classification (Liu et al., 1998), a
powerful enhancement of conventional rule learning,
that results from the integration of two fundamental
tasks in data mining, namely, association rule min-
ing and classification. Associative classifiers retain
the advantages of traditional rule-based classification
models (i.e., interpretability, expressiveness and high
effectiveness) and, also, tend to achieve a better pre-
dictive performance (Xin and Han, 2003).
The necessary concepts concerning associative
classification in the domain of the transactional rep-
resentation of the XML trees are formalized next.
The notion of class association rule is the starting
point.
Definition 2.3. Class Association Rule. Let F be a
feature space, deriving from the selection of certain
features of the XML data. Also, assume that D (the
so-called training data) is a database of XML trees
represented as transactions over F and that L is a
set of class labels. A class association rule (or, equiv-
alently, a CAR) r : I c is a pattern that catches the
association (i.e. the co-occurrence)in D of some sub-
set I of F with a class label c belonging to L . I and
c are said to be, respectively, the antecedent and con-
sequent of the CAR.
Essentially, a CAR relates the occurrence of a cer-
tain combination of features in a transaction corre-
sponding to an XML tree with one particular class.
A rule r : I c is said to cover a (labeled or un-
labeled) transaction x D (and, dually, x is said to
trigger or fire r) if the condition I x holds. The
set D
r
of transactions covered by r is defined as
D
r
= {x D |I x}.
The notions of support and confidence are em-
ployed to define the interestingness of a rule r.
Definition 2.4. CAR Support and Confidence. A
transaction x D supports a CAR r : I c if it holds
that I x and c = class(x). The support of r, denoted
as supp(r), is the fraction of transactions in D that
support r. The confidence or predictive strength of r,
denoted by conf(r), is defined as conf(r) =
supp(r)
supp(I )
,
where supp(I ) is the fraction of transactions in D in-
cluding the subset I .
Hereafter, a CAR r is actually interesting if it
meets certain minimum requirements on its support
and confidence and if its antecedent and consequent
are positively correlated (Arunasalam and Chawla,
2006). This avoids misleading CARs (i.e., CARs with
negative correlation despite a high confidence) with
skewed classes, which are often encountered in the
XML domain.
Definition 2.5. Associative Classifier. An associa-
tive classifier C is a disjunction C = {r
1
. . . r
k
}
of interesting CARs learnt from a database D of
labeled transactions (representing XML trees with
known class labels).
An associative classifier is a set of CARs that as-
sign an unlabeled XML tree (in transactional form) to
a class if certain features occur in the tree. An ap-
proach to induce associative classifiers for effective
XML classification is proposed next.
3 THE XCCS APPROACH
XCCS is a general framework for the associative clas-
sification of XML data, that relies on CARs to model
the associations between subsets of co-occurring fea-
tures and the discriminated classes.
XCCS exploits a selection of features of the avail-
able XML data for the discrimination of the individ-
ual classes. XML classification in XCCS divides into
model learning and prediction. The former learns
an associative classifier C from a database of labeled
XML trees in transactional form. The latter exploits
C to predict the class of unlabeled XML trees.
3.1 Model Learning
The model learning process in XCCS, sketched in
fig. 1, receives four input parameters: a database D
of XML trees, a set S of discriminatory features, a
set L of class labels in D and one global threshold τ,
from which the minimum support thresholds for the
individual classes in L are derived.
KDIR 2011 - International Conference on Knowledge Discovery and Information Retrieval
106
MODEL-LEARNING(D ,L ,S ,τ)
Input: a training dataset D ;
a set S of substructures of the XML trees in D ;
a set L of class labels in D ;
and a support threshold τ;
Output: An associative classifier C = {r
1
... r
k
};
1: let F {F
s
|s S } be the feature space;
2: R
/
0;
3: D
/
0;
4: for each t D do
5: x {F
s
|F
s
F ,s t};
6: D
D
{x};
7: end for
8: R MINECARS (F , D
, τ);
9: C PRUNE(R);
10: RETURN C
Figure 1: Model learning in XCCS.
Model learning preliminarily involves the defini-
tion (at line 1) of the space F of features related to
the elements of S as well as the mapping of the in-
dividual XML trees in D to as many corresponding
transactions over F (lines 4-7). The MINECARS pro-
cedure is then used to discover a potentially large set
R of CARs targeting the classes in L . The rule set R
is eventually distilled into a compact associative clas-
sifier C through the pruning method PRUNE.
A detailed treatment of the MINECARS and
PRUNE steps is provided, respectively, in subsec-
tions 3.1.1 and 3.1.2.
3.1.1 Mining the Class Association Rules
MINECARS is an Apriori-based procedure, that
searches for meaningful CARs in the training data
D . MINECARS enhances the basic Apriori algo-
rithm (Agrawal and Srikant, 1994) by incorporating
two effective mechanisms, i.e., multiple minimum
class support (Liu et al., 2000) and complement class
support (Arunasalam and Chawla, 2006), with which
to distill, within each class in D , an appropriate num-
ber of CARs with a positive correlation between their
antecedents and consequents. The exploitation of
these mechanism is particularly useful when the dis-
tribution of classes in D is skewed. In the absence
of suitable expedients, class imbalance typically lim-
its the extraction of a suitable number of CARs from
the less frequently occurring classes and negatively
acts on the correlation of CAR antecedents and con-
sequents, up to the point of identifying misleading
CARs (i.e. with a negative correlation).
Figure 2 shows the scheme of MINECARS, which
divides into frequent itemset discovery (lines P1-
P18) and CAR generation (lines P19- P26). In the
ongoing discussion, an itemset is a subset of struc-
tural features from the space F .
Frequent itemset discovery starts (at line P3) with
C
1
, a set of candidate 1-itemsets, consisting of one
structural feature from F and a class label from L .
At the generic iteration, MINECARS builds a set L
k
of frequent k-itemsets from L
k1
. Two steps are per-
formed for this purpose. The join step (at line P14)
involves joining L
k1
with itself to yield C
k
, a col-
lection of candidate k-itemsets. Notice that this re-
quires joining pairs of frequent k 1-itemsets with
identical class labels. The well-known Apriori prop-
erty, according to which an unfrequent itemset cannot
have frequent supersets, is then used (at line P15) to
drop from C
k
those k-itemsets with at least one k 1-
subset that is not in L
k1
. The support counting step
(lines P5- P12) involves counting the occurrences of
the surveyed candidate itemsets in C
k
by scanning the
training data D . Those candidates whose support ex-
ceeds a class-specific threshold are considered to be
frequent and retained within L
k
.
MINECARS halts when no more frequent item-
sets can be discovered.
Multiple minimum class support (Liu et al., 2000)
is employed at line P13 to automatically adjust the
global minimum support threshold τ, supplied by the
user, to the minimum support threshold specific for
each class. Essentially, the generic candidate itemset
c is frequent if its support is over τ · supp(class(c)),
the adjusted minimum support threshold for class(c).
If class distribution is skewed, multiple minimum
class support implements a first stage of focused prun-
ing, that dynamically assigns a higher minimum sup-
port threshold to more frequent classes (which pre-
vents from yielding several overfitting itemsets) and
a lower minimum support threshold to less frequent
classes (which enforces the generation of an appro-
priate number of itemsets).
Instead, complement class support (Arunasalam
and Chawla, 2006) is used in the CAR generation
stage, to avoid the specification of a global minimum
confidence threshold. In particular, a specific property
of complement class support (shown in (Arunasalam
and Chawla, 2006)) is exploited at line P22 to au-
tomatically identify a class-specific minimum confi-
dence threshold. According to such a property, a rule
r : I c is such that I and c are positively corre-
lated if and only if conf(I c) >
σ(c)
|D |
, where σ(c)
is the overall number of occurrences of class c in D .
Therefore, the CARs whose confidence exceeds (at
line P22) the minimum threshold corresponding to
their targeted class are guaranteed to be positively cor-
related. Thus, both confidence and positive correla-
tion between rule components can be verified without
additional parameters or further correlation analysis.
When classes are skewed, the dynamic setting of a
A TRANSACTIONAL APPROACH TO ASSOCIATIVE XML CLASSIFICATION BY CONTENT AND STRUCTURE
107
MINECARS(F ,D ,τ)
Input: a finite set of boolean attributes F ;
a training set D of transactions representing XML trees;
and a support threshold τ;
Output: a set R of class association rules;
/* Frequent itemset discovery */
P1: I
/
0, k 1;
P2: Let L be the set of class labels in D ;
P3: Let C
1
{c|∃F
s
F , l L such that c = {F
s
}, class(c) = l};
P4: while C
k
6=
/
0 do
P5: for each candidate candidate itemset c C
k
do
P6: supp(c) 0;
P7: end for
P8: for x D do
P9: for c C
k
such that class(c) = class(x) and c x do
P10: supp(c) supp(c) +
1
|D |
;
P11: end for
P12: end for
P13: L
k
{c C
k
|supp(c) > τ · supp(class(c))};
P14: C
k+1
{c
i
c
j
|c
i
, c
j
L
k
class(c
i
) = class(c
j
) |c
i
c
j
| = k + 1};
P15: C
k+1
{c C
k+1
|∀c
c with |c
| = k, it holds that c
L
k
};
P16: k k+1;
P17: end while
P18: I
k
L
k
;
/* CAR generation */
P19: R
/
0;
P20: for each frequent itemset I I do
P21: create rule r : I class(I );
P22: if conf(r) >
σ(class(I ))
|D |
then
P23: R R {r};
P24: end if
P25: end for
P26: RETURN R;
Figure 2: The MINECARS procedure.
class-specific minimum confidence threshold acts as
a second stage of focused pruning, that ensures the
discovery of discriminative rules targeting the unfre-
quent classes and still avoids an overwhelming num-
ber of rules from the predominant classes.
3.1.2 Learning an Associative-classification
Model
Due to the inherently combinatorial nature of the
associative patterns, MINECARS may yield a large
number of CARs, which are likely to overfit the train-
ing data and provide contrasting predictions. To avoid
such issues, a compact and accurate classifier is dis-
tilled from the rule set R through the covering method
PRUNE, illustrated in fig. 3.
PRUNE initially orders (at line M1) the available
CARs according to the total order , which is in-
spired to the one introduced in (Liu et al., 1998). Pre-
cisely, given any two CARs r
i
, r
j
R, r
i
precedes r
j
,
which is denoted by r
i
r
j
, if (i) conf(r
i
) is greater
than conf(r
j
), or (ii) conf(r
i
) is the same as conf(r
j
),
but supp(r
i
) is greater than supp(r
j
), or (iii) conf(r
i
)
is the same as conf(r
j
) and supp(r
i
) is identical to
supp(r
j
), but length(r
i
) is less than length(r
j
). The
length of a CAR r : I c is the number of features in
the antecedent of r, i.e., length(r) = |I |.
If two CARs r
i
, r
j
have equal confidence, support
and length, then r
i
r
j
if r
i
was generated before r
j
.
A covering process (lines M4- M19) then seeks a
compact classifier C , consisting of a minimal number
of CARs from R, that attain a high predictive accu-
racy over unlabeled transactions (representing unclas-
sified XML trees).
The covering process attempts the maximization
of the effectiveness F(C ) of the resulting classifier C
across all classes. F(C ) is evaluated in terms of the
macro-averaged F-measure (Manning et al., 2008) of
C , which is defined as follows
F(C ) =
1
|L |
cL
F
(c)
(C )
where F
(c)
(C ) is the effectiveness (or, also, the
predictive performance) of C over the generic class
c, described below. F(C ) assigns a same relevance to
the effectiveness of C over the different classes, re-
gardless of the occurrence frequency of the individual
classes in the training data. This is especially useful in
the presence of class imbalance, since F
(c)
(C ) is not
dominated by the predictive performances of C over
the most frequent classes across the transactions.
The covering process increases F(C ) by sepa-
rately acting on each F
(c)
(C ), through the selection
(at line M6) of CARs from R that, when appended to
C (at line M10), improve the predictive performance
of the resulting classifier over c without a significant
loss in compactness.
For each class c, covering scans (according the
order) the different CARs of R that target c. A CAR
r : I c is appended to C if F
(c)
(C {r}) is greater
than F
(c)
(C ) (at line M7). In this case, r is removed
from R (at line M9) and all transactions covered by r
are dropped from D (at line M12).
Notice that F
(c)
(C ) is 0 while C does not include
any CARs targeting c.
The total order established over R (at line M1)
plays a key role while covering the generic class c:
it ensures that the CARs predicting c with highest im-
plicative strength are appended to C since the early
stages of the covering process, which is beneficial for
F
(c)
. Moreover, as covering of class c proceeds, the
consequent removal of transactions from D operates
as a pruning method, which increasingly tends to pre-
vent from retaining in C those CARs targeting c, that
have not yet been considered. This positively acts on
KDIR 2011 - International Conference on Knowledge Discovery and Information Retrieval
108
both the effectivenessand the compactness of C since,
according to the order, such CARs predict c with
either a lower implicative strength or a higher speci-
ficity.
In particular, pruning specific CARs is useful to
filter redundancy (Li et al., 2001) away from C .
Therein, let r : I c and r
: I
c be two redun-
dant CARs of R, such that I I
(i.e. r
is more
specific than r) and the implicative strength of r
is
lower than that of r. According to the order of the
CARs, it holds that r r
(i.e., r precedes r
in R).
Therefore, if r is added to C , the consequent removal
from D of D
r
(i.e. the set of transactions covered by
r) also involves the elimination of D
r
, since D
r
is ac-
tually a subset of D
r
. As a consequence, when cover-
ing subsequently considers r
, the latter will be unable
to improve F
(c)
and, thus, will be discarded. Redun-
dancy avoidance strongly contributes to the compact-
ness of C (especially when the size of R is large), in
cooperation with the information-theoretic scheme at
the end of this section. In the envisaged cooperation,
pruning specific CARs avoids redundancy, whereas
the latter scheme is used to evaluate whether the gain
in predictive performance, due to the addition of non-
redundant CARs to C , is worth the consequent loss in
compactness.
Notice that the different classes are separately
covered (at line M4) in increasing order of their oc-
currence frequency, to avoid that transactions belong-
ing to less frequent classes are removed from D while
covering other classes with higher occurrence fre-
quency. This would have the undesirable effect of
avoiding an appropriate evaluation of the gain in ef-
fectiveness due to the addition to C of CARs targeting
the foresaid less frequent classes.
Covering halts when either there are no more
CARs to consider (which is caught, for each class c,
at line M5), or all training transactions have been cov-
ered (which is caught at line M15), or the predictive
performance of C cannot be further increased (which
is caught, for each class c, at line M5).
The generic F
(c)
summarizes two further measures
of class-specific effectiveness, i.e., the degree of pre-
cision P
(c)
and recall R
(c)
of classifier C in class c:
F
(c)
(C ) = 2
1
P
(c)
(C )
+
1
R
(c)
(C )
1
An increase in F
(c)
, due to the addition of a CAR
r to the current classifier C , means that r produces
an acceptable improvement of the predictive perfor-
mance of C , ascribable to a sensible gain in at least
one between P
(c)
and R
(c)
.
Precision P
(c)
(C ) is the exactness of C within
class c, i.e., the proportion of transactions that are
actually of class c among the ones assigned by C to
class c. Recall R
(c)
(C ) is instead the completeness of
C within c, i.e., the fraction of transactions of class c
that are correctly predicted by C . Formally, let D
(c)
C
=
{x D |∃r C , r : I c, I x} be the set of transac-
tions covered by the CARs of C predicting class c and
p
(c)
C
= {x D |∃r C , r : I c, I x, class(x) = c}
be the set of transactions correctly assigned by C to
class c. Also, assume that σ(c) is the overall number
of transactions of class c. Precision P
(c)
(C ) and recall
R
(c)
(C ) are defined as reported below
P
(c)
(C ) =
|p
(c)
C
|
|D
(c)
C
|
R
(c)
(C ) =
|p
(c)
C
|
σ(c)
Precision P
(c)
(C ) and recall R
(c)
(C ) provide com-
plementary information on the effectiveness of C
over c. Indeed, an improvement in precision alone,
achievedby appending r to C , would not say anything
on the corresponding variation in the recall of C {r}.
Dually, an improvement in recall alone would not say
anything on the corresponding variation in the preci-
sion of C {r}.
The simultaneous increase of both precision and
recall is a challenging issue in the design of algo-
rithms for learning classification models, since it of-
ten happens that a gain in recall corresponds to a loss
in precision and vice-versa. F
(c)
(C ) is the harmonic
mean of P
(c)
(C ) and R
(c)
(C ) and, hence, it is always
closer to the smaller between the two. Therefore, an
improvement of F
(c)
(C {r}) with respect to F
(c)
(C )
ensures that an increase in recall due to the addition
of r to C is not vanished by a serious loss in precision.
Let D
r
be set of transactions left in D , that are
covered by r : I c (at line M11). Also, assume
that p
(c)
r
is the subset of those transactions in D
r
cor-
rectly classified by r into class c, i.e., p
(c)
r
= {x
D
r
|class(x) = c}. The updated values of precision
P
(c)
(C {r}) and recall R
(c)
(C {r}), resulting from
the addition of r to C , are incrementally computed
from P
(c)
(C ) and R
(c)
(C ) as follows
P
(c)
(C {r}) =
|p
(c)
C
| + |p
(c)
r
|
|D
(c)
C
| + |D
r
|
R
(c)
(C {r}) = R
(c)
(C ) +
|p
(c)
r
|
σ(c)
When covering ends (line M19), the resulting
classifier C is a list of predictive CARs grouped by
the targeted class. The individual groups of CARs
appear in C in increasing order of the occurrence fre-
quency of the targeted class. Moreover, within each
A TRANSACTIONAL APPROACH TO ASSOCIATIVE XML CLASSIFICATION BY CONTENT AND STRUCTURE
109
PRUNE(R,D ,L )
Input: a set R of CARs;
a set D of transactions;
a set L of class labels in D ;
Output: a classifier C ;
/* Rule ordering according to the devised total order */
M1: R ORDER(R);
M2: C
/
0;
M3: T D ;
M4: for each c L in increasing order of occurrence frequency do
M5: while there are still CARs in R that target c do
M6: choose the next rule r : I c from R;
M7: if F
(c)
(C {r}) > F
(c)
(C ) then
M8: cur length length(C {r});
M9: R R {r};
M10: C C {r};
M11: D
r
{x D |I x};
M12: D D D
r
;
M13: min length cur length;
M14: end if
M15: if |D | = 0 then
M16: continue at line M20;
M17: end if
M18: end while
M19: end for
M20: if |D | > 0 then
M21: c
argmax
cL
supp(c, D );
M22: else
M23: c
argmax
cL
supp(c);
M24: end if
M25: C C {
/
0 c
};
M26: RETURN C ;
Figure 3: The PRUNE procedure.
group, the CARs reflect the total order established
over R.
As it is said in (Ning et al., 2006), the class-based
ordering of the CARs in C confers to the classifier
a high interpretability, that would not be obtained if
the same CARs were sorted in C according to the
relationship. Indeed, in this latter case, the meaning
of each CAR r in C would involve the negation of
any other CAR r
in C such that r
r. This would
clearly reduce the comprehension of the CARs sited
at the bottom of C , especially if the size of C is large.
Class-based ordering has been largely adopted in the
design of seminal rule-induction algorithms.
To conclude the discussion on PRUNE, the mu-
tual exclusiveness and the exhaustive coverage of the
CARs of any resulting classifier C must be touched.
A rule-based classifier is mutually exclusive if
each input triggers no more than one rule. Generally,
such a property does not hold for C . Indeed, it is ac-
tually possible that multiple CARs are triggered by a
same transaction. This is clearly undesirable because
(some of) the triggered CARs may provide contrast-
ing predictions. This problem is overcome in sec. 3.2.
Instead, the addition to C (at line M25) of a default
rule
/
0 c
ensures exhaustive coverage, i.e., that
every transaction is covered by at least one CAR of
C {
/
0 c
}. In particular, the default rule covers all
those transactions uncovered by the CARs of C and
assigns them to a suitable class c
. This guarantees a
maximum recall with very poor precision in class c
.
To attenuate the loss in precision, c
can be reason-
ably chosen (at line M23) to be the class with high-
est occurrence frequency in the training data, which
ensures the highest precision for the default rule. De-
pending on the overall coverage of C , there are two
alternative possibilities for the choice of the default
class c
. If there are still uncovered transactions after
the termination of the covering process (at line M20),
c
is selected (at line M21) as the class with maxi-
mum occurrence frequency supp(c, D ) in the residual
training data D . Otherwise, if all transactions have
been covered, c
is chosen (at line M23) to be the
class with highest occurrence frequency in the whole
training data. In both cases, ties can be arbitrarily bro-
ken.
3.2 Prediction
Let C be an associative classifier induced by XCCS at
the end of model-learning phase of fig. 1. Also, as-
sume that t is an unlabeled (i.e. an unclassified) XML
tree, whose transactional representation over the un-
derlying feature space F is x. The class predicted by
C for t is C (x). To avoid conflicting predictions from
multiple triggered CARs, C (x) is provided by the first
CAR of C that covers x.
4 EVALUATION
The empirical behavior of XCCS is studied in order
to comparatively evaluate the effectiveness of XCCS
across different domains;
All tests are performed on a Linux machine, with
an Intel Core 2 Duo cpu, 4Gb of memory and 2Ghz
of clock speed.
4.1 Data Sets
The behaviorof XCCS is tested over several real XML
data sets. Synthetic XML corpora are not considered
for experimental purposes, since these are generally
unlikely to provide coherent textual information in
natural language.
Macro-averaged effectiveness results are obtained
by performing a stratified 10-fold cross validation on
the transactional representation of each data set.
We choose four real-world XML data sets, that
KDIR 2011 - International Conference on Knowledge Discovery and Information Retrieval
110
include textual information and are characterized
by skewed distributions of the classes of XML
documents.
Wikipedia is an XML corpus proposed in the INEX
contest 2007 (Denoyer and Gallinari, 2008) as a
major benchmark for XML classification and clus-
tering. The corpus groups 96, 000 XML documents
representing very long articles from the digital ency-
clopedia. The XML documents are organized into 21
classes (or thematic categories), each corresponding
to a distinct Wikipedia Portal. A challenging aspect
of the corpus is the ambiguity of certain pairs of
classes such as, e.g.,
Portal:Pornography
and
Portal:Sexuality
or
Portal:Chistianity
and
Portal:Spirituality
(Denoyer and Gallinari,
2008).
IEEE is a reference text-rich corpus, presented
in (Denoyer and Gallinari, 2007), that includes
12, 107 XML documents representing full articles.
These are organized into 18 classes corresponding
to as many IEEE Computer Society publications:
6
Transactions
and 12 other journals. A same
thematic can be treated into two distinct journals.
DBLP is a bibliographic archive of sci-
entific publications on computer science
(
http://dblp.unitrier.de/xml/
). The archive
is available as one very large XML file with a diver-
sified structure. The whole file is decomposed into
479, 426 XML documents corresponding to as many
scientific publications. These individually belong
to one of 8 classes:
article
(173, 630 documents),
proceedings
(4, 764 documents),
mastersThesis
(5 documents),
incollection
(1, 379 documents),
inproceedings
(298, 413 documents),
book
(1, 125
documents),
www
(38 documents),
phdthesis
(72
documents). The individual classes exhibit differen-
tiated structures, despite some overlap among certain
document tags (such as title, author, year and pages),
that occur in (nearly) all of the XML documents.
The Sigmod collection groups 988 XML documents
(i.e., articles from Sigmod Record) complying
to three different class DTDs:
IndexTermsPage
,
OrdinaryIssue
and
Proceedings
. These classes
contain, respectively, 920, 51 and 17 XML doc-
uments. Such classes have diversified structures,
despite the occurrence of some overlapping tags,
such as volume, number, authors, title and year.
4.2 Preprocessing
The high dimensionality (i.e. cardinality) of the fea-
ture space S may be a concern for the time efficiency
and the scalability of model induction. In particu-
lar, when the classes of XML trees cannot be dis-
criminated through the structural information alone
and, hence, the content information must necessarily
be taken into account, the number of XML features
likely becomes very large if the XML documents con-
tain huge amounts of textual data.
To reduce the dimensionality of the feature space
S , the available XML data is preprocessed into two
steps.
The first step addresses the textual information
of the XML data and sensibly reduces the overall
number of distinct terms in the leaves of the XML
trees through token extraction, stop-word removaland
stemming.
Dimensionality reduction is performed at the sec-
ond step both to reduce overfitting and to ensure a
satisfactory behavior of XCCS in terms of efficiency,
scalability and compactness of the induced classifiers.
The idea is partitioning S into groups of XML fea-
tures, that discriminate the individual classes in a sim-
ilar fashion. For this purpose, we explicitly repre-
sent the discriminatory behavior of the features in S
and then group the actually discriminatory features
through distributional clustering (Baker and McCal-
lum, 1998). In particular, the discriminatory behav-
ior of each feature p.w in S is represented as an
array v
p.w
with as many entries as the number of
classes in L . The generic entry of v
p.w
is the prob-
ability P(c|p.w) of class c given the XML feature
p.w. Clearly, P(c|p.w) =
P(p.w|c)P(c)
P(p.w)
by Bayes rule,
where probabilities P(p.w|c), P(c) and P(p.w) are
estimated from the data. Before clustering features,
noisy (i.e. non discriminatory) features are removed
from S . Specifically, a feature p.w is noisy if there is
no class c in L , that is positively correlated with p.w
according to chi-square testing at a significance level
0.05. The arrays relative to the remaining features of
S are then grouped through distributional clustering
into a desired number of feature clusters with similar
discriminatory behavior.
Eventually, an aggressive compression of the orig-
inal feature space S is achieved by replacing each
feature p.w of S with a respective synthetic feature,
i.e., the label of the cluster to which v
p.w
belongs.
Distributional clustering efficiently compresses the
feature space by various orders of magnitude, while
still enabling a significantly better classification per-
formance than several other established techniques
for dimensionality reduction (Baker and McCallum,
1998).
4.3 Classification Effectiveness
We compare XCCS against several other established
competitors in terms of classification effectiveness.
A TRANSACTIONAL APPROACH TO ASSOCIATIVE XML CLASSIFICATION BY CONTENT AND STRUCTURE
111
Both direct and indirect comparisons are performed.
The direct comparisons involve three compet-
ing classification approaches, that produce rule-based
classifiers, i.e., XRULES (Zaki and Aggarwal, 2006),
CBA (Liu et al., 1998) as well as CPAR (Xin and
Han, 2003). These competitors are publicly available
and, thus, can be compared against XCCS on each
XML data set.
XRULES is a state-of-the-art competitor, that ad-
mits multiple cost-models to evaluate classification
effectiveness. For each data set, we repeatedly
train XRULES to suitably tune the minimum support
threshold for the frequent subtrees to find in the var-
ious classes and, then, report the results of the cost
model allowing the best classification performance.
CBA and CPAR are two seminal techniques for
learning associative classifiers. CBA and CPAR are
included among the competitors of XCCS to compare
the effectiveness of the three distinct approaches to
associative classification at discriminating classes in
(high-dimensional) transactional domains.
To evaluate CBA and CPAR, we use the imple-
mentations from (Coenen, 2004). In all tests, both
CBA and CPAR are trained on the transactional rep-
resentations of the XML data sets used to feed XCCS.
Again, CBA and CPAR are repeatedly trained on the
transactional representation of each XML data set, in
order to suitably tune their input parameters. For ev-
ery data set, we report the results of the most effective
classifiers produced by CBA and CPAR.
Through preliminary tests we noticed that, in all
tests, a satisfactory behavior of XCCS can be ob-
tained by fixing the support threshold τ of fig. 1 to
0.1. This is essentially due to the adoption of the min-
imum class support (Liu et al., 2000) (discussed in
section 3.1.1) in the MINECARS procedure of fig. 2.
Fig. 4 summarizes the effectiveness of the chosen
competitors across the selected data sets.
Columns Size and #C indicate, respectively, the
number of XML documents and classes for each cor-
pus. Column Model identifies the competitors. Rules
is the rounding off of the average number of rules of
a classifier in the stratified 10-fold cross validation.
The effectiveness of each classifier is measured in
terms of average precision (P), average recall (R), av-
erage F-measure (F). More precisely, the values of
P, R and F are averages of precision, recall and F-
measure over the folds of the stratified 10-fold cross
validation of classifiers on the individual data sets.
The maximum values of P, R and F on each data
set are highlighted in bold.
Notice that we tabulate only the (best) results
achieved by the approaches (de Campos et al., 2008;
Murugeshan et al., 2008; Yang and Zhang, 2008;
Yong et al., 2007; Xing et al., 2007) in the respective
papers.
Some results were not originally measured and,
hence, are reported as N.A. (short for not available).
Rules has no sense for (de Campos et al., 2008; Mu-
rugeshan et al., 2008; Yang and Zhang, 2008; Yong
et al., 2007; Xing et al., 2007) and, thus, its entry
in the corresponding rows is left blank. The sym-
bol that appears in three rows of fig. 4 reveals
that XRULES did not successfully complete the tests
over Wikipedia, IEEE and DBLP. The enumeration
of the frequent embedded subtrees within each class
and the consequent generation of predictive struc-
tural rules (satisfying the specified level of minimum
class-specific support) are very time-expensive steps
of XRULES, especially when the underlying num-
ber of XML documents is (very) large. In all com-
pleted tests, XRULES is less effective than XCCS.
In addition, the huge number of rules produced by
XRULES makes the resulting classification models
difficult to understand (and, hence, hardly actionable)
in practice. . The classification performance of CBA
is inferior than that of XCCS on the selected data
sets. Moreover, as discussed in sec. 3.1.2, interpret-
ing CBA classifiers may be cumbersome, since their
rules are not ordered by the targeted class. CPAR
delivers a satisfactory classification performance on
the chosen XML corpora. Nonetheless, CPAR is
still less effective and compact than XCCS. The ap-
proaches (de Campos et al., 2008; Murugeshan et al.,
2008; Yang and Zhang, 2008) and (Yong et al., 2007;
Xing et al., 2007) exhibit generally inferior classifi-
cation performances than XCCS on the Wikipedia and
IEEE corpora, respectively.
To conclude, XCCS consistently induces the most
effective classifiers on the chosen corpora. As far as
compactness is concerned, such classifiers are gen-
erally comparable to the ones induced by CBA and
significantly more compact than the ones induced by
XRULES and CPAR. The effectiveness of XCCS con-
firms its general capability at handling XML data with
skewed class distributions.
5 CONCLUSIONS AND FURTHER
RESEARCH
XCCS is a new approach to XML classification that
induces clearly interpretable predictivemodels, which
are of great practical interest for more effective and
efficient XML search, retrieval and filtering. XCCS
induces very compact classifiers with outperforming
effectiveness from very large corpora of XML data.
Ongoing research aims to increase the discrimina-
KDIR 2011 - International Conference on Knowledge Discovery and Information Retrieval
112
Data Size #C Model Rules P R F
Wikipedia 96,000 21
XCCS 87 0.77 0.78 0.78
XRULES - - - -
CBA 90 0.60 0.61 0.61
CPAR 156 0.73 0.72 0.73
(de Campos et al., 2008) N.A. N.A. 0.75
(Murugeshan et al., 2008) N.A. 0.76 N.A.
(Yang and Zhang, 2008) N.A. 0.84 N.A.
IEEE 12,107 18
XCCS 49 0.73 0.75 0.74
XRULES - - - -
CBA 68 0.53 0.55 0.54
CPAR 133 0.68 0.68 0.68
(Yong et al., 2007) N.A. N.A. 0.72
(Xing et al., 2007) N.A. N.A. 0.60
DBLP 479426 8
XCCS 10 1 1 1
XRULES - - - -
CBA 8 0.95 0.95 0.95
CPAR 10 0.96 0.96 0.96
Sigmod 998 3
XCCS 6 1 1 1
XRULES > 50000 0.95 0.94 0.94
CBA 3 0.91 0.93 0.92
CPAR 32 0.93 0.94 0.93
Figure 4: Results of the empirical evaluation.
tory power of XML features by incorporating the tex-
tual context of words in the leaves of the XML trees.
Also, we are studying enhancements of model learn-
ing in XCCS, with which to induce classification rules
that also consider the absence of XML features.
REFERENCES
Agrawal, R. and Srikant, R. (1994). Fast algorithms for
mining association rules. In Proc. of Int. Conf. on Very
Large Data Bases, pages 487 – 499.
Arunasalam, B. and Chawla, S. (2006). CCCS: A top-
down association classifier for imbalanced class distri-
bution. In Proc. of Int. Conf. on Knowledge Discovery
and Data Mining, pages 517–522.
Baker, L. and McCallum, A. (1998). Distributional cluster-
ing of words for text classification. In Proc. of ACM
Int. Conf. on Research and Development in Informa-
tion Retrieval, pages 96 – 103.
Coenen, F. (2004). LUCS KDD implementations of CBA
and CMAR. Dpt of Computer Science, University of
Liverpool - www.csc.liv.ac.uk/ frans/KDD/Software/.
de Campos, L., Fern´andez-Luna, J., Huete, J., and Romero,
A. (2008). Probabilistic methods for structured doc-
ument classification at inex’07. In Proc. of INitiative
for the Evaluation of XML Retrieval, pages 195 – 206.
Denoyer, L. and Gallinari, P. (2007). Report on the xml
mining track at inex 2005 and inex 2006. ACM SIGIR
Forum, 41(1):79 – 90.
Denoyer, L. and Gallinari, P. (2008). Report on the xml
mining track at inex 2007. ACM SIGIR Forum,
42(1):22 – 28.
Garboni, C., Masseglia, F., and Trousse, B. (2006). Sequen-
tial pattern mining for structure-based xml document
classification. In Proc. of the INitiative for the Evalu-
ation of XML Retrieval, pages 458 – 468.
Knijf, J. D. (2007). Fat-cat: Frequent attributes tree based
classification. In Proc. of the INitiative for the Evalu-
ation of XML Retrieval, pages 485 – 496.
Li, W., Han, J., and Pei, J. (2001). CMAR: Accurate
and efficient classification based on multiple class-
association rules. In Proc. of Int. Conf. on Data Min-
ing, pages 369 – 376.
Liu, B., Hsu, W., and Ma, Y. (1998). Integrating classifica-
tion and association rule mining. In Proc. of Conf. on
Kwnoledge Discovery and Data Mining, pages 80–86.
Liu, B., Ma, Y., and Wong, C. (2000). Improving an asso-
ciation rule based classifier. In Proc. of Int. Conf. on
Principles of Data Mining and Knowledge Discovery,
pages 504 – 509.
Manning, C., Raghavan, P., and Sch¨utze., H. (2008). Intro-
duction to Information Retrieval. Cambridge Univer-
sity Press.
Murugeshan, M., Lakshmi, K., and Mukherjee, S. (2008).
A categorization approach for wikipedia collection
based on negative category information and initial de-
scriptions. In Proc. of the INitiative for the Evaluation
of XML Retrieval.
Ning, P., Steinbach, M., and Kumar, V. (2006). Introduction
to Data Mining. Addison Wesley.
Theobald, M., Schenkel, R., and Weikum, G. (2003). Ex-
ploiting structure, annotation, and ontological knowl-
edge for automatic classification of xml data. In Proc.
of WebDB Workshop, pages 1 – 6.
Xin, X. and Han, J. (2003). CPAR: Classification based
on predictive association rules. In Proc. of SIAM Int.
Conf. on Data Mining, pages 331–335.
Xing, G., Guo, J., and Xia, Z. (2007). Classifying xml doc-
uments based on structure/content similarity. In Proc.
of the INitiative for the Evaluation of XML Retrieval,
pages 444 – 457.
Yang, J. and Zhang, F. (2008). Xml document classification
using extended vsm. In Proc. of the INitiative for the
Evaluation of XML Retrieval, pages 234 – 244.
Yi, J. and Sundaresan, N. (2000). A classifier for semi-
structured documents. In Proc. of Int. Conf. on Knowl-
edge Discovey and Data Mining, pages 340 – 344.
Yong, S., Hagenbuchner, M., Tsoi, A., Scarselli, F., and
Gori, M. (2007). Xml document mining using graph
neural network. In Proc. of the INitiative for the Eval-
uation of XML Retrieval, pages 458 – 472.
Zaki, M. and Aggarwal, C. (2006). Xrules: An effective al-
gorithm for structural classification of xml data. Ma-
chine Learning, 62(1-2):137–170.
A TRANSACTIONAL APPROACH TO ASSOCIATIVE XML CLASSIFICATION BY CONTENT AND STRUCTURE
113