Both direct and indirect comparisons are performed.
The direct comparisons involve three compet-
ing classification approaches, that produce rule-based
classifiers, i.e., XRULES (Zaki and Aggarwal, 2006),
CBA (Liu et al., 1998) as well as CPAR (Xin and
Han, 2003). These competitors are publicly available
and, thus, can be compared against XCCS on each
XML data set.
XRULES is a state-of-the-art competitor, that ad-
mits multiple cost-models to evaluate classification
effectiveness. For each data set, we repeatedly
train XRULES to suitably tune the minimum support
threshold for the frequent subtrees to find in the var-
ious classes and, then, report the results of the cost
model allowing the best classification performance.
CBA and CPAR are two seminal techniques for
learning associative classifiers. CBA and CPAR are
included among the competitors of XCCS to compare
the effectiveness of the three distinct approaches to
associative classification at discriminating classes in
(high-dimensional) transactional domains.
To evaluate CBA and CPAR, we use the imple-
mentations from (Coenen, 2004). In all tests, both
CBA and CPAR are trained on the transactional rep-
resentations of the XML data sets used to feed XCCS.
Again, CBA and CPAR are repeatedly trained on the
transactional representation of each XML data set, in
order to suitably tune their input parameters. For ev-
ery data set, we report the results of the most effective
classifiers produced by CBA and CPAR.
Through preliminary tests we noticed that, in all
tests, a satisfactory behavior of XCCS can be ob-
tained by fixing the support threshold τ of fig. 1 to
0.1. This is essentially due to the adoption of the min-
imum class support (Liu et al., 2000) (discussed in
section 3.1.1) in the MINECARS procedure of fig. 2.
Fig. 4 summarizes the effectiveness of the chosen
competitors across the selected data sets.
Columns Size and #C indicate, respectively, the
number of XML documents and classes for each cor-
pus. Column Model identifies the competitors. Rules
is the rounding off of the average number of rules of
a classifier in the stratified 10-fold cross validation.
The effectiveness of each classifier is measured in
terms of average precision (P), average recall (R), av-
erage F-measure (F). More precisely, the values of
P, R and F are averages of precision, recall and F-
measure over the folds of the stratified 10-fold cross
validation of classifiers on the individual data sets.
The maximum values of P, R and F on each data
set are highlighted in bold.
Notice that we tabulate only the (best) results
achieved by the approaches (de Campos et al., 2008;
Murugeshan et al., 2008; Yang and Zhang, 2008;
Yong et al., 2007; Xing et al., 2007) in the respective
papers.
Some results were not originally measured and,
hence, are reported as N.A. (short for not available).
Rules has no sense for (de Campos et al., 2008; Mu-
rugeshan et al., 2008; Yang and Zhang, 2008; Yong
et al., 2007; Xing et al., 2007) and, thus, its entry
in the corresponding rows is left blank. The sym-
bol − that appears in three rows of fig. 4 reveals
that XRULES did not successfully complete the tests
over Wikipedia, IEEE and DBLP. The enumeration
of the frequent embedded subtrees within each class
and the consequent generation of predictive struc-
tural rules (satisfying the specified level of minimum
class-specific support) are very time-expensive steps
of XRULES, especially when the underlying num-
ber of XML documents is (very) large. In all com-
pleted tests, XRULES is less effective than XCCS.
In addition, the huge number of rules produced by
XRULES makes the resulting classification models
difficult to understand (and, hence, hardly actionable)
in practice. . The classification performance of CBA
is inferior than that of XCCS on the selected data
sets. Moreover, as discussed in sec. 3.1.2, interpret-
ing CBA classifiers may be cumbersome, since their
rules are not ordered by the targeted class. CPAR
delivers a satisfactory classification performance on
the chosen XML corpora. Nonetheless, CPAR is
still less effective and compact than XCCS. The ap-
proaches (de Campos et al., 2008; Murugeshan et al.,
2008; Yang and Zhang, 2008) and (Yong et al., 2007;
Xing et al., 2007) exhibit generally inferior classifi-
cation performances than XCCS on the Wikipedia and
IEEE corpora, respectively.
To conclude, XCCS consistently induces the most
effective classifiers on the chosen corpora. As far as
compactness is concerned, such classifiers are gen-
erally comparable to the ones induced by CBA and
significantly more compact than the ones induced by
XRULES and CPAR. The effectiveness of XCCS con-
firms its general capability at handling XML data with
skewed class distributions.
5 CONCLUSIONS AND FURTHER
RESEARCH
XCCS is a new approach to XML classification that
induces clearly interpretable predictivemodels, which
are of great practical interest for more effective and
efficient XML search, retrieval and filtering. XCCS
induces very compact classifiers with outperforming
effectiveness from very large corpora of XML data.
Ongoing research aims to increase the discrimina-
KDIR 2011 - International Conference on Knowledge Discovery and Information Retrieval
112