relation coefficient tends to be lower than the correla-
tion between MRI and accuracy. Figures 2 and 3 show
that there is actually only one outlier dataset, namely
the Ring dataset, for which we provide further discus-
sion.
The choice of ρ, the average percent of samples
embodied by the probe with greatest size, slightly in-
fluences the performance of the MRI. The reported ta-
bles show that the optimal choice is ρ = 0.05. Greater
values of ρ decrease the performance of the MRI
for ranking purposes. However, this phenomenon is
largely expected, as increasing values of ρ force the
algorithm to concentrate on greater hyper spheres,
gradually shadowing its ability of performing a local
analysis.
4 CONCLUSIONS
In this work, a method for partitioning datasets into
regions of different classification complexity has been
proposed. The method relies on a specific metric,
called MRI, which is typically used for clustering the
elements of a dataset into three regions of increasing
classification complexity, thus separating the “easy”
part of the data from the “hard” part (possibly due to
noise). Increasing the number of clusters up to five
does not decrease the ranking capacity of the MRI,
except for particular datasets and only when com-
pared with F-Score or Matthews’ correlation coeffi-
cient. Moreover, the proposed method proved to be
stable and effective for the majority of experiments
and parameter settings.
Further work on the MRI will be carried out along
both theoretical and experimental directions. Studies
on statistical significance of MRI estimates may help
to discover a lower bound on the optimal number of
clusters to be used for splitting a dataset. We are also
planning to substitute the imbalance estimation func-
tion with a local correlation estimation, aimed at sep-
arating linearly separable areas (which are typically
easy to classify), from noisy areas, as these two would
have the same imbalance but different local correla-
tion indexes.
ACKNOWLEDGEMENTS
Emanuele Tamponi gratefully acknowledges Sardinia
Regional Government for the financial support of his
PhD scholarship (P.O.R. Sardegna F.S.E. Operational
Programme of the Autonomous Region of Sardinia,
European Social Fund 2007-2013 - Axis IV Human
Resources, Objective l.3, Line of Activity l.3.1.).
REFERENCES
Abdi, H. and Williams, L. J. (2010). Principal component
analysis. Wiley Interdisciplinary Reviews: Computa-
tional Statistics, 2:433—-459.
Alcal´a-Fdez, J., Fern´andez, A., Luengo, J., Derrac, J.,
and Garc´ıa, S. (2011). Keel data-mining software
tool: Data set repository, integration of algorithms and
experimental analysis framework. Multiple-Valued
Logic and Soft Computing, 17(2-3):255–287.
Bache, K. and Lichman, M. (2013). UCI machine learning
repository.
Bhattacharyya, A. (1943). On a measure of divergence
between two statistical populations defined by their
probability distributions. Bulletin of Cal. Math. Soc.,
35(1):99–109.
Fukunaga, K. (1990). Introduction to statistical pattern
recognition (2nd ed.). Academic Press Professional,
Inc., San Diego, CA, USA.
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reute-
mann, P., and Witten, I. H. (2009). The weka data
mining software: an update. SIGKDD Explor. Newsl.,
11(1):10–18.
Ho, T. (2000). Complexity of classification problems and
comparative advantages of combined classifiers. In
Multiple Classifier Systems, volume 1857 of Lecture
Notes in Computer Science, pages 97–106. Springer
Berlin Heidelberg.
Ho, T. K. and Basu, M. (2000). Measuring the complexity
of classification problems. In in 15th International
Conference on Pattern Recognition, pages 43–47.
Luengo, J., Fern´andez, A., Garc´ıa, S., and Herrera, F.
(2011). Addressing data complexity for imbal-
anced data sets: analysis of smote-based oversam-
pling and evolutionary undersampling. Soft Comput-
ing, 15(10):1909–1936.
Luengo, J. and Herrera, F. (2012). Shared domains of com-
petence of approximate learning models using mea-
sures of separability of classes. Information Sciences,
185(1):43 – 65.
Mahalanobis, P. C. (1930). On tests and measures of group
divergence. Part 1. Theoretical formulae. Journal and
Proceedings of the Asiatic Society of Bengal (N.S.),
26:541–588.
Mansilla, E. B. and Ho, T. K. (2005). Domain of Compe-
tence of XCS Classifier System in Complexity Mea-
surement Space.
Pierson, W. E., Jr., and Pierson, W. E. (1998). Using bound-
ary methods for estimating class separability.
Pierson, W. E., Ulug, B., Ahalt, S. C., Sancho, J. L., and
Figueiras-Vidal, A. (1998). Theoretical and complex-
ity issues for feature set evaluation using boundary
methods. In Zelnio, E. G., editor, Algorithms for
Synthetic Aperture Radar Imagery V, volume 3370 of
Society of Photo-Optical Instrumentation Engineers
(SPIE) Conference Series, pages 625–636.
Singh, S. (2003). Multiresolution estimates of classifica-
tion complexity. Pattern Analysis and Machine Intel-
ligence, IEEE Transactions on, 25(12):1534–1539.
ICPRAM2015-InternationalConferenceonPatternRecognitionApplicationsandMethods
340