learning mechanisms to generate several classifiers
which employ voting to provide a classification.
5 CONCLUSIONS
Starting from the observation that when dealing with
IDS there is no winner strategy for all data sets
(neither in terms of sampling, nor algorithm), special
attention should be paid to the particularities of the
data in hand. In doing so, one should focus on a
wider context, taking into account several factors
simultaneously: the imbalance rate, together with
other data-related meta-features, the algorithms and
their associated parameters.
Our experiments show that, in an imbalanced
problem, the IR can be used in conjunction with the
data set dimensionality and the IAR factor, to
evaluate the appropriate classifier that best fits the
situation. Moreover, a good metric to assess the
performance of the model built is important; again, it
should be chosen based on the particularities of the
problem and of the goal established for it.
When starting an evaluation, we should begin with
the imbalanced data set and the MLP, as it proved to
be the best classifier on every category we have
evaluated on imbalanced data sets. In case the
training time with MLP is too large, the second best
choice is either the decision tree with C4.5 (without
pruning would be better as IR increases), or NB. In
terms of evaluation metrics, the choice should be
based on the data particularities (i.e. imbalance), but
also on the goal of the classification process (are we
dealing with a cost-sensitive classification or are all
errors equally serious?).
ACKNOWLEDGEMENTS
The work for this paper has been supported by
research grant no. 12080/2008 – SEArCH, founded
by the Ministry of Education, Research and
Innovation.
REFERENCES
Barandela, R., Sanchez, J. S., Garcia, V., Rangel, E.
(2003). Strategies for Learning in Class Imbalance
Problems. Pattern Recognition. 36(3). 849--851
Batista, G.E.A.P.A., Prati, R. C. Monard, M. C. (2004). A
Study of the Behavior of Several Methods for
Balancing Machine Learning Training Data, 20—29
Brodersen, K.H., Ong, C.S. ,Stephen, K.E. and Buhmann,
J.M. (2010). The balanced accuracy and its posterior
distribution. Proc. of the 20th Int. Conf. on Pattern
Recognition. pp. 3121–3124
Cieslak, D. A., Chawla, N. V., Striegel, A. (2006).
Combating Imbalance in Network Intrusion Datasets.
In: Proceedings of the IEEE International Conference
on Granular Computing. 732--737
Chawla, N. V., Bowyer, K. W., Hall, L. O., Kegelmeyer,
W. P. (2002). SMOTE: Synthetic Minority Over-
Sampling Technique. Journal of Artificial Intelligence
Research, 16:321--357
Chawla, N. V. (2006). Data Mining from Imbalanced Data
Sets, Data Mining and Knowledge Discovery
Handbook, chapter 40, Springer US, 853--867
Cohen, G., Hilario, M., Sax, H., Hugonnet, S.,
Geissbuhler, A. (2006). Learning from Imbalanced
Data in Surveillance of Nosocomial Infection.
Artificial Intelligence in Medicine, 37(1):7--18
Garcia, S., Herrera, F. (2009). Evolutionary
Undersampling for Classification with Imbalanced
Datasets: Proposals and Taxonomy, Evolutionary
Computation 17(3): 275--306
Grzymala-Busse, J. W., Stefanowski, J., Wilk, S. (2005).
A Comparison of Two Approaches to Data Mining
from Imbalanced Data. Journal of Intelligent
Manufacturing, 16, Springer Science+Business Media,
65--573
Guo, H., Viktor, H.L. (2004). Learning from Imbalanced
Data Sets with Boosting and Data Generation: The
DataBoost-IM Approach, Sigkdd Explorations.
Volume 6, 30—39
Huang, K., Yang, H., King, I., and Lyu, M. R. (2006).
Imbalanced Learning with a Biased Minimax
Probability Machine. IEEE Transactions on Systems,
Man, and Cybernetics, Part B: Cybernetics, 36(4):
913--923
Japkowicz, N., Myers, C. and Gluck, M. A. (1995). A
Novelty Detection Approach to Classification. IJCAI :
518--523
Japkowicz, N., and Stephen, S. (2002). The Class
Imbalance Problem: A Systematic Study. Intelligent
Data Analysis Journal. Volume 6: 429--449
Weiss, G., and Provost, F. (2003). Learning when Training
Data are Costly: The Effect of Class Distribution on
Tree Induction. Journal of Artificial Intelligence
Research 19, 315--354
Weiss, G. (2004). Mining with Rarity: A Unifying
Framework, SIGKDD Explorations 6(1), 7--19
Woods, K., Doss, C., Bowyer, K., Solka, J., Priebe, C.,
Kegelmeyer, P. (1993). Comparative Evaluation of
Pattern Recognition Techniques for Detection of
Microcalcifications in Mammography. Int. Journal of
Pattern Rec. and AI, 7(6), 1417--1436
A COMPREHENSIVE STUDY OF THE EFFECT OF CLASS IMBALANCE ON THE PERFORMANCE OF
CLASSIFIERS
21