We propose intelligent pruning, which prunes
branches differently according to the number of
instances they cover. At the data level, our main
focus is on generating a combined strategy for
acquiring the best knowledge we can from the
available data: for a given data set, identify the
optimal distribution in a similar manner as (Weiss,
2003). The second step of the strategy generates
several folds out of all available data, with the
optimal distribution for each. In each fold, all the
instances from the minority class are the same (all
minority instances from the entire data set), while
for the majority class(es), we generate partitions to
reach the optimal distribution and assign one
partition per fold, so that each majority instance
occurs in a single fold. Then a model is generated
from each fold and a voting criterion is applied in
order to classify a new instance. Another point of
interest at the data level is the identification of the
appropriate IA which ensures the best performance
for a given classifier and apply a feature selection
strategy, as preprocessing step, in order to reach the
suggested IA.
5 CONCLUSIONS
To properly analyze the imbalance problem, the
relation between the imbalance and other features of
the problem, like its size and complexity
(Japkowicz, 2000), or size and IA (Potolea, 2010)
should be investigated. Secondly, since the
performance is not expected to improve significantly
with a more sophisticated sampling strategy, more
focus should be allocated to algorithm related
improvements, rather than to data improvements.
Finally, starting from the observation that there is no
winner (neither in terms of sampling, nor algorithm)
for all data sets, special attention should be paid to
the particularities of the data at hand. That is, to
apply various tuning strategies for finding the
appropriate combination of learning technique and
sampling strategy, while in a following step, finding
the best settings for them: parameter values, function
selection, threshold, distribution, and many others,
specific to the technique.
ACKNOWLEDGEMENTS
This work was supported by the IBM Faculty Award
received in 2009 by Rodica Potolea from the CS
Department of TUCN Romania.
REFERENCES
Batista, G., Prati R., and Monard M., 2004, A Study of the
Behavior of Several Methods for Balancing Machine
Learning Training Data. SIGKDD Explorations, 6:20-
24, Volume 6, Issue 1, pp. 20-29.
Chan, P., and Stolfo, S., 1998, Toward scalable learning
with non-uniform class and cost distributions: a case
study in credit card fraud detection. In Proceedings of
the Fourth International Conference on Knowledge
Discovery and Data Mining, Menlo Park, CA: AAAI
Press, pp. 164-168
Chawla, N.V., Japkowicz, N. and Kolcz, A., 2004,
Editorial: special issue on learning from I imbalanced
data sets, SIGKDD Explorations Special Issue on
Learning from Imbalanced Datasets 6 (1), pp. 1–6.
Chawla, N. V., 2006 Data Mining from Imbalanced Data
Sets, Data Mining and Knowledge Discovery
Handbook, chapter 40, Springer US, pp. 853-867.
Grzymala-Busse, J. W., Stefanowski, J., and Wilk, S.,
2005, A comparison of two approaches to data mining
from imbalanced data, Journal of Intelligent
Manufacturing, 16, 2005 Springer Science+Business
Media, Inc. Manufactured in The Netherlands, pp.
565–573.
Hall, L. O., and Joshi, A., 2005, Building Accurate
Classifiers from Imbalanced Data Sets, IMACS’05.
Hall, M., et.alt., 2009, The WEKA Data Mining Software;
SIGKDD Explorations, Volume 11, Issue 1.
Holte, R. C., Acker, L. E., and Porter., B. W., 1989,
Concept learning and the problem of small disjuncts.
In Proceedings of the Eleventh International Joint
Conference on Artificial Intelligence, pp. 813-818.
Japkowicz, N., 2000, The Class Imbalance Problem:
Significance and Strategies, in Proceedings of the
2000 International Conference on Artificial
Intelligence (IC-AI'2000), pp. 111-117.
Japkowicz, N., and Stephen, S., 2002, The Class
Imbalance Problem: A Systematic Study, Intelligent
Data Analysis Journal, Volume 6, Number 5,
November 2002, pp. 429 – 449.
Potolea, R., and Lemnaru, C, 2010, The class imbalance
problem: experimental study and a solution, paper
submitted for ECMLPKDD 2010.
UCI Machine Learning Data Repository, 2010,
http://archive.ics.uci.edu/ml/, last accessed Jan. 2010.
Vidrighin Bratu, C., Muresan T., and Potolea, R., 2008,
Improving Classification Accuracy through Feature
Selection, in Proceedings of the 4th IEEE
International Conference on Intelligent Computer
Communication and Processing, ICCP 2008, pp. 25-
32.
Visa, S., and Ralescu, A., 2005, Issues in mining
imbalanced data sets -a review paper, in Proceedings
of the Sixteen Midwest Artificial Intelligence and
Cognitive Science Conference, pp. 67–73.
Weiss, G., and Provost, F., 2003, Learning when Training
Data are Costly: The Effect of Class Distribution on
Tree Induction. Journal of Artificial Intelligence
Research 19, pp. 315-354.
Weiss, G., 2004, Mining with rarity: A unifying
framework, SIGKDD Explorations 6(1), pp. 7–19.
ICEIS 2010 - 12th International Conference on Enterprise Information Systems
446