Figure 5: Algorithm Performance for Regions: Precision.
examples that were classified as elements from posi-
tive classes and those that are actually positive, grows
most rapidly.
4 CONCLUSIONS AND FUTURE
RESEARCH WORK
In this paper, a resampling technique based on statis-
tical properties of data set, was proposed. We have
tested our technique in terms of its accuracy and four
performance measures: Recall, Precision, G-mean
and F-measure. As investigation reveals, C4.5 algo-
rithm applied to resampled data sets produced better
results. But, in spite of the presented promising direc-
tion of rather general resampling techniques, the algo-
rithm has to be yet improved in terms of classification
performance. The effect of its application to various
forms of data sets structure (highly skewed data sets,
multimodal data sets, etc.) should be investigated as
well. The comparison with other resampling methods
also has to be carried out.
The resampling algorithm can be also carried out
on the basis of the likelihood ratio test. The Neyman-
Pearson Lemma implies that likelihood ratio test gives
the best result in fixed size samples.
Further, for the start-up problem we were interested
in, an accurate classification can result in injuries
control boundaries analogous to presented in (Bon-
darenko, 2006a), (Bondarenko, 2006b), (F. Pokropp
and Sever, 2006). The trees obtained by classifica-
tion, can be very large (a lot of nodes and leaves), and
in this since they are less comprehensible for control
boundaries illustration. But we can simplify the ob-
tained classification results by transforming every de-
cision tree into a set of ”if-then” rules (”Traffic In-
juries Rules”), which seem to be easier for under-
standing and interpreting. Using real traffic injuries
data, it is possible to develop realistic model for daily
injuries number prediction, depending on temporal
factors (year, month, day type). Of course, this re-
search direction is open for other practical implica-
tions as well.
REFERENCES
Bondarenko, J. (2006a). Analysis of traffic injuries among
children based on generalized linear model with a la-
tent process in the mean. Discussion Paper in Statis-
tics and Quantitative Economics, Helmut-Schmidt
University Hamburg, (116).
Bondarenko, J. (2006b). Children traffic accidents mod-
els: Analysis and comparison. Discussion Paper
in Statistics and Quantitative Economics, Helmut-
Schmidt University Hamburg, (117).
F. Pokropp, W. Seidel, A. B. M. H. and Sever, K. (2006).
Control charts for the number of children injured in
traffic accidents. In H.-J. Lenz, P.-T. W., editor, Fron-
tiers in Statistical Quality Control, pages 151–171.
Physica, Heidelberg, 5 edition.
G. Cohen, M. Hilario, H. S. S. H. and Geissbuhler, A.
(2005). Learning from imbalanced data in surveil-
lance of nosocomial infection. Artificial Intelligence
in Medicine, 37:7–18.
H. Han, W. W. and Mao, B. (2005). Borderline-smote:
A new over-sampling method in imbalanced data sets
learning. In Proceedings of the International Confer-
ence on Intelligent Computing, pages 878–887.
Hido, S. and Kashima, H. (2008). Roughly balanced bag-
ging for imbalanced data. In Proceedings of the SIAM
International Conference on Data Mining, pages 143–
152.
Kohavi, R. and Provost, F. (1998). Glossary of terms. edi-
torial for the special issue on applications of machine
learning and the knowledge discovery process. Ma-
chine Learning, 30:271–274.
Kubat, M. and Matwin, S. (1997). Adressing the curse of
imbalanced training sets: Onesided selection. In Pro-
ceedings of the 14th International Conference on Ma-
chine Learning, pages 179–186.
N. Chawla, K. W. Bowyer, L. O. H. and Kegelmeyer, W. P.
(2002). Smote: Synthetic minority oversampling tech-
nique. Journal of Artificial Intelligence Research,
16:321–357.
Quinlan, J. (1993). C4.5: Programs for Machine Learning.
Morgan Kaufmann, San Mateo, California.
S. Ertekin, J. H. and Giles, C. L. (2007). Active learning
for class imbalance problem. In Proceedings of the
30th Annual International ACM SIGIR Conference on
Research and Development in Information Retrieval,
pages 823–824.
S. Kotsiantis, D. K. and Pintelas, P. (2006). Handling im-
balanced datasets: a review. GESTS International
Transactions on Computer Science and Engineering,
30:25–36.
V. Garcha, J. S. and Mollineda, R. (2008). On the use of sur-
rounding neighbors for synthetic over-sampling of the
minority class. In Proceedings of the 8th WSEAS In-
ternational Conference on Simulation, Modelling and
Optimization, pages 389–394.
X.-Y. Liu, J. W. and Zhou, Z.-H. (2006). Exploratory un-
dersampling for class-imbalance learning. In Proceed-
ings of the International Conference on Data Mining,
pages 965–969.
ICINCO 2009 - 6th International Conference on Informatics in Control, Automation and Robotics
148