the training set size while preserving the original
decision boundaries as much as possible.
3 DATA-LEVEL METHODS
Data-driven methods consist of artificially balancing
the original data set, either by over-sampling the mi-
nority class and/or by under-sampling the majority
class, until the problem classes are approximately
equally represented. Both strategies can be applied
in any learning system, since they act as a prepro-
cessing phase, allowing the learning system to receive
the training instances as if they belonged to a well-
balanced data set. Thus, any bias of the system to-
wards the majority class due to the different propor-
tion of examples per class would be expected to be
removed. The simplest method to increase/reduce the
minority/majority class corresponds to non-heuristic
methods that aim at balancing the class distribution
through the random replication/elimination of pos-
itive/negative examples. Nevertheless, these meth-
ods have shown important drawbacks. Random over-
sampling may increase the likelihood of overfitting,
since it makes exact copies of the minority class in-
stances. On the other hand, random under-sampling
may discard data potentially important for the classi-
fication process. Despite this problem, it has empiri-
cally been shown to be one of the most effective re-
sampling methods. In order to overcome these draw-
backs, several authors have developed focused resam-
pling algorithms that produce balanced data sets in an
intelligent way.
(Chawla et al., 2002) proposed an over-sampling
technique that generates new synthetic minority in-
stances by interpolating between several positive ex-
amples that lie close together. This method, called
SMOTE (Synthetic Minority Oversampling TEch-
nique), allows the classifier to build larger decision
regions that contain nearby instances from the minor-
ity class. From the original SMOTE algorithm, sev-
eral modifications have been proposed in the litera-
ture, most of them pursuing to determine the region
in which the positive examples should be generated.
For instance, Borderline- SMOTE (Han et al., 2005)
consists of using only positive examples close to the
decision boundary, since these are more likely to be
misclassified.
Unlike the random method, many proposals are
based on a more intelligent selection of negative ex-
amples to be eliminated. For example, (Kubat and
Matwin, 1997) proposed an under-sampling tech-
nique, called obe-sided selection, that selectively re-
moves only those negative instances that are “redun-
dant” or that “border” the minority class examples
(they assume that these bordering cases are noise). In
contrast to the one-sided selection technique, the so-
called neighborhood cleaning rule emphasizes more
on data cleaning than on data reduction. To this end,
Wilsons editing is used to identify and remove noisy
negative instances. Similarly, (Barandela et al., 2003)
introduced a method that eliminates not only noisy
instances of the majority class by means of Wilsons
editing (WE), but also redundant examples through
the MSS condensing algorithm.
4 EXPERIMENTAL SETUP AND
RESULTS
Experiments were carried out over 13 data sets taken
from the UCI Machine Learning Database Reposi-
tory (Frank and Asuncion, 2010) and a private library
( http://www.vision.uji.es/∼sanchez/Databases/). All
data sets have been transformed into two-class prob-
lems by keeping one original class (the minority class)
and joining the objects of the remaining classes (giv-
ing the majority class). For example, in Segmenta-
tion database the objects of classes 1, 2, 3, 4 and 6
were joined to shape a unique majority class and the
original class 5 was left as the minority class (see a
summary in Table 1).
Table 1: Data sets used in the experiments
Data Set Positive Examples Negative Examples Classes Majority Class
Breast 81 196 2 1
Ecoli 35 301 8 1,2,3,5,6,7,8
German 300 700 2 1
Glass 17 197 9 1,2,4,5,6,7,8,9
Haberman 81 225 2 1
Laryngeal
2
53 639 2 1
Phoneme 1586 3818 2 1
Pima 268 500 2 1
Scrapie 531 2582 2 1
Segmentation 330 1980 6 1,2,3,4,6
Spambase 1813 2788 2 1
Vehicle 212 634 4 2,3,4
Yeast 429 1055 10 1,3,4,5,6,7,8,9,10
For each data set, we have used a stratified 5-
fold cross-validation, obtaining 65 new problems.
SMOTE and random under-sampling were applied to
the training data (in the feature space), and four differ-
ent prototype selection techniques were used on im-
balanced and resampled data sets: R50, R100, RCNN
and RMSS. Two learners, Fisher and 1-NN classifiers,
were constructed from the original and transformed
data sets.
In total, 65 different training data sets, two resam-
pling methods and no sampling, results in 65 ×3 =
195 transformed data sets. Since there are four proto-
type selection methods and two learning algorithms,
ICPRAM 2012 - International Conference on Pattern Recognition Applications and Methods
244