chine) (Vapnik, 1998), (Scholkopf and Smola, 2002),
Tree classifiers and more recent algorithms such as
Optimum Path Forest (Papa and Falcao, 2010),(Papa
et al., 2007) as base classifiers.
Performance evaluation using test dataset shows
very good results on suspicious profiles selection.
Also, on field evaluation of fraud detection using our
automatic system shows similar results to manual ex-
perts’ method.
The paper is organized as follows. Section 2 de-
scribes general aspects of the class imbalance prob-
lem, section 3 describes different strategies proposed,
section 4 presents the results obtained, and, finally,
section 5 concludes the work.
2 THE CLASS IMBALANCE
PROBLEM
When working on the fraud detection problem, one
can not assume that the number of people who com-
mit fraud are the same than those who do not, usually
there are fewers elements from the class who com-
mit fraud. This situation is known as the problem of
class imbalance, and it is particularly important in real
world applicationswhere it is costly to misclassify ex-
amples from the minority class. In this cases, stan-
dard classifiers tend to be overwhelmed by the major-
ity class and ignore the minority class, hence obtain-
ing suboptimal classification performance. Having to
confront this type of problem, we decided to use three
different strategies on different levels, changing class
distribution by resampling, manipulating classifiers,
and on the ensemble of them.
The first consists mainly in resampling techniques
such as under-sampling the majority class or over-
sampling the minority one. Random under-sampling
aims at balancing the data set through random re-
moval of majority class examples. The major prob-
lem of this technique is that it can discard poten-
tially important data for the classification process. On
the other hand, the simplest over-sampling method is
to increase the size of the minority class by random
replication of those samples. The main drawback of
over-sampling is the likelihood of over-fitting, since
it makes exact copies of the minority class instances
As a way of facing the problems of resampling tech-
niques discussed before, different proposals address
the imbalance problem by adapting existing algo-
rithms to the special characteristics of the imbalanced
data sets. One approach is one-class classifiers, which
tries to describe one class of objects (target class) and
distinguish it from all other objects (outliers). In this
paper, the performance of One-Class SVM, adapta-
tion of the popular SVM algorithm, will be analyzed.
Another technique is cost-sensitive learning, where
the cost of a particular kind of error can be different
from others, for example by assigning a high cost to
mislabeling a sample from the minority class.
Another problem which arises when working with
imbalanced classes is that the most widely used met-
rics for measuring the performance of learning sys-
tems, such as accuracy and error rate, are not appro-
priate because they do not take into account misclas-
sification costs, since they are strongly biased to fa-
vor the majority class. In the past few years, sev-
eral new metrics which measure the classification per-
formance on majority and minority classes indepen-
dently, hence taking into account the class imbalance,
have been proposed (Manning et al., 2009).
• Recall
p
=
TP
TP+ FN
• Recall
n
=
TN
TN + FP
• Precision =
TP
TP+ FP
• F
value
=
(1+ β
2
)Recall
p
× Precision
β
2
Recall
p
+ Precision
Table 1: Confusion matrix.
Labeled as
Positive Negative
Positive TP (True Positive) FN (False Negative)
Negative FP (False Positive) TN (True Negative)
Recall
p
is the percentage of correctly classified
positive instances, in this case, the fraud samples.
Precision is defined as the proportion of labeled as
positive instances that are actually positive. The com-
bination of this two measurements, the F-value, rep-
resents the geometric mean between them, weighted
by the parameter β. Depending on the value of β we
can prioritize Recall or Precision. For example, if we
have few resources to perform inspections, it can be
useful to prioritize Precision, so the set of samples la-
beled as positive has high density of true positive.
3 STRATEGY PROPOSED
The system presented consists of basically on three
modules: Pre-Processing and Normalization, Feature
selection and extraction and, finally, Classification.
Figure 1 shows the system configuration. The sys-
tem input corresponds to the last three years of the
monthly consumption curve of each costumer, here
ICPRAM 2012 - International Conference on Pattern Recognition Applications and Methods
136