tion problems when the maximum variance direction
of one class is different from another class i.e., un-
equal covariance matrices (Vaswani and Chellappa,
2006; Xudong, 2009). Finding principal axes direc-
tions i.e principle components (PCs) is one of the key
steps for PCA and it depends on spread of the data.
In case of the unbalanced datasets, the spread is dom-
inated by majority class as its prior probabilities are
much higher than minority class samples. Moreover,
in real world domains such as intrusion detection sys-
tems, the occurrence of intrusion transactions are rare
and generating them is a costly process. Miss pre-
dicting these rarely occurred intrusion transactions is
risky and could lead to financial loss for organiza-
tions. Therefore, Capturing and validating labeled
samples, particularly non-majority class samples in
PCA subspace for classification task is a challenging
issue.
In this paper, we propose a class specific di-
mensionality reduction and oversampling framework
named CPC SMOTE to address class imbalance is-
sue in Principle Component Analysis subspace where
there is directional difference in Principal Compo-
nents (PCs) of two classes. The proposed framework
based on capturing class specific features in order to
hold major variance directions from individual class
and oversampling is to compensate lack of data in
under-represented class. Proposed approach is eval-
uated over decision tree classifier using accuracy and
F-measure as evaluation metrics. Experimental ev-
idence show that proposed approach yields superior
performance on simulated and real world unbalance
datasets compared with classifier learned on reduced
dimensions of whole unbalanced datasets as well as
on oversampled datasets.
The rest of the paper is organized as follows.
Section 2 discusses work related to class imbalance
problem. Section 3 provides proposed CPC SMOTE
framework. Section 4 presents the experimental eval-
uation based on a comparative study, which is done
with applying PCA on whole unbalanced dataset as
well as applying SMOTE on unbalanced datasets. Fi-
nally, conclusions are given in section 5.
2 RELATED WORK
There are several ways to handle class imbalance
problem. Among them cost sensitive learning, one
class classification and resampling the class distri-
bution are frequently used. However most of the
research that addresses the class imbalance prob-
lem centered on balancing the class distributions.
Resampling the data either by random oversamp-
ing or by random undersampling in order to make
approximately balance class distributions are quite
popularly adopted solutions. But for discriminative
learners such as decision tree classifier oversampling
causes overfitting, where as the undersampling leads
to performance degradation due to loss of informa-
tive instances from majority class (Drummond and
Holte, 2003). (Weiss and Provost, 2003) concluded
that the natural distribution is not usually the best
distribution for learning. Study of ”whether over-
sampling is more effective than under-sampling” and
”which over-sampling or under-sampling rate should
be used” was done by (Estabrooks and Japkowicz,
2004), who concluded that combining different ex-
pressions of the resampling approach is an effective
solution. (Kubat and Matwin, 1997) did selective
under-sampling of majority class by keeping minor-
ity classes fixed. They categorized the minority sam-
ples into some noise overlapping, the positive class
decision region, borderline samples, redundant sam-
ples and safe samples. By using Tomek links con-
cept, which is a type of data cleaning procedure they
deleted the borderline majority samples.
(Chawla and Kegelmeyer, 2002), proposed Syn-
thetic Minority Over-sampling Technique (SMOTE).
It is an oversampling approach in which the minor-
ity sample is over-sampled by creating synthetic (or
artificial) samples rather than by oversampling with
replacement. The minority class is over-sampled by
taking each minority class sample and introducing
synthetic samples along the line segments joining
any/all of the k minority class’ nearest neighbors. De-
pending upon the amount of oversampling required,
neighbors from the k nearest neighbors are randomly
chosen. This approach effectively forces the decision
region of the minority class to become more general.
(Villalba and Cunningham, 2008) evaluate un-
supervised dimensionality reduction techniques over
one-class classification methods and concluded that
Principle Component Analysis (PCA) damages the
performance on most of the datasets. (Xudong, 2009)
analyzed the role of PCA over unbalanced training
sets and concluded that the PCA subspace is biased by
the majority class eigen vectors. Further the authors
proposed Asymmetric Principle component Analysis
(APCA), a weighted PCA to combat the bias issue
in PCA subspace. In this paper we propose a class
specific dimensionality reduction and oversampling
framework in the context of two class classification
to combat this problem. Proposed approach yielded
superior performance on those datasets where there is
directional difference between two classes’ principle
components.
KDIR 2010 - International Conference on Knowledge Discovery and Information Retrieval
238