Nearest Neighbour, and the logical discriminant
algorithms) if new data contains missing data (Duma
et al., 2010). One of the main reasons is over-fitting
of the data. Even though support vector machines
are designed to be less prone to over-fitting, in very
high dimensional space, this problem cannot be
avoided. When training support vector machines, a
lot of detail is learned if the client data has many
attributes. The result of this is incorrect predictions
of new client data, especially if new data is of poor
quality as a result of missing values. The model also
has little resilience in the presence of increasing
missing data. In comparison with other supervised
learning models, its classification performance
decreased sharply when the quality of the data
deteriorated.
We present a comparative study on genetic
algorithms and autoassociative networks as effective
models to help improve the classification of the
support vector machines and increase resilience.
Genetic algorithms have been applied successfully
as methods for optimising classification algorithms,
(Chen et al., 2008); (Minaei-Bidgoli et al., 2004). It
has also been applied in fault classification of
mechanical systems as a method for estimating
missing values (Marwala et al., 2006).The
autoassociative networks have been applied
successfully in HIV classification (Leke et al.,
2006), missing data imputation (Marivate et al.,
2007) and assisting in image recognition (Pandit et
al., 2011).
We also employ the use of the principal
component analysis (PCA) as a feature selection
technique to reduce over-fitting and computational
cost. Principal component analysis removes those
dimensions that are not relevant for classification.
The reduced dataset in then passed on to the support
vector machine to learn. Principal component
analysis has been applied successfully in fault
identification and analysis of vibration data
(Marwala, 2001). Is has also been used in automatic
classification of ultra-sound liver images
(Balasubramanian et al., 2007) and in identifying
cancer molecular patterns in micro-array data (Han,
2010).
The rest of the paper is organised as follows:
Section 2 gives a background discussion on the
support vector machine, the principal component
analysis, genetic algorithms, autoassociative
networks and missing data mechanisms. Section 3 is
a discussion on the datasets and pre-processing. A
discussion on the AN-SVM structure and the GA-
SVM structure is also given. Section 4 is a
discussion on the experimental results. Conclusion
and future works is discussed in section 5.
2 BACKGROUND
2.1 Support Vector Machine
Support vector machine is a classification method
applied to both linear and non-linear complex
problems (Steeb, 2008). It makes use of a non-linear
mapping to transform data from lower to higher
dimensions. In the higher dimension, it searches for
an optimal hyper-plane that separates the attributes
of one class to another. If the data set is linearly
separable (i.e. a straight line can be drawn to
separate all attributes of a one class from all
attributes of another), the support vector machine
finds the maximal marginal hyper-plane, i.e. the
hyper-plane with the greatest margin. The separation
satisfies the following equation (Steeb, 2008),
(1)
where is the weight vector and is the input
vector. A larger margin allows classification of new
data to be more accurate. If the data set is linearly
inseparable, the original data is transformed into a
new higher dimension. In the new dimension, the
support vector machine searches for an optimal
hyper-plane that separates the attributes of the
classes. The maximal marginal hyper-plane found in
the new dimension corresponds to the non-linear
surface in the original space. The mapping of input
data into higher dimensions is performed by kernel
functions expressed in the form (Steeb, 2008),
(2)
where and are nonlinear mapping
functions. There are three commonly used kernel
functions used to training attributes into higher
dimensions, namely the polynomial, Gaussian radial
basis and sigmoid function (Steeb, 2008). In this
paper, we use the Gaussian radial basis function.
Support vector machines have been applied
successfully in the insurance industry
and in credit
risk analysis. They have been used to help identify
and manage credit risk (Chen et al, 2009, Yang et al,
2008). They have also been employed to predict
insolvency (Yang et al, 2008).
2.2 Principal Component Analysis
Principal component analysis (PCA) is a popular
IMPROVING THE PERFORMANCE OF THE SUPPORT VECTOR MACHINE IN INSURANCE RISK
CLASSIFICATION - A Comparitive Study
341