machine learning repository. The dataset is used to
predict which customers are likely to have an
interest in buying a caravan insurance policy. In this
paper, we are interested in finding out customers
who are likely to have a car insurance policy,
provided there is missing information.
The training dataset consists of over 5400
instances of which 5000 were used for the
experiment. The testing dataset consists of only
4000 instances. Each set has a total of 86 attributes
with completely observable data, 5 of which are
categorical numeric values and 80 are continuous
numeric values. The class attribute consists of only
two values (0 to indicate a customer that is likely not
to have insurance or 1 to indicate a customer that is
likely to have an insurance cover).
The second insurance dataset is the state of
Texas insurance dataset which is used by the Texas
government to draw up a Texas Liability Insurance
Closed Claims Report. The report provides a
summary of claims involving bodily injuries from
insurance companies. These claims were either
settled in court or disposed of, and the insurer
performed all the compensations and expense
payments on the claim. There are two types of
claims expressed in the dataset, long and short form.
Short form focuses on claims on bodily injuries that
are not expensive to settle. Long form relates to
claims on bodily injuries that are very expensive and
can be settled in most cases via a medical insurance
company. In this dataset, we classify instances based
on whether they have medical insurance cover as a
risk analysis exercise provided there is missing data.
The Texas Insurance dataset consists of over
9000 instances, trimmed manually to 5446 instances
by removing all the short form claims. For
consistency, the dataset was separated into training
and testing datasets, 4000 and 1446 instances
respectively. Both the training and testing sets have
missing values initially. Each set consists of a total
of over 220 attributes initially, but the attributes
were trimmed to 185 attributes. This was done by
manually removing those attributes that were clearly
not significant for the experiment, like the unique
identities, dates as well the type of claim attributes.
The class attribute used also has two values (0 to
indicate no medical insurance and 1 to indicate that
the claimer has medical insurance).
There are five levels of proportions of
missingness on the testing dataset that were
generated (10%, 25%, 30%, 40%, 50%). At each
level, the missingness was arbitrarily generated
across the entire dataset, then on half the attributes
of the set. Therefore, in total, 12 testing datasets
were created to test the strength of the Ripper
algorithm using feature selection techniques.
3.2 PCA-Rip Structure
Figure 1 illustrates the structure followed in
improving the Ripper classification performance
using the PCA as a feature selection technique. We
refer to the structure as the PCA-Rip. From the
figure, the original data [A] is passed to the PCA.
PCA reduces the dimensions of the data to give the
output [T] expressed in equation (2). Attributes with
eigenvalues > 1 were selected as a simple and
effective approach to reduce the number of
attributes. The Ripper algorithm builds a rule-based
system using [T]. Once the Ripper algorithm is
complete with learning the data, the PCA converts
the data into its “original” data [A’] as expressed in
equation (3). Data classification is performed using
testing data.
Figure 1: PCA-Rip structure.
The software tools used for PCA-Rip were Weka
3.6.2 library, C# 3.5 programming language and
IKVM. Weka library has a built-in Principal
Component analysis component. The component is
used in conjunction with a Ranker search component
to return the selected attributes in a chronological
order from the most significant to the least
significant attributes. IKVM is a software tool used
to convert java code into C# code. The PCA-Rip
illustrated in figure was built and tested using the C#
programming language.
Principal component analysis reduction
Original data
Ripper
Principal component analysis original
Data Classification
…
…
…
…
IMPROVING THE PERFORMANCE OF THE RIPPER IN INSURANCE RISK CLASSIFICATION - A Comparitive
Study using Feature Selection
207