and have more power to measure default were selected
for modelling part of project.
2 DATASET DESCRIPTION
The dataset used in this study consists of 16000
samples each represented with a feature vector of 18
variables and an associated class label. The variable
names and types along with their ranges are shown
in Table 1. The dataset belongs to the individual loan
applications of a financial institution.
The variables can be categorized under 2 main
categories which are finance-related information and
personal information. In this section, we briefly
introduce the input variables under these 2 categories
and the class variable to clarify the information
represented by each variable in our dataset.
2.1 Finance-related Information
The variable denoted with ‘housingMaturity’ in Table
1 represents for how many months the customer is
paying the instalments of housing credits. The
maturity value of housing credits can take a value
between 6 and 240 months in Turkish finance system.
Similarly, vehicle maturity shows the number of
months for the credit instalments of vehicle loan. This
is also an integer variable and has a range from 0 to 60
months. The number of months for the credit
instalments of consumer loan is stored in the
consumer maturity variable, which has a range from 0
to 120. The variable referred to as ‘ProductNumber’
in Table 1 represents the total number of different
products taken by the customer before, including the
current active loan. This variable is in integer type
and it has a range from 1 to 113. The ‘workingTime’
and ‘workplace’ variables show the term of
employment and status of the working place of the
credit customer, respectively.
While the working time information is represented
with an integer variable, the workplace is a categorical
variable which takes 3 different values as “Public”
“Private or Corporate” and “Other”. The other
variable related with the working place of the
customer is ‘Ownership’ which is a categorical
variable and takes 4 different values indicating the
owner of the workplace the customer is working for.
The possible values of this variable are “personal”,
“rental”, “family-owned” and “other”. The
‘insuranceCode’ variable represents the type of social
security of the credit customer. It is a categorical
variable which can take 5 different values.
Loan Type is an indicator for consumer maturity,
vehicle maturity and housing maturity variables. It is
a factor variable and it is kept in financial institution’s
system in integer type. Variable has values as
“consumer loan”, “housing loan”, and “vehicle loan”
and kept as 1, 2, and 3 in the system. The financial
institution is using this variable for analyzing the
relationship between the number of instalments and
whether the credit will end as default or not default.
Most of the credits given by the financial institution
are consumer credits rather than housing and vehicle.
There is a "due date" in every kind of credit
settlements as credit card, credit deposit account or
different loan types. If the payment due date is 1 or 2
days delayed, the delay is referred as 1 term. If
consecutive loan repayments have been made late on
a two-time payment date, it is a two-term delay. The
“DefaultNumber” variable refers to customers who
have experienced the legal default process before. The
credits whose repayment period is delayed for 3 terms
go into default process and closed after completion of
repayment.
There are 2 important credit scores determined by
the Consumer Reporting Agency (CRA) for each
customer. One of these variables, referred to as CRA
in Table 1 is an integer variable with a range from 0
to 1612. The CRA calculates this value according to
their internal rating system and provides to the
financial institution when required. The value of 0
(zero) means that the score cannot be calculated by
CRA for that customer. The higher the score the more
credit worthiness customer has. The other important
credit score included in our dataset is the individual
indebtedness index (III) which is designed to predict
the risks arising from high indebtedness. The main
difference between CRA score and III value is that
while the CRA value aims to determine the risk based
on the past or current payment problems, III value is
used to identify people who have not suffered any
difficulties but are likely to suffer in the future due to
excessive borrowing.
2.2 Personal Information
In addition to the variables related with the financial
status of the customers, the dataset contains some
personal information that might be important in the
credit worthiness of the customer. These are marital
status, occupation, education status, and age.
The marital status variable specifies the marital
status of the customer as of the date of credit
application. This is a categorical variable with 5
different values. The occupation information is
represented with 8 different categories each one
Variable Importance Analysis in Default Prediction using Machine Learning Techniques
57