
After analysis, the point we are making is why or
why not someone was approved. We did not need, for
instant Application Withdrawn or Loan Purchased
by Financial Institution . Therefore we dropped ev-
erything but Loan originated , Application approved
but not accepted (still counts as denied), and then Ap-
plication denied by financial institution. We dropped
everything by using simple code with a conditional
statement. We needed a clear-cut answer henceforth
why went binary.
3.7 Merging by Year
We take each year’s k = 10000 items for each year
and combine them into one CSV file. We added an
extra feature that includes the year of the record.
As we were working on this, we found govern-
ment data ranging (the Exogenous data) from US
Bankruptcies to US government Pay Rolls. We de-
cided to add to US Bankruptcies, US Consumer
Spending, US Disposable Personal Income, US GDP
Growth Rate, US New Home Sales, US Personal In-
come, and US Personal Savings Rate. All these corre-
late with one another. For instance, if US New Home
sales are up for the year 2013 then we know the num-
ber of loans given out will be up for the same year.
For the set up we took each year of the Exogenous
data got the average and put it with the corresponding
year of the HMDA data.
Being able to combine Exogenous data would
help us understand the algorithm better and have a
better understanding of why some people were ap-
proved and others disapproved.
For the Exogenous data, we did Bankruptcy, Con-
sumer Confidence, Disposable Income, Personal In-
come, Personal Savings, Prime Lending Rate, New
Home Sales, GDP Growth Rate, and Consumer
Spending. After we put it together our date was
110, 000 rows and 88 Columns. We then followed
the same process and one hot encoded and the results
were (110, 000, 237)
• Bankruptcies were used to see why some people
could be denied loans.
• Consumer Confidence was to show the business
conditions that year.
• Disposable Income was added because the more
income that can be spent the more loans can be
given out.
• Personal Income was added because personal in-
come matters on what type of loans are given out.
• Personal Savings was added because the more
savings people have the more they can put the
money towards a house.
• Prime Lending Rate was added because it is major
on how many loans have been given out.
• New Home Sales was added because people need
loans for new home.s
• GDP Growth Rate was added to see how much
our economy has grown and to compare it to our
results.
• Consumer Spending was added due to it’s impor-
tant of the correlation between it and the GDP
Growth Rate.
4 CLASSIFICATION
ALGORITHM DESIGN AND
ANALYSIS
We considered several classical classification al-
gorithms, which include Random Forest, Decision
Tree, Support Vector Machine (SVM), and Naive
Bayes. These classifiers cover a diverse set of ap-
proaches including tree/entropy-based (Random For-
est and Decision Tree), probabilistic (Naive Bayes),
and maximum-margin hyperplane (SVM). We also
consider a more recent Deep Neural Network classi-
fier (DNN).
4.1 Classical Classification Algorithms
We take the processed data and conduct experiments
to evaluate each classifier. We randomly split the
combined file of 110000 records into a train and test
(70% train set, 30% test set). The classification task
is to take the processed features and predict whether
a loan is approved or denied (i.e. binary classifica-
tion). For each experiment, we use a confusion ma-
trix to show our findings. This is a table to show the
performance of the algorithm. The bottom of the con-
fusion matrix indicates what the classifier predicted
and the left indicates the actual loan approval. True
means the loan was approved and False means the
loan was denied. The True-True intersection means
that it was predicted accurately by the classifier. The
False-False intersection means that it was also pre-
dicted accurately. Whereas False-True and True-False
mean that it was not predicted accurately. Figure 2
shows the confusion matrices for the Random Classi-
fier and Decision Tree classifiers.
The Decision Tree was chosen for simplicity pur-
poses. Each node in the tree splits off into a more spe-
cific subtrees and filters down to a leaf node. The data
is then captured, understood, and analyzed. The De-
cision Tree is notable in that it can provide a more ex-
Deep Learning, Feature Selection and Model Bias with Home Mortgage Loan Classification
251