are no obvious changes in both systolic blood
pressure (sysBP) and glucose level. By using 20% of
testing data, the model accuracy is 84.79%. To
compute the model, the confusion matrix was applied
to the dataset (Figure 3). The confusion matrix
demonstrates that the model with 5 (TP) + 714 (TN)
=719 is correct, and 2 (FP) + 127(FN) = 129 is not
correct. The negative predictive value is also less
correct than the positive predictive value, which is
85% < 71%. The sensitivity is 3.79%, and specificity
is 99.72% indicating that the model has lower
sensitivity than specificity. With the accuracy of the
model, the model could get better with more data.
4.2 Fisher’s Linear Discriminant
Analysis
Figure 4: Fisher’s Linear Discriminant Confusion Matrix
For this model, the general accuracy rate is 83.37%.
The sensitivity is 9.15%, and the specificity is
98.17%. Overall, this algorithm yields higher
accuracy and specificity, but lower sensitivity. The
positive predictive value is 84.42%, greater than the
negative predictive value of 50%.
Figure 5: Neural Network Diagram.
4.3 Neural Network Analysis
For the Neural Network algorithm, the general
accuracy is 85.7%. The sensitivity is 14.5%, and the
specificity is 97.1%. Overall, the neural network
algorithm shows that it has higher accuracy and
specificity, but lower sensitivity. The positive
predictive value is 87.7%, which is far greater than
the negative predictive value of 44.7%.
5 CONCLUSIONS
Based on the results, all algorithms behave better on
the specificity than the sensitivity. With the accuracy
of the model, three algorithms could get better with
more data. In addition, both logistic regression and
neural networks' accuracy are slightly higher than
those of the fisher's linear discriminant. Therefore,
those two algorithms can determine the risky factors
of cardiovascular disease.
The neural network puts its testing process in a
black box, so further investigation of attribution is
needed (Larry Hardesty 2020). As the algorithm runs
one more time, one distinct attribute would be
manually dropped, and the accuracy of the test set
would be recorded. The neural network shows that the
most significant tier of attributes includes: male, age,
diabetes, sysBP, and BMI. The second most
significant tier of attributes includes: cigsPerDay,
BPMeds, prevalentStroke, prevalentHyp, and
totChol.
Logistic regression can evaluate the significance
of the attributes in the CHD based on the P-value.
After the data cleaning, attributes with P-value less
than 5% illustrate the significant and vital roles in the
cardiovascular disease prediction, including male,
age, cigsPerDay, prevalentStroke, sysBP, and glucose
level.
The most relevant and risk factors of coronary
heart disease should be male, age and sysBP since
these attributes were demonstrated as the most
significant factors in both logistic regression and
neural network algorithms. Besides, since their P-
values are less than 5% in the logistic regression and
their accuracy in the neural network, both
“cigsPerDay” and “prevalentStroke” are relatively
risky factors for coronary heart disease.
All three of the methods have their advantages.
Although the accuracies of Logistic Regression and
Neural Network are slightly higher than the accuracy
of Fisher’s Linear Discriminant, Fisher’s Linear
Discriminant still provided a decent accuracy.
Individually, Logistic Regression provided a chance
to evaluate the significance of the attributes directly
through the p-values. Fisher’s Linear Discriminant is
easy to understand and implement for the beginner to
start with a vast machine learning project since it can
quickly provide a working model. Neural Network
generally has high accuracy and robustness, which
can functionally operate with high volume and data
dimensions. The layer sizes are flexible. And
therefore, under the right scenarios, each one of the
algorithms can be optimal.