Machine-Learning-Based Prediction of Obesity
Siyuan Chen
a
International Business School, Henan University, Zhengzhou, China
Keywords: Machine Learning, Obesity Prediction, Neural Network.
Abstract: Obesity is a common phenomenon today. It is a chronic metabolic disease caused by excessive fat
accumulation and is the result of the interaction of multiple factors such as genetics and environment. As
machine learning is widely used in various fields, obesity data is processed by using machine learning methods
to obtain fitting models, to realize the prediction of obesity and to determine the main causes of obesity. The
main contents of this paper include: (1) using the obesity data provided by the UCI data set as the research
object, using a series of preprocessing data in Python language. (2) Establish a machine learning model and
import the data to generate bar charts and other graphs that reflect the relevant results. (3) Results analysis, to
Determine which factors are more closely related to obesity and evaluate the performances of the models. The
comprehensive analysis of accuracy, recall rate, precision rate and other indicators finally obtains the best
prediction effect of the GBDT algorithm, which can effectively predict obesity.
1
INTRODUCTION
Nowadays, people's living standards are constantly
improving with the development of society, and their
diet is becoming more and more diversified. While
enjoying these conveniences, there are also some
potential risks. The pace of society is accelerating,
and people are under pressure from all sides. With this
situation, people's health constantly produces various
problems. Obesity is a common phenomenon, it is
caused by genetic and environmental factors such as
chronic metabolic diseases, the Chinese Residents
Nutrition and Chronic Condition Report (2020),
according to more than half of adults overweight /
obesity, 6~17, children under 6 and adolescents
overweight / obesity rate reached 19.0% and 10.4%
respectively (Liu, 2021). Obesity not only affects the
external appearance but also leads to various other
diseases. At the same time, to prevent and treat
obesity, people need to spend more time and money.
Therefore, the prevention and treatment of obesity is
of great significance to prevent chronic diseases and
reduce the personal financial burden. With artificial
intelligence developing so quickly, machine learning
has become widely applied in many different fields,
the medical field is also a big aspect of machine
learning application, through the use of the machine
a
https://orcid.org/0009-0006-3168-8559
learning function of existing data analysis, which can
conclude that the relevant data about obesity to
determine what are the main causes of obesity.
From the point of the current development
situation, although there have been some machine
learning methods applied to disease prediction, for the
prediction of obesity is relatively lack of
comprehensive and perfect research, so this paper
based on machine learning from a variety of
algorithms to find out the good prediction of obesity
prediction model, to realize the timely management
and treatment of obesity.
2 METHOD
This chapter introduces six machine learning models
of logical regression, decision tree, random forest,
GBDT, XGBoost and deep learning network (DNN),
and presents the experimental data set and the
unbalanced learning methods, and expounds the main
content and direction of this experiment.
2.1 Logistic Regression
Among the linear regression models is logistic
regression, which uses data from independent
Chen, S.
Machine-Learning-Based Prediction of Obesity.
DOI: 10.5220/0012916000004508
Paper published under CC license (CC BY-NC-ND 4.0)
In Proceedings of the 1st International Conference on Engineering Management, Information Technology and Intelligence (EMITI 2024), pages 153-158
ISBN: 978-989-758-713-9
Proceedings Copyright © 2024 by SCITEPRESS Science and Technology Publications, Lda.
153
variables as input to forecast the chance of a desired
result. Logistic regression has similarities to the
principle of multiple linear regression, which first
determines the best-fitting regression line to represent
the connection between the independent variable (x)
and the dependent variable (y), the regression line
(Wang, 2022). The model form is related to the linear
equation. If the independent variable is a data set, the
equation can be shown as a matrix:
baxy +=
bxaxaxaxaz
nn
+++++=
)()()3()3()2()2()1()1(
(1)
Logistic regression functions the linear equation
corresponding to a state p, determining the size of the
dependent variable based on the values of p and 1-p.
The dependent variables of logistic regression can be
dichotomous or multiple classifications, and the
independent variables can be continuous or discrete.
There are three types of logistic regression:
ordinal, multinomial, and binary.
The results of this paper are divided into obesity
and non-obesity after data processing, so binomial
logistic regression is adopted. The dependent variable
of binomial logistic regression is essentially a
dichotomy method, that is, there are only 0 or 1 results,
and the probability distribution is as follows:
ee
e
xx
x
w
xYP
w
w
xYP
TT
T
+
==
+
==
1
1
)|0(
1
)|1(
(2)
2.2 Decision Tree
The decision tree technique is a prediction technique
for creating target variables or a categorization
scheme based on several variables. This algorithm
can effectively handle large data sets. Common uses
of the decision tree model include variable selection,
evaluation of variable importance, processing of
missing values, and prediction (Song, 2015). The
duality of the results enables a good application of
decision trees for obesity prediction in this
experimental study.
The main parts of the decision tree model are the
nodes and branches and the construction of the model
includes splitting, stopping and pruning (Song, 2015).
The nodes of the decision tree can be divided into
three types: (1) root nodes, also called decision nodes.
(2) Internal nodes. (3) The decision tree's ultimate
outcome is represented by the leaf node, sometimes
referred to as the end node.
The decision tree is a continuous model that
combines a series of tests and compares the feature
values in each test to the threshold value(Navada,
Ansari, Patil, 2011). Each node in the decision tree
corresponds to an analysis of data properties, The
decision tree model links the dataset's observations to
the conclusions(Sharma, Kumar, 2016) of the
pertinent target values, with each branch denoting the
analysis's findings.
2.3 Random Forest
The classification and regression tree model is further
improved by the random forest technique, which is
composed of a large number of decision trees created
by randomization, which can be used for prediction
once constructed. Because the validity of the decision
tree for binary classification applies to the prediction
of obesity, the random forest model was adopted by
this experiment. The average of the outputs of a
random forest with several decision trees is
aggregated into a single output of reference (Rigatti,
2017). Formally, a model made up of several random
base regression trees is called a random forest {rn (x,
Θ m, Dn), m 1}, where Θ 1, Θ 2,... Is the output of the
random variable, Θ. Combining these random trees
forms an aggregate regression estimate of research
(Biau, 2016 ). The formula is as follows:
)()(
[]
nnn
n
DXDX ,,, ΘΕ=
Θ
γ
γ
(3)
Based on each decision tree produces a result
based on the input data; and the final output of the
random forest is the paper obtained after integrating
the multiple results (Liu, 2014).
The advantage of random forest is that it can find
the interaction between the predicted variables and
the non-linear relationship, but it is difficult to judge
which variables have a greater impact on the
prediction results.
2.4 GBDT Model
GBDT, whose full name is a gradient-lifting decision
tree, is an iterative decision tree algorithm that is also
applicable to the binary characteristics of this study.
To obtain the ultimate prediction outcome, the model
aggregates the outcomes of several decision trees.
The next weak classifier fits the residual function of
the predicted value, which is the difference between
the predicted value and the true value, and the
principle adds the results of all the weak classifiers to
the predicted value. The decision tree is the common
learner in the GBDT model, which is an integrated
learning model (Zhang, 2021 ).
EMITI 2024 - International Conference on Engineering Management, Information Technology and Intelligence
154
2.5 XGBoost Model
XGBoost is an improved model of the gradient
boosting algorithm and a machine learning algorithm
implemented under the Gradient Boosting framework.
The basic component of the XGBoost model is the
decision tree, which has good results for dichotomous
data analysis, so it is applied to the prediction of
obesity in this experiment. There is an order between
them: the latter decision tree will be combined with
the prediction results of the previous decision tree,
that is, take the analysis error of the previous tree into
account, which increases the proportion of the error
between the previous samples in the subsequent
prediction, thus improving the accuracy and
scientificity of the model prediction. According to the
model principle, the prediction calculation formula is
obtained as follows:
)(
)1()(
x
fyy
t
a
a
t
a
t
+=
(4)
That is the predicted value of the first tree for the
sample t = the predicted value of the first a-1 tree for
the first t-1 sample + the first tree for the sample t.
The objective function of the model prediction result
is obtained as follows:
)()
)(
,(
1
)(
a
n
t
t
a
f
a
t
yl
y
Ω+=Ι
=
(5)
2.6 Deep Neural Network
The DNN is designed to predict the data set. The
model construction of the neural network includes
input, output and three hidden layers. Through the
excitation function tanh () in the hidden layer, the
nonlinearity of the data set can expand the expression
ability of the neural network (Figure 1). The tanh
function is centered on zero, the gradient is steeper
and not limited to one direction, and it is overall
superior to the sigmoid function (Sharma, 2017).
Figure 1: Deep Neural network model (Photo/Picture credit :
Original).
Deep neural networks have the good nonlinear
fitting ability but require large-scale datasets for
training, otherwise overfitting may occur.
3 RESULTS AND DISCUSSION
3.1 Pro-Processing
3.1.1 Distribution of the Data Sets
As shown in pie figure 2, the proportion of obesity in
the surveyed subjects in this experimental dataset is
not balanced compared with other resultPs, so it is
speculated in the dataset.
Figure 2: Schematic representation of the obesity ratio
among the survey subjects (Photo/Picture credit : Original).
3.1.2 Unbalanced Processing Algorithm
The SMOTE algorithm, namely the Synthetic
Minority Oversampling Technique, also known as the
synthetic minority oversampling technology, is also
an improved random oversampling method that
generates new artificial minority instances by
interpolating in several instances located together
(Yang, 2021). It is based on sampling data from a few
classes by connecting random data points (Elreedy,
2019). SMOTE can alleviate the overfitting
phenomenon of random oversampling by inserting
synthetic instances in a new position, but it still has
two disadvantages, one is that it can spread the noise,
and the other is that in the SMOTE algorithm, all
instances share the same global neighborhood
parameters, which will ignore the distribution
characteristics, resulting in poor classification effect.
The SMOTE algorithm is very robust, but in the case
of chaotic sample data and some scattered samples, it
will mechanically generate new points, which will
Machine-Learning-Based Prediction of Obesity
155
become noise points and affect the classification
performance of the model (Yang, 2021).
3.1.3 Hyperparameter Setting
The experiment code was run using Python language
through Pycharm compilation software. Where the
main hyperparameter package Epochs, batch_size.
Epochs refer to the number of times the training
set is fully trained when training the neural network,
also called the process of neural network forward
propagation and backpropagation.
The amount of samples that the neural network
chooses to use in a single training session is known as
the batch_size.
In this experiment, the results of different
parameters were analyzed for multiple times, so that
the set epochs value is 200 and the base _ size value
is 70, and the comparison results between models
obtained with the above values are the most balanced.
3.2 Evaluation Indicators
(1) Accuracy
Model accuracy (Accuracy) is one of the main
indicators to evaluate the model performance, and the
following is its calculating formula:
FNFPTNTP
TNTP
Accuracy
+++
+
=
(6)
(2) Precision Rate
Accuracy is the proportion of samples with positive
model prediction results in all positive samples. The
following is the calculating formula:
FPTP
TP
ecision
+
=Pr
(7)
(3) Recall
The percentage of samples that the model predicts is
referred to as the recall rate. The following is the
calculating formula:
FNTP
TP
call
+
=Re
(8)
When evaluating the model with recall rate, a
model's recall rate is low; when the precision rate is
low, its recall rate is relatively high.
(4) F1-score
The F1-score, also known as the F1 score, is an
indicator that combines precision and recall rate to
produce average results. The formula is:
RecallPrecisi
RecallPrecisi2
F1
+
××
=
(9)
Using the F1 score can avoid the problem that the
data value is relatively close and it is difficult to
choose and can find the model with better
performance more efficiently.
(5) AUC
The meaning of AUC is the area under the curve and
the axis, here the curve is the ROC curve. The area
under the curve can be more intuitively judged by the
model performance.
3.3 Analysis of the Indicators
(1) Precision (accuracy)
First, observe the prediction effect of each model on
the data set through accuracy, and the comparison
figure is as follows (Figure 3 and Table 1):
Figure 3: The prediction accuracy of each model
(Photo/Picture credit : Original).
Table 1: The prediction accuracy of each model.
Metho
d
accurac
y
Lo
g
istic Re
g
ression 0.87
Decision Tree 0.91
Random Fores
t
0.915
GBDT 0.919
XGBoos
t
0.92
DNN 0.897
As illustrated in Figure 3, the logistic regression
model has the lowest prediction accuracy, only
around 87%, followed by DNN, and the models with
higher accuracy were GBDT and XGBoost.
Because the result of the data prediction is binary,
the model based on the tree model prediction accuracy
is high, the GBDT in predicting new samples, each
tree will produce an output value, the output value
superposition, to get the final prediction value, for
each tree training is the difference is the true result of
the next tree prediction, for the experimental data set
its data difference is large, so the test GBDT model
EMITI 2024 - International Conference on Engineering Management, Information Technology and Intelligence
156
prediction accuracy is high. The XGBoost model is
similar to the GBDT principle, and the result of the
previous tree during training affects the generation of
the latter tree, so the prediction accuracy of this data
set is relatively high.
(2) Accuracy Rate (precision) and Recall Rate (recall)
The precision and recall of each experimental data set
are shown in the figure below (Figure 4 and Table 2
Table 3):
Figure
4:
Prediction accuracy and recall rate
of
each
model(Photo/Picture credit : Original).
Table 2: Prediction accuracy of each model.
Metho
d
Precision0
Lo
g
istic Re
g
ression 0.723
Decision Tree 0.822
Random Fores
t
0.806
GBDT 0.817
XGBoos
t
0.818
DNN 0.84
Table 3: Recall rate of each model.
Metho
d
Recall0
Lo
g
istic Re
g
ression 0.824
Decision Tree 0.84
Random Fores
t
0.89
GBDT 0.905
XGBoos
t
0.904
DNN 0.75
As demonstrated in Figure 4, the relationship
between precision and recall rate. Among them, the
comparison of the DNN model is the most obvious,
with the unbalanced phenomenon of the data set itself.
Therefore, the two indicators of the model that cannot
make gradient adjustments in the prediction are quite
different, while the models that can make gradient
adjustments, such as GBDT and XGBoost, are
relatively balanced.
(3) F1-Score
Considering that the analysis results combining
precision and recall rate are not intuitive enough, the
F1 score is used to reflect the performance of each
model more intuitively. The pairs between models are
such as the following Figure 5:
Figure 5: The predicted F1 scores for each model
(Photo/Picture credit: Original).
Table 4: The predicted F1 scores for each model.
Metho
d
F1-score
Lo
g
istic Re
g
ression 0.909
Decision Tree 0.939
Random Fores
t
0.941
GBDT 0.946
XGBoos
t
0.95
DNN 0.93
Can intuitively seen from Figure 5, the performance
of the GBDT and XGBoost models is better than other
models, the reason speculated that there may be two:
one because the two models’ training mode is more
superior, and can constantly adjust data gradient
because the experimental data set itself is unbalanced,
the two models can more effectively improve the
imbalance phenomenon caused by bad results (Table
4). Therefore, the GBDT and XGBoost models have
a better performance.
Machine-Learning-Based Prediction of Obesity
157
4 CONCLUSION
This study through the UCI data on obesity prediction
data research model, then through the model of the
data visualization analysis, and by combining the
model performance evaluation through the
parameters of accuracy, precision, recall rate, and F1
score, finds more accurate obesity prediction model,
the results show that GBDT and XGBoost model in
the data prediction related indicators are high, the
fitting effect is good, the future can through the two
machine learning algorithms to predict obesity and
related diseases. At the same time can be obtained
from the characteristics of age, family whether
someone with obesity, whether often eats calorie food
and two meals between other food frequencies these
four characteristics are associated with obesity, the
doctor in determining whether obese patients can be
according to these characteristics to further determine
whether obesity.
However, there are certain limitations and areas
for improvement in this study. The relative lack of
data volume in this experiment leads to insufficient
analysis of other features. If a large amount of
relevant data can be obtained, it will be more
favorable for the model prediction.
The experimental data acquisition range is small,
and the prediction results are less applicable. The
survey results of the data set come from a small range
of acquisitions, and it is controversial in its
universality. The scope of data collection should be
increased to make it have better universality.
REFERENCES
Biau G. (2012). Analysis of a random forests model. The
Journal of Machine Learning Research, 13: 1063-1095.
Elreedy D, Atiya A F. (2019). A comprehensive analysis of
synthetic minority oversampling technique (SMOTE)
for handling class imbalance. Information Sciences, 505:
32-64.
Liu Y. The Report on Nutrition and Chronic Diseases of
Chinese Residents (2020) was released. Agricultural
Products Market Weekly, 2021 (2): 58-59.
Liu Y. (2014). Random forest algorithm in big data
environment. Computer modelling & new technologies,
18(12A): 147-151.
Navada A, Ansari A N, Patil S, et al. (2011). Overview of
use of decision tree algorithms in machine learning,
2011 IEEE control and system graduate research
colloquium, 2011: 37-42.
Rigatti S J. (2017). Random forest. Journal of Insurance
Medicine, 47(1): 31-39.
Sharma S, Sharma S, Athaiya A. (2017). Activation
functions in neural networks. Towards Data Sci, 6(12):
310-316.
Sharma H, Kumar S. (2016). A survey on decision tree
algorithms of classification in data mining.
International Journal of Science and Research (IJSR),
5(4): 2094-2097.
Song YY, Lu Y. (2015). Decision tree methods:
applications for classification and prediction. Shanghai
Arch Psychiatry, 27(2):130-5.
Wang X. (2023). Machine learning based prediction model
for heart disease. Southwestern University.
Yang D. (2021). Research and application of unbalanced
data processing algorithms. Jiangsu: Jiangsu University
of Science and Technology.
Yu C S, Lin Y J, Lin C H, et al. (2020). Predicting metabolic
syndrome with machine learning models using a
decision tree algorithm: Retrospective cohort study.
JMIR medical informatics, 8(3): e17110.
Zhang W, Yu J, Zhao A, et al. (2021). Predictive model of
cooling load for ice storage air-conditioning system by
using GBDT. Energy Reports, 7: 1588-1597.
EMITI 2024 - International Conference on Engineering Management, Information Technology and Intelligence
158