Red Wine Prediction Comparing Several Machine Learning Models

Yitong Huang

Chemical Engineering and Bioengineering College, Zhejiang University, Hangzhou, China

Keywords: Machine Learning, Red Wine Quality Prediction, Optimization.

Abstract: With the increasing interest in red wine among consumers, the quality of red wine has become a topic of

significant importance. However, the limited availability of trained wine tasters in certain regions, especially

for red wine, has hindered the progress of the red wine industry. Therefore, the utilization of mathematical

models and computer software for assessing the quality, identification, and classification of red wine is crucial.

The research articles primarily focus on comparing the accuracy of various methods such as Random Forest,

Radial Basis Function, and Naive Bayes models. Among these models, the Random Forest method

demonstrates the highest accuracy, with an adjusted of 0.8656. The study also investigates the factors

influencing the model's accuracy and proposes optimization strategies using NBF_NB and Genetic algorithms

for enhanced precision in the results. As the demand for high-quality red wine continues to grow, the

implementation of advanced analytical techniques and optimization methods is imperative for ensuring

accurate and efficient wine quality evaluation.

1. INTRODUCTION

Red wine is an alcoholic product with a long history

and profound cultural heritage. The quality of red

wine has a crucial impact on consumer experience,

the reputation of production companies and market

competitiveness. Therefore, predicting the quality of

red wine has become one of the focuses of the wine

making industry and consumers. Red wine quality

evaluation method is an important aspect in the field

of red wine research because it is directly related to

consumers' perception and choice of red wine (Pei,

2022). To forecast the red wine quality, many efforts

have been made to consider miscellaneous factors

affecting it.

Machine learning research aims to tackle

challenges in time by dividing them into accessible

parts and addressing them individually through

machine learning algorithms. For this research Dahal

et al. utilized diverse machine learning techniques to

identify key features that influence wine quality (Pei,

2022). The study conducted by the researchers

utilized 11 physiochemical attributes creating

machine learning models aimed at the red wine

quality prediction (Dahal et al., 2021). Kumar and

colleagues applied data mining techniques for

extracting insights on the datasets provided by the

UCL (Kumar et al., 2020). Trivedi and Sehrawat

compared various classification algorithms,

elucidating the reasons behind the varying accuracies

produced by different algorithms (Trivedi and

Sehrawat, 2018). A decision tree classifier is used to

determine the red wine by Lee et al., whereas Mahima

Gupta and group combined and used Random Forest

and K-Nearest Neighbors algorithms to categorize

wines as good, average, or poor (Lee et al., 2015 &

Mahima et al., 2020). The main objective of this

research endeavor aims to predict wine quality with

methodologies that encompass both physiochemical

and chemical attributes. Nonetheless, the project

encounters certain challenges. The foremost obstacle

lies in the limited sample size, a hurdle that the

researchers are striving to surmount in their studies.

Given the arduous and costly nature of gathering

extensive viticulture data, synthetic data resembling

the original dataset is generated to address this issue.

Another concern is the risk of data leakage, which

occurs when information is inadvertently shared

between datasets during the preprocessing phase of

program execution.

2. METHODOLOGIES

In this research, we first perform exploratory data

analysis on the dataset and preprocess the input data.

Huang, Y.

Red Wine Prediction Comparing Several Machine Learning Models.

DOI: 10.5220/0013007200004601

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 1st International Conference on Innovations in Applied Mathematics, Physics and Astronomy (IAMPA 2024), pages 169-173

ISBN: 978-989-758-722-1

169

Then we construct and train several machine learning

models, including Radial basis function model(RBF),

naive Bays model(NBC), Random forest model, to

obtain corresponding results for further analysis.

RBF (Radial Basis Function):It is composed of J.

Moody and C The neural network algorithm based on

radial basis functions proposed by Darken in 1988.

RBF neural network is a local approximation network

that can approximate any continuous or discrete

function with arbitrary accuracy (Bi et al., 2016), and

can handle rules that are difficult to analyze within

the system. It is quite effective when handle nonlinear

classification and prediction problems.

Three layer consist the neural network and

function, they are input layer, hidden layer, and

output layer. The input layer is the same as other

neural network, in the article, the input layer

represents physical and chemical test characteristic

attributes of red wine, the datasets score the

characteristic attributes of the red wine, resulting in

the confirmation for the final quality prediction. the

Its structural diagram is shown in Figure 1. As shown

in the below figure, the input layer is (X1, X2,..., Xp),

the hidden layer is (c1, c2,..., ch), and the output layer

is y, and (w1, w2,..., wm) represents the hidden layer

to the classification of red wine quality grades (Bi et

al., 2016). The output layer’s connection weights are

determined by a nonlinear function, h(x), known as a

radial basis function, utilized by each node in the

hidden layer. The primary function of the hidden

layer is to transform the vector containing low-

dimensional statistical data, p, into a high-

dimensional representation, h, which ultimately

influences the quality assessment. This

transformation enables the network to address cases

of linear inseparability in low dimensions by making

them separable in higher dimensions.

The central concept driving this process is the

kernel function, which ensures that the mapping from

input to output within the network is nonlinear, while

maintaining linearity in the network’s output with

adjustable parameters. By solving the network’s

weights directly through linear equations, the learning

process is significantly accelerated, and the risk of

getting stuck in local minima is minimized. The

activation function of a radial basis function neural

network is typically represented by a Gaussian

function.

R(x



−c



)=exp(−



ơ



||x



−c





) (1)

The structure of radial basis neural networks can

be obtained as follows:

𝑦



∑

𝜔







𝑒𝑥𝑝(−



ơ



||𝑥



−𝑐





)+𝑏



j=1,2,...,n

(2)

Among them, xp is the p-th input sample, ci is the

i-th center point, and h is the number of nodes in the

hidden layer. N is the number of samples or

classification outputs, and bi is the threshold of the i-

th neuron.

NBC: Naive Bayes classifier (NBC) is a very

simple classification algorithm. For the red wine

quality to be classified, under the condition of the

event occurring, the probability of occurrence for

each category is the highest, indicating which

category it is considered to belong to (Liang, 2019).

The NBC model assumes that attributes are

independent of each other, but in real data, each

attribute is correlated, which is precisely this

assumption that limits the use of the NBC model.

Bayesian method can be calculated by assuming a

prior probability and the conditional probability

obtained from observation data under a given

assumption:

P(O|) =

(|)()

()

(3)

Assuming that each data sample Y={y1, y2,..., yn}

yis a set of n-dimensional vectors with n class labels

C C C n, 1, 2.

Obtaining:

max{P(x=𝐶



|Y),P(y=𝐶



|Y),...,y=𝐶



|Y)} (4)

Transform the classification problem into a

conditional probability problem, where P (Y) is

constant for all classes, so the probability of each C

classification occurring under the condition of Yis P

(C | Y). Then, take the C at the maximum probability

as our answer and determine the class label Ci.

Because X, as a sample data, often has a large

dimension, probability of any combination of features

is usually difficult to analysis, which requires the use

of the word "naive" in naive Bayes. Assuming that the

conditions are independent of each other, the number

of parameters to be solved is greatly reduced. Only

one P (X | C) needs to be solved separately, and then

multiplied to obtain: while the prior probability can

be obtained from the training sample:

P(X|𝐶



∏

𝑃(





𝑥



|𝐶



) (5)

The prior probability can be obtained from the

training samples:

P(𝐶









(6)

IAMPA 2024 - International Conference on Innovations in Applied Mathematics, Physics and Astronomy

170

Among them, the total number for training

samples is the numerator, and the whole number of

samples is the denominator.

Random forest: Multiple decision trees is

combined to apply the ensemble learning, that is

called Random forest. Each decision tree serves as a

base unit in this algorithm, which falls under the

umbrella of ensemble learning methods. Intuitively,

each decision tree functions as a classifier,

particularly in scenarios involving classification tasks.

When presented with an input sample, the ensemble

of N trees generates N classification outcomes. The

random forest algorithm then consolidates these

individual classification results by aggregating the

votes and selecting the category with the highest

number of votes as the final output. This approach

essentially embodies the Bagging concept. The

motivation behind the development of random forests

stems from the limitations of decision trees, which

exhibit weak generalization capabilities due to their

singular decision path. By leveraging random forests,

these shortcomings can be effectively addressed, as

the ensemble of decision trees collectively enhances

the algorithm’s generalization performance (Cao et

al., 2022).

3. RESULT AND DISCUSSION

The relevant research mentioned in the articles bases

on the application of jupyterlab, the predictive

classification based on characteristic data of wine

prediction.

The main task of the research is to extract data

from the database, remove blank or missing rows, and

retrieve relevant model data packages to predict and

classify the remaining feature data related to red wine

prediction

This paper focuses on the datasets provided in

UCI of the red wine quality.

The datasets contains 11 physical and chemical

test characteristic attributes, The 12th column is the

quality evaluation score of the wine, ranging from 0

to 10. And the work is to predict the quality

evaluation using those 11 characteristic (Ma, 2022).

The preparation work includes deleting rows with

missing values, converting all data to numeric data,

because the data set most distributes in six grades 3-8

(Liu, 2019), the articles mainly divide those datasets

into six categories (Table 1).

3.1. Evaluation metrics

 Mean Absolute Percentage Error (MAPE)

MAPE provides the error in terms of percentages,

the smaller the MAPE the better the predictions.

 Adjusted

Both R-squared and adjusted R-squared indicate

the proportion of variance in a dependent variable that

can be accounted for by independent variables within

a regression model. However, adjusted R-squared

serves as an evaluation of how successful a regression

model predicts responses for new observations. It will

increase when more useful variables are added to the

model, and decrease reversely (Wu and Yang, 2022).

3.2. Discussion

To improve the classification accuracy of red

wine quality grade, a machine learning theory

combining RBF neural network and naive

Bayesian classification is used to construct a

classification model based on the determination

of multiple physical and chemical components

extracted from red wine, achieving effective

classification of red

wine quality. This model is still

based on the RBF, but when confirming the basis

connection between the hidden layer and the output

layer, naive Bayes model could be used so the basis

connection would be more accurate. This

combination is really effective especially when there

is a myriad of statistics (Table 2).

Table 1: Datasets of wine.

Fixed

acidity

Volatilc

acidity

Citric

acid

Residual

sugar

chiorides

Free

Sulfur

dioxide

Total

Sulfur

dioxide

density PH sulphates alcohol quality

0 7.4 0.70 0 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5

1 7.8 0.88 0 2.6 0.098 25.0 67.0 0.9968 3.20 0.68 9.8 5

2 7.8 0.76 0.04 2.3 0.092 15.0 54.0 0.9970 3.26 0.65 9.8 5

Red Wine Prediction Comparing Several Machine Learning Models

171

Table 2 Result of evaluation.

Model MAPE

Adjusted

RBF 0.3287 0.809375

NBC 0.3160 0.815626

RF 0.3098 0.865625

NBF_NB 0.3032 0.871875

It is shown that the improved algorithm

combining the two models enhanced the

classification accuracy of quality levels obviously

comparing with separate model using; It has positive

practical reference value for red wine processing

enterprises.This indirectly confirms that in the face of

big data calculations with over a thousand data points

and eleven parameters in this study, the combination

of models will have more advantages.But the

optimization does now show apparent advantages

among the RF model.This probably because the RF

itself is proficient in dealing with high dimension and

large scale datasets, making it a reliable methods in

the research. Normalized Confusion Matrix about

three basic model are shown in figure 1, figure 2 and

figure3.

Figure 1: The confusion matrix of RBF

Figure 2: The confusion matrix of NBC

Figure 3: The confusion matrix of RF

4. CONCLUSION

In summary, this paper includes both machine

learning and some optimization to find a suitable

method to predict Red wine quality. These models are

RBF, NBC, RF as basic and NBF_NB as optimization.

Among all of these models, two optimization

obviously make the more accurate prediction, and the

classification effect is significant. All in all, after

using several effective methods to the training model,

classifiers’ performance successfully enhanced . The

significance of data generation algorithms and the

role of feature selection count most essentially in this

study. However, if further adjust the parameters of

those models, there might be more improvement. The

theory in the paper can be applied to intelligent

detection of food quality in other food processing

IAMPA 2024 - International Conference on Innovations in Applied Mathematics, Physics and Astronomy

172

industries, which has practical significance for some

food processing enterprises. With the continuous

deepening of learning in the field of machine learning,

more useful and efficient models can be developed

and widely applied.

REFERENCES

Pei Wenhua. Research on Red Wine Quality Classification

Based on Machine Learning. Science and Industry,

2022, 22(12): 304-309.

Dahal, K., Dahal, J., Banjade, H., & Gaire, S.. Prediction of

Wine Quality Using Machine Learning Algorithms.

Open Journal of Statistics, 2021, 11, 278-289.

S. Kumar, K. Agrawal, & N. Mandan. Red Wine Quality

Prediction Using Machine Learning Techniques. 2020

International Conference on Computer Communication

and Informatics (ICCCI), Coimbatore, India, 2020, pp.

1-6.

A. Trivedi & R. Sehrawat. Wine Quality Detection through

Machine Learning Algorithms. 2018 International

Conference on Recent Innovations in Electrical,

Electronics & Communication Engineering

(ICRIEECE), Bhubaneswar, India, 2018, pp. 1756-

1760.

S. Lee, J. Park, & K. Kang. Assessing wine quality using a

decision tree. 2015 IEEE International Symposium on

Systems Engineering (ISSE), Rome, Italy, 2015, pp.

176-178.

Mahima, Gupta, U., Patidar, Y., Agarwal, A., Singh, K.P.

Wine Quality Analysis Using Machine Learning

Algorithms. Micro-Electronics and

Telecommunication Engineering. Springer,

Singapore,2020.

Bi Yanliang, Ning Qian, Lei Yinjie, et al. Classification of

Wine Quality Levels Using Improved Genetic

Algorithm Optimized BP Neural Network. Computer

Measurement & Control, 2016, 24(01): 226-228.

Liang Shuqi. Red Wine Quality Prediction System Based

on Naive Bayes Principle. China High-Tech, 2019(01):

95-97.

Cao, Y., Chen, H., & Lin, B. Wine Type Classification

Using Random Forest Model. Highlights in Science,

Engineering and Technology, 2022, 4, 400-408.

Ma Dongjuan. Discriminant Analysis of Red Wine Quality

Based on Physical and Chemical Indicators of Grapes.

Modern Food, 2022, 28(24): 181-184.

Liu Pan. Classification of Red Grape Wine Quality Levels

Based on RBF and Naive Bayes. Electronic

Technology and Software Engineering, 2019(04): 144-

145.

X. Wu & B. Yang. Ensemble Learning Based Models for

House Price Prediction, Case Study: Miami, U.S. 2022

5th International Conference on Advanced Electronic

Materials, Computers and Software Engineering

(AEMCSE), Wuhan, China, 2022, pp. 449-458.

Red Wine Prediction Comparing Several Machine Learning Models

173