WineQT Dataset from Kaggle. This is a real dataset
that collects the quality of Portuguese Vinho Verde
wines under various chemical factors. The target in
prediction is quality. The input feature is chemical
factors such as fixed acidity, residual sugar, pH,
alcohol, etc. This study will consider these
characteristics, build a prediction model, and
ultimately determine the optimal model by comparing
the quality of each model.
The present paper is structured as follows: In the
Section 2, the literature review will be used to
introduce the relevant work of the peer. In Section 3,
this paper discusses the proposed methods, including
the principles of the methods and the reasons for their
selection. In Section 4, this paper will examine the
experimental results, compare models, and select the
best model. In Section 5, the paper summarizes the
main findings and conclusions.
2 RELATED WORK
The quality of wine is affected by many factors,
including alcohol content, acidity, etc. Each
researcher chooses features differently to predict
wine quality. Natalie Harris et al. judge the quality of
wine through the aroma of wine (Harris et al., 2023).
Dragana B. Radosavljevic et al. judge the quality of
wine through the physical and chemical properties of
wine such as alcohol, ph, and density. The two
different methods each have their own advantages
(Radosavljevic et al., 2019). Considering more
factors in the research and selecting highly relevant
features can make it easier for researchers to predict
the quality of wine.
There have been many studies so far that have
explored various solutions related to wine quality
prediction, including machine learning and deep
learning. Among them, machine learning includes
Extreme Gradient Boosting (XGB), Adaptive
Boosting(AdaBoost), Gradient Boosting(GB), RF,
Decision Tree(DT), etc., and deep learning includes
ANN, Convolutional Neural Networks(CNN), etc.
Among them, Piyush Bhardwaj et al. used RF and
AdaBoost classifier to demonstrate their superiority
in predicting wine quality, and evaluated the model
from the aspects of Precision, Recall, F1, ROC_AUC,
and MCC (Bhardwaj et al., 2022). Feature selection
is an important factor that is frequently made during
model evaluation. RF, XGB, GB Classifier and Extra
trees classifier are used to select the top ten features
with Pearson correlation coefficient for training.
Keshab R. Dahal et al. conducted a comparative study
on the ensemble learning method Gradient Boosting
and the deep learning method ANN (Dahal et al.,
2021). In order to reduce the interference of the
dataset on the model, they used feature scaling
technology to reduce the scale difference between
features. Then, they evaluated the performance of the
model using three indicators: R, MSE and MAPE.
In addition to evaluating the performance of
different models on training datasets. Khushboo Jain
et al. evaluated the importance of data processing
techniques and feature selection for predicting wine
quality instead of focusing on various methods. In
machine learning and data mining, feature selection
is a research topic that has attracted much attention,
because different features have different effects on
the performance of the model. Agarwal et al.
considered the application of different feature
selection techniques such as principal component
analysis and recursive feature elimination in their
research. Piyush Bhardwaj et al. considered 54
features when predicting wine quality, and extracted
the 10 most important features through feature
selection. Six of these features were extremely
important in all models used in the experiment.
Current researchers mainly consider wine quality
prediction from three aspects. First, they focus on
datasets from different sources and predict wine
quality from various characteristics by collecting
datasets of different dimensions. Then consider
feature engineering, such as balancing the differences
between features through methods such as feature
scaling and identifying the most predictive features
through feature selection. Finally, they compare the
model's evaluation metrics and end up with a set of
models with the highest scores to determine the best
model.
3 METHODOLOGIES
In order to better understand the factors that affect
wine quality, this research will first conduct an
exploratory analysis of the data. Then the data is
preprocessed. After obtaining a suitable dataset, this
study will use this dataset to create and train different
models. These include KNN, RF, SVM, and ANN.
Finally, the final results will be obtained, and the
results discussed in depth. The flow chart is shown in
Figure 1.