Research on Optimization of Random Forest Algorithm Based on
Feature Engineering
Lei Yang, Yong Fan and Yungui Chen
Guangdong University of Science and Technology, Dongguan, China
Keywords: Random Forest, Feature Engineering, Machine Learning.
Abstract: Random forest is an ensemble learning method that builds a strong classifier by combining multiple weak
classifiers. Feature engineering refers to the process of improving model performance by selecting and
manipulating features. This paper selects 3 A-share stocks of China's Shanghai Stock Exchange as the research
object, and proposes a prediction model based on the random forest optimization algorithm of feature
engineering, and uses this model to predict the closing price of the stock. Our research results show that the
optimized model performance of predictions can be improved.
1 INTRODUCTION
In machine learning, feature engineering refers to the
process of improving model performance by selecting
and manipulating features. Traditional feature
engineering methods include feature selection,
feature extraction, and feature transformation 0.
Feature selection refers to the selection of features
that have a significant impact on model performance
through statistical methods or machine learning
algorithms. Feature extraction refers to the extraction
of valuable information of features through clustering,
embedding, transformation and other methods 0.
Feature conversion refers to converting features into
a form that is more suitable for model learning.
This paper uses Spearman correlation to analyze
the characteristics of the data, and selects the most
useful features to construct a new data set by
performing feature selection on the data set0. This
paper uses the filtering method for feature selection,
analyzes the characteristics of the data with the help
of Spearman correlation, extracts the data with higher
correlation and separates it into another data set, and
uses the same random forest algorithm parameters 0.
Two datasets for learning and prediction. Spearman
correlation is a nonparametric method for measuring
the correlation between two variables. It measures the
monotonic relationship between variables, i.e.
whether they follow the same trend. Unlike Pearson
correlation, Spearman correlation does not require the
relationship between variables to be linear, so
Spearman correlation is more suitable when there is a
nonlinear relationship between variables 0.
The value range of Spearman correlation is
between -1 and 1, where 0 means that there is no
monotonic relationship between two variables, and -
1 means that there is a completely opposite
monotonic relationship between two variables, that is,
when one variable increases, the other variable will
decrease, and 1 means that there is exactly the same
monotonic relationship between the two variables,
that is, when one variable increases, the other variable
also increases 0.
Random forest is an ensemble learning method
that builds a strong classifier by combining multiple
weak classifiers. It is a probabilistic prediction model
that builds a model through a large number of training
samples and random feature selection. Its basic
principle is to build a model through a large number
of training samples and random feature selection
methods, each weak classifier classifies the training
samples, and finally summarizes the prediction
results of all weak classifiers to obtain the final
prediction result 0. Random forest has been widely
used in many fields because of its good generalization
ability, robustness and interpretability.
This paper proposes a random forest algorithm
optimization method based on feature engineering.
This method improves on the random forest algorithm
by combining techniques such as feature selection,
feature extraction, and feature transformation.
Specifically, this method first selects features that
have a significant impact on model performance
through feature selection methods, then extracts
valuable information about features through feature
extraction methods, and finally converts features into