2 DATASET AND METHOD
The data used in this research comes from the data of
China's Shanghai Composite Index in the first half of
2022. The historical transaction data of 4 China's
Shanghai Stock Exchange A-shares were selected as
the research object. The common feature of these 4
stocks is that their prices are relatively low, with no
more than 10 yuan, namely Guangshen Railway,
Hainan Airport, Jihua Group, and Shandong Hi-
Speed. The stock codes are 601333, 600515, 601718,
and 600350 respectively.
This research uses the machine learning algorithm
of decision tree to predict the stock price trend by
analyzing and processing the data of the opening date,
opening price, highest price of the day, lowest price
of the day, closing price, and trading volume of these
four stocks 0. For this data set, it is divided into
training set and test set, the total amount of data is
468, and the ratio of training set and test set is 85%
and 15%, respectively, for a single stock, the total
amount of data is 117 samples, its training set is 100
samples, and the test set is 17 samples.
Divide the input space into M regions R
1
, R
2
, ...
R
M
and generate the decision tree:
1
ˆ
( ) ( )
M
mm
m
f x c I x R
(1)
The loss function of a subtree generated with any
internal node t of the tree as the root node is:
C
a
(T
t
) = C(T
t
) + a|T
t
| (2)
T represents any subtree, C(T) represents the
prediction error of the subtree to the training data, |T|
represents the number of leaf nodes of the subtree, a
is the parameter of the subtree (a ≧ 0), C
a
(T) is the
overall loss of the subtree T under the specified
parameter a. The loss function of replacing the
subtree with node t to obtain a single-node tree is:
C
a
(t) = C(t) + a (3)
When a=0 and sufficiently small, there is the
following relationship:
C
a
(T
t
) < C
a
(t) (4)
As a gradually increases, the following
relationship exists when reaching a certain value:
C
a
(T
t
) = C
a
(t) (5)
When a continues to increase, the inequality (4)
will be reversed. When equation (5) holds:
(6)
At this time, the loss function remains unchanged
after the subtree is cut, but the overall tree has fewer
nodes. From bottom to top, calculate the a value for
each internal node of the tree according to the formula
(6), select the node corresponding to the minimum a
value, and prune the subtree generated by this node to
complete the current round of pruning branch process.
We set the maximum depth of the decision tree to
100 layers, use the inductive binary tree, and set the
minimum number of samples on the leaf nodes to 2.
When the number of subsets is less than 5, it will not
be split. The regression prediction model is used in
the prediction. Visualize the analysis results after
outputting the results 0. Meanwhile, we also conduct
performance analysis and experiments on our
predictive model using different evaluation metrics.
3 EXPERIMENT AND ANALYSIS
Decision trees work by dividing a data set into
subsets, each of which has a decision node, and each
decision node has a series of features and values that
are used to calculate the next decision node. The
structure of the decision tree is composed of decision
nodes and leaf nodes, where the decision node
represents the decision to be made at the node, and
the leaf node represents the result of the decision. The
goal of a decision tree is to find the best path from the
root node to the leaf nodes to get the best result 0.
The advantage of a decision tree is that it can
analyze a problem from multiple perspectives and can
calculate the results quickly. However, decision trees
also have some disadvantages, such as overfitting,
which can lead to inaccurate predictions. Therefore,
when using decision trees for prediction, the data
should be properly processed and adjusted to avoid
overfitting. As shown in Figure 1, it is the decision
tree model for stock price prediction of Shandong Hi-
Speed. The decision tree models for the other three
stocks are similar.
This research divides the historical transaction
data of 4 stocks into training set and test set. There are
468 pieces of data in total. For a single stock, the total
amount of data is 117, and the proportions of training
set and test set are 85% and 15%, respectively. There
are 100 samples in the training set and 17 samples in
the test set. The prediction results are shown in Table
1.
Research on Decision Tree in Price Prediction of Low Priced Stocks
387