stability of model training and the accuracy of
prediction (Lv et al 2015). LSTM is a unique iteration
of the Recurrent Neural Network (RNN) model,
making it more effective than conventional RNN
models. The LSTM model incorporates forget gates,
input gates, and output gates to effectively capture the
long-term dependencies in time series data, resulting
in enhanced prediction accuracy. Furthermore, due to
the LSTM model's superior ability to handle high-
dimensional data and extract valuable information,
numerous researchers employ deep learning
techniques to analyse high-dimensional
spatiotemporal data. Shi et al. conducted a
comparative analysis of the efficacy of random
forests, Back Propagation (BP) neural networks, and
LSTM models in predicting rail transit flow. They
concluded that the LSTM model exhibited superior
fitting results and demonstrated more comprehensive
predictive capabilities and higher average prediction
accuracy compared to the other models (Shi et al
2020).
This paper mainly utilizes the LSTM model to
predict urban subway passenger flow, reaffirming the
accuracy and irreplaceability of the LSTM model in
the field of short-term passenger flow prediction, and
providing more accurate passenger flow information
for subway operation managers, offering a scientific
basis for decision-making.
2 METHOD AND DATE
2.1 Data Source and Preprocessing
The dataset employed in this study consists of
passenger flow data from the Seoul Metropolitan
Subway system, spanning from 2015 to 2018. This
dataset is publicly accessible on the Kaggle website.
It records the number of passengers entering and
exiting each of the 275 stations of the Seoul subway
from 5 AM to midnight daily, with a time resolution
of one hour. The format of the original data is
presented as shown in Table 1.
Data preprocessing is a crucial initial step in data
analysis, aiming to ensure the quality and consistency
of the data for effective analysis and modeling. The
Jupyter Notebook development tool and the Python
programming language were used in this paper to
preprocess the data. The pandas library was used for
data cleaning, while the glob library was used to look
for data file directories and paths. By removing
duplicate entries, confirming data types, looking for
missing values, and doing statistical analysis, the
dataset's accuracy and integrity were verified. There
were no missing or duplicate records discovered, and
there were no outliers in the number of passengers for
any given period.
Because different features differ significantly
from one another, it is easy for little data to be missed
during training. Thus, for analysis to be effective, all
data must be normalized.
𝐱
norm
=
𝐱𝐱
min
𝐱
max
𝐱
min
(1)
Where, x_"norm" represents the normalized
value. x is the sample value, x_"min" is the minimum
value, and x_"max" is the maximum value.
Normalization, by unifying the data scale, enhances
the learning efficiency of the algorithm and reduces
the risk of gradient vanishing or explosion, thereby
improving the performance and stability of short-term
prediction models. Normalization, through
standardizing the data scale, not only increases the
efficiency of algorithm learning but also minimizes
the risk of gradient disappearance or explosion,
consequently enhancing the effectiveness and
steadiness of short-term forecast models.
2.2 Cluster Analysis
By evaluating the correlation between the silhouette
coefficient of the dataset and the number of clusters,
it can be deduced that the ideal number of clusters is
either 2 or 4, as depicted in Fig. 1. From a research
perspective, two clusters often lack significant
research value; therefore, four clusters were selected
for subsequent clustering.
The 3D visualization following clustering clearly
indicates that the clustering effect on the original data
is not pronounced. Therefore, the data was subjected
to PCA (Principal Component Analysis) for
dimensionality reduction before clustering, as shown
in Fig. 2.
Table 1: Setting Word’s margins.
USE_DT station_code station_name division 05~06 …… 23~24
2018/11/1 0:00 2530 Gongdeok in 74 …… 342
2018/3/26 0:00 309 Jichuk Out 19 …… 2