clustering algorithms include K-means algorithm, K-
medoids algorithm, Canopy algorithm, etc. K-means
algorithm was proposed by MacQueen in 1976. It has
the advantages of low time complexity, strong
algorithm scalability and distributed computing, so it
has been widely applied in different fields. This paper
adopts K-means algorithm.
The core idea of k-means algorithm is as
follows: For a given data set containing N data
objects, k-means clustering algorithm firstly
randomly selects K data objects as the initial cluster
center of the clustering algorithm, and then
summarizes all data objects in the data set into the
cluster represented by the center point of the most
similar cluster according to the given similarity
measure. Then, according to the mean value of the
data objects in each class cluster, the center point of
the class cluster is updated and the data objects in the
data set are redivided. The process is iterated
repeatedly until the class cluster of the data objects in
the data set does not change or other given
termination iteration conditions are met (Gao, 2020).
Input: data set containing n data objects
G={X1,X2,X3,.... Xn}; Cluster Number of cluster k.
Output: k independent class clusters: C= {C
1
,C
2
,...
C
k
} (Cheng, 2021).
Steps of k-means clustering algorithm:
1) Randomly selecting k data objects from data
set G as the center points of the initial class cluster;
2) Calculating the similarity measure between the
data object in dataset G and k class cluster center
points, and assigning the data object to the class
cluster represented by the most similar class cluster
center point;
3) The data object information in each class
cluster was counted, and the mean value was taken as
the new center point of the class cluster to update the
center point information of the class cluster;
4) Performing steps 2 and 3 iteratively until the
algorithm is executed and the center point of class
cluster no longer changes.
3.3 Data Analysis Process Design
This study takes the learning behavior data of
students on the teaching platform as the research
object, adopts clustering algorithm to analyze online
learning behavior data, and establishes a prediction
model to achieve accurate warning of students'
classification. The practical process is as follows:
1) Online learning behavior analysis and
feature selection. Firstly, the online learning behavior
data of students are collected through the learning
platform, and obvious data such as the number of
chapter learning, the number of check-in completion,
the total number of live viewing and the number of
homework completion are used. In the process of
extracting learning behavior data, in addition to the
four types of learning behavior data, students' student
numbers and other data are retained to facilitate the
classification and early warning of students in the
later stage (Zhou, 2020).
2) Because the epidemic has been normalized,
students often need to carry out online learning at
home or in the dormitories where the network signal
is not free. The learning behavior data of some
students were cleaned.The data of students whose
login number is zero are cut out. After data extraction
and cleaning, new learning data were collected. A
total of 12 students' learning behavior data were
cleaned, and records of 402 students' learning
behavior data were kept.
3) Due to the different orders of magnitude of
the extracted learning behavior data, the extracted
learning behavior data in this paper are standardized.
Standard deviation standardization (Z-Score) was
used to standardize learning behavior data. The core
code for data normalization in Python is data=(data-
data.mean(axis=0))/data.std(axis=0). Data represents
the data object, and data.mean(axis=0) represents the
mean value of the data object. Data.std (Axis =0)
represents the standard deviation of the data
object[7].
4) Data training and clustering analysis
calculation results of K-means algorithm model.The
K-means clustering algorithm was used to conduct
clustering analysis on the four learning behavior data
of the remaining 402 students after data cleaning.
4 DATA ANALYSIS RESULTS
The k-means model provided by Scikit-learn, a third
party library of Python, is used to conduct
unsupervised algorithm machine learning and data
training on standardized learning behavior data. In
this paper, data training and clustering calculation
were conducted for several times, and the online
learning students were finally divided into four types,
and the classification of online learning students was
completed (Yang, 2021).