Punch Type Classification and Hit Judgement
Using Estimated Skeletal Model in Boxing Match Videos
Soma Watanabe
1
and Yoshinari Kameda
2a
1
College of Engineering Systems, University of Tsukuba, Japan
2
Center for Computational Sciences, University of Tsukuba, Japan
Keywords: Skeletal Description, Action Classification, Action Recognition, Boxing, Sports Video.
Abstract: In boxing, the choice of punch types and how to hit the punch to the opponent player is an important issue.
So the support of computer vision on punch type classification and hit judgement on boxing match video is
demanded. There are currently two challenges to that purpose. The first is the preparation of appropriate video
dataset of boxing matches. The second is the discussion of the right method for punch type classification and
punch hit judgment. We propose to create a video dataset of boxing matches from a boxing 3DCG simulation.
The simulation can automatically annotate attributes to the dataset. This is useful for hit / no-hit judgement as
it is not easy to identify which punches are actually hit and which are not on real boxing match videos. Based
on the dataset we prepared, we propose a new method using time-series skeletal representation for classifying
the type of punches and judging the hits. The experimental results show that our proposed method is able to
classify the types of punches and judge the hits.
1 INTRODUCTION
Boxing is a combat sport in which victory or defeat is
determined by two fighters exchanging punches. In
boxing, the key to victory is to use different types of
punches to hit the opponent. Automation of the clas-
sification of the type of punches and the judging of
hits from boxing match videos would contribute to the
development of tactical analysis in boxing.
There are two challenges in analyzing punches us-
ing computer vision. The first is to prepare a video
data set of boxing matches with correct annotation.
Due to the nature of boxing, the number of fights per
fighter in boxing is not so large. In addition, in order
to use the match videos as training data, it is necessary
to assign not only the type of punch as a teacher signal
but also the attribute of whether the punch was a hit
or not. It is not easy to visually assign attributes to the
video. To the best of our knowledge, there are no good
boxing match video datasets that describe these attrib-
utes and are publicly available. Second, partly due to
the first reason, no method has been proposed to au-
tomatically classify the type of punch and judge the
hit based on boxing match video.
a
https://orcid.org/0000-0001-6776-1267
In this study, we propose a method to automati-
cally recognize punches based on a boxing match
video dataset created from a CG simulation of a box-
ing match. In automatic punch recognition, both type
classification and hit judgment are performed.
In the generation of punch images in a boxing CG
simulation, motion for each punch type is used. By
performing the punch hit judgment on the simulation,
the true value of the hit judgment attribute is automat-
ically assigned to the dataset for each punch.
We propose an automatic recognition method us-
ing time-series skeletal representation for classifying
the type of punch and judging the hit. The framework
is the same for both classification and judgement. A
learning model suitable for recognizing the time-se-
ries skeletal representation of multiple human bodies,
including occlusions, is used for both. We formulate
the punch type classification as a multi-class classifi-
cation problem and the hit judgment as a two-class
classification problem.
The contribution of this research is twofold. The
first is the creation of a boxing match video dataset
using CG simulation. We created a video dataset suit-
able for skeletal estimation by using 3DCG simula-
tion of a scene of a boxing match in which a boxer is
218
Watanabe, S. and Kameda, Y.
Punch Type Classification and Hit Judgement Using Estimated Skeletal Model in Boxing Match Videos.
DOI: 10.5220/0013018400003828
Paper published under CC license (CC BY-NC-ND 4.0)
In Proceedings of the 12th International Conference on Sport Sciences Research and Technology Support (icSPORTS 2024), pages 218-224
ISBN: 978-989-758-719-1; ISSN: 2184-3201
Proceedings Copyright © 2024 by SCITEPRESS – Science and Technology Publications, Lda.
punching. The created dataset is used for both training
and validation. Second, we show that the time-series
skeletal representation of two fighters can be used to
classify the type of punch and judge the hit.
The general flow of the proposed method is de-
scribed below. First, a 3DCG animation is created.
Next, punch motion is captured from various angles
to create punch videos. A punch type label and a hit
label are assigned to each punch video. In the training
phase, we first apply skeletal estimation to the punch
videos to obtain a skeletal time series representation
for each punch video. Then, using this set of skeletal
time-series representations, we train a model that
classifies the type of punch. Similarly, we train a
model that judges the punch hit. ST-GCN++(Duan et
al. 2022) is used for the machine learning model.
2 RELATED WORKS
Cizmic et al. proposed a punch classification method
using wearable sensors (Cizmic et al. 2023). In the
field of image recognition, punch classification meth-
ods for shadow boxing (Kasiri et al. 2017) (Kasiri et
al. 2015) have been proposed, while Broilvskiy et al.
proposed an action recognition method on a video da-
taset of a single fighter shadow boxing (Broilvskiy et
al. 2021). The above studies are aimed only at punch
type determination and do not discuss punch hit de-
termination.
So far, recognition methods using skeletal repre-
sentations have been presented for single-person ac-
tions in daily activities (Xu et al. 2023) (Lee et al.
2023). Similarly, action recognition methods have
been presented for the daily actions of multiple per-
sons (Li et al. 2020) (Tu et al. 2021) (Chen et al. 2023).
In these studies, the interaction between persons has
been cited as a problem in multi-person action classi-
fication tasks. In the same presentation, a method with
high robustness against interactions was also reported.
Action classification methods using skeletal esti-
mation have been reported for stand-alone sports such
as skating (Li et al. 2021), taekwondo (Luo et al.
2023), and karate (Guo et al. 2021). These studies
have shown that action classification methods using
skeletal representations can provide superior recogni-
tion performance.
CG has been used as a dataset for supervised
learning in the past, and Wood et al. have trained on
CG-generated face image data alone for tasks such as
identifying landmarks and analyzing regions of the
human face, showing that training on CG face image
data alone can record the same accuracy as training
on actual face image data (Wood et al. 2023 ). They
have shown that when training with only CG face im-
age data, the accuracy is comparable to that of training
with actual face image data.
3 BOXING MATCH VIDEOS
In this study, boxing match video refers to video foot-
age of two fighters punching each other with the goal
of hitting a punch. It is assumed that the video may be
shot from various positions around the ring. The cam-
era work is assumed to be stationary when filming
from each location, so as not to lose the generality of
the filming.
The punch type judgment task in this study fo-
cuses on four types of punches used in boxing: hooks,
jabs, straights, and uppercuts. In the punch hit judg-
ment task, two types of punches are considered: hit
and miss.
A single punch video is defined as a video of one
punch by one of the two players in the match video.
For each punch video, we judge whether it is one
of the four types of punches. For each punching video,
and whether it is a "hit" or a "miss". In this study, we
do not classify which player punched which.
A CG simulation is used to generate a number of
punching videos. In one punching video, one of the
two players performs the motion of punching and the
other performs the motion of receiving a punch. For
each punch video, two labels are assigned: punch type
and punch hit. The punch type label is determined
from the motion used. The label for the hit is automat-
ically set based on the contents of the CG simulation.
4 DETERMINATION OF PUNCH
TYPE AND HIT DETECTION
4.1 Video Processing
The punching video is acquired by CG simulation at
30 fps. For each frame in the punching video, we es-
timate the skeleton of two boxers using the 25 key-
point model of OpenPose (Cao et al. 2017). A time-
series skeletal representation is obtained for the 17
keypoints (Figure 1) necessary for boxing action
recognition. This time-series skeletal representation is
used as input to the learning network model for judg-
ing the type of punches and hits. Figure 2 shows the
processing steps. ST-GCN++(Duan et al. 2022) is
used for both learning network models.
Punch Type Classification and Hit Judgement Using Estimated Skeletal Model in Boxing Match Videos
219
Figure 1: Estimated two skeleton models of two boxers.
One skeleton model is composed of 17 key points.
4.2 Model
The punch classification task and the punch hit deci-
sion task in this study use ST-GCN++(Duan et al.
2022) as the learning network model, which is an ef-
fective model for skeleton-based action recognition
tasks using graph convolution. ST-GCN++ is robust
to person-person interactions. ST-GCN++ is also ro-
bust to occlusion caused by hiding of person regions
by objects or persons. The model achieved an accu-
racy of 0.83 on the skeleton-based action recognition
task on the NTURGB+D 120 dataset (Liu et al. 2019),
which is a video dataset of basic everyday human ac-
tions.
We use a trained model with the NTURGB+D 120
skeletal time series information dataset (Liu et al.
2019) from ST-GCN++. Fine tuning of the trained
model is performed using the dataset created in this
study.
5 VIDEO DATASET
GENERATION
5.1 Camera Layout
Place the camera on one of the points on the sphere
which is aligned to the center of the ring. The camera
is positioned every 15 degrees horizontally and every
15 degrees vertically. The total number of shooting
points is 96. The camera is set so that it faces the cen-
ter of the ring at any point. Figure 3 shows a schematic
of the camera arrangement.
To obtain a single punch image, one of the 96 lo-
cations is selected at random. To ensure the reproduc-
ibility of the experiment, the random number se-
quence is fixed.
Figure 2: Process flow of proposed punch type classifica-
tion and hit judgement.
5.2 Punch Video Synthesis
Figure 4 shows the setup of the stadium model and
boxer model used in this study. Two players are as-
signed to an offensive motion and a defensive motion
for each. The offensive player randomly selects one of
nine different punching motions, and the defensive
player randomly selects one of 12 different defensive
motions. All the motions were obtained as a commerc-
icSPORTS 2024 - 12th International Conference on Sport Sciences Research and Technology Support
220
Figure 3: Layout of 96 Camera locations placed around the
boxing ring.
Figure 4: A view of one camera location, showing the boxer
CG models and the stadium model.
ial product of the motion dataset performed by expe-
rienced boxing athletes. The punch motions consist of
three jabs, two each of hooks, uppercuts, and straights.
Offensive and defensive selection, punch motion se-
lection, and defensive motion selection are conducted
randomly. During the experiment, the seed of the ran-
dom number generation is fixed.
Since there are 9 punching motions, 12 defensive
motions, and 96 camera positions, there are 10,368
punching videos. Using a random number sequence,
6,000 punch videos are used for training and 900 for
validation. Table 1 and Table 2 show the distribution
of punches of type classification and hit judgment in
the training dataset created in this study. The average
number of frames indicates the length of the motion
as the video is recorded at 30fps. Figure 5 shows a
part of the created punch videos. Figure 6 shows the
results of the skeletal representation.
5.3 Hit Label Annotation
Collider in our 3D simulation is used to add a label to
the punch hit decision. In this research, a sphere-
shaped Sphere Collider is used for the glove and a
capsule-shaped Capsule Collider is used for the play-
er's body. The installation is shown in Figure 7. When
the Sphere Collider on the hand of an offensive player
Figure 5: Four examples of generated punch motions.
Punch Type Classification and Hit Judgement Using Estimated Skeletal Model in Boxing Match Videos
221
Figure 6: Skeletal model description of the punch motions
of Figure 5.
Figure 7: Layout of sphere collider at hands and capsule col-
lider at body. Punch hit is detected by the collision of these
colliders.
Table 1: Distribution of punch motions by types.
hook jab staight uppercut
#data 1376 2046 1273 1305
ratio 0.229 0.341 0.212 0.218
Average
# frame
40.91 44.34 41.18 49.34
Table 2: Distribution of punch motions by hit / no-hit.
hit No hit
#data 2389 3611
ratio 0.398 0.602
Average
# frame
43.61 44.41
icSPORTS 2024 - 12th International Conference on Sport Sciences Research and Technology Support
222
collides with the Capsule Collider on the body of a
defender, the label "hit" is added. When the hit does
not occur, the label "no hit" is assigned.
6 EXPERIMENTS
For the punch type determination, 1,200 randomly
generated punch videos were used for verification,
yielding an accuracy of 0.94. Figure 8 shows the con-
fusion matrix. The accuracy for each label exceeded
0.90, and a high accuracy of 0.98 was recorded for
straight punches.
For the punch hit evaluation, 1,200 randomly gen-
erated punch videos were also used for verification,
yielding an accuracy of 0.88. Figure 9 shows the con-
fusion matrix. The accuracy for each label was 0.87
or higher.
Figure 8: Confusion matrix of punch type classification.
Figure 9: Confusion matrix of punch hit judgement.
7 CONCLUSIONS
In this paper, a boxing match video dataset was cre-
ated from a boxing CG simulation, and an automatic
punch recognition method was proposed based on this
dataset. The automatic punch recognition is decom-
posed into two tasks: a task for classifying the type of
punch and a task for judging the hit. For each of these
tasks, we trained on ST-GCN++. Experimental results
showed that the recognition rate for the punch type
classification task was 0.94, and the recognition rate
for the punch hit judgment task was 0.88.
This work was partially supported by JSPS KA-
KENHI 22K19803 and it was originated by
(Watanabe, 2024).
REFERENCES
Duan, H., Wang, J., Chen, K., & Lin,D. (2022). Towards
good practices for skeleton action recognition”, Proc.
the 30th ACM International Conference on Multimedia,
7351-7354.
Cizmic, D., Hoelbling, D., Baranyi, R., Breiteneder, R., &
Grechenig, T. (2023). “Smart boxing glove “RD α”:
IMU combined with force sensor for highly accurate
technique and target recognition using machine learn-
ing,” J. Applied Sciences, 13(16), 9073-9088.
Kasiri, S., Fookes, C., Sridharan, S., & Morgan, S. (2017).
“Fine-grained action recognition of boxing punches
from depth imagery,” J. Computer Vision and Image
Understanding, 159, 143–153.
Kasiri, S., Fookes, C., Sridharan, S., Morgan, S., & Martin,
T. (2015). “Combat sports analytics: Boxing punch clas-
sification using overhead depth imagery,” Proc. IEEE
International Conference on Image Processing (ICIP),
4545-4549.
Broilvskiy, A., & Makarov, I. (2021). “Human action recog-
nition for boxing training simulator,” Proc. 9th Analysis
of Images, Social Networks and Texts (AIST), LNCS
12602, 331–343.
Xu, H., Gao, Y., Hui, Z., Li, J., & Gao, X. (2023). “Lan-
guage knowledge-assisted representation learning for
skeleton-based action recognition,” https://arxiv.org/
abs/2305.12398.
Lee, J., Lee, M., Lee, D., & Lee, S. (2023). “Hierarchically
decomposed graph convolutional networks for skele-
ton-based action recognition, Proc. IEEE/CVF Inter-
national Conference on Computer Vision (ICCV),
10410-10419.
Li, T., Sun, Z., & Chen, X. (2020). “Group-skeleton-based
human action recognition in complex events,” Proc. the
28th ACM International Conference on Multimedia,
4703-4707.
Tu, H., Xu, R., Chi, R., & Peng, Y. (2021). “Multiperson
interactive activity recognition based on interaction re-
lation model,” Hindawi Journal of Mathematics, 1-12.
Punch Type Classification and Hit Judgement Using Estimated Skeletal Model in Boxing Match Videos
223
Chen, Z., Wang, H., & Gui, J. (2023). “Occluded skeleton-
based human action recognition with dual inhibition
training,” Proc. the 31st ACM International Conference
on Multimedia, 2625–2634.
Li, H., Lei, Q., Zhang, H., Du, J., & Gao, S. (2021). “Skel-
eton-based deep pose feature learning for action quality
assessment on figure skating videos,” Proc. the 11th In-
ternational Conference on Information Technology in
Medicine and Education (ITME), 196-200.
Luo, C., Kim, S., Park, H., Lim, K., & Jung, H. (2023).
“Viewpoint-agnostic taekwondo action recognition us-
ing synthesized two-dimensional skeletal datasets,” J.
Sensors, 23(19), 8049-8063, 2023.
Guo, J., Liu, H., Li, X., Xu, D., & Zhang, Y. (2021). “An
attention enhanced spatial–temporal graph convolu-
tional LSTM network for action recognition in karate,”
J. Applied Sciences, 11(18), 8641-8653.
Wood, E., Baltrusaitis, T., Hewitt, C., Dziadzio, S., Cash-
man, T. J., & Shotton, J. (2023). “Fake it till you make
it: Face analysis in the wild using synthetic data alone,”
Proc. IEEE/CVF International Conference on Computer
Vision (ICCV), 3681-3691.
Cao, Z., Simon, T., Wei, E. S., Sheikh, Y. (2017). “Realtime
multi-person 2D pose estimation using part affinity
fields,” Proc. IEEE conference on Computer Vision and
Pattern Recognition (CVPR), 1302-1310.
Liu, J., Shahroudy, A., Perez, M., Wang, G., Duan, L., Kot,
A. (2019). “NTU RGB+D 120: A large-scale bench-
mark for 3D human activity understanding”, Trans.
IEEE Transactions on Pattern Analysis and Machine In-
telligence (T-PAMI), 42(10), 2684-2701.
Watanabe, S., Kameda, Y. (2024). “Automatic punch
classification using skeletal estimation in boxing match
videos,” IEICE Tech. Rep., vol. 123, no. 433,
MVE2023-83, pp. 224-228.
icSPORTS 2024 - 12th International Conference on Sport Sciences Research and Technology Support
224