punching. The created dataset is used for both training
and validation. Second, we show that the time-series
skeletal representation of two fighters can be used to
classify the type of punch and judge the hit.
The general flow of the proposed method is de-
scribed below. First, a 3DCG animation is created.
Next, punch motion is captured from various angles
to create punch videos. A punch type label and a hit
label are assigned to each punch video. In the training
phase, we first apply skeletal estimation to the punch
videos to obtain a skeletal time series representation
for each punch video. Then, using this set of skeletal
time-series representations, we train a model that
classifies the type of punch. Similarly, we train a
model that judges the punch hit. ST-GCN++(Duan et
al. 2022) is used for the machine learning model.
2 RELATED WORKS
Cizmic et al. proposed a punch classification method
using wearable sensors (Cizmic et al. 2023). In the
field of image recognition, punch classification meth-
ods for shadow boxing (Kasiri et al. 2017) (Kasiri et
al. 2015) have been proposed, while Broilvskiy et al.
proposed an action recognition method on a video da-
taset of a single fighter shadow boxing (Broilvskiy et
al. 2021). The above studies are aimed only at punch
type determination and do not discuss punch hit de-
termination.
So far, recognition methods using skeletal repre-
sentations have been presented for single-person ac-
tions in daily activities (Xu et al. 2023) (Lee et al.
2023). Similarly, action recognition methods have
been presented for the daily actions of multiple per-
sons (Li et al. 2020) (Tu et al. 2021) (Chen et al. 2023).
In these studies, the interaction between persons has
been cited as a problem in multi-person action classi-
fication tasks. In the same presentation, a method with
high robustness against interactions was also reported.
Action classification methods using skeletal esti-
mation have been reported for stand-alone sports such
as skating (Li et al. 2021), taekwondo (Luo et al.
2023), and karate (Guo et al. 2021). These studies
have shown that action classification methods using
skeletal representations can provide superior recogni-
tion performance.
CG has been used as a dataset for supervised
learning in the past, and Wood et al. have trained on
CG-generated face image data alone for tasks such as
identifying landmarks and analyzing regions of the
human face, showing that training on CG face image
data alone can record the same accuracy as training
on actual face image data (Wood et al. 2023 ). They
have shown that when training with only CG face im-
age data, the accuracy is comparable to that of training
with actual face image data.
3 BOXING MATCH VIDEOS
In this study, boxing match video refers to video foot-
age of two fighters punching each other with the goal
of hitting a punch. It is assumed that the video may be
shot from various positions around the ring. The cam-
era work is assumed to be stationary when filming
from each location, so as not to lose the generality of
the filming.
The punch type judgment task in this study fo-
cuses on four types of punches used in boxing: hooks,
jabs, straights, and uppercuts. In the punch hit judg-
ment task, two types of punches are considered: hit
and miss.
A single punch video is defined as a video of one
punch by one of the two players in the match video.
For each punch video, we judge whether it is one
of the four types of punches. For each punching video,
and whether it is a "hit" or a "miss". In this study, we
do not classify which player punched which.
A CG simulation is used to generate a number of
punching videos. In one punching video, one of the
two players performs the motion of punching and the
other performs the motion of receiving a punch. For
each punch video, two labels are assigned: punch type
and punch hit. The punch type label is determined
from the motion used. The label for the hit is automat-
ically set based on the contents of the CG simulation.
4 DETERMINATION OF PUNCH
TYPE AND HIT DETECTION
4.1 Video Processing
The punching video is acquired by CG simulation at
30 fps. For each frame in the punching video, we es-
timate the skeleton of two boxers using the 25 key-
point model of OpenPose (Cao et al. 2017). A time-
series skeletal representation is obtained for the 17
keypoints (Figure 1) necessary for boxing action
recognition. This time-series skeletal representation is
used as input to the learning network model for judg-
ing the type of punches and hits. Figure 2 shows the
processing steps. ST-GCN++(Duan et al. 2022) is
used for both learning network models.