operating. To deal with this problem setting, the
proposed method (1) makes the spatial-temporal
graphs which represents not only visual features of
the participants but also the positions and the number
of the participants and (2) uses the graph to classify
to the category of the surgical processes. Our model
utilizes Point Transformer Layer as the building
blocks to deal with the graphs which has the
geometric information in spatial-temporal space.
By using the model, the early recognition of the
surgical process is performed on the public datasets;
mock surgery of knee replacement (Özsoy et al.,
2022). Experimental results show that our method can
recognize each surgical process from early durations
of the inputted video, where the F1 values in the
public dataset (Özsoy et al., 2022) are from 68.2% to
90.0% in 30 seconds from the beginning. These
results mean our method can recognize the surgical
process based on from early part 17.1 % to 34.1 % of
the entire information in (Özsoy et al., 2022),
respectively. Compared with the state-of-the-art in
terms of early recognition for the group activity (Zhai
et al., 2023), it is shown that that ours outperforms
(Zhai et al., 2023) in 4D-OR dataset (Özsoy et al.,
2022).
REFERENCES
Y. Li, J Ohya, T. Chiba, R. Xu, H. Yamashita (2016). ”
Subaction Based Early Recognition of Surgeons' Hand
Actions from Continuous Surgery Videos”, IIEEJ
Transactions on Image Electronics and Visual
Computing Vol.4 No.2
H Zhao, P. Wildes (2021). ” Review of Video Predictive
Understanding: Early Action Recognition and Future
Action Prediction”, arXiv:2107.05140v2.
B. Zhou, A. Andonian, A. Oliva, A. Torralba (2018). ”
Temporal Relational Reasoning in Videos”,
arXiv:1711.08496
L. Chen, J. Lu, Z. Song, J Zhou (2018). ” Part-Activated
Deep Reinforcement Learning for Action Prediction”,
Proc. of ECCV, pp 435–451.
G. Singh, S. Saha, M. Sapienza, P. Torr, F. Cuzzolin (2017).
” Online Real-Time Multiple Spatiotemporal Action
Localisation and Prediction”, Proc of ICCV, pp. 3657–
3666
C. Sun, A. Shrivastava, C. Vondrick, R. Sukthankar, K.
Murphy, C. Schmid (2019). ” Relational Action
Forecasting”, Proc of CVPR, pp. 273–283
S. Ma, L. Sigal, S. Sclaroff: (2016). ”Learning Activity
Progression in LSTMs for Activity Detection and Early
Detection”, Proc of CVPR, pp.1942–1950
M. S. Aliakbarian, F. S. Saleh, M. Salzmann, B. Fernando,
L. Petersson, L. Andersson (2017). ” Encouraging
LSTM to Anticipate Actions Very Early”,
arXiv:1703.07023
Y. Kong, Z. Tao, Y. Fu (2017). ” Deep Sequential Context
Networks for Action Prediction”, Proc of CVPR,
pp.3662–3670
Y. Kong, D. Kit, Y. Fu (2014). ”A Discriminative Model
with Multiple Temporal Scales for Action Prediction”,
Proc of ECCV, pp. 596–611
X. Wang, J. Hu, J. Lai, J. Zhang, W. Zheng (2019).
”Progressive Teacher-Student Learning for Early
Action Prediction”. Proc of CVPR, pp. 3551–3560.
J. Bradbury, S. Merity, C. Xiong, R. Socher (2016). “Quasi-
Recurrent Neural Networks”, arXiv:1611.01576v2
H Zhao, R. Wildes (2019). ”Spatiotemporal Feature
Residual Propagation for Action Prediction”, Proc of
ICCV, pp. 7002–7011
K. Soomro, A. R. Zamir, M. Shah (2012). “UCF101: A
Dataset of 101 Human Actions Classes From Videos in
The Wild”, arXiv:1212.0402
H. Jhuang, J. Gall, S. Zuffi, C. Schmid, M. J. Black (2013).
“Towards Understanding Action Recognition”, Proc of
ICCV, pp. 3192–3199
Y. Kong, Y. Jia, Y. Fu (2012). ”Learning Human
Interaction by Interactive Phrases”, Proc of ECCV, pp.
300–313,
A. Patron-Perez, M. Marszalek, A. Zisserman, I. Reid:
“High Five: Recognising human interactions in TV
shows”, Proc of BMVC, pp. 50.1–50.11, (2010)
(doi:10.5244/C.24.50)
R. Goyal et al. (2017). “The “Something Something” Video
Database for Learning and Evaluating Visual Common
Sense”, Proc of ICCV, pp. 5843– 5851
J. Hu, W. Zheng, J. Lai, J. Zhang (2017). “Jointly Learning
Heterogeneous Features for RGB-D Activity
Recognition”. TPAMI, vol.39, no.11, pp.2186-2200
Y. Li, C. Lan, J. Xing, W. Zeng, C. Yuan, J. Liu (2016).
”Online Human Action Detection Using Joint
Classification-Regression Recurrent Neural
Networks”, proc of ECCV, pp. 203–220
C. Liu, Y. Hu, Y. Li, S. Song, J. Liu (2017). “PKU-MMD:
A Large Scale Benchmark for Continuous Multi-Modal
Human Action Understanding”, arXiv:1703.07475
K.Simonyan, A.Zisserman, (2015). “Very Deep
Convolutional Networks for LargeScale Image
Recognition”, arXiv:1409.1556v6,
J. Chen, W. Bao, Y. Kong: Group Activity Prediction with
Sequential 14 Relational Anticipation Model,
arXiv:2008.02441v1 (2020)
The Japan Institute for Labour Policy and Training (2022).
https://www.jil.go.jp/kokunai/blt/backnumber/2022/11
/s_02.html access 2023/04/
M Ibrahim, et al. (2016) “A Hierarchical Deep Temporal
Model for Group Activity Recognition”,
arXiv:1511.06040
W. Choi, K. Shahid, S. Savarese, (2009). “What are they
doing? : Collective Activity Classification Using
Spatio-Temporal Relationship Among People.” 9th
International Workshop on Visual Surveillance
(VSWS09) in conjuction with ICCV,
https://cvgl.stanford.edu/projects/collective/collective
Activity.html