
These experiment results show the effectiveness
of the proposed method in detecting hands moving
objects. However, F1 was still low for some activ-
ity categories. For “picking objects,” large optical
flows were often observed in areas other than an ob-
ject moved by a hand, and an increase in FP caused a
decrease in P, resulting in a decrease in F1. For “tak-
ing food” and “cleaning objects,” forearm movements
were often slow or small, and a decrease in T P caused
a decrease in R, which in turn led to a decrease in F1.
Therefore, our future task is to improve the proposed
method so that it can handle such situations.
5 CONCLUSION
In this paper, we focused on the action of people mov-
ing objects with their hands, and proposed a method
to detect hands moving objects from video.
The proposed method integrates skeleton and mo-
tion information obtained from video into a single
type of features by using prior knowledge about the
detection target, and performs detection processing
based on those features. Since this approach performs
detection based on a single type of features, it is ex-
pected to improve the efficiency of the necessary pro-
cessing, including training the detection model.
Compared to the existing method based on a sim-
ilar approach, our method deals with various hand
movements by introducing the affine transformation
forearm motion model, and discriminates hand states
in detail by introducing the feature vector based clas-
sifier. Through the experiments on the video dataset
of human daily activities, we demonstrated that the
proposed method can improve the accuracy of detect-
ing hands moving objects from video (compared to
existing method, F1 improved from 0.51 to 0.71).
As future work, we plan to:
• implement the proposed method using more pow-
erful classifier, such as deep learning based clas-
sifier, instead of the current SVM based classifier,
• conduct comparative experiments with methods
that process different information, such as skele-
ton and motion information, in separate streams.
REFERENCES
Antoun, M. and Asmar, D. (2023). Human object interac-
tion detection: Design and survey. Image Vision Com-
put., 130:104617.
Cao, Z., Hidalgo, G., Simon, T., Wei, S.-E., and Sheikh,
Y. (2021). OpenPose: Realtime multi-person 2D pose
estimation using part affinity fields. IEEE Trans. Pat-
tern Anal. Mach. Intell., 43(1):172–186.
Drillis, R., Contini, R., and Bluestein, M. (1964). Body
segment parameters: A survey of measurement tech-
niques. Artif. Limbs, 8(1):44–66.
Fan, H., Zhuo, T., Yu, X., Yang, Y., and Kankanhalli, M.
(2022). Understanding atomic hand-object interac-
tion with human intention. IEEE Trans. Circuits Syst.
Video Technol., 32(1):275–285.
Fang, H.-S., Li, J., Tang, H., Xu, C., Haoyi Zhu and, Y. X.,
Li, Y.-L., and Lu, C. (2023). Alphapose: Whole-body
regional multi-person pose estimation and tracking in
real-time. IEEE Trans. Pattern Anal. Mach. Intell.,
45(6):7157–7173.
Gu, Y., Ye, X., Sheng, W., Ou, Y., and Li, Y. (2020). Mul-
tiple stream deep learning model for human action
recognition. Image Vision Comput., 93:103818.
Haroon, U., Ullah, A., Hussain, T., Ullah, W., Sajjad, M.,
Muhammad, K., Lee, M. Y., and Baik, S. W. (2022). A
multi-stream sequence learning framework for human
interaction recognition. IEEE Trans. Human-Mach.
Syst., 52(3):435–444.
Khaire, P. and Kumar, P. (2022). Deep learning and
RGB-D based human action, human–human and hu-
man–object interaction recognition: A survey. J. Vi-
sual Commun. Image Represent., 86:103531.
Kim, S., Yun, K., Park, J., and Choi, J. Y. (2019). Skeleton-
based action recognition of people handling objects.
In Proc. IEEE Winter Conf. Appl. Comput. Vision,
pages 61–70.
Koppula, H. S., Gupta, R., and Saxena, A. (2013). Learning
human activities and object affordances from RGB-D
videos. Int. J. Rob. Res., 32(8):951–970.
Luo, T., Guan, S., Yang, R., and Smith, J. (2023). From de-
tection to understanding: A survey on representation
learning for human-object interaction. Neurocomput.,
543:126243.
Sargano, A. B., Angelov, P., and Habib, Z. (2017). A com-
prehensive review on handcrafted and learning-based
action representation approaches for human activity
recognition. Appl. Sci., 7(1):110.
Shafizadegan, F., Naghsh-Nilchi, A. R., and Shabaninia, E.
(2024). Multimodal vision-based human action recog-
nition using deep learning: A review. Artif. Intell.
Rev., 57(178):85.
Simonyan, K. and Zisserman, A. (2014). Two-stream con-
volutional networks for action recognition in videos.
In Proc. Neural Inf. Process. Syst. Conf., volume 27,
pages 1–11.
Tsukamoto, T., Abe, T., and Suganuma, T. (2020). A
method for detecting human-object interaction based
on motion distribution around hand. In Proc. 15th Int
Joint Conf. Comput. Vision, Imaging Comput. Graph-
ics Theory Appl., volume 5, pages 462–469.
Wang, J., Shuai, H.-H., Li, Y.-H., and Cheng, W.-H. (2023).
Human-object interaction detection: An overview.
IEEE Consum. Electron. Mag., pages 1–14.
Zhu, F., Shao, L., Xie, J., and Fanga, Y. (2016). From
handcrafted to learned representations for human ac-
tion recognition: A survey. Image Vision Comput.,
55:42–52.
A Method for Detecting Hands Moving Objects from Videos
399