Authors:
Zihao Guo
1
;
Fei Li
1
;
Rujie Liu
1
;
Ryo Ishida
2
and
Genta Suzuki
2
Affiliations:
1
Fujitsu Research & Development Center Co., Ltd., Beijing, China
;
2
Fujitsu Research, Fujitsu Limited, Kawasaki, Japan
Keyword(s):
Human Object Interaction Detection, Transformer, Multi-decoder, Body Part Information, Channel Attention.
Abstract:
Human Object Interaction Detection is one of the essential branches of video understanding. However, many complex scenes exist, such as humans interacting with multiple objects. The whole human body as the subject of interaction in the complex interaction environment may misjudge the interaction with the wrong objects. In this paper, we propose a Transformer based structure with the body part additional module to solve this problem. The Transformer structure is applied to provide powerful information mining capability. Moreover, a multi-decoder structure is adopted for solving different sub-problems, enabling models to focus on different regions to provide more powerful performance. The most important contribution of our work is the proposed body part additional module. It introduces the body part information for Human-Object Interaction(HOI) detection, which refines the subject of the HOI triplet and assists the interaction detection. The body part additional module also includes th
e Channel Attention module to ensure the balance between the information, preventing the model from paying too much attention to the body part or the Human-Object pair. We got better performance than the State-Of-The-Art model.
(More)