of the IEEE international conference on computer vi-
sion, pages 2470–2478.
Gkioxari, G., Girshick, R., and Malik, J. (2015b). Contex-
tual action recognition with r* cnn. In Proceedings
of the IEEE international conference on computer vi-
sion, pages 1080–1088.
He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep resid-
ual learning for image recognition. In Proceedings of
the IEEE conference on computer vision and pattern
recognition, pages 770–778.
Ji, J., Desai, R., and Niebles, J. C. (2021). Detecting human-
object relationships in videos. In Proceedings of the
IEEE/CVF International Conference on Computer Vi-
sion, pages 8106–8116.
Kim, B., Lee, J., Kang, J., Kim, E.-S., and Kim, H. J.
(2021). Hotr: End-to-end human-object interaction
detection with transformers. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pat-
tern Recognition, pages 74–83.
Kim, B., Mun, J., On, K.-W., Shin, M., Lee, J., and Kim, E.-
S. (2022). Mstr: Multi-scale transformer for end-to-
end human-object interaction detection. In Proceed-
ings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition, pages 19578–19587.
Li, Y.-L., Liu, X., Wu, X., Li, Y., Qiu, Z., Xu, L., Xu, Y.,
Fang, H.-S., and Lu, C. (2022a). Hake: A knowledge
engine foundation for human activity understanding.
Li, Y.-L., Xu, L., Liu, X., Huang, X., Xu, Y., Wang, S.,
Fang, H.-S., Ma, Z., Chen, M., and Lu, C. (2020).
Pastanet: Toward human activity knowledge engine.
Li, Z., Zou, C., Zhao, Y., Li, B., and Zhong, S. (2022b). Im-
proving human-object interaction detection via phrase
learning and label composition. In Proceedings of
the AAAI Conference on Artificial Intelligence, vol-
ume 36, pages 1509–1517.
Liao, Y., Liu, S., Wang, F., Chen, Y., Qian, C., and Feng,
J. (2020). Ppdm: Parallel point detection and match-
ing for real-time human-object interaction detection.
In Proceedings of the IEEE/CVF Conference on Com-
puter Vision and Pattern Recognition, pages 482–490.
Lin, T.-Y., Goyal, P., Girshick, R., He, K., and Doll
ar, P.
(2017). Focal loss for dense object detection. In
Proceedings of the IEEE international conference on
computer vision, pages 2980–2988.
Lu, C., Su, H., Li, Y., Lu, Y., Yi, L., Tang, C.-K., and
Guibas, L. J. (2018). Beyond holistic object recogni-
tion: Enriching image understanding with part states.
Ma, X., Nie, W., Yu, Z., Jiang, H., Xiao, C., Zhu, Y., Zhu,
S.-C., and Anandkumar, A. (2022). Relvit: Concept-
guided vision transformer for visual relational reason-
ing. arXiv preprint arXiv:2204.11167.
Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I.,
and Savarese, S. (2019). Generalized intersection over
union: A metric and a loss for bounding box regres-
sion. In Proceedings of the IEEE/CVF conference on
computer vision and pattern recognition, pages 658–
Sunkesula, S. P. R., Dabral, R., and Ramakrishnan, G.
(2020). Lighten: Learning interactions with graph and
hierarchical temporal networks for hoi in videos. In
Proceedings of the 28th ACM International Confer-
ence on Multimedia, pages 691–699.
Tamura, M., Ohashi, H., and Yoshinaga, T. (2021). Qpic:
Query-based pairwise human-object interaction detec-
tion with image-wide contextual information. In Pro-
ceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pages 10410–10419.
Yuan, H., Wang, M., Ni, D., and Xu, L. (2022). De-
tecting human-object interactions with object-guided
cross-modal calibrated semantics. arXiv preprint
Zhang, A., Liao, Y., Liu, S., Lu, M., Wang, Y., Gao, C., and
Li, X. (2021a). Mining the benefits of two-stage and
one-stage hoi detection. Advances in Neural Informa-
tion Processing Systems, 34:17209–17220.
Zhang, F. Z., Campbell, D., and Gould, S. (2021b). Spa-
tially conditioned graphs for detecting human-object
interactions. In Proceedings of the IEEE/CVF In-
ternational Conference on Computer Vision, pages
Zhou, D., Liu, Z., Wang, J., Wang, L., Hu, T., Ding, E.,
and Wang, J. (2022). Human-object interaction de-
tection via disentangled transformer. In Proceedings
of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pages 19568–19577.
Zhu, X., Su, W., Lu, L., Li, B., Wang, X., and Dai, J.
(2020). Deformable detr: Deformable transform-
ers for end-to-end object detection. arXiv preprint
Body Part Information Additional in Multi-decoder Transformer-Based Network for Human Object Interaction Detection