
vision: Collection, pipeline and challenges for epic-
kitchens-100. International Journal of Computer Vi-
sion, pages 1–23.
Darkhalil, A., Shan, D., Zhu, B., Ma, J., Kar, A., Hig-
gins, R., Fidler, S., Fouhey, D., and Damen, D. (2022).
Epic-kitchens visor benchmark: Video segmentations
and object relations. Advances in Neural Information
Processing Systems, 35:13745–13758.
De Geest, R., Gavves, E., Ghodrati, A., Li, Z., Snoek, C.,
and Tuytelaars, T. (2016). Online action detection. In
Computer Vision–ECCV 2016: 14th European Con-
ference, Amsterdam, The Netherlands, October 11-
14, 2016, Proceedings, Part V 14, pages 269–284.
Springer.
Gao, M., Xu, M., Davis, L. S., Socher, R., and Xiong, C.
(2019). Startnet: Online detection of action start in
untrimmed videos. In Proceedings of the IEEE/CVF
international conference on computer vision, pages
5542–5551.
Gao, M., Zhou, Y., Xu, R., Socher, R., and Xiong, C.
(2021). Woad: Weakly supervised online action de-
tection in untrimmed videos. In Proceedings of the
IEEE/CVF conference on computer vision and pattern
recognition, pages 1915–1923.
Grauman, K., Westbury, A., Byrne, E., Chavis, Z., Furnari,
A., Girdhar, R., Hamburger, J., Jiang, H., Liu, M.,
Liu, X., et al. (2022). Ego4d: Around the world in
3,000 hours of egocentric video. In Proceedings of
the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pages 18995–19012.
Gu, A. and Dao, T. (2023). Mamba: Linear-time sequence
modeling with selective state spaces. arXiv preprint
arXiv:2312.00752.
Hu, X., Wang, S., Li, M., Li, Y., and Du, S. (2024). Time-
attentive fusion network: An efficient model for on-
line detection of action start. IET Image Processing,
18(7):1892–1902.
Idrees, H., Zamir, A. R., Jiang, Y.-G., Gorban, A., Laptev,
I., Sukthankar, R., and Shah, M. (2017). The thu-
mos challenge on action recognition for videos “in the
wild”. Computer Vision and Image Understanding,
155:1–23.
Li, Y., Liu, M., and Rehg, J. M. (2018). In the eye of be-
holder: Joint learning of gaze and actions in first per-
son video. In Proceedings of the European conference
on computer vision (ECCV), pages 619–635.
Liu, S., Zhang, C.-L., Zhao, C., and Ghanem, B. (2024).
End-to-end temporal action detection with 1b param-
eters across 1000 frames. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pat-
tern Recognition, pages 18591–18601.
Plizzari, C., Goletto, G., Furnari, A., Bansal, S., Ragusa, F.,
Farinella, G. M., Damen, D., and Tommasi, T. (2024).
An outlook into the future of egocentric vision. Inter-
national Journal of Computer Vision, pages 1–57.
Scavo, R., Ragusa, F., Farinella, G. M., and Furnari, A.
(2023). Quasi-online detection of take and release ac-
tions from egocentric videos. In International Confer-
ence on Image Analysis and Processing, pages 13–24.
Springer.
Sener, F., Chatterjee, D., Shelepov, D., He, K., Singhania,
D., Wang, R., and Yao, A. (2022). Assembly101: A
large-scale multi-view video dataset for understanding
procedural activities. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recogni-
tion, pages 21096–21106.
Shan, D., Geng, J., Shu, M., and Fouhey, D. F. (2020). Un-
derstanding human hands in contact at internet scale.
In Proceedings of the IEEE/CVF conference on com-
puter vision and pattern recognition, pages 9869–
9878.
Shou, Z., Pan, J., Chan, J., Miyazawa, K., Mansour, H.,
Vetro, A., Giro-i Nieto, X., and Chang, S.-F. (2018).
Online detection of action start in untrimmed, stream-
ing videos. In Proceedings of the European confer-
ence on computer vision (ECCV), pages 534–551.
Sou
ˇ
cek, T., Alayrac, J.-B., Miech, A., Laptev, I., and Sivic,
J. (2022). Look for the change: Learning object
states and state-modifying actions from untrimmed
web videos. In Proceedings of the IEEE/CVF Con-
ference on Computer Vision and Pattern Recognition,
pages 13956–13966.
Vaswani, A. (2017). Attention is all you need. Advances in
Neural Information Processing Systems.
Wang, X., Qing, Z., Huang, Z., Feng, Y., Zhang, S., Jiang,
J., Tang, M., Gao, C., and Sang, N. (2021a). Proposal
relation network for temporal action detection. arXiv
preprint arXiv:2106.11812.
Wang, X., Zhang, S., Qing, Z., Shao, Y., Gao, C., and
Sang, N. (2021b). Self-supervised learning for semi-
supervised temporal action proposal. In CVPR.
Wang, X., Zhang, S., Qing, Z., Shao, Y., Zuo, Z., Gao, C.,
and Sang, N. (2021c). Oadtr: Online action detection
with transformers. In Proceedings of the IEEE/CVF
International Conference on Computer Vision, pages
7565–7575.
Xue, Z., Ashutosh, K., and Grauman, K. (2024). Learning
object state changes in videos: An open-world per-
spective. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pages
18493–18503.
Zhang, C.-L., Wu, J., and Li, Y. (2022). Actionformer: Lo-
calizing moments of actions with transformers. In Eu-
ropean Conference on Computer Vision, pages 492–
510. Springer.
Zhao, Y. and Kr
¨
ahenb
¨
uhl, P. (2022). Real-time online
video detection with temporal smoothing transform-
ers. In European Conference on Computer Vision,
pages 485–502. Springer.
VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications
870