Our study indicates that both spatial and tempo-
ral down-sampling generally lead to reduced perfor-
mance, though spatial down-sampling shows some
dataset-dependent improvements. Additionally, the
effectiveness of buffer size varies with the duration of
action and there is no optimal global value for that.
For pre-training, we did not observe improvements
from conventional self-supervision methods construc-
tion contexts. Finally, our results highlight the poten-
tial of frame-based approaches for future investiga-
tion of action recognition.
This research is part of the BoB project, an ICON
project cofunded by Flanders Innovation & En-
trepreneurship (VLAIO) and imec, with project no.
HBC.2021.0658. The data of the construction site
was provided by Willemen Groep, a Belgian con-
struction group, with support from AICON inc.
