
6 CONCLUSION
This work presents an effective automated system
for creating interactive real estate video tours by
addressing room classification and transition detec-
tion. The ResNet-Transformer network demonstrated
strong capabilities in capturing both spatial and tem-
poral features for accurate room classification. The
Transformer-based model improved more than 20%
as compared to a more traditional LSTM sequence
processing. The integration of door transition de-
tection as a post-processing step enhanced the per-
formance across all models. Indeed, detected door
transitions contribute essential structural information,
particularly aiding in the accurate delineation of room
boundaries when doors were present. This approach
improved the overall precision and ensured more con-
sistent room layout predictions, paving the way for
more sophisticated applications in automated video
editing systems, particularly in real estate domain. In
future work, we plan to do user satisfaction studies.
ACKNOWLEDGEMENTS
This work has partially funded by the VLAIO project
WAIVE and the real estate video company.
REFERENCES
Ben-Shabat, Y., Yu, X., Saleh, F., Campbell, D., Rodriguez-
Opazo, C., Li, H., and Gould, S. (2020). The ikea
asm dataset: Understanding people assembling furni-
ture through actions, objects and pose.
Fathi, A., Ren, X., and Rehg, J. M. (2011). Learning to
recognize objects in egocentric activities. In CVPR
2011, pages 3281–3288.
Gammulle, H., Denman, S., Sridharan, S., and Fookes, C.
(2017). Two Stream LSTM: A Deep Fusion Frame-
work for Human Action Recognition. In 2017 IEEE
Winter Conference on Applications of Computer Vi-
sion (WACV), pages 177–186.
He, K., Zhang, X., Ren, S., and Sun, J. (2015). Deep resid-
ual learning for image recognition.
Ji, L., Wu, C., Zhou, D., Yan, K., Cui, E., Chen, X., and
Duan, N. (2022). Learning temporal video procedure
segmentation from an automatically collected large
dataset. In 2022 IEEE/CVF Winter Conference on Ap-
plications of Computer Vision (WACV), pages 2733–
2742.
Kuehne, H., Arslan, A. B., and Serre, T. (2014). The lan-
guage of actions: Recovering the syntax and seman-
tics of goal-directed human activities. In Proceedings
of Computer Vision and Pattern Recognition Confer-
ence (CVPR).
Lautz, J., Snowden, B., and Dunn, M. (2014). Chief
economist and senior vice president.
Lea, C., Flynn, M. D., Vidal, R., Reiter, A., and Hager,
G. D. (2016). Temporal convolutional networks for
action segmentation and detection. IEEE Transac-
tions on Pattern Analysis and Machine Intelligence,
(arXiv:1611.05267). arXiv:1611.05267 [cs].
Lea, C., Flynn, M. D., Vidal, R., Reiter, A., and Hager,
G. D. (2017a). Temporal convolutional networks for
action segmentation and detection. In 2017 IEEE
Conference on Computer Vision and Pattern Recog-
nition (CVPR), pages 1003–1012, Los Alamitos, CA,
USA. IEEE Computer Society.
Lea, C., Flynn, M. D., Vidal, R., Reiter, A., and Hager,
G. D. (2017b). Temporal Convolutional Networks for
Action Segmentation and Detection. In 2017 IEEE
Conference on Computer Vision and Pattern Recogni-
tion (CVPR), pages 1003–1012, Honolulu, HI. IEEE.
Lu, Z. and Elhamifar, E. (2024). FACT: Frame-action cross-
attention temporal modeling for efficient supervised
action segmentation. In Conference on Computer Vi-
sion and Pattern Recognition 2024.
May, R., Maier, H., and Dandy, G. (2010). Data splitting
for artificial neural networks using som-based strati-
fied sampling. Neural Networks, 23(2):283–294.
Miech, A., Alayrac, J.-B., Smaira, L., Laptev, I., Sivic, J.,
and Zisserman, A. (2020). End-to-End Learning of
Visual Representations from Uncurated Instructional
Videos. In CVPR.
Ni, B., Yang, X., and Gao, S. (2016). Progressively Parsing
Interactional Objects for Fine Grained Action Detec-
tion. In 2016 IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), pages 1020–1028,
Las Vegas, NV, USA. IEEE.
Ragusa, F., Furnari, A., Livatino, S., and Farinella, G. M.
(2020). The meccano dataset: Understanding human-
object interactions from egocentric videos in an
industrial-like domain.
Rohrbach, M., Rohrbach, A., Regneri, M., Amin, S., An-
driluka, M., Pinkal, M., and Schiele, B. (2016). Rec-
ognizing Fine-Grained and Composite Activities us-
ing Hand-Centric Features and Script Data. Interna-
tional Journal of Computer Vision, 119(3):346–373.
arXiv:1502.06648 [cs].
Sener, F., Chatterjee, D., Shelepov, D., He, K., Singhania,
D., Wang, R., and Yao, A. Assembly101: A large-
scale multi-view video dataset for understanding pro-
cedural activities. CVPR 2022.
Stein, S. and McKenna, S. J. (2013). Combining embedded
accelerometers with computer vision for recognizing
food preparation activities. In Proceedings of the 2013
ACM International Joint Conference on Pervasive and
Ubiquitous Computing, UbiComp ’13, page 729–738,
New York, NY, USA. Association for Computing Ma-
chinery.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,
L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I.
(2017). Attention is all you need. Advances in Neural
Information Processing Systems.
Yi, F., Wen, H., and Jiang, T. (2021). Asformer: Trans-
former for action segmentation.
VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications
326