Girshick, R. B., Donahue, J., Darrell, T., and Malik, J.
(2014). Rich feature hierarchies for accurate object
detection and semantic segmentation. 2014 IEEE
Conference on Computer Vision and Pattern Recog-
nition, pages 580–587.
He, K., Gkioxari, G., Doll
´
ar, P., and Girshick, R. B. (2020).
Mask r-cnn. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 42:386–397.
He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep resid-
ual learning for image recognition. 2016 IEEE Con-
ference on Computer Vision and Pattern Recognition
(CVPR), pages 770–778.
Hochreiter, S. and Schmidhuber, J. (1997). Long short-term
memory. Neural Computation, 9:1735–1780.
Irani, M. and Anandan, P. (1999). About direct methods. In
Workshop on Vision Algorithms.
Ji, S., Xu, W., Yang, M., and Yu, K. (2013). 3d convolu-
tional neural networks for human action recognition.
IEEE Transactions on Pattern Analysis and Machine
Intelligence, 35:221–231.
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Suk-
thankar, R., and Fei-Fei, L. (2014). Large-scale
video classification with convolutional neural net-
works. 2014 IEEE Conference on Computer Vision
and Pattern Recognition, pages 1725–1732.
Khan, A. U. and Borji, A. (2018). Analysis of hand seg-
mentation in the wild. 2018 IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pages
4710–4719.
Kl
¨
aser, A., Marszalek, M., and Schmid, C. (2008). A spatio-
temporal descriptor based on 3d-gradients. In BMVC.
Kondratyuk, D., Yuan, L., Li, Y., Zhang, L., Tan, M.,
Brown, M., and Gong, B. (2021). Movinets: Mobile
video networks for efficient video recognition. ArXiv,
abs/2103.11511.
Laptev, I., Marszalek, M., Schmid, C., and Rozenfeld,
B. (2008). Learning realistic human actions from
movies. 2008 IEEE Conference on Computer Vision
and Pattern Recognition, pages 1–8.
Li, X., Hou, Y., Wang, P., Gao, Z., Xu, M., and Li, W.
(2021). Trear: Transformer-based rgb-d egocentric
action recognition. ArXiv, abs/2101.03904.
Liu, S. and Deng, W. (2015). Very deep convolutional
neural network based image classification using small
training sample size. 2015 3rd IAPR Asian Confer-
ence on Pattern Recognition (ACPR), pages 730–734.
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu,
C.-Y., and Berg, A. (2016). Ssd: Single shot multibox
detector. In ECCV.
Liu, Y., Jiang, X., Sun, T., and Xu, K. (2019). 3d gait
recognition based on a cnn-lstm network with the fu-
sion of skegei and da features. 2019 16th IEEE Inter-
national Conference on Advanced Video and Signal
Based Surveillance (AVSS), pages 1–8.
Ma, M., Fan, H., and Kitani, K. M. (2016). Going deeper
into first-person activity recognition. 2016 IEEE Con-
ference on Computer Vision and Pattern Recognition
(CVPR), pages 1894–1903.
Min, K. and Corso, J. J. (2020). Integrating human gaze into
attention for egocentric activity recognition. ArXiv,
abs/2011.03920.
Nguyen, X. S., Brun, L., L
´
ezoray, O., and Bougleux, S.
(2019). A neural network based on spd manifold
learning for skeleton-based hand gesture recognition.
2019 IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR), pages 12028–12037.
Ohn-Bar, E. and Trivedi, M. M. (2014). Hand gesture recog-
nition in real time for automotive interfaces: A multi-
modal vision-based approach and evaluations. IEEE
Transactions on Intelligent Transportation Systems,
15:2368–2377.
Oreifej, O. and Liu, Z. (2013). Hon4d: Histogram of ori-
ented 4d normals for activity recognition from depth
sequences. 2013 IEEE Conference on Computer Vi-
sion and Pattern Recognition, pages 716–723.
Poleg, Y., Arora, C., and Peleg, S. (2014). Temporal seg-
mentation of egocentric videos. 2014 IEEE Confer-
ence on Computer Vision and Pattern Recognition,
pages 2537–2544.
Rahmani, H. and Mian, A. S. (2016). 3d action recogni-
tion from novel viewpoints. 2016 IEEE Conference
on Computer Vision and Pattern Recognition (CVPR),
pages 1506–1515.
Ramirez-Amaro, K., Beetz, M., and Cheng, G. (2017).
Transferring skills to humanoid robots by extracting
semantic representations from observations of human
activities. Artif. Intell., 247:95–118.
Rastgoo, R., Kiani, K., and Escalera, S. (2020). Hand sign
language recognition using multi-view hand skeleton.
Expert Syst. Appl., 150:113336.
Rawat, W. and Wang, Z. (2017). Deep convolutional neural
networks for image classification: A comprehensive
review. Neural Computation, 29:2352–2449.
Ren, S., He, K., Girshick, R. B., and Sun, J. (2015). Faster
r-cnn: Towards real-time object detection with region
proposal networks. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 39:1137–1149.
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S.,
Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bern-
stein, M. S., Berg, A., and Fei-Fei, L. (2015). Ima-
genet large scale visual recognition challenge. Inter-
national Journal of Computer Vision, 115:211–252.
Ryoo, M., Rothrock, B., and Matthies, L. (2015). Pooled
motion features for first-person videos. 2015 IEEE
Conference on Computer Vision and Pattern Recogni-
tion (CVPR), pages 896–904.
Sandler, M., Howard, A. G., Zhu, M., Zhmoginov, A., and
Chen, L.-C. (2018). Mobilenetv2: Inverted residuals
and linear bottlenecks. 2018 IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pages
4510–4520.
Scovanner, P., Ali, S., and Shah, M. (2007). A 3-
dimensional sift descriptor and its application to ac-
tion recognition. Proceedings of the 15th ACM inter-
national conference on Multimedia.
Shan, D., Geng, J., Shu, M., and Fouhey, D. F. (2020). Un-
derstanding human hands in contact at internet scale.
Multi-stage RGB-based Transfer Learning Pipeline for Hand Activity Recognition
847