computer vision and pattern recognition, pages 1125–
1134.
Iverson, J. M. and Goldin-Meadow, S. (1998). Why people
gesture when they speak. Nature, 396(6708):228–228.
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar,
R., and Fei-Fei, L. (2014). Large-scale video classifica-
tion with convolutional neural networks. In 2014 IEEE
Conference on Computer Vision and Pattern Recogni-
tion, pages 1725–1732.
Karras, T., Aila, T., Laine, S., and Lehtinen, J. (2017). Pro-
gressive growing of gans for improved quality, stability,
and variation. arXiv preprint arXiv:1710.10196.
Kingma, D. P. and Welling, M. (2013). Auto-encoding
variational bayes. arXiv preprint arXiv:1312.6114.
Köpüklü, O., Köse, N., and Rigoll, G. (2018). Motion fused
frames: Data level fusion strategy for hand gesture
recognition. 2018 IEEE/CVF Conference on Computer
Vision and Pattern Recognition Workshops (CVPRW),
pages 2184–21848.
Liu, J., Liu, Y., Wang, Y., Prinet, V., Xiang, S., and Pan,
C. (2020). Decoupled representation learning for
skeleton-based gesture recognition. In Proceedings
of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pages 5751–5760.
Lugaresi, C., Tang, J., Nash, H., McClanahan, C., Uboweja,
E., Hays, M., Zhang, F., Chang, C.-L., Yong, M. G.,
Lee, J., et al. (2019). Mediapipe: A framework
for building perception pipelines. arXiv preprint
arXiv:1906.08172.
Materzynska, J., Berger, G., Bax, I., and Memisevic, R.
(2019). The jester dataset: A large-scale video dataset
of human gestures. In Proceedings of the IEEE/CVF
International Conference on Computer Vision Work-
shops, pages 0–0.
Memo, A. and Zanuttigh, P. (2018). Head-mounted gesture
controlled interface for human-computer interaction.
Multimedia Tools and Applications, 77(1):27–53.
Microsoft. Teams. https://teams.microsoft.com.
Min, Y., Zhang, Y., Chai, X., and Chen, X. (2020). An ef-
ficient pointlstm for point clouds based gesture recog-
nition. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pages
5761–5770.
Mirza, M. and Osindero, S. (2014). Conditional generative
adversarial nets. arXiv preprint arXiv:1411.1784.
Mueller, F., Bernard, F., Sotnychenko, O., Mehta, D., Srid-
har, S., Casas, D., and Theobalt, C. (2018). Ganerated
hands for real-time 3d hand tracking from monocular
rgb. In Proceedings of the IEEE Conference on Com-
puter Vision and Pattern Recognition, pages 49–59.
Narayana, P., Beveridge, R., and Draper, B. A. (2018). Ges-
ture recognition: Focus on the hands. In Proceedings
of the IEEE conference on computer vision and pattern
recognition, pages 5235–5244.
Poon, G., Kwan, K. C., and Pang, W.-M. (2019). Occlusion-
robust bimanual gesture recognition by fusing multi-
views. Multimedia Tools and Applications, pages 1–20.
Pudipeddi, B., Mesmakhosroshahi, M., Xi, J., and Bharad-
waj, S. (2020). Training large neural networks with
constant memory using a new execution algorithm.
arXiv preprint arXiv:2002.05645.
Radford, A., Metz, L., and Chintala, S. (2015). Unsu-
pervised representation learning with deep convolu-
tional generative adversarial networks. arXiv preprint
arXiv:1511.06434.
Rajbhandari, S., Rasley, J., Ruwase, O., and He, Y. (2020).
Zero: Memory optimizations toward training trillion
parameter models. In SC20: International Confer-
ence for High Performance Computing, Networking,
Storage and Analysis, pages 1–16. IEEE.
Ren, Z., Yuan, J., Meng, J., and Zhang, Z. (2013). Robust
part-based hand gesture recognition using kinect sensor.
IEEE transactions on multimedia, 15(5):1110–1120.
Rezende, D. J., Mohamed, S., and Wierstra, D. (2014).
Stochastic backpropagation and approximate inference
in deep generative models. In International conference
on machine learning, pages 1278–1286. PMLR.
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S.,
Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein,
M., Berg, A. C., and Fei-Fei, L. (2015). ImageNet
Large Scale Visual Recognition Challenge. Interna-
tional Journal of Computer Vision (IJCV), 115(3):211–
252.
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., and
Chen, L.-C. (2018). Mobilenetv2: Inverted residuals
and linear bottlenecks. In Proceedings of the IEEE
conference on computer vision and pattern recognition,
pages 4510–4520.
Simonyan, K. and Zisserman, A. (2014). Two-stream con-
volutional networks for action recognition in videos.
In Ghahramani, Z., Welling, M., Cortes, C., Lawrence,
N., and Weinberger, K. Q., editors, Advances in Neural
Information Processing Systems, volume 27. Curran
Associates, Inc.
Tang, H., Liu, H., and Sebe, N. (2020). Unified generative
adversarial networks for controllable image-to-image
translation. IEEE Transactions on Image Processing,
29:8916–8929.
Tang, H., Liu, H., Xiao, W., and Sebe, N. (2019). Fast
and robust dynamic hand gesture recognition via key
frames extraction and feature fusion. Neurocomputing,
331:424–433.
Tay, Y., Bahri, D., Yang, L., Metzler, D., and Juan, D.-C.
(2020). Sparse sinkhorn attention. In International
Conference on Machine Learning, pages 9438–9447.
PMLR.
Wang, L., Qiao, Y., and Tang, X. (2015). Action recognition
with trajectory-pooled deep-convolutional descriptors.
In 2015 IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), pages 4305–4314.
Zafrir, O., Boudoukh, G., Izsak, P., and Wasserblat, M.
(2019). Q8bert: Quantized 8bit bert. arXiv preprint
arXiv:1910.06188.
Zaheer, M., Guruganesh, G., Dubey, K. A., Ainslie, J., Al-
berti, C., Ontanon, S., Pham, P., Ravula, A., Wang,
Q., Yang, L., et al. (2020). Big bird: Transformers for
longer sequences. In NeurIPS.
Zhang, B., Wang, L., Wang, Z., Qiao, Y., and Wang, H.
(2016). Real-time action recognition with enhanced
Generation-Based Data Augmentation Pipeline for Real-Time Automatic Gesture Recognition
445