hardware is the high number of unsupported opera-
tions that limit the applicability of many state-of-the-
art networks (Tab. 3).
Our solution employs a lightweight, two-stage al-
gorithm that first transforms the spatio-temporal 4D
data of each sensor modality to a 3D tensor with fixed
shape. The gesture, now in the form of 3D data, is
then classified by a GCN with one input branch for
each sensor modality. The gesture recognition net-
works are designed for the deployment on EdgeTPU,
NCS2, and Jetson Nano, avoiding unsupported oper-
ations, such as recurrent layers or 3D convolutions.
Our largest proposed network achieves equal clas-
sification performance as 3DCNN (Molchanov et al.,
2015) with only 8.9% of the model size. On the low-
end of the model sizes, we propose a gesture clas-
sification network that only uses 149 KB of mem-
ory while still performing robustly (92.3% accu-
racy). Thus, our network models can be deployed
on resource-constrained embedded accelerators in the
performance range of the Google EdgeTPU, Intel
NCS2 and NVIDIA Jetson Nano.
In the future, the system is planned to be further
optimized in order to be deployed on automotive mi-
crocontrollers such as Infineon’s AURIX.
REFERENCES
Abavisani, M., Joze, H. R. V., and Patel, V. M. (2018). Im-
proving the performance of unimodal dynamic hand-
gesture recognition with multimodal training. CoRR,
abs/1812.06145.
Alom, M. Z., Taha, T. M., Yakopcic, C., Westberg, S.,
Hasan, M., Esesn, B. C. V., Awwal, A. A. S., and
Asari, V. K. (2018). The history began from alexnet:
A comprehensive survey on deep learning approaches.
CoRR, abs/1803.01164.
Ceolini, E., Taverni, G., Khacef, L., Payvand, M., and Do-
nati, E. (2019). Sensor fusion using emg and vision
for hand gesture classification in mobile applications.
In 2019 IEEE Biomedical Circuits and Systems Con-
ference (BioCAS), pages 1–4. IEEE.
Chai, X., Liu, Z., Yin, F., Liu, Z., and Chen, X. (2016).
Two streams recurrent neural networks for large-scale
continuous gesture recognition. In 2016 23rd Inter-
national Conference on Pattern Recognition (ICPR),
pages 31–36.
Chen, Y., Chen, T., Xu, Z., Sun, N., and Temam, O.
(2016a). DianNao Family: Energy-Efficient Hard-
ware Accelerators for Machine Learning. Commun.
ACM, 59(11):105–112.
Chen, Y.-H., Krishna, T., Emer, J., and Sze, V. (2016b). Ey-
eriss: An Energy-Efficient Reconfigurable Accelera-
tor for Deep Convolutional Neural Networks. In IEEE
International Solid-State Circuits Conference, ISSCC
2016, Digest of Technical Papers, pages 262–263.
Concha, D. T., Maia, H. D. A., Pedrini, H., Tacon, H.,
Brito, A. D. S., Chaves, H. D. L., and Vieira, M. B.
(2018). Multi-stream convolutional neural networks
for action recognition in video sequences based on
adaptive visual rhythms. In 2018 17th IEEE Inter-
national Conference on Machine Learning and Appli-
cations (ICMLA), pages 473–480.
Donahue, J., Hendricks, L. A., Guadarrama, S., Rohrbach,
M., Venugopalan, S., Saenko, K., and Darrell, T.
(2014). Long-term recurrent convolutional net-
works for visual recognition and description. CoRR,
abs/1411.4389.
Feichtenhofer, C., Pinz, A., and Zisserman, A. (2016). Con-
volutional two-stream network fusion for video action
recognition. CoRR, abs/1604.06573.
Girshick, R. B., Donahue, J., Darrell, T., and Malik, J.
(2013). Rich feature hierarchies for accurate ob-
ject detection and semantic segmentation. CoRR,
abs/1311.2524.
Google (2020). Google EdgeTPU Documentation.
Hazra, S. and Santra, A. (2019). Radar gesture recognition
system in presence of interference using self-attention
neural network. In 2019 18th IEEE International
Conference On Machine Learning And Applications
(ICMLA), pages 1409–1414.
Intel (2020). Intel Neural Compute Stick 2 Documentation.
K
¨
op
¨
ukl
¨
u, O., K
¨
ose, N., and Rigoll, G. (2018). Motion fused
frames: Data level fusion strategy for hand gesture
recognition. CoRR, abs/1804.07187.
K
¨
op
¨
ukl
¨
u, O., Gunduz, A., Kose, N., and Rigoll, G. (2020).
Online dynamic hand gesture recognition including
efficiency analysis. IEEE Transactions on Biometrics,
Behavior, and Identity Science, 2(2):85–97.
Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012).
Imagenet classification with deep convolutional neu-
ral networks. In Pereira, F., Burges, C. J. C., Bottou,
L., and Weinberger, K. Q., editors, Advances in Neu-
ral Information Processing Systems 25, pages 1097–
1105. Curran Associates, Inc.
Molchanov, P., Gupta, S., Kim, K., and Pulli, K. (2015).
Multi-sensor system for driver’s hand-gesture recog-
nition.
Molchanov, P., Yang, X., Gupta, S., Kim, K., Tyree, S., and
Kautz, J. (2016). Online detection and classification
of dynamic hand gestures with recurrent 3d convolu-
tional neural networks. pages 4207–4215.
Naguri, C. R. and Bunescu, R. C. (2017). Recognition
of dynamic hand gestures from 3d motion data using
lstm and cnn architectures. In 2017 16th IEEE Inter-
national Conference on Machine Learning and Appli-
cations (ICMLA), pages 1130–1133.
NVIDIA (2020). NVIDIA Jetson Nano Documentation.
Reuther, A., Michaleas, P., Jones, M., Gadepally, V., Samsi,
S., and Kepner, J. (2019). Survey and benchmark-
ing of machine learning accelerators. 2019 IEEE
High Performance Extreme Computing Conference
(HPEC).
Rockchip (2020). Rockchip Documentation.
Tran, D., Bourdev, L. D., Fergus, R., Torresani, L., and
Paluri, M. (2014). C3D: generic features for video
analysis. CoRR, abs/1412.0767.
Sensor Fusion Neural Networks for Gesture Recognition on Low-power Edge Devices
149