Authors:
João Ribeiro Pinto
1
;
2
;
Pedro Carvalho
1
;
3
;
Carolina Pinto
4
;
Afonso Sousa
1
;
2
;
Leonardo Capozzi
1
;
2
and
Jaime S. Cardoso
1
;
2
Affiliations:
1
Centre for Telecommunications and Multimedia, INESC TEC, Porto, Portugal
;
2
Faculty of Engineering (FEUP), University of Porto, Porto, Portugal
;
3
School of Engineering (ISEP), Polytechnic of Porto, Porto, Portugal
;
4
Bosch Car Multimedia, Braga, Portugal
Keyword(s):
Action Recognition, Audio, Cascading, Convolutional Networks, Recurrent Networks, Video.
Abstract:
With the advent of self-driving cars, and big companies such as Waymo or Bosch pushing forward into fully driverless transportation services, the in-vehicle behaviour of passengers must be monitored to ensure safety and comfort. The use of audio-visual information is attractive by its spatio-temporal richness as well as non-invasive nature, but faces the likely constraints posed by available hardware and energy consumption. Hence new strategies are required to improve the usage of these scarce resources. We propose the processing of audio and visual data in a cascade pipeline for in-vehicle action recognition. The data is processed by modality-specific sub-modules, with subsequent ones being used when a confident classification is not reached. Experiments show an interesting accuracy-acceleration trade-off when compared with a parallel pipeline with late fusion, presenting potential for industrial applications on embedded devices.