REFERENCES
A.Aubrey, B.Rivet, Y.Hicks, L.Girin, J.Chambers, and
C.Jutten (2007). Two novel visual voice activity de-
tectors based on appearance models and retinal filter-
ing. In Proc. 2007 15th European Signal Processing
Conference (EUSIPCO), pages 2409–2413.
A.Stergiou, G.Kapidis, G.Kalliatakis, C.Chrysoulas,
R.Veltkamp, and R.Poppe (2019). Saliency tubes:
Visual explanations for spatio-temporal convolutions.
In Proc. 2019 IEEE International Conference on
Image Processing (ICIP), pages 1830–1834.
B.G.Gebre, P.Wittenburg, S.Drude, M.Huijbregts, and
T.Heskes (2014a). Speaker diarization using gesture
and speech. In Proc. INTERSPEECH 2014, pages
582–586.
B.G.Gebre, P.Wittenburg, T.Heskes, and S.Drude (2014b).
Motion history images for online speaker/signer di-
arization. In Proc. 2014 IEEE International Con-
ference on Acoustics, Speech and Signal Processing
(ICASSP), pages 1537–1541.
B.Joosten, E.Postma, and E.Krahmer (2015). Voice activity
detection based on facial movement. In Journal on
Multimodal User Interfaces volume 9, pages 183–193.
D.E.King (2009). Dlib-ml: A machine learning toolkit.
In Journal of Machine Learning Research (JMLR)
vol.10, page 1755–1758.
G.Bradski (2000). The opencv library. In Dr.Dobb’s Jour-
nal of Software Tools.
H.Bilen, B.Fernando, E.Gavves, and A.Vedaldi (2016). Dy-
namic image networks for action recognition. In 2016
IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), pages 3034–3042.
H.R.V.Joze, A.Shaban, M.L.Iuzzolino, and K.Koishida
(2020). Mmtm: Multimodal transfer module for cnn
fusion. Proc. 2020 IEEE/CVF Conference on Com-
puter Vision and Pattern Recognition (CVPR), pages
13286–13296.
J.Carletta, S.Ashby, S.Bourban, M.Flynn, M.Guillemot,
T.Hain, J.Kadlec, V.Karaiskos, W.Kraaij,
M.Kronenthal, G.Lathoud, M.Lincoln, A.L.Masson,
I.Mccowan, W.Post, D.Reidsma, and P.Wellner
(2005). The ami meeting corpus: A pre-
announcement. In Proc. Machine Learning for
Multimodal Interaction (MLMI), pages 28–39.
K.He, X.Zhang, S.Ren, and J.Sun (2016). Deep residual
learning for image recognition. In Proc. 2016 IEEE
Conference on Computer Vision and Pattern Recogni-
tion (CVPR), pages 770–778.
K.Hoover, S.Chaudhuri, C.Pantofaru, M.Slaney, and
I.Sturdy (2017). Putting a face to the voice: Fusing
audio and visual signals across a video to determine
speakers. In arXiv:1706.00079.
K.Stefanov, J.Beskow, and G.Salvi (2017). Vision-based
active speaker detection in multiparty interactions.
In Proc. Grounding Language Understanding (GLU),
pages 47–51.
M.Cristani, A.Pesarin, A.Vinciarelli, M.Crocco, and
V.Murino (2011). Look at who’s talking: Voice activ-
ity detection by automated gesture analysis. In Proc.
AmI Workshops 2011, pages 72–80.
M.Lin, Q.Chen, and S.Yan (2014). Network in network.
In Proc. International Conference for Learning Rep-
resentations (ICLR).
M.Shahid, C.Beyan, and V.Murino (2019a). Comparisons
of visual activity primitives for voice activity detec-
tion. In Proc. Image Analysis and Processing – ICIAP
2019, pages 48–59.
M.Shahid, C.Beyan, and V.Murino (2019b). Voice activity
detection by upper body motion analysis and unsuper-
vised domain adaptation. In Proc. 2019 IEEE/CVF
International Conference on Computer Vision Work-
shop (ICCVW), pages 1260–1269.
M.Shahid, C.Beyan, and V.Murino (2021). S-vvad: Visual
voice activity detection by motion segmentation. In
Proc. 2021 IEEE Winter Conference on Applications
of Computer Vision (WACV), pages 2331–2340.
N.Latif, A.V.Barbosa, E.Vatiokiotis-Bateson,
M.S.Castelhano, and K.G.Munhall (2014). Move-
ment coordination during conversation. PLOS ONE,
pages 1–10.