Furthermore we were able to publish some of
the trained models on PyPI (PSF, 2022) under
https://pypi.org/project/vvadlrs3/ to make it easier to
develop applications.
REFERENCES
Aung, Z. H. and Ritthipravat, P. (2016). Robust visual voice
activity detection using long short-term memory recur-
rent neural network. In Revised Selected Papers of the
7th Pacific-Rim Symposium on Image and Video Tech-
nology - Volume 9431, PSIVT 2015, pages 380–391,
New York, NY, USA. Springer-Verlag New York, Inc.
Chollet, F. et al. (2015). Keras. https://keras.io.
F. Luthon, M. L. (1998). Lip motion automatic detection.
gun Choi, J. and Kim, M. (2009). The usage and evalu-
ation of anthropomorphic form in robot design. In
Undisciplined! Design Research Society Conference
2008.
Guy, S., Lathuilière, S., Mesejo, P., and Horaud, R. (2020).
Learning visual voice activity detection with an auto-
matically annotated dataset.
Hochreiter, S. and Schmidhuber, J. (1997). Long short-term
memory. Neural Computation, 9(8):1735–1780.
Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang,
W., Weyand, T., Andreetto, M., and Adam, H. (2017).
Mobilenets: Efficient convolutional neural networks
for mobile vision applications.
Huang, G., Liu, Z., van der Maaten, L., and Weinberger,
K. Q. (2018). Densely connected convolutional net-
works.
Kanda, T. and Ishiguro, H. (2017). Human-Robot Interaction
in Social Robotics. CRC Press.
King, D. E. (2009). Dlib-ml: A machine learning toolkit.
Journal of Machine Learning Research, 10:1755–1758.
Matin, M. and Valdenegro-Toro, M. (2020). Hey Human, If
your Facial Emotions are Uncertain, You Should Use
BNNs! In Women in Computer Vision @ ECCV.
Meriem Bendris, D. C. and Chollet, G. (2010). Lip activ-
ity detection for talking faces classification in tvcon-
tent. 3rd International Conference on Machine Vision
(ICMV), pages 187–190.
Miwa, H., Okuchi, T., Itoh, K., Takanobu, H., and Takan-
ishi, A. (2003). A new mental model for humanoid
robots for human friendly communication introduc-
tion of learning system, mood vector and second or-
der equations of emotion. In 2003 IEEE Interna-
tional Conference on Robotics and Automation (Cat.
No.03CH37422), volume 3, pages 3588–3593 vol.3.
Nikitina, A. (2011). Successful Public Speaking. bookboon.
OZTOP, E., FRANKLIN, D. W., CHAMINADE, T., and
CHENG, G. (2005). Human–humanoid interaction: Is
a humanoid robot perceived as a human? International
Journal of Humanoid Robotics, 02(04):537–559.
Parkhi, O. M., Vedaldi, A., and Zisserman, A. (2015). Deep
face recognition. In British Machine Vision Confer-
ence.
Patrona, F., Iosifidis, A., Tefas, A., Nikolaidis, N., and Pitas,
I. (2016). Visual voice activity detection in the wild.
IEEE Transactions on Multimedia, 18(6):967–977.
Patterson, E. K., Gurbuz, S., Tufekci, Z., and Gowdy, J. N.
(2002). Cuave: A new audio-visual database for multi-
modal human-computer interface research. 2002 IEEE
International Conference on Acoustics, Speech, and
Signal Processing, 2:II–2017–II–2020.
PSF, P. S. F. (2022). The python package index (pypi).
Python package repository.
Spyridon Siatras, N. N. and Pitas, I. (2006). Visual speech
detection using mouth region intensities. 14th Euro-
pean Signal Processing Conference (EUSIPCO 2006),
Florence, Italy, September 4-8, 2006, copyright by
EURASIP.
Triantafyllos Afouras, Joon Son Chung, A. Z. (2018). Lrs3-
ted: a large-scale dataset for visual speech recognition.
In arXiv:1809.00496v2 [cs.CV] 28 Oct 2018.
Zellner, B. (1994). Pauses and the temporal structure of
speech.
Ángel Pascual del Pobil Ferré, Bou, M. D., Anna Stenzel,
Eris Chinellato, Markus Lappe, and Roman Liepelt
(2013). When humanoid robots become human-like in-
teraction partners: Corepresentation of robotic actions.
page 18.
HUCAPP 2023 - 7th International Conference on Human Computer Interaction Theory and Applications
46