Ma, P., Martinez, B., Petridis, S., and Pantic, M. (2021). To-
wards practical lipreading with distilled and efficient
models. In ICASSP 2021-2021 IEEE International
Conference on Acoustics, Speech and Signal Process-
ing (ICASSP), pages 7608–7612. IEEE.
Ma, P., Wang, Y., Petridis, S., Shen, J., and Pantic, M.
(2022). Training strategies for improved lip-reading.
In ICASSP 2022-2022 IEEE International Confer-
ence on Acoustics, Speech and Signal Processing
(ICASSP), pages 8472–8476. IEEE.
Martinez, B., Ma, P., Petridis, S., and Pantic, M. (2020).
Lipreading using temporal convolutional networks.
In ICASSP 2020-2020 IEEE International Confer-
ence on Acoustics, Speech and Signal Processing
(ICASSP), pages 6319–6323. IEEE.
McCool, C., Marcel, S., Hadid, A., Pietik
¨
ainen, M., Mate-
jka, P., Cernock
`
y, J., Poh, N., Kittler, J., Larcher, A.,
Levy, C., et al. (2012). Bi-modal person recognition
on a mobile phone: using mobile phone data. In Mul-
timedia and Expo Workshops (ICMEW), 2012 IEEE
International Conference on, pages 635–640. IEEE.
Mroueh, Y., Marcheret, E., and Goel, V. (2015). Deep mul-
timodal learning for audio-visual speech recognition.
In Acoustics, Speech and Signal Processing (ICASSP),
2015 IEEE International Conference on, pages 2130–
2134. IEEE.
Neti, C., Potamianos, G., Luettin, J., Matthews, I., Glotin,
H., Vergyri, D., Sison, J., and Mashari, A. (2000).
Audio visual speech recognition. Technical report,
IDIAP.
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J.,
Chanan, G., Killeen, T., Lin, Z., Gimelshein, N.,
Antiga, L., et al. (2019). Pytorch: An imperative style,
high-performance deep learning library. Advances
in neural information processing systems, 32:8026–
8037.
Petridis, S., Shen, J., Cetin, D., and Pantic, M. (2018).
Visual-only recognition of normal, whispered and
silent speech. arXiv preprint arXiv:1802.06399.
Petridis, S., Wang, Y., Ma, P., Li, Z., and Pantic, M. (2020).
End-to-end visual speech recognition for small-scale
datasets. Pattern Recognition Letters, 131:421–427.
Rekik, A., Ben-Hamadou, A., and Mahdi, W. (2014). A new
visual speech recognition approach for rgb-d cameras.
In International conference image analysis and recog-
nition, pages 21–28. Springer.
Rekik, A., Ben-Hamadou, A., and Mahdi, W. (2015a). Hu-
man machine interaction via visual speech spotting.
In Advanced Concepts for Intelligent Vision Systems,
pages 566–574. Springer.
Rekik, A., Ben-Hamadou, A., and Mahdi, W. (2015b). Uni-
fied system for visual speech recognition and speaker
identification. In International Conference on Ad-
vanced Concepts for Intelligent Vision Systems, pages
381–390. Springer.
Rekik, A., Ben-Hamadou, A., and Mahdi, W. (2016).
An adaptive approach for lip-reading using image
and depth data. Multimedia Tools and Applications,
75(14):8609–8636.
Sagonas, C., Tzimiropoulos, G., Zafeiriou, S., and Pantic,
M. (2013). 300 faces in-the-wild challenge: The first
facial landmark localization challenge. In 2013 IEEE
International Conference on Computer Vision Work-
shops, pages 397–403. IEEE.
Sanderson, C. (2002). The vidtimit database. Technical
report, IDIAP.
Sheng, C., Kuang, G., Bai, L., Hou, C., Guo, Y., Xu, X.,
Pietik
¨
ainen, M., and Liu, L. (2022). Deep learning
for visual speech analysis: A survey. arXiv preprint
arXiv:2205.10839.
Stafylakis, T. and Tzimiropoulos, G. (2017). Combining
residual networks with lstms for lipreading. arXiv
preprint arXiv:1703.04105.
Sun, K., Yu, C., Shi, W., Liu, L., and Shi, Y. (2018). Lip-
interact: Improving mobile device interaction with
silent speech commands. In Proceedings of the 31st
Annual ACM Symposium on User Interface Software
and Technology, pages 581–593.
Wong, Y. W., Ch’ng, S. I., Seng, K. P., Ang, L.-M., Chin,
S. W., Chew, W. J., and Lim, K. H. (2011). A new
multi-purpose audio-visual unmc-vier database with
multiple variabilities. Pattern Recognition Letters,
32(13):1503–1510.
Wu, Y. and Ji, Q. (2019). Facial landmark detection: A
literature survey. International Journal of Computer
Vision, 127(2):115–142.
Zhao, G., Barnard, M., and Pietikainen, M. (2009). Lipread-
ing with local spatiotemporal descriptors. IEEE
Transactions on Multimedia, 11(7):1254–1265.
VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications
638