Visual-only Voice Activity Detection using Human Motion in Conference Video

Keisuke Yamazaki, Satoshi Tamura, Yuuto Gotoh, Masaki Nose

2022

Abstract

In this paper, we propose a visual-only Voice Activity Detection (VAD) method using human movements. Although audio VAD is commonly used in many applications, it has a problem it is not robust in noisy environments. In such the cases, multi-modal VAD using speech and mouth information is effective. However, due to the current pandemic situation, people wear masks causing we cannot observe mouths. On the other hand, utilizing a video capturing the entire of a speaker is useful for visual VAD, because gestures and motions may contribute to identify speech segments. In our scheme, we firstly obtain dynamic images which represent motion of a person. Secondly, we fuse dynamic and original images using Multi-Modal Transfer Module (MMTM). To evaluate the effectiveness of our scheme, we conducted experiments using conference videos. The results show that the proposed model has better than the baseline. Furthermore, through model visualization we confirmed that the proposed model focused much more on speakers.

Download


Paper Citation


in Harvard Style

Yamazaki K., Tamura S., Gotoh Y. and Nose M. (2022). Visual-only Voice Activity Detection using Human Motion in Conference Video. In Proceedings of the 11th International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM, ISBN 978-989-758-549-4, pages 570-577. DOI: 10.5220/0010829200003122


in Bibtex Style

@conference{icpram22,
author={Keisuke Yamazaki and Satoshi Tamura and Yuuto Gotoh and Masaki Nose},
title={Visual-only Voice Activity Detection using Human Motion in Conference Video},
booktitle={Proceedings of the 11th International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM,},
year={2022},
pages={570-577},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0010829200003122},
isbn={978-989-758-549-4},
}


in EndNote Style

TY - CONF

JO - Proceedings of the 11th International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM,
TI - Visual-only Voice Activity Detection using Human Motion in Conference Video
SN - 978-989-758-549-4
AU - Yamazaki K.
AU - Tamura S.
AU - Gotoh Y.
AU - Nose M.
PY - 2022
SP - 570
EP - 577
DO - 10.5220/0010829200003122