
Table 3: Detection accuracy of Multi-view CNN on short videos from FakeAVCeleb (train and test on 1, 000 videos of (500
F
v
F
a
, 500 R
v
R
a
) cutted into 1s frame length with an overlap 50%, the classes number for classification is set to 2).
Modality View Accuracy (train/test) Precision (train/test) Recall (train/test)
Visual XY 85.76%/80.22% 88.12%/82.17% 89.64%/79.96%
Visual XT 85.47%/83.04% 88.45%/87.68% 87.22%/84.93%
Visual TY 88.15%/88.21% 91.31%/88.93% 88.72%/92.76%
Visual XY T 94.88%/90.05% 96.21%/93.52% 95.23%/90.41%
Audio − 100%/100% 100%/100% 100%/100%
Audio + Visual XY T + Audio 98.77%/97.91% 99.06%/97.96% 98.90%/98.72%
that could further integrate the audiovisual features at
multiple levels of abstraction. Such a multimodal sys-
tem could benefit from the inherent strengths of each
modality, potentially leading to a more resilient detec-
tion mechanism against sophisticated deepfake ma-
nipulations. These efforts will contribute to the over-
arching goal of ensuring the authenticity and trust-
worthiness of digital media.
ACKNOWLEDGEMENTS
The authors thank Angers Loire M
´
etropole (ALM) for
the Ph.D grant of Abderrazzaq Moufidi.
REFERENCES
Afchar, D., Nozick, V., Yamagishi, J., and Echizen, I.
(2018). Mesonet: a compact facial video forgery de-
tection network. In 2018 IEEE international work-
shop on information forensics and security (WIFS),
pages 1–7. IEEE.
Chollet, F. (2017). Xception: Deep learning with depthwise
separable convolutions. In Proceedings of the IEEE
conference on computer vision and pattern recogni-
tion, pages 1251–1258.
Chung, J. S., Nagrani, A., and Zisserman, A. (2018). Vox-
celeb2: Deep speaker recognition. arXiv preprint
arXiv:1806.05622.
Cozzolino, D., Pianese, A., Nießner, M., and Verdoliva, L.
(2023). Audio-visual person-of-interest deepfake de-
tection. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pages
943–952.
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-
Fei, L. (2009). Imagenet: A large-scale hierarchical
image database. In 2009 IEEE conference on com-
puter vision and pattern recognition, pages 248–255.
Ieee.
Desplanques, B., Thienpondt, J., and Demuynck, K. (2020).
Ecapa-tdnn: Emphasized channel attention, propaga-
tion and aggregation in tdnn based speaker verifica-
tion. arXiv preprint arXiv:2005.07143.
Huang, T.-h., Lin, J.-h., and Lee, H.-y. (2021). How far are
we from robust voice conversion: A survey. In 2021
IEEE Spoken Language Technology Workshop (SLT),
pages 514–521. IEEE.
Ilyas, H., Javed, A., and Malik, K. M. (2023). Avfakenet: A
unified end-to-end dense swin transformer deep learn-
ing model for audio-visual deepfakes detection. Ap-
plied Soft Computing, 136:110124.
Jia, Y., Zhang, Y., Weiss, R., Wang, Q., Shen, J., Ren, F.,
Nguyen, P., Pang, R., Lopez Moreno, I., Wu, Y., et al.
(2018). Transfer learning from speaker verification
to multispeaker text-to-speech synthesis. Advances in
neural information processing systems, 31.
Jiang, Z., Liu, J., Ren, Y., He, J., Zhang, C., Ye, Z., Wei, P.,
Wang, C., Yin, X., Ma, Z., et al. (2023). Mega-tts 2:
Zero-shot text-to-speech with arbitrary length speech
prompts. arXiv preprint arXiv:2307.07218.
Khalid, H., Kim, M., Tariq, S., and Woo, S. S. (2021a).
Evaluation of an audio-video multimodal deepfake
dataset using unimodal and multimodal detectors.
In Proceedings of the 1st workshop on synthetic
multimedia-audiovisual deepfake generation and de-
tection, pages 7–15.
Khalid, H., Tariq, S., Kim, M., and Woo, S. S. (2021b).
Fakeavceleb: A novel audio-video multimodal deep-
fake dataset. arXiv preprint arXiv:2108.05080.
Korshunova, I., Shi, W., Dambre, J., and Theis, L. (2017).
Fast face-swap using convolutional neural networks.
In Proceedings of the IEEE international conference
on computer vision, pages 3677–3685.
Ling, J., Tan, X., Chen, L., Li, R., Zhang, Y., Zhao, S., and
Song, L. (2022). Stableface: Analyzing and improv-
ing motion stability for talking face generation.
Lugaresi, C., Tang, J., Nash, H., McClanahan, C., Uboweja,
E., Hays, M., Zhang, F., Chang, C.-L., Yong, M., Lee,
J., Chang, W.-T., Hua, W., Georg, M., and Grund-
mann, M. (2019). Mediapipe: A framework for per-
ceiving and processing reality. In Third Workshop on
Computer Vision for AR/VR at IEEE Computer Vision
and Pattern Recognition (CVPR).
Masood, M., Nawaz, M., Malik, K. M., Javed, A., Irtaza,
A., and Malik, H. (2023). Deepfakes generation and
detection: State-of-the-art, open challenges, counter-
measures, and way forward. Applied intelligence,
53(4):3974–4026.
Moufidi, A., Rousseau, D., and Rasti, P. (2023). Attention-
based fusion of ultrashort voice utterances and depth
videos for multimodal person identification. Sensors,
23(13):5890.
Nirkin, Y., Keller, Y., and Hassner, T. (2019). Fsgan: Sub-
IMPROVE 2024 - 4th International Conference on Image Processing and Vision Engineering
72