Authors:
Abderrazzaq Moufidi
1
;
2
;
David Rousseau
2
and
Pejman Rasti
1
;
2
Affiliations:
1
Centre d’ Études et de Recherche pour l’Aide à la Décision (CERADE), ESAIP, 18 Rue du 8 Mai 1945, Saint-Barthélemy-d’Anjou 49124, France
;
2
Laboratoire Angevin de Recherche en Ingénierie des Systèmes (LARIS), UMR INRAe-IRHS, Université d’Angers, 62 Avenue Notre Dame du Lac, Angers 49000, France
Keyword(s):
Deepfake, Multimodality, Multi-View, Audio-Lips Correlation, Late Fusion, Spatiotemporal.
Abstract:
The focus of this study is to address the growing challenge posed by AI-generated, persuasive but often misleading multimedia content, which poses difficulties for both human and machine learning interpretation. Building upon our prior research, we analyze the visual and auditory elements of multimedia to identify multimodal deepfakes, with a specific focus on the lower facial area in video clips. This targeted approach sets our research apart in the complex field of deepfake detection. Our technique is particularly effective for short video clips, lasting from 200 milliseconds to one second, surpassing many current deep learning methods that struggle in this duration. In our previous work, we utilized late fusion for correlating audio and lip movements and developed a novel method for video feature extraction that requires less computational power. This is a practical solution for real-world applications with limited computing resources. By adopting a multi-view strategy, the propos
ed network can leverage various weaknesses found in deepfake generation, from visual anomalies to motion inconsistencies or issues with jaw positioning, which are common in such content.
(More)