15 , corresponding to a recall of 73.3%. Failing at
finding the correct claim can be explained by two ex-
planations:
1. The claim author’s face is not correctly recog-
nized (for instance, in videos 5 or 9, where
no faces or wrong faces are recognized, respec-
tively). This could be addressed by using more
training images when fine tuning the face recog-
nition model.
2. The quality of the transcription did not allowed
a correct matching (for instance, in videos 3 or
8, where the correct claim author’s face is recog-
nized). This could be addressed by either deter-
mining optimal values of the parameters N and
thresh sim or considering more recent matching
algorithm (for instance, transformer models).
Since wrong claims are also found in 5 cases, the
proposed method has a precision of 68.75%. Simi-
lar explanations (solutions) could explain (improve)
this score:
1. The face recognition module find other faces than
the correct claim author’s face, either by mis-
take or because they do actually appear in the
video, leading our system to consider more el-
ements in candidates claims (i.e. more claims
in the dataset), and thus having the possibility to
match a wrong claim.
2. The overlap between claim keywords sets are im-
portant, meaning that some claims of the dataset
share several identical keywords with some other
claims. This could be addressed by refining the
quality or the diversity of the keywords describ-
ing fact-checked claims (either manually curated
or automatically extracted/inferred).
Regarding our experimentations with video resolution
reduced to 240p, it appears that the impact is not im-
portant. Indeed, for all the videos considered in the
toy dataset, when the correct claim was found on the
original resolution, it was also found on the 240p ver-
sion. The influence of this reduced resolution can
however be observed at the facial recognition step: for
most of the videos (7 out of 15, nearly 50%), the face
recognition module find several wrong persons. This
argue for the fact that our multimodal approach, that
considers both visual and textual features, is relevant
when dealing with reduced video resolution.
5 CONCLUSION AND
PERSPECTIVES
In this paper, we have introduced a multimodal ap-
proach for detecting claims that have already been
fact-checked in videos input. Due to the recurring as-
pect of false information that is propagated through-
out different media and the time-consuming task to
assess the veracity of information for fact-checkers,
we believe that such a system could be provided as an
asset to experts such as journalists, but also the gen-
eral public. Focusing on political discourse in French
language, we demonstrate the feasibility of a com-
plete system, offline and explainable. The results that
have been obtained are promising towards future real-
time applications, and its robustness could be easily
improved using more recent and performing state-of-
the-art methods such as transformers models. In fu-
ture works, we also plan to stress out our workflow
with a larger fact-checked dataset that is currently be-
ing curated and the larger STVD-FC video dataset.
REFERENCES
Akhtar, M., Schlichtkrull, M., Guo, Z., Cocarascu, O., Sim-
perl, E., and Vlachos, A. (2023). Multimodal auto-
mated fact-checking: A survey.
Baltrusaitis, T., Zadeh, A., Lim, Y. C., and Morency, L.-
P. (2018). Openface 2.0: Facial behavior analysis
toolkit. In 2018 13th IEEE International Confer-
ence on Automatic Face and Gesture Recognition (FG
2018), pages 59–66.
Barr
´
on-Cede
˜
no, A., Elsayed, T., Nakov, P., Martino, G.
D. S., Hasanain, M., Suwaileh, R., Haouari, F., Bab-
ulkov, N., Hamdan, B., Nikolov, A., Shaar, S., and
Ali, Z. S. (2020). Overview of checkthat 2020: Auto-
matic identification and verification of claims in social
media. CoRR, abs/2007.07997.
Bazarevsky, V., Kartynnik, Y., Vakunov, A., Raveendran,
K., and Grundmann, M. (2019). Blazeface: Sub-
millisecond neural face detection on mobile gpus.
CoRR, abs/1907.05047.
Guo, Z., Schlichtkrull, M., and Vlachos, A. (2022). A sur-
vey on automated fact-checking. Transactions of the
Association for Computational Linguistics, 10:178–
206.
Kotonya, N. and Toni, F. (2020). Explainable automated
fact-checking: A survey. CoRR, abs/2011.03870.
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S.,
Fu, C.-Y., and Berg, A. C. (2016). Ssd: Single shot
multibox detector. In Proceedings of the 14th Euro-
pean Conference on Computer Vision, pages 21–37.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., and Dean,
J. (2013). Distributed representations of words and
phrases and their compositionality. In Proceedings of
the 26th International Conference on Neural Informa-
tion Processing Systems - Volume 2, page 3111–3119.
Nakov, P., Barr
´
on-Cede
˜
no, A., da San Martino, G., Alam,
F., Struß, J. M., Mandl, T., M
´
ıguez, R., Caselli, T.,
Kutlu, M., Zaghouani, W., Li, C., Shaar, S., Shahi,
G. K., Mubarak, H., Nikolov, A., Babulkov, N., Kar-
tal, Y. S., Wiegand, M., Siegel, M., and K
¨
ohler, J.
Fact-Checked Claim Detection in Videos Using a Multimodal Approach
619