4 EXPERIMENTS
4.1 Data Set
We carried out experiments on the DFDC data
set (Dolhansky et al., 2020), constructed by AWS,
Facebook, Microsoft, and the Partnership on AI along
with other academic partners. The data set is part of
the Deepfake Detection Challenge, which aims to de-
velop machine learning models that can help to detect
real and manipulated media content.
The dataset contains over 470GB of videos
(19,154 real videos and 100,000 fake videos),
recorded using 486 actors. Each video has a dura-
tion of about 10 seconds and is generated using 4 dif-
ferent deepfake generation techniques, namely Deep-
fake Autoencoder, MM/NN face swap, Neural Talk-
ing Heads, and Face Swapping GAN. No data aug-
mentations are performed. Additionally, a test set
is available that is used for performance compari-
son on Kaggle, namely for the Public Leaderboard.
The Public Test Set is collected in the same way as
DFDC and contains 4,000 videos (2,000 real videos
and 2,000 fake videos) from 214 actors who do not
perform in the DFDC data-set. The major differences
with respect to the DFDC dataset are that it includes
videos generated with one additional deepfake gen-
eration technique, namely StyleGAN, combined with
heavy data augmentation.
We pre-process the dataset and prepare the video
samples for our experiments as follows. First, we per-
form a subsampling. As the dataset is imbalanced
with 100,000 fake videos and 19,154 real videos, we
randomly subsample the fake videos such that the fi-
nal data-set is balanced with 19,154 fake and 19,154
real videos. Subsequently, we split the dataset into
training and test sets. As the DFDC data-set is pro-
vided in 50 parts, we perform folder-wise split to
avoid mixing of actor videos across multiple folders,
so that we ensure the test videos do not contain actors
that are in the training videos. We use folders 0-39 for
Training, 40-44 for Validation, and 45-49 for Testing.
4.2 Performance Metrics
For evaluating our model performances, we compute
two metrics, namely accuracy and the area under the
ROC curve (AUC). The accuracy is the ratio of cor-
rectly classified observations to all classification out-
puts. The AUC is a measure used for comparing the
performance of classifiers. AUC is equal to the prob-
ability that the classifier will rank a randomly sam-
pled positive example higher than a randomly sam-
pled negative example.
4.3 Baseline Models
We compare the performance of the proposed archi-
tecture with that of the following approaches:
CapsuleNet: we use the Capsule Forensics ap-
proach (Nguyen et al., 2018), which we refer to as
CapsuleNet. We configure the number of capsules
as in the backbone of our model, i.e. 10 capsules.
The input is a single frame and the output is the
probability of it being real or fake.
XceptionNet: the XceptionNet network (R
¨
ossler
et al., 2019) achieved the highest performance on the
benchmark data-sets for deepfake detection. We use
the pre-trained model and replace the last layer with a
set of custom layers: we deploy a sequence of a con-
nected layer (2048 to 512 units), and a final fully con-
nected layer that transforms a 512-dimensional output
into a scalar number.
4.4 Experiments
We carried out different experiments to evaluate the
impact of different frame selection strategies on the
performance of the proposed method in comparison
with those of the methods mentioned in Section 4.3
and validate the influence that they have on the quality
and generalization capabilities of the trained models.
We compare the XceptionNet model with the Cap-
sule Network when they are trained using a single
frame selected from each video: we focused on the
contribution that spatial features only have for the de-
tection of fake videos. We also compare the perfor-
mance results of XceptionNet and CapsuleNet when
using the frame-by-frame selection strategy (i.e. Av-
erage): the models are trained on multiple frames
taken from each video to learn the spatial features.
In the test phase, the predictions on multiple frames
of a video are averaged to classify a test video as real
or fake. To train the spatio-temporal model, i.e. Cap-
suleNet+LSTM, we deploy a multiple frame strategy:
the models are trained on sequences of frames taken
from the training videos to learn spatio-temporal fea-
tures of the deepfake inconsistencies. In the test
phase, test sequences are classified at once by the
LSTM part of the networks.
We performed further experiments to provide in-
sights about the regions of the frames that the net-
works focus on to perform the classification. We pro-
vide the CapsuleNet+LSTM model with sequences
extracted from real and deepfake videos and visual-
ized the activation maps of the capsule units using the
open-source tool Grad-CAM (Ozbulak, 2019).
Deepfake Detection using Capsule Networks and Long Short-Term Memory Networks
411