An Enhanced Adversarial Network with Combined Latent Features for

Spatio-temporal Facial Affect Estimation in the Wild

Decky Aspandi

1,2 a

, Federico Sukno

1 b

, Bj

orn Schuller

2,3 c

and Xavier Binefa

1 d

Department of Information and Communication Technologies, Pompeu Fabra University, Barcelona, Spain

Chair of Embedded Intelligence for Health Care and Wellbeing, University of Augsburg, Germany

GLAM – Group on Language, Audio, & Music, Imperial College London, U.K.

Keywords:

Affective Computing, Temporal Modelling, Adversarial Learning.

Abstract:

Affective Computing has recently attracted the attention of the research community, due to its numerous

applications in diverse areas. In this context, the emergence of video-based data allows to enrich the widely

used spatial features with the inclusion of temporal information. However, such spatio-temporal modelling

often results in very high-dimensional feature spaces and large volumes of data, making training difﬁcult and

time consuming. This paper addresses these shortcomings by proposing a novel model that efﬁciently extracts

both spatial and temporal features of the data by means of its enhanced temporal modelling based on latent

features. Our proposed model consists of three major networks, coined Generator, Discriminator, and Combiner,

which are trained in an adversarial setting combined with curriculum learning to enable our adaptive attention

modules. In our experiments, we show the effectiveness of our approach by reporting our competitive results

on both the AFEW-VA and SEWA datasets, suggesting that temporal modelling improves the affect estimates

both in qualitative and quantitative terms. Furthermore, we ﬁnd that the inclusion of attention mechanisms

leads to the highest accuracy improvements, as its weights seem to correlate well with the appearance of facial

movements, both in terms of temporal localisation and intensity. Finally, we observe the sequence length of

around 160 ms to be the optimum one for temporal modelling, which is consistent with other relevant ﬁndings

utilising similar lengths.

1 INTRODUCTION

Affective Computing has recently attracted the atten-

tion of the research community, due to its numerous

applications in diverse areas which include educa-

tion (Duo and Song, 2010) or healthcare (Liu et al.,

2008), among others. The growing availability of

affect-related datasets, such as AFEW-VA (Kossaiﬁ

et al., 2017) and the recently introduced SEWA (Kos-

saiﬁ et al., 2019) database enable the rapid develop-

ment of deep learning-based techniques, which cur-

rently hold the state of the art.

Further, the emergence of video-based data allows

to enrich the widely used spatial features with the

inclusion of temporal information. To this end, sev-

eral authors have explored the use of long-short term

https://orcid.org/0000-0002-6653-3470

https://orcid.org/0000-0002-2029-1576

https://orcid.org/0000-0002-6478-8699

https://orcid.org/0000-0002-4324-9952

memory (LSTM) recurrent neural networks (RNNs)

(Tellamekala and Valstar, 2019; Ma et al., 2019), en-

dowed also with attention mechanisms (Luong et al.,

2015; Li et al., 2020; Xiaohua et al., 2019). However,

such spatio-temporal modelling often results in very

high-dimensional feature spaces and large volumes of

data, making training difﬁcult and time consuming.

Moreover, it has been shown that the sequence length

considered during training can be a decisive factor for

successful temporal modelling (Kossaiﬁ et al., 2017;

Xia et al., 2020; Farhadi and Fox, 2018; Aspandi et al.,

2019b), and yet a detailed study of this aspect is lack-

ing in the ﬁeld.

This paper addresses both the lack of incorpora-

tion and analysis of temporal modelling on affective

analysis. We propose a novel model which can be

efﬁciently used to extract both spatial and temporal

features of the data by means of its enhanced tempo-

ral modelling based on latent features. We do so by

incorporating three major networks, coined Generator,

172

Aspandi, D., Sukno, F., Schuller, B. and Binefa, X.

An Enhanced Adversarial Network with Combined Latent Features for Spatio-temporal Facial Affect Estimation in the Wild.

DOI: 10.5220/0010332001720181

In Proceedings of the 16th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2021) - Volume 4: VISAPP, pages

172-181

ISBN: 978-989-758-488-6

Discriminator, and Combiner, which are trained in an

adversarial setting to estimate the affect domains of Va-

lence (V) and Arousal (A). Furthermore, we capitalise

on these latent features to enable temporal modelling

using LSTM RNNs, which we train progressively us-

ing curriculum learning enhanced with adaptive atten-

tion. Speciﬁcally, the contributions of this paper are as

follows:

(a)

We upgrade the standard adversarial setting, con-

sisting of a Generator and a Discriminator, with

a third network that combines the features from

these networks, which are modiﬁed accordingly.

This yields features that combine the latent space

from the autoencoder-based Generator and a V-A

Quadrant estimate produced by the modiﬁed Dis-

criminator, resulting in a compact but meaningful

representation that helps reduce the training com-

plexity.

(b)

We propose the use of curriculum learning to en-

able analysis and optimisation of the temporal mod-

elling length.

(c)

We incorporate dynamic attention to further en-

hance our model estimates and show its effective-

ness by reporting state of the art accuracy on both

the AFEW-VA and SEWA datasets.

2 RELATED WORK

Affective Computing started by exploiting the use

of classical machine learning techniques to enable

automatic affect estimation. Examples of early ap-

proaches include partial least squares regression (Po-

volny et al., 2016), and support vector machines (Nico-

laou et al., 2011). Subsequently, to further progress

the investigations in this ﬁeld, the development of

larger and bigger datasets was addressed. Several

datasets have been introduced so far, starting with

SEMAINE (McKeown et al., 2010), AFEW-VA (Kos-

saiﬁ et al., 2017), RECOLA (Ringeval et al., 2013),

OMG (Barros et al., 2018), AffectNet (Mollahosseini

et al., 2015), and more recently SEWA (Kossaiﬁ et al.,

2019), aff-wild (Kollias et al., 2019; Zafeiriou et al.,

2017), and aff-wild2(Kollias and Zafeiriou, 2019; Kol-

lias et al., 2020). Furthermore, the V-A labels have

become the standard emotional dimensions over time,

as opposed to hard emotion labels, given their con-

tinuous nature (Kossaiﬁ et al., 2017; Kossaiﬁ et al.,

2019).

Throughout the last few years, models based on

Deep Learning have emerged and currently hold the

state of the art in the context of affective analysis, given

their ability to learn from large scale data. A recent

example along this line is the work from Mitenkova

et al. (Mitenkova et al., 2019), who introduce tensor

modelling for affect estimations by using spatial fea-

tures. In their work, they use tucker tensor regression

optimised by means of deep gradient methods, thus

allowing to preserve the structure of the data and re-

duce the number of parameters. Other works, such as

(Handrich et al., 2020), adopt the multi-task approach

to simultaneously address face detection and affective

states prediction. Speciﬁcally, they use YOLO-based

CNN models (Huang et al., 2018) to estimate the facial

locations alongside V-A values through their proposed

combined losses. As such, their models are able to

incorporate the characteristics of facial attributes and

estimate their relevance to affect inferences.

The recent growth of video-based datasets has en-

couraged the inclusion of temporal modelling, which

has shown to improve models’ training (Xie et al.,

2016; Cootes et al., 1998). Relevant examples in Af-

fective Computing include the works of Tellamekala

et al. (Tellamekala and Valstar, 2019) and Ma et al.

(Ma et al., 2019). In their work, Tellamekala et al.

(Tellamekala and Valstar, 2019) enforce temporal co-

herency and smoothness aspects on their feature rep-

resentation by constraining the differences between

adjacent frames, while Ma et al. resort to the utilisa-

tion of LSTM RNNs with residual connections applied

to multi-modal data. Furthermore, the use of atten-

tion has also been recently explored by Xiaohua et al.

(Xiaohua et al., 2019) and Li et al. (Li et al., 2020).

Xiahoua et al. adopt multi-stage attention, which in-

volves both spatial and temporal attention, on their

facial based affect estimations. Meanwhile, using

spectrogram data as input, Li et al. propose a deep

network that utilises an attention mechanism (Luong

et al., 2015) on top of their LSTM networks to predict

the affective states.

Unfortunately, to our knowledge, all previous

works involving temporal modelling on affective com-

puting miss one important aspect of the analysis: the

involved sequence length in their training. While the

speciﬁed length of temporal modelling has been shown

to affect the ﬁnal results on other related facial analysis

tasks (Kossaiﬁ et al., 2017; Xia et al., 2020; Farhadi

and Fox, 2018; Aspandi et al., 2019b), the compu-

tational cost required to train large spatio-temporal

models hampers one to address such analysis. How-

ever, these problems could be mitigated by: 1) the

use of progressive sequence learning to permit step-

wise observations of various sequence lengths; this

approach has been shown in the recent work of (As-

pandi et al., 2019b) on facial landmark estimations,

which uses curriculum learning enabling more robust

training analysis and tuning of the temporal length; 2)

An Enhanced Adversarial Network with Combined Latent Features for Spatio-temporal Facial Affect Estimation in the Wild

173

Figure 1: Schematic representation of our Full ANCLaF Networks. Left is our base model, which consists of three networks

jointly trained in an adversarial setting: Latent Feature Extractor (G), Quadrant Estimator (D), and Valence Arousal Estimator

(C). On the right, we see our network endowed with sequence modelling (ANCLaF-S) and attention mechanism (ANCLaF-SA).

the use of reduced feature sizes, enabling more efﬁ-

cient training process (Comas et al., 2020); this has

been explored in the affective computing ﬁeld by the

recent works such as (Aspandi et al., 2020), which

uses generative modelling to extract a latent space of

representative features. These two aspects have in-

spired us to propose the combined models presented

in this work, as explained in the next section.

3 METHODOLOGY

Figure 1 shows the overview of our proposed models,

which consist of three networks: a Latent Feature Ex-

tractor (acting as Generator, G), a Quadrant Estimator

(or Discriminator, D), and a Valence/Arousal Estima-

tor (or Combiner, C). Given input image I which con-

tains the facial area, both G and D will be responsible

to learn low dimensional features that the combiner

will use to estimate the associated Valence (V) and

Arousal (A) state

. The architecture of both the G and

D networks follows the recent work from (Aspandi

et al., 2020), and we propose to use LSTM enhanced

with attention to create our C network. We proposed

two main architecture variants: the

ANCLaF

network

(left part of Figure 1), which uses single images as

input and estimates V and A values independently for

each frame, and

ANCLaF-S

and

ANCLaF-SA

(right

part of Figure 1) that uses sequences of latent features

extracted from

frames as input, and utilises LSTM

RNNs for the inference (

-S

), optionally combined with

internal attention layers (-SA).

3.1 Adversarial Network with

Combined Latent Features

(ANCLaF)

The pipeline of our base model

ANCLaF

starts with

the G network. It receives either the original input

image

, or a distorted version of it,

, as detailed

in (Aspandi et al., 2019c; Aspandi et al., 2019a). It

simultaneously produces the cleaned reconstruction of

the input image

and a 2D latent representation that

will be used as features (Z):

G(I)

= dec

(enc

(I)) withZ

≈ enc

(I), (1)

where

are the parameters of the respective networks,

enc

and

dec

are the encoder and decoder, respectively.

Subsequently, the D network receives

and tries to

estimate whether it was obtained from a true or fake

example (namely, original or distorted input image),

as well as a rough estimate of the affective state. In

contrast with the formulation in (Aspandi et al., 2020),

in which D targets directly the intensity of V and A,

we propose to base the estimated affect on the circum-

plex quadrant (

) (Russell, 1980) which discretises

emotions along the valence and arousal dimensions

(four quadrants). This, in turn, reduces the training

complexity. Thus, letting FC stand for fully connected

layer:

D(I)

= FC

(enc

(I)) ⇒ Q

and {0, 1}. (2)

Then,

is used to condition the extracted latent fea-

tures

through layer-wise concatenation, which we

call

(Dai et al., 2017; Ye et al., 2018). Given these

conditioned latent features, the C network performs

the ﬁnal stage of affect estimation, producing reﬁned

predictions of both V and A (Lv et al., 2017; Triantafyl-

lidou and Tefas, 2016; Aspandi et al., 2019b). Thus, if

θ denote the estimated V and A:

ANCLaF(I) = C

([G

(I); D

(I))])

= C

([Z

])

= FC

(enc

([Z

])) ⇒

ANCLaF

(3)

VISAPP 2021 - 16th International Conference on Computer Vision Theory and Applications

174

3.2 Attention Enhanced Sequence

Latent Affect Networks

We propose two sequence-based variants of our mod-

els: ANCLaF-S and -SA. Both of them use the com-

bined features extracted by the G and D networks

which are fed to LSTM networks to allow for temporal

modelling (Hochreiter and Schmidhuber, 1997) and

complemented with an FC layer to produce the ﬁnal

estimates. These networks are trained using curricu-

lum learning (Bengio et al., 2009; Farhadi and Fox,

2018; Aspandi et al., 2019b), in which the number

of frames is progressively increased, allowing more

throughout analysis of the training progress. Moreover,

the training outcome for a given length facilitates the

subsequent training of larger sequences (Farhadi and

Fox, 2018). In this work, we considered a series of 2,

4, 8, 16, and 32 successive frames (

N = {2, 4, 8, 6, 32}

)

for both training and inference stages. Depending on

the number of frames to take into account (n), we

use ANCLaF-S-n and ANCLaF-SA-n to name the re-

spective variants of both ANCLaF-S and ANCLaF-SA

networks. Lastly, the main difference between the two

sequence models is that ANCLaF-SA also includes

internal attentional modelling using the current and

previous internal states from the LSTM layers. Thus,

V-A predictions of ANCLaF-S-n are:

∀n ∈ N , ANCLaF-S-n(I

), h

(LST M

([Z

, Q

], h

n−1

))

⇒ FC

(LST M

(ZQ

, h

n−1

)), (4)

where LSTM is the Long Short Term Memory net-

work (Hochreiter and Schmidhuber, 1997), and

are

LSTM states (h) after n successive frames. Built upon

ANCLaF-SA

, we further use attention modelling (Lu-

ong et al., 2015) to enable adaptive weights on model

inferences by calculating the context vectors (

) that

summarise the importance of each previous state

Differently from the original method, however, here,

we also propose to include both the LSTM inner state

vide the full previous information, and also to adapt

these techniques to only consider n previous states

following our curriculum learning approach. Hence,

given the combined LSTM states at frame

, denoted

(

= [c

, h

]

), and

previous states (

), the alignment

score is calculated as:

(t) = align(S

), withS

= [h

] (5)

exp



;

]



∑

exp



;

]



Then, the location-based function computes the align-

ment scores from the previous states (

S):

= softmax(W

S). (6)

Given the alignment vector, it is used to compute the

context vector

as the weighted average over the

considered n previous hidden states:

∑

 S

(7)

Finally, the context vector is concatenated with the cur-

rent

to be used as input to the C network pipeline:

∀n ∈ N , ANCLaF-SA-n(I

), h

(LST M

([C

;ZQ

], h

n−1

)). (8)

3.3 Training Losses

We use the modiﬁed adversarial training from (As-

pandi et al., 2020) to train both the G and D networks,

and incorporate the training of the C network by pro-

viding the latter with the features extracted from both

the G and D nets on the ﬂy. With this setup, we allow

C to beneﬁt from the improved quality of the features

extracted by G and D as their training progresses. The

equations for the modiﬁed adversarial training of these

three networks are:

adv

[logD(I)] +

[log(1 −D(G(

I))) + E

a f c

[C(I), θ

(9)

We use similar

a f c

losses as in (Aspandi et al.,

2020), which incorporates multiple affect metrics:

Rooted Mean Square Error (RMSE) (Eq. 11), Cor-

relation(COR) (Eq. 12), Concordance Correlation Co-

efﬁcients (CCC) (Eq. 13), and (Kossaiﬁ et al., 2017)

with the addition of Intra-class Correlation Coefﬁcient

(ICC)(Kossaiﬁ et al., 2019). Thus, with

{

θ}

as the

predicted and the ground truth V-A values, the

a f c

deﬁned as follows:

a f c

∑

i=1

RMSE

+ L

COR

+ L

CCC

+ L

ICC

) (10)

RMSE

∑

i=1

(

− θ

)

, (11)

COR

E[(

θ − ˆµ

) −(θ − µ

)]

(12)

CCC

= 2x

E[(

θ − ˆµ

) −(θ − µ

)]

+ σ

+ (µ

− µ

)

(13)

ICC

= 2x

E[(

θ − ˆµ

) −(θ − µ

)]

+ σ

, (14)

where

is the total number of instances of discrete

An Enhanced Adversarial Network with Combined Latent Features for Spatio-temporal Facial Affect Estimation in the Wild

175

V-A classes

, and

is a normalisation factor (Aspandi

et al., 2019a) for the total V-A classes (discretised by

a value of 10). This normalisation factor is crucial in

cases of large imbalance in the number of instances

per class, like in the AFEW-VA dataset (see Section

4.1).

3.4 Model Training

We use both the AFEW (Kossaiﬁ et al., 2017) and

SEWA (Kossaiﬁ et al., 2019) datasets to train all our

model variants, by following their original subject-

independent protocol (5-fold cross validation). We

conducted two training stages for each of our proposed

models. Firstly, we trained the G, D, and C networks

simultaneously using adversarial loss as indicated in

Equation 9. This training stage produced our baseline

results without any sequential modelling, and condi-

tional latent features

to be used for the next stages

of ANCLaF-S(A) Training.

In the second stage, The training of both ANCLaF-

S and ANCLaF-SA was performed using the com-

bined latent and quadrant features, under the previ-

ously deﬁned curriculum learning scheme. We pro-

gressively train our ANCLaF-S models from 2, 4, 8,

16 to 32 steps of temporal modelling with multi-stage

transfer learning (Christodoulidis et al., 2016). Sub-

sequently, we add our proposed attention mechanism

to the pre-trained ANCLaF-S models, thus obtaining

our ANCLaF-SA models. In both cases, we optimise

the affect loss deﬁned in Equation 10 with the same

experimental settings used to train ANCLaF.

We need to note that our combined training setup

translates to more than 100 experiments in total.

Hence, the use of latent features (known as a good

choice to achieve reduced dimensionality representa-

tions) is critical to speed up our training process and

make our experiments feasible. We observed a saving

up to 1 : 4 of the original times during training each of

our models by using the extracted latent features, with

respect to using the original image (around 12 hours

versus 2 days) on a single NVIDIA Titan X GPU. Full

deﬁnitions of our models can be found in the respective

online source code

4 EXPERIMENTS AND RESULTS

4.1 Datasets and Experiment Settings

To quantify the impact of our temporal modelling,

we opted to use two of the most popular and ac-

https://github.com/deckyal/Seq-Att-Affect

cessible video datasets available: Acted Facial Ex-

pressions in the Wild (AFEW-VA)(Kossaiﬁ et al.,

2017) and Automatic Sentiment Analysis in the Wild

(SEWA)(Kossaiﬁ et al., 2019). On the one hand,

AFEW-VA has more individual examples (600 ver-

sus 538) than SEWA, however, the latter has more

frame examples, more contextual information (such

as subject, id of the associated culture) and is more

balanced in terms of V-A labels (Mitenkova et al.,

2019). Furthermore, both datasets contain in the wild

situations, enabling real time model evaluations. Fi-

nally, the labels provided are in the form of continuous

V-A values, together with additional facial landmark

locations that we reﬁned further using other external

models (Aspandi et al., 2019b) to obtain more stable

detection of the facial area.

In each experiment, we provide the results from

all variants of our models to highlight the contribu-

tion of each module: ﬁrst, we evaluate the ANCLaF

model, which operates by exclusively using the latent

features extracted on each frame (

) without any

temporal modelling. Then, we provide results from

both ANCLaF-S and ANCLaF-SA, which incorpo-

rate temporal modelling (and attention in the case of

-SA). We report both RMSE and COR results, on both

datasets, adding also ICC and CCC metrics for the

AFEW-VA and SEWA datasets, respectively, to facili-

tate quantitative analysis to other results reported in the

literature. Finally, for fair comparisons, we compare

our models against external results which followed

similar experimental protocols, i. e., using exclusively

this dataset in their training stage.

4.2 Comparative Results

Table 1 and table 2 provide the full comparisons of

our proposed models against other reported results for

both the AFEW-VA and SEWA datasets, respectively.

We can identify several ﬁndings based on these re-

sults: Firstly, that our base ANCLaF model, relying

on a single image at a time, can produce quite com-

petitive accuracy compared to other results from the

literature. Furthermore, its accuracy is also higher

than the results from the original AEG-CD-SZ models

in which it is based upon (Aspandi et al., 2020), as

shown by its higher accuracy on the SEWA datasets,

especially for Valence. This may indicate its better

processing capabilities of the visual features, consider-

ing that AEG-CD-SZ also incorporates audio features,

which in a way also explains its higher accuracy on

the prediction of Arousal.

Secondly, we notice a slight accuracy improve-

ment when our models incorporate sequence mod-

elling (ANCLaF-S), especially in terms of correlations,

VISAPP 2021 - 16th International Conference on Computer Vision Theory and Applications

176

Figure 2: Analysis of prediction results from a single frame (ANCLaF) and from multiple frames with temporal modelling

(ANCLaF-S-n). Top: the overview of the overall results; Bottom:, a closer look at the prediction results.

Figure 3: Analysis of the attention impact on the prediction results of our sequence modelling (results from ANCLaF-S-8 and

ANCLaF-SA-8, which correspond to the best ANCLaF-S and ANCLaF-SA models, respectively). Top: overview of the overall

results; Bottom: two examples of a closer view on the prediction graph. The column Wa-8 shows the attention weights learnt

for the eight considered frames.

An Enhanced Adversarial Network with Combined Latent Features for Spatio-temporal Facial Affect Estimation in the Wild

177

Table 1: Quantitative comparisons on the AFEW-VA dataset.

Model

RMSE ↓ COR ↑ ICC ↑

VAL ARO AVG VAL ARO AVG VAL ARO AVG

Baseline (Kossaiﬁ et al., 2017) 2.680 2.275 2.478 0.407 0.450 0.429 0.290 0.356 0.323

Coherent(Tellamekala and Valstar, 2019) - - - 0.293 0.426 0.360 - -

Simul (Handrich et al., 2020) 2.600 2.500 2.550 0.390 0.290 0.340 0.320 0.210 0.265

ANCLaF 2.682 2.344 2.513 0.306 0.399 0.353 0.219 0.309 0.264

ANCLaF-S-2 2.675 2.295 2.485 0.314 0.410 0.362 0.236 0.296 0.266

ANCLaF-S-4 2.654 2.279 2.467 0.303 0.420 0.361 0.224 0.307 0.266

ANCLaF-S-8 2.595 2.202 2.398 0.328 0.425 0.377 0.272 0.344 0.308

ANCLaF-S-16 2.617 2.292 2.454 0.302 0.401 0.351 0.224 0.299 0.261

ANCLaF-S-32 2.568 2.328 2.448 0.288 0.405 0.346 0.214 0.304 0.259

ANCLaF-S-AVG 2.622 2.279 2.450 0.307 0.412 0.360 0.234 0.310 0.272

ANCLaF-SA-2 2.540 2.241 2.390 0.373 0.454 0.413 0.291 0.353 0.322

ANCLaF-SA-4 2.586 2.260 2.423 0.386 0.445 0.415 0.302 0.342 0.322

ANCLaF-SA-8 2.481 2.239 2.360 0.371 0.467 0.419 0.294 0.367 0.331

ANCLaF-SA-16 2.601 2.225 2.413 0.377 0.467 0.422 0.294 0.363 0.328

ANCLaF-SA-32 2.581 2.256 2.419 0.361 0.436 0.399 0.270 0.332 0.301

ANCLaF-SA-AVG 2.558 2.244 2.401 0.373 0.454 0.414 0.290 0.352 0.321

Table 2: Quantitative comparisons on the SEWA dataset.

Model

RMSE ↓ COR ↑ CCC ↑

VAL ARO AVG VAL ARO AVG VAL ARO AVG

Baseline (Kossaiﬁ et al., 2019) - - - 0.350 0.350 0.350 0.350 0.290 0.320

Tensor (Mitenkova et al., 2019) 0.334 0.380 0.357 0.503 0.439 0.471 0.469 0.392 0.431

AEG-CD-SZ(Aspandi et al., 2020) 0.323 0.350 0.337 0.442 0.478 0.460 0.405 0.430 0.418

ANCLaF 0.354 0.347 0.351 0.530 0.395 0.462 0.492 0.364 0.428

ANCLaF-S-2 0.349 0.345 0.347 0.533 0.396 0.464 0.503 0.368 0.436

ANCLaF-S-4 0.344 0.336 0.340 0.536 0.403 0.469 0.510 0.382 0.446

ANCLaF-S-8 0.341 0.339 0.340 0.538 0.404 0.471 0.514 0.381 0.448

ANCLaF-S-16 0.354 0.344 0.349 0.527 0.395 0.461 0.490 0.369 0.429

ANCLaF-S-32 0.353 0.346 0.349 0.527 0.396 0.461 0.494 0.368 0.431

ANCLaF-S-AVG 0.348 0.342 0.345 0.532 0.399 0.465 0.502 0.374 0.438

ANCLaF-SA-2 0.343 0.333 0.338 0.545 0.420 0.482 0.509 0.390 0.449

ANCLaF-SA-4 0.336 0.328 0.332 0.550 0.429 0.490 0.526 0.399 0.463

ANCLaF-SA-8 0.336 0.332 0.334 0.558 0.424 0.491 0.529 0.405 0.467

ANCLaF-SA-16 0.334 0.331 0.332 0.556 0.421 0.488 0.528 0.393 0.461

ANCLaF-SA-32 0.336 0.362 0.349 0.550 0.418 0.484 0.513 0.389 0.451

ANCLaF-SA-AVG 0.337 0.337 0.337 0.552 0.422 0.488 0.521 0.395 0.458

namely, concordeance corelation coefﬁcient (CCC),

and ICC. This ﬁnding demonstrates the beneﬁt of the

temporal modelling, yielding more stable results than

those achieved by ANCLaF (cf. Section 4.3). How-

ever, even though the overall accuracy of ANCLaF-S

is better than that of ANCLaF (and quite comparable

to other state of the art models), the improvement can

be considered modest, especially if we compare it with

the improvement achieved when we include attention

in our models. Indeed, we can see that our ANCLaF-

SA outperforms almost all compared models across

the different affect metrics. These ﬁndings suggest

that the plain utilisation of LSTMs may not be enough

to attain a considerable and substantial increase of ac-

curacy (Schmitt et al., 2019), justifying the inclusion

of the attention mechanism in our approach.

Thirdly, we further observe a noticeable trend of

steady increase in accuracy from the predictions of

both ANCLaF-S and ANCLaF-SA as the number of

considered frames grows from 2 to 8, and then it

plateaus (or even worsens a bit) as

continues to in-

crease. This trend suggests that generally, a medium

sequence length (between 4 to 16 frames) is optimal

to produce more accurate predictions and that too

short and too long sequences degrade temporal mod-

elling. This ﬁnding is quite consistent with those from

(Aspandi et al., 2019b), indicating the importance of

progressive learning, which allows us to analyse and

choose the optimal sequence length during training.

Lastly, this sequence length selection may also im-

pact the context vector along with its weights learnt in

our attentional module, which explains why a similar

trend was observed in the results from these models

(see Section 4.4 for more details).

4.3 Analysis of the Impact of Sequence

Modelling

Figure 2 shows an example of V-A predictions for

ANCLaF and ANCLaF-S-n, together with the ground-

truth annotations. Speciﬁcally, in the top part, we

can see the predicted affect states from our models

that, in general, are quite related to the ground truth

values. However, we notice that the results of our

sequence based models are more accurate than their

non-sequential counterparts. We can also see that the

the predicted values from ANCLaF are quite sparse,

thus, quite unstable compared to ANCLAF-S, which

explains its lower COR, CCC, and ICC values. Our

sequence modelling, on the other hand, is able to create

smooth predictions with higher overall accuracy.

On the bottom part of the ﬁgure, showing a mag-

niﬁed portion of the same example, we further notice

that the results for all ANCLaF-S-n are quite similar,

with those from ANCLaF-S-8 showing the highest

resemblance to the ground-truth. Thus, inclusion of

too short or too long sequences yields sub-optimal re-

sults due to the complexity of the facial movements

included between frames (see the next section for fur-

ther details).

4.4 Analysis of the Role of the Learnt

Attentions Weights

To analyse the impact of the attention mechanism

on our sequence modelling, we ﬁrst show in Figure

3 a comparison of our baseline sequence modelling

(ANCLaF-S) against ANCLaF-SA with attention acti-

vated. In the top part, we can see the predictions from

the best performing models with and without attention

(ANCLaF-S-8 and ANCLaF-SA-8). Comparing the

predictions from both models, we ﬁnd that the results

are quite similar, though in some cases ANCLaF-SA

seems to be more accurate and closer to the ground

truth. The quantitative accuracy results indicated on

the respective legends conﬁrm this observation.

The attention weights learnt by ANCLaF-SA, in-

volving the previous eight frames, are also displayed

at the bottom of the prediction plots. We can see that

the weights calculated with respect to the associated

frames seem to be higher in the presence of changes.

VISAPP 2021 - 16th International Conference on Computer Vision Theory and Applications

178

Figure 4: Analysis of the relationship between the selection of sequence length (n) and the learnt weights of our attentional

approach. Top: overview of the prediction results of all variants of our models with attention mechanism (ANCLaF-SA-n)

alongside their learnt weights. Middle: details for frames 622 to 653 with their associated weights for each model. Bottom:

legend containing the quantitative comparisons.

Indeed, we observe that the attention weights are usu-

ally activated prior to subsequent facial movements.

Interestingly, the intensity of the activations also ap-

pears to highlight the level of these facial movements,

or the changes between frames. For instance, from

frames 280-287, we can see that the different level of

the weight intensity seems to be small, which also cor-

relates to the subtle changes observed in those frames

(e. g., closing of the eyes). In contrast, in frames

643-650, we see high levels of activation on the ﬁrst

few frames that correspond to the more discernible

facial movements on the respective frames, such as the

changes observed in the mouth area. These correla-

tions illustrate how our models are capable of learning

temporal changes.

Figure 4 provides further details on the attention

mechanism for different temporal modelling lengths.

We can see that all the displayed models show quite

smooth results, thanks to the temporal modelling, but

not all of them achieve the same accuracy on the pre-

dictions. The bottom part of the ﬁgure, highlighting

the input sequence from frames 622 to 653, can help to

provide an intuition about the optimal temporal mod-

elling length, which was found to be about

frames.

To this end, let us start by looking at the whole set of

32 frames: we can see that such a sequence of frames

comprises multiple facial changes, and considering all

of them together makes the training task harder to opti-

mise. On the other hand, if we consider groups of very

few frames (e. g., 2 or 4 frames), the system is likely to

capture only part of a given facial action, which may

impede it to properly interpret it. Therefore, we see

that the optimal sequence length is the one that con-

tains enough frames to interpret facial changes without

extending too much the temporal context, which may

unnecessarily increase training complexity and reduce

accuracy.

Finally, it is important to emphasise that the op-

timal sequence length needs to take into account the

frame rate and the speciﬁc facial movements that are

present in each dataset. In the considered dataset, with

an overall frame rates of 50 fps, this length corresponds

to 160 ms.

An Enhanced Adversarial Network with Combined Latent Features for Spatio-temporal Facial Affect Estimation in the Wild

179

5 CONCLUSIONS

In this work, we have successfully built a sequence-

attention based neural network for affect estimations

in the wild. We did so by incorporating three major

sub-networks: the Generator, which is responsible to

extract latent features on each frame; the Discrimi-

nator, which is used to supply the ﬁrst step of affect

estimates of emotional quadrant, and the Combiner,

which merges latent features and quadrant information

to produce the ﬁnal reﬁned affect estimates of Valence

and Arousal on a frame by frame basis. We then added

an LSTM layer to allow temporal modelling, which

we further enhanced by using step-wise attention mod-

elling. We trained these three major sub-networks in

an adversarial setting, and used curriculum learning

on the sequential training stages.

We showed the effectiveness of our approach by

reporting top state of the art results on two of the most

widely used video datasets for affect analysis, namely

AFEW-VA and SEWA. Speciﬁcally, our baseline mod-

els, which operate without any sequence modelling,

yield quite competitive results with other models re-

ported in the literature. On the other hand, our more

advanced models, which are sequence-based, clearly

helped to improve the affect estimates both in qualita-

tive and quantitative terms. Qualitatively, the temporal

modelling helped to produce more stable results, with

visibly smoother transitions between affect predictions.

Quantitatively, our models produced the overall best

accuracy results reported so far on both tested datasets.

Within sequence-based models, we observed the

highest accuracy improvements when the attention

mechanism was included. Detailed analysis of the

attention weights highlighted their correlation with

the appearance of facial movements, both in terms of

(temporal) localisation and intensity. Finally, we found

a sequence length of around 160 ms to be the optimum

one for temporal modelling, which is consistent with

other relevant ﬁndings utilising similar lengths.

Future work will need to explore further optimi-

sation of the considered adversarial topologies and

attention mechanisms as well as their transferability

across databases, cultures, and domains.

ACKNOWLEDGMENTS

This work is partly supported by the Spanish Min-

istry of Economy and Competitiveness under project

grant TIN2017-90124-P, the Maria de Maeztu Units of

Excellence Programme (MDM-2015-0502), and the

donation bahi2018-19 to the CMTech at UPF. Further

funding has been received from the European Union’s

Horizon 2020 research and innovation programme un-

der grant agreement No. 826506 (sustAGE).

REFERENCES

Aspandi, D., Mallol-Ragolta, A., Schuller, B., and Binefa, X.

(2020). Latent-based adversarial neural networks for

facial affect estimations. In 2020 15th IEEE FG, pages

348–352, Los Alamitos, CA, USA. IEEE Computer

Society.

Aspandi, D., Martinez, O., and Binefa, X. (2019a). Heatmap-

guided balanced deep convolution networks for family

classiﬁcation in the wild. In 2019 14th IEEE FG 2019,

pages 1–5.

Aspandi, D., Martinez, O., Sukno, F., and Binefa, X. (2019b).

Fully end-to-end composite recurrent convolution net-

work for deformable facial tracking in the wild. In

2019 14th IEEE FG, pages 1–8.

Aspandi, D., Martinez, O., Sukno, F., and Binefa, X. (2019c).

Robust facial alignment with internal denoising auto-

encoder. In 2019 16th Conference on Computer and

Robot Vision (CRV), pages 143–150.

Barros, P., Churamani, N., Lakomkin, E., Siqueira, H.,

Sutherland, A., and Wermter, S. (2018). The omg-

emotion behavior dataset. In 2018 International Joint

Conference on Neural Networks (IJCNN), pages 1–7.

IEEE.

Bengio, Y., Louradour, J., Collobert, R., and Weston, J.

(2009). Curriculum learning. In Proceedings of the

26th Annual International Conference on Machine

Learning, ICML ’09, page 41–48, New York, NY, USA.

Association for Computing Machinery.

Christodoulidis, S., Anthimopoulos, M., Ebner, L., Christe,

A., and Mougiakakou, S. (2016). Multisource trans-

fer learning with convolutional neural networks for

lung pattern analysis. IEEE journal of biomedical and

health informatics, 21(1):76–84.

Comas, J., Aspandi, D., and Binefa, X. (2020). End-to-end

facial and physiological model for affective comput-

ing and applications. In 2020 15th IEEE International

Conference on Automatic Face and Gesture Recogni-

tion (FG 2020) (FG), pages 1–8, Los Alamitos, CA,

USA. IEEE Computer Society.

Cootes, T. F., Edwards, G. J., and Taylor, C. J. (1998). Active

appearance models. In Burkhardt, H. and Neumann, B.,

editors, Computer Vision — ECCV’98, pages 484–498,

Berlin, Heidelberg. Springer Berlin Heidelberg.

Dai, B., Fidler, S., Urtasun, R., and Lin, D. (2017). Towards

diverse and natural image descriptions via a condi-

tional gan. In The IEEE International Conference on

Computer Vision (ICCV).

Duo, S. and Song, L. (2010). An e-learning system based on

affective computing. Physics Procedia, 24.

Farhadi, D. G. A. and Fox, D. (2018). Re 3: Real-time

recurrent regression networks for visual tracking of

generic objects. IEEE Robot. Autom. Lett., 3(2):788–

795.

Handrich, S., Dinges, L., Al-Hamadi, A., Werner, P., and

VISAPP 2021 - 16th International Conference on Computer Vision Theory and Applications

180

Al Aghbari, Z. (2020). Simultaneous prediction of

valence/arousal and emotions on affectnet, aff-wild and

afew-va. Procedia Computer Science, 170:634–641.

Hochreiter, S. and Schmidhuber, J. (1997). Long short-term

memory. Neural Comput., 9(8):1735–1780.

Huang, R., Pedoeem, J., and Chen, C. (2018). Yolo-lite:

a real-time object detection algorithm optimized for

non-gpu computers. In 2018 IEEE International Con-

ference on Big Data (Big Data), pages 2503–2510.

IEEE.

Kim, C., Li, F., and Rehg, J. M. (2018). Multi-object tracking

with neural gating using bilinear lstm. In The European

Conference on Computer Vision (ECCV).

Kollias, D., Schulc, A., Hajiyev, E., and Zafeiriou, S. (2020).

Analysing affective behavior in the ﬁrst abaw 2020

competition. arXiv preprint arXiv:2001.11409.

Kollias, D., Tzirakis, P., Nicolaou, M. A., Papaioannou,

A., Zhao, G., Schuller, B., Kotsia, I., and Zafeiriou,

S. (2019). Deep affect prediction in-the-wild: Aff-

wild database and challenge, deep architectures, and

beyond. International Journal of Computer Vision,

127(6-7):907–929.

Kollias, D. and Zafeiriou, S. (2019). Expression, affect,

action unit recognition: Aff-wild2, multi-task learning

and arcface. arXiv preprint arXiv:1910.04855.

Kossaiﬁ, J., Tzimiropoulos, G., Todorovic, S., and Pantic,

M. (2017). Afew-va database for valence and arousal

estimation in-the-wild. Image and Vision Computing,

65:23–36.

Kossaiﬁ, J., Walecki, R., Panagakis, Y., Shen, J., Schmitt, M.,

Ringeval, F., Han, J., Pandit, V., Schuller, B., Star, K.,

et al. (2019). Sewa db: A rich database for audio-visual

emotion and sentiment research in the wild. arXiv

preprint arXiv:1901.02839.

Li, C., Bao, Z., Li, L., and Zhao, Z. (2020). Explor-

ing temporal representations by leveraging attention-

based bidirectional lstm-rnns for multi-modal emotion

recognition. Information Processing & Management,

57(3):102185.

Liu, C., Conn, K., Sarkar, N., and Stone, W. (2008). Online

affect detection and robot behavior adaptation for inter-

vention of children with autism. IEEE T Robot, 24:883

– 896.

Luong, M.-T., Pham, H., and Manning, C. D. (2015). Ef-

fective approaches to attention-based neural machine

translation. arXiv preprint arXiv:1508.04025.

Lv, J.-J., Shao, X., Xing, J., Cheng, C., and Zhou, X.

(2017). A deep regression architecture with two-stage

re-initialization for high performance facial landmark

detection. 2017 IEEE CVPR, pages 3691–3700.

Ma, J., Tang, H., Zheng, W.-L., and Lu, B.-L. (2019). Emo-

tion recognition using multimodal residual lstm net-

work. In Proceedings of the 27th ACM International

Conference on Multimedia, pages 176–183.

McKeown, G., Valstar, M. F., Cowie, R., and Pantic, M.

(2010). The semaine corpus of emotionally coloured

character interactions. In 2010 IEEE Int Con Multi,

pages 1079–1084. IEEE.

Mitenkova, A., Kossaiﬁ, J., Panagakis, Y., and Pantic, M.

(2019). Valence and arousal estimation in-the-wild

with tensor methods. In 2019 14th IEEE FG 2019,

pages 1–7. IEEE.

Mollahosseini, A., Hasani, B., and Mahoor, M. H. (2015).

Affectnet: A database for facial expression. Valence,

and Arousal Computing in the Wild Department of

Electrical and Computer Engineering, University of

Denver, Denver, CO, 80210.

Nicolaou, M. A., Gunes, H., and Pantic, M. (2011). Con-

tinuous prediction of spontaneous affect from multiple

cues and modalities in valence-arousal space. IEEE T

Affect Comput, 2(2):92–105.

Povolny, F., Matejka, P., Hradis, M., Popkov

a, A., Otrusina,

L., Smrz, P., Wood, I., Robin, C., and Lamel, L. (2016).

Multimodal emotion recognition for avec 2016 chal-

lenge. In Proceedings of the 6th International Work-

shop on Audio/Visual Emotion Challenge, AVEC ’16,

page 75–82, New York, NY, USA. Association for

Computing Machinery.

Ringeval, F., Sonderegger, A., Sauer, J., and Lalanne, D.

(2013). Introducing the recola multimodal corpus of

remote collaborative and affective interactions. In 2013

10th IEEE FG, pages 1–8.

Russell, J. A. (1980). A circumplex model of affect. Journal

of personality and social psychology, 39(6):1161.

Schmitt, M., Cummins, N., and Schuller, B. (2019). Con-

tinuous emotion recognition in speech–do we need

recurrence? Training, 34(93):12.

Tellamekala, M. K. and Valstar, M. (2019). Temporally

coherent visual representations for dimensional affect

recognition. In 2019 8th International Conference on

Affective Computing and Intelligent Interaction (ACII),

pages 1–7. IEEE.

Triantafyllidou, D. and Tefas, A. (2016). Face detection

based on deep convolutional neural networks exploit-

ing incremental facial part learning. In 2016 23rd In-

ternational Conference on Pattern Recognition (ICPR),

pages 3560–3565.

Xia, Y., Braun, S., Reddy, C. K. A., Dubey, H., Cutler,

R., and Tashev, I. (2020). Weighted speech distortion

losses for neural-network-based real-time speech en-

hancement. In ICASSP 2020 - 2020 IEEE ICASSP,

pages 871–875.

Xiaohua, W., Muzi, P., Lijuan, P., Min, H., Chunhua, J.,

and Fuji, R. (2019). Two-level attention with two-

stage multi-task learning for facial emotion recogni-

tion. Journal of Visual Communication and Image

Representation, 62:217–225.

Xie, J., Girshick, R. B., and Farhadi, A. (2016). Deep3d:

Fully automatic 2d-to-3d video conversion with deep

convolutional neural networks. In ECCV 2016, pages

842–857.

Ye, H., Li, G. Y., Juang, B.-H. F., and Sivanesan, K. (2018).

Channel agnostic end-to-end learning based commu-

nication systems with conditional gan. In 2018 IEEE

Globecom Workshops (GC Wkshps), pages 1–5. IEEE.

Zafeiriou, S., Kollias, D., Nicolaou, M. A., Papaioannou, A.,

Zhao, G., and Kotsia, I. (2017). Aff-wild: Valence and

arousal ‘in-the-wild’challenge. In IEEE CVPRW, 2017,

pages 1980–1987. IEEE.

An Enhanced Adversarial Network with Combined Latent Features for Spatio-temporal Facial Affect Estimation in the Wild

181