Facial Expression Recognition for Traumatic Brain Injured Patients
Chaudhary Muhammad Aqdus Ilyas, Mohammad A. Haque, Matthias Rehm,
Kamal Nasrollahi and Thomas B. Moeslund
Visual Analysis of People Laboratory and Interaction Laboratory, Aalborg University (AAU), Aalborg, Denmark
Keywords:
Computer Vision, Face Detection, Facial Landmarks, Facial Expressions, Convolution Neural Networks,
Long-Short Term Memory, Traumatic Brain Injured Patients.
Abstract:
In this paper, we investigate the issues associated with facial expression recognition of Traumatic Brain Insured
(TBI) patients in a realistic scenario. These patients have restricted or limited muscle movements with reduced
facial expressions along with non-cooperative behavior, impaired reasoning and inappropriate responses. All
these factors make automatic understanding of their expressions more complex. While the existing facial
expression recognition systems showed high accuracy by taking data from healthy subjects, their performance
is yet to be proved for real TBI patient data by considering the aforementioned challenges. To deal with this,
we devised scenarios for data collection from the real TBI patients, collected data which is very challenging
to process, devised effective way of data preprocessing so that good quality faces can be extracted from the
patients facial video for expression analysis, and finally, employed a state-of-the-art deep learning framework
to exploit spatio-temporal information of facial video frames in expression analysis. The experimental results
confirms the difficulty in processing real TBI patients data, while showing that better face quality ensures
better performance in this case.
1 INTRODUCTION
Facial expression is one of the main sources of com-
munication for human emotions as approximately
55 percent of human communication is happened
through facial expressions (Mehrabian, 1968). Com-
puter vision techniques have been developed to ex-
tract facial features and use them for different pur-
poses (Mathias et al., 2014) (Klonovs et al., 2016),
for example for assessing, mental states (Hyett et al.,
2016) (Chen, 2011), health indicators (Li et al., 2012),
and various physiological parameters like heartbeat
rate, fatigue, blood pressure and respiratory rate (Ha-
que et al., 2016). Among these, automatic detection
of facial expression is subject of high importance due
to its applications in many fields such as in biome-
trics, forensics, medical diagnosis, monitoring, de-
fence and surveillance (Ekman and Friesen, 1971)
(Mathias et al., 2014) (Du and Martinez, 2015) (Hy-
ett et al., 2016) (Chen, 2011) (Li et al., 2012) (Ha-
que et al., 2015a) (Li and Jain, 2011). Therefore, re-
searchers are putting great emphasis on development
of accurate and robust Facial Expression Recognition
(FER) systems. A vast body of literature has been
produced on this topic in the past decade.
The existing FER systems can be broadly cate-
gorized according to their feature extraction methods
(Tian et al., 2001) and the used classification techni-
ques. Most widely used methods for facial feature
extraction are: geometric features based methods, ap-
pearance based methods and hybrid ones (Pantic and
Patras, 2006) (Jiang et al., 2014). Geometry based
feature extraction methods use geometric shape and
position of the facial parts like lips, nose, eyebrows
and mouth, with temporal information such as the
movement of facial features points from the previous
frame to the current frame (Ghimire and Lee, 2013)
(Haque et al., 2014). Geometric features are resistant
to illumination variation so non-frontal head postu-
res can be handled by processing the figure to fron-
tal head pose to extract the features by measuring
distance of fiducial points (Poursaberi et al., 2012)
(Anwar Saeed and Elzobi, 2014). Appearance based
methods were employed by researchers by using tex-
ture information of facial images (Lyons et al., 1999)
(li Tian, 2004). In hybrid feature extraction methods
both geometric as well as appearance based appro-
aches are deployed for facial image representation
(Poursaberi et al., 2012).
FER systems can be further divided on the basis
of classification approaches of extracted facial fea-
tures. For example, Ghimire et al., 2016 proposed
522
Ilyas, C., Haque, M., Rehm, M., Nasrollahi, K. and Moeslund, T.
Facial Expression Recognition for Traumatic Brain Injured Patients.
DOI: 10.5220/0006721305220530
In Proceedings of the 13th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2018) - Volume 4: VISAPP, pages
522-530
ISBN: 978-989-758-290-5
Copyright © 2018 by SCITEPRESS Science and Technology Publications, Lda. All rights reserved
an approach in which both appearance and geome-
tric features are used for facial expression recogni-
tion and Support vector Machine (SVM) for classi-
fication (Ghimire et al., 2017). Researchers in (Uddin
et al., 2017) (Zhao and Zhang, 2011) have used Lo-
cal Binary Pattern (LBP), Histogram of Oriented Gra-
dient (HoG) in (Ghimire et al., 2017), Linear Discri-
minant Analysis (LDA) in (Uddin and Hassan, 2015)
(Uddin et al., 2017) (Zhao and Zhang, 2011), wave-
lets based approaches in (Palestra et al., 2015) (Yan
et al., 2014) (Poursaberi et al., 2012), Non-Negative
Matrix Factorization (NMF) and Discriminant NMF
in (de Vries et al., 2015) (Ravichander et al., 2016).
Lajevardi and Hussain proposed an investigative ana-
lysis on feature extraction and selection models for
automatic FER system based on AdaBoost algorithm
followed by Gabor filters, log Gabor filters, LBP and
higher-order local autocorrelation (HLAC), which is
then further modified by applying HLAC-like features
(HLACLF) (Lajevardi and Hussain, 2010). Similarly
(Ghimire and Lee, 2013) proposed a temporal based
FER by tracking the facial feature points and classi-
fying them using multi-class AdaBost and SVM. In
(Poursaberi et al., 2012), geometric distance speci-
fic fiducial points are determined for FER. Resear-
chers in (Palestra et al., 2015) (Li et al., 2012) (Pantic
and Patras, 2006) (Kotsia and Pitas, 2007) used SVM
for accurate classification; whereas authors in (Uddin
and Hassan, 2015) used the Hidden Markov Models.
SVM shows better results when facial expressions are
recognized from single frame, but in case of sequence
of images HMM produce better results. It is not the
set rule as some authors have used combination of dif-
ferent techniques and produced results comparable to
state of the art methods.
In recent years more and more researchers have
moved towards deep learning techniques for fast,
accurate and robust FER. Authors in (Farfade et al.,
2015) (Bellantonio et al., 2017) (Triantafyllidou and
Tefas, 2016) applied Deep Convolution Neural Net-
work (DCNN) for classification of features into ex-
pressions and achieved appreciable results. Yoshihara
et al. proposed a feature point detection method for
qualitative analysis of facial paralysis using DCNN
(Yoshihara et al., 2016). For initial feature point de-
tection, Active Appearance Model (AAM) is used as
an input to DCNN for fine tuning. Deep Belief Net-
work (DBN) is another widely used method for robust
FER. Kharghanian et al. (Kharghanian et al., 2016)
used DBN for pain assessment from facial expressi-
ons, where features were extracted with the help of
Convolution Deep Belief Networks (CDBN) to iden-
tify the pain. Like (Haque et al., 2017), it is tested
on the publicly available UNBC McMaster Shoulder
Pain database with 95 percentage accuracy. Howe-
ver, these existing methods of FER from healthy pe-
ople, as used in (Uddin et al., 2017) (Pantic and Pa-
tras, 2006) (Kharghanian et al., 2016), are not suitable
when applied to real patients in a real scenario.
Recently (Rodriguez et al., 2017) proposed a pain
assessment system with FER, where CNN is used to
learn facial features from VGG-Faces, then linked to
Long Short-Term Memory (LSTM) to take advantage
of temporal relations between video frames. This
method was further improved by (Bellantonio et al.,
2017) by feeding super-resolved facial frames to the
CNN+LSTM architecture. These systems of (Rodri-
guez et al., 2017) and (Bellantonio et al., 2017) work
well for extraction of facial expression and its inter-
pretation in form of social signals for healthy people.
However, the performance of those systems are yet
to be tested on datasets collected on the real patients’
scenarios like Traumatic Brain Injured (TBI) patients
in a care giving center. This mainly because these
patients behavior might be very non-cooperative and
non-compliance, and they can have agitation, con-
fusion, loud verbalization, physical aggression, dis-
inhibition, impaired reasoning, poor concentration,
judgment and mental inflexibility (Lauterbach et al.,
2015). Brain injured patients may also have reduced
expressions such as smiling, laughing, crying, anger
or sadness or their responses may be inappropriate.
On the contrary, some TBI patients also exhibit ex-
treme responses like sudden tears, anger outbursts or
laughter. It’s all due to loss of ability to control over
emotions to some extents. These raise the questions
whether the state of the art FER systems, like (Rodri-
guez et al., 2017) and (Bellantonio et al., 2017), will
be reliable when working with these patients data.
The main issue is that these system require facial ima-
ges that are good quality and well-posed towards the
camera. However, due to the mentioned issues the
TBI patients can not always face the camera and their
facial images are not of good quality with certainty
due to for example rapid changes in head pose. To
deal with these difficulties, we equip the state of the
art FER system of (Rodriguez et al., 2017) with a Face
Quality Assessment (FQA) system that discards most
of the faces that are not useful for the FER system and
feeds the FER system only with faces that are of bet-
ter quality compared to the other facial images. We
have tested the proposed system on real data of TBI
patients which has been collected in a Neurocenter in
which these patients are taken care of. To the best of
our knowledge, no one has done any previous work
on TBI patients to understand their facial expressi-
ons using computer vision techniques. Therefore this
work presents a novel experience in this regard and
Facial Expression Recognition for Traumatic Brain Injured Patients
523
opens up notion for enhancing social communication
between patients and care givers.
The rest of this paper is organized as follows.
Section 2 describes the proposed methodology for fa-
cial feature extraction and recognition of expressions.
Section 3 presents the results obtained from the expe-
riments. Finally, Section 4 concludes the paper.
2 THE PROPOSED METHOD
This section describes the architecture of the propo-
sed method for FER analysis in a real patient scenario.
The block diagram of the proposed method is illustra-
ted in Figure 1. Following (Rodriguez et al., 2017),
in the first step, the face is detected from a input vi-
deo. In order to reduce erroneous detection of face
we employ a face alignment approach by detecting
facial landmarks. The detected landmarks are tracked
and faces are cropped according to the landmark po-
sitions. In the next step face quality is assessed by
following (Haque et al., 2015b) and only good qua-
lity faces are stored in face log. Faces are then fed to
a CNN. This network was pre-trained with VGG-16
faces as used by (Bellantonio et al., 2017) and (Ro-
driguez et al., 2017). These steps of the system are
further explained in the following subsections.
2.1 Data Acquisition and Preprocessing
The subjects are filmed by a Axis RGB-Q16 camera
with resolution of 1280 x 960 to 160 x 90 pixels at
30fps (frames per second). Then, these images are
fed to a facial image acquisition system which con-
sists of three steps: face detection, face quality as-
sessment and face logging. The first step is face de-
tection from the video frames for which we used a
well-know method, called VJ (Viola and Jones) face
detector (Viola and Jones, 2001). Due to its speed
and moderately high accuracy by using Haar-like fea-
tures we selected this method. This method constructs
a classifier with the help of learning algorithm based
on AdaBoost which effectively classify the images on
the basis of few critical features from large set and
discard background regions by cascading. However,
it is prone to erroneous detection when face is in low
quality in terms of occlusion or pose variation. On
the other hand, while most FER databases have near-
frontal head poses of good quality images with very
less occlusions (no spectacles, hand gestures cove-
ring the mouth, etc.), in our case subjects are TBI pa-
tients and they are not cooperative enough to ensure
good facial data capturing. So there is high possibility
of non-frontal view and continuous pose variations,
resulting in low quality of images and consequently
large amount of miss detected faces as shown in Fi-
gure 2. Moreover, due to inability to recognize and
appropriately respond to non-verbal cues TBI patients
have feeble response (Bird and Parente, 2014). This
in turns increases the complexity of data collection.
Thus, instead of detecting face in every single frames
of a video, we employ a face alignment method on
a properly detected face frame in the video and then
track the facial landmarks in the subsequent frames.
This reduces the possibility of erroneous detection by
VJ in subsequent video frames, as the face is tracked
instead of detected again and again in the video se-
quence.
Face alignment is a process of localization of inner
facial structures such as apex of the nose or curve of
the eye by using some predefined landmarks that help
in better enrolment of the face. Such land-marking
also helps in the speedy extraction of geometric struc-
tures as well as additional strong local characteristics.
Due to advancement in technology, regression ba-
sed facial land-marking methods have contributed to-
wards the automatic face alignment. One of the most
effective approach is the Supervised Decent Method
(SDM) (Xiong and la Torre, 2013). In SDM, 49 fa-
cial landmarks are applied around eyes corners, nose
line, lips and eye borrows. In addition, SDM uses
small optical flow vectors and pixel by pixel neig-
hbourhood measure by avoiding window based point
tracing. This provides high computational efficiency,
and more stable and precise tracking for long time
period of visual frames as demonstrated by (Haque
et al., 2015b). Thus, we employ the SDM based face
alignment in the proposed method of FER. The steps
of face alignment in a video is shown in Figure 3. The
face is first detected in a video frame by an off-the-
shelf face detector (VJ in this case) and then the fa-
cial landmarks are identified in that frame. Instead of
detecting face in the subsequent video frames, those
landmarks are tracked in the subsequent frames. The
performance of following the SDM-based approach
over mere VJ will be evaluated in the experimental
result section. By using the landmarks, we find the
face boundary and then crop the faces. The faces are
then forwarded to the next step.
2.2 Face Quality Assessment
System performance for FER is highly dependent on
the quality of facial images. In practice for the TBI
patient dataset, there is high possibility of non-frontal
view of face and continuous pose variations, resulting
in low quality of images, even though those faces are
tracked by the SDM. Figure 4 show the case of occlu-
VISAPP 2018 - International Conference on Computer Vision Theory and Applications
524
Figure 1: Block diagram of Facial Expression Recognition System based on CNN+LSTM model to exploit spatio-temporal
information.
Figure 2: Miss detection of faces by VJ face detector due to occlusion or high pose variation.
ded face (which of course means low quality) for a vi-
deo sequence where average pixel intensities are va-
rying due to the presence and absence of occlusion
over time. To avoid such problems, we employ a FQA
technique on the faces cropped after SDM. This is ac-
complished by measuring some face quality matrices
like image resolution, sharpness, and face rotation as
shown in (Haque et al., 2013). Before logging facial
frames into final face log for FER, low quality face
frames are identified by setting first frame as a refe-
rence frame and comparing similarity in the rest of
the frames in a particular event of video as follow in
(Irani et al., 2016). Similarity of frames is calculated
by the following equation:
S
Clr
=
M
m=1
N
n=1
(A
mn
A)(B
mn
B)
q
M
m=1
N
n=1
(A
mn
A)
2
M
m
N
n
(B
mn
B)
2
(1)
In the above equation A and B are the reference
faces whereas A and B are average pixels levels of
the current frame. M and N are number of rows and
columns in an image matrix. The degree of dissimila-
rity calculated from the above equation forms the ba-
sis for face quality score. The more the dissimilarity
the more the possibility of a low quality face.
2.3 Face Logging
In this step, the faces obtained after SDM tracking are
considered along with their associated quality score.
If the score is lower than a predefined threshold we
simply discard that before logging. Once the quality
of face is ensured, images are cropped to a common
input size of neural network (224x224 pixels in our
experiment) and these are ready to feed to the deep
learning architecture.
Facial Expression Recognition for Traumatic Brain Injured Patients
525
Figure 3: Facial landmark identification and tracking in Supervised Descent Method (SDM).
Figure 4: Depiction of varying pixels intensities due to the presence and absence of occlusion over time. a) shows the example
face frames and b) show the variation in pixel intensities over time.
2.4 The CNN+LSTM based Deep
Learning Architecture for FER
Convolutional neural networks are specialized set of
neuron networks having multiple layers of input and
output that utilizes the local features in image to
obtain the visual information. CNN has multiple
layers for convolution and padding. A typical 2-
Dimensional (2D) CNN takes 2D images as input
and considers each image as a n × n matrix. Gene-
rally, parameters of the CNN are randomly initialized
and learned by performing gradient descend using a
back propagation algorithm. It uses a convolution
operator in order to implement a filter vector. The
output of the first convolution will be a new image,
which will be passed through another convolution by
a new filter. This procedure will continue until the
most suitable feature vector elements {V
1
, V
2
, ...,V
n
}
are found. Convolutional layers are normally alterna-
ted with another type of layer, called Pooling layer,
which function is to reduce the size of the input in
order to reduce the spatial dimensions and gaining
computational performances and translation invari-
ance (Noroozi et al., 2017). CNN performed remar-
kably well in facial recognition (Ji et al., 2013) as well
as automatic facial detection (Farfade et al., 2015). In
order to take advantage of its good results for FER
we have applied this method on TBI patients data to
extract facial features relevant to FER.
In general, CNN deals with images that are isola-
VISAPP 2018 - International Conference on Computer Vision Theory and Applications
526
ted. However, in our case we have used the sequences
of images in a timely manner and thus, having the
notion of using temporal information as well. So to
exploit the temporal information associated with fa-
cial expression in video, we have used an implemen-
tation of Recurrent Neural Network (RNN), that is
capable of absorbing the sequential information, cal-
led LSTM model from (Rodriguez et al., 2017). The
LSTM states are controlled by three gates associated
with forget ( f ), input (i), and output (o) states. These
gates control the flow of information through the mo-
del by using point-wise multiplications and sigmoid
functions σ, which bound the information flow bet-
ween zero and one by the followings:
i(t) = σ(W
(xi)
x(t) +W
(hi)
h(t 1) + b
(1i)
) (2)
f (t) = σ(W
(x f )
x(t) +W (h f )h(t 1) + b
(1 f )
)
(3)
z(t) = tanh(W
(xc)
x(t)) +W
(hc)
h(t 1) + b
(1c)
)
(4)
c(t) = f (t)c(t 1)+ i(t)z(t), (5)
o(t) = σ(W
(xo)
x(t) +W
(ho)
h(t 1)+b
(1o)
) (6)
h(t) = o(t)tanh(c(t)), (7)
where z(t) is the input to the cell at time t, c is the
cell, and h is the output. W
(xy)
are the weights from
x to y.
In this paper, we use a combination of CNN
and LSTM where CNN extract facial features from
the faces logged from the TBI patients video and
LSTM find temporal correlation based on those fe-
atures in temporal setting. A schematic diagram of
the CNN+LSTM is shown in the right hand side of
the Figure 1 and more details can be found in (Rodri-
guez et al., 2017). We used a off-the-shelf fine-tuned
version of the VGG-16 CNN model (Parkhi et al.,
2015) pre-trained with faces for spatial feature ex-
traction. We obtained the features of the fc7 layer
of the CNN (VGG-16) and then use them as input to
a the LSTM to exhibit hybrid deep learning perfor-
mance by CNN+LSTM. The implementation of the
CNN+LSTM is available online through (Rodriguez
et al., 2017).
3 EXPERIMENTAL RESULTS
In this section, we first describe the database captured
and used during our investigation. We then demon-
strate and commented on the results.
3.1 The Database
In order to have experiments for FER on TBI patients
data, we require a database. However, to the best of
our knowledge, there is no publicly available facial vi-
deo database from real TBI patients. In establishment
of a database, first task was identification of data col-
lection methods. Most of TBI patients have varying
ability to identify and respond to non-verbal expres-
sion of emotions (Bird and Parente, 2014). After visi-
ting different neurocenters and care-homes where TBI
patients are provided rehabilitation facilities around
Denmark, and consulting with experts and care-givers
who are in direct contact with TBI patients, we have
finalized three uniform scenarios for data collection
from all the patients under observation. The unifor-
mity in data collection is maintained to have reliable
data for future use. Those scenarios are: a) cognitive
rehabilitation therapy, b) physiotherapy, and c) social
communication with other residents of the neurocen-
ter. In cognitive therapy, a TBI patient plays a game
or mind quiz in order to judge how much thinking
or cognitive ability a particular subject posses. On
the basis of this activity further data elicitation pro-
cess is organized. In the second activity of physiot-
herapy, subjects stress level of fatigue is determined.
The last activity, where TBI patients have to interact
with other patients and care-givers, provides insight
about patient ability to give and perceive communica-
tion signals.
On contrary to normal people, TBI patients have
intolerance, rapid mood swings accompanied by an-
ger or tear bursts, low concentration and impaired
facial emotion recognition. Considering these chal-
lenges, collection of data, particularly facial videos,
is not a trivial task as most of patients do not keep
their face positions still. Even if they do so, it is still
not easy to understand their emotions for some other
problems. Mostly they have sad or depressed emoti-
ons after post traumatic life. However, experts who
are dealing with TBI patients over certain period of
time are able to annotate the patients emotional sta-
tus as neutral or normal expression. Another problem
is: they get agitated very quickly and so it was big
task to involve them in the aforementioned three acti-
vities. For this purpose, to have clear and precise
emotion recognition, we devised a game in such a
way that we intentionally let the patients to win to see
their happy expressions. Similarly to have their head
posed in front of camera, a tablet displaying emoti-
onal scenes, is placed just parallel to camera recor-
ding their facial expressions. Similar adjustments are
made in other activities during recording. One inte-
resting observation is that all the TBI patients have
taken deep interest in mind game, and movie or pic-
Facial Expression Recognition for Traumatic Brain Injured Patients
527
ture illustration regardless of their disability nature.
This allows us to collect more neutral, happy and an-
gry expressions. However, we could not collect much
expressions of sadness, surprised and fatigue due to
non-cooperation, traumatic disabilities and other so-
cial and technical issues.
We collected data in multiple phases in a number
of sessions. In total we got 539 video sequences (one
sequence means one expression event) with variable
lengths (1-5 seconds). However, we observe that the
data is highly imbalanced as out of 539 events 463
are of neutral expression. In other words, out of ap-
proximately 20,000 frames, almost 14000 represents
neutral expressions. Among others, 108 events (app.
3300 frames) of happy, 72 events (app. 2200 frames)
of angry and very few are other expressions. On ot-
her hand, most of them have too much head motions,
so making the data even more challenging for further
processing.
3.2 Performance Evaluation
In this section, we first demonstrate the impact of em-
ploying a SDM-based face aligner and tracker over VJ
face detector. Table 1 shows the amount of erroneous
face detection in the video frames. From the results,
we observe that FQA removed 2429 erroneous faces
out of 27689 while using VJ. It means that 8.67 per-
centage of the detection were not correct by VJ. On
other hand, when FQA technique is employed on fa-
ces detected by SDM, 4.46 percentage of the facial
frames were not detected correctly as FQA discarded
1128 frames out of 25289 frames. Comparing both
results, SDM-based detection by using alignment and
tracking provided better accuracy in finding the right
faces.
Table 1: The performance of SDM-based face alignment
and tracking to extract faces from the video frames in com-
parison to basic VJ face detector.
Number of Frames VJ SDM
Total no. of frames 27689 25289
Training frames 22082 20403
Testing frames 5607 4886
Total mis-detection 2429 1128
Percentage Error 8.67 % 4.46 %
Table 2 shows the accuracy of FER in terms of
AUC for two scenarios while the number of epochs
in the LSTM was varying in yielding the results. The
epochs of CNN-LSTM system is gradually increased
by step of 5, from 5 to 50 keeping other parameters
such as RHO, recurrent depth, and drop-out proba-
bility constant. From the results we observe that the
accuracy of VJ-based CNN-LSTM system is incre-
ased with gradual increase in epochs up to 25 epo-
chs. It reached up to level of 76.94 percent at the 25th
epoch. At 30th epoch, its value was dropped down to
67.31 percent, but strangely jumped to 75.26 percent
in a higher values of epochs.
Table 3 show the effect of chaning RHO value for
three scenarios. From the results we observe that the
SDM-based approach reached maximum AUC value
of 75.26 percent. RHO value is gradually changed
at step of 2, from 1 to 11, means giving more tem-
poral information for FER, while keeping the epochs
constant. AUC values showed the VJ-based approach
exhibits slightly higher accuracy by increasing tem-
poral information. In contrast, SDM-based approach
got the accuracy above 70 percent in all steps with
maximum value of 73.38 percent and minimum value
of 70.17 percent. Similar uphill trends is observed up
to RHO 5 and then a slight decline is observed.
It is clearly evident from the experiment results for
TBI patients data, despite of the challenging datasets
accuracy of system is increased to certain extent.
Table 2: AUC results for FER of TBI patients data with
gradual increase in epoch values.
Area Under Curve (AUC)
Epocs Value Viola Jones SDM
10 66.37 69.49
15 69.55 72.03
20 75.42 63.21
25 76.94 72.96
30 67.31 75.26
35 75.76 72.35
40 63.03 73.38
45 67.08 71.81
50 68.63 74.85
Table 3: AUC results for FER of TBI patients data with
gradual increase in RHO values.
RHO Values
Area Under Curve (AUC)
Full Frames Viola Jones SDM
1 51.43 61.18 70.17
3 54.26 62.08 72.09
5 53.21 63.03 73.38
7 59.27 63.57 72.27
9 57.12 64.5 72.83
11 59.17 63.29 71.09
4 CONCLUSIONS
In this paper, we pointed out the rationale about in-
VISAPP 2018 - International Conference on Computer Vision Theory and Applications
528
vestigating facial expression analyzing system by
using data obtained from real TBI patients. The study
reveals the challenges associated with real-world sce-
narios including patients, instead of healthy volun-
teers used in the previous works. We captured data
from TBI patients in a neurocenter, extracted faces
from the video frames by employing different met-
hods to find out the effective one. We then fed the
cropped faces into a CNN+LSTM based deep lear-
ning framework to exploit both spatio-temporal infor-
mation to detect the patients mental status in terms
of facial expressions. The results were demonstrated
with different spatio-temporal parameters of the sy-
stem. The result showed that the facial information
obtained from patient is varying in such a way that it
is hard to predict the expression with high accuracy.
Moreover, we observed strong effect of employing an
effective face detection method with face quality as-
sessment for FER. However, as a note for future work,
further processing such as face frontalization, larger
dataset for training and subject specific knowledge
base incorporation might be useful in improving the
performance.
REFERENCES
Anwar Saeed, Ayoub Al-Hamadi, R. N. and Elzobi, M.
(2014). Frame-based facial expression recognition
using geometrical features. Advances in Human-
Computer Interaction, 2014:1–13.
Bellantonio, M., Haque, M. A., Rodriguez, P., Nasrollahi,
K., Telve, T., Escalera, S., Gonzalez, J., Moeslund,
T. B., Rasti, P., and Anbarjafari, G. (2017). Spatio-
temporal Pain Recognition in CNN-Based Super-
Resolved Facial Images, pages 151–162. Springer In-
ternational Publishing, Cham.
Bird, J. and Parente, R. (2014). Recognition of nonverbal
communication of emotion after traumatic brain in-
jury.
Chen, Y. (2011). Face Perception in Schizophrenia
Spectrum Disorders: Interface Between Cognitive and
Social Cognitive Functioning, pages 111–120. Sprin-
ger Netherlands, Dordrecht.
de Vries, G.-J., Pauws, S., and Biehl, M. (2015). Facial Ex-
pression Recognition Using Learning Vector Quanti-
zation, pages 760–771. Springer International Publis-
hing, Cham.
Du, S. and Martinez, A. M. (2015). Compound facial ex-
pressions of emotion: from basic research to clinical
applications. Dialogues Clin Neurosci, 17(4):443–
455.
Ekman, P. and Friesen, W. V. (1971). Constants across cul-
tures in the face and emotion. J Pers Soc Psychol,
17(2):124–129.
Farfade, S. S., Saberian, M. J., and Li, L.-J. (2015). Multi-
view face detection using deep convolutional neural
networks. In Proceedings of the 5th ACM on Inter-
national Conference on Multimedia Retrieval, ICMR
’15, pages 643–650. ACM.
Ghimire, D. and Lee, J. (2013). Geometric feature-based fa-
cial expression recognition in image sequences using
multi-class adaboost and support vector machines.
Sensors, 13(6):7714–7734.
Ghimire, D., Lee, J., Li, Z.-N., and Jeong, S. (2017). Re-
cognition of facial expressions based on salient geo-
metric features and support vector machines. Multi-
media Tools and Applications, 76(6):7921–7946.
Haque, M. A., Irani, R., Nasrollahi, K., and Moeslund, T. B.
(2016). Facial video-based detection of physical fati-
gue for maximal muscle activity. IET Computer Vi-
sion, 10(4):323–329.
Haque, M. A., Nasrollahi, K., and Moeslund, T. B. (2013).
Real-time acquisition of high quality face sequences
from an active pan-tilt-zoom camera. In 2013 10th
IEEE International Conference on Advanced Video
and Signal Based Surveillance, pages 443–448.
Haque, M. A., Nasrollahi, K., and Moeslund, T. B. (2014).
Constructing facial expression log from video sequen-
ces using face quality assessment. In 2014 Internati-
onal Conference on Computer Vision Theory and Ap-
plications (VISAPP), volume 2, pages 517–525.
Haque, M. A., Nasrollahi, K., and Moeslund, T. B. (2015a).
Heartbeat Signal from Facial Video for Biometric Re-
cognition, pages 165–174. Springer International Pu-
blishing, Cham.
Haque, M. A., Nasrollahi, K., and Moeslund, T. B. (2015b).
Quality-aware estimation of facial landmarks in video
sequences. In 2015 IEEE Winter Conference on Ap-
plications of Computer Vision, pages 678–685.
Haque, M. A., Nasrollahi, K., and Moeslund, T. B. (2017).
Pain expression as a biometric: Why patients’ self-
reported pain doesn’t match with the objectively me-
asured pain? In 2017 IEEE International Conference
on Identity, Security and Behavior Analysis (ISBA),
pages 1–8.
Hyett, M. P., Parker, G. B., and Dhall, A. (2016). The Utility
of Facial Analysis Algorithms in Detecting Melancho-
lia, pages 359–375. Springer International Publishing,
Cham.
Irani, R., Nasrollahi, K., Dhall, A., Moeslund, T. B., and
Gedeon, T. (2016). Thermal super-pixels for bimodal
stress recognition. In 2016 Sixth International Confe-
rence on Image Processing Theory, Tools and Appli-
cations (IPTA), pages 1–6.
Ji, S., Xu, W., Yang, M., and Yu, K. (2013). 3d convolu-
tional neural networks for human action recognition.
IEEE Transactions on Pattern Analysis and Machine
Intelligence, 35(1):221–231.
Jiang, B., Valstar, M., Martinez, B., and Pantic, M. (2014).
A dynamic appearance descriptor approach to facial
actions temporal modeling. IEEE Transactions on Cy-
bernetics, 44(2):161–174.
Kharghanian, R., Peiravi, A., and Moradi, F. (2016). Pain
detection from facial images using unsupervised fea-
ture learning approach. In 2016 38th Annual Internati-
Facial Expression Recognition for Traumatic Brain Injured Patients
529
onal Conference of the IEEE Engineering in Medicine
and Biology Society (EMBC), pages 419–422.
Klonovs, J., Haque, M. A., Krueger, V., Nasrollahi, K.,
Andersen-Ranberg, K., Moeslund, T. B., and Spaich,
E. G. (2016). Monitoring Technology, pages 49–84.
Springer International Publishing, Cham.
Kotsia, I. and Pitas, I. (2007). Facial expression recognition
in image sequences using geometric deformation fea-
tures and support vector machines. IEEE Transactions
on Image Processing, 16(1):172–187.
Lajevardi, S. and Hussain, Z. (2010). Novel higher-order
local autocorrelation-like feature extraction methodo-
logy for facial expression recognition. IET Image Pro-
cessing, 4:114–119(5).
Lauterbach, M. D., Notarangelo, P. L., Nichols, S. J.,
Lane, K. S., and Koliatsos, V. E. (2015). Diagnos-
tic and treatment challenges in traumatic brain injury
patients with severe neuropsychiatric symptoms: in-
sights into psychiatric practice. Neuropsychiatr Dis
Treat, 11:1601–1607.
Li, F., Zhao, C., Xia, Z., Wang, Y., Zhou, X., and Li, G.-Z.
(2012). Computer-assisted lip diagnosis on traditio-
nal chinese medicine using multi-class support vector
machines. BMC Complementary and Alternative Me-
dicine, 12(1):127.
Li, S. Z. and Jain, A. K. (2011). Handbook of Face Recog-
nition. Springer Publishing Company, Incorporated,
2nd edition.
li Tian, Y. (2004). Evaluation of face resolution for expres-
sion analysis. In 2004 Conference on Computer Vision
and Pattern Recognition Workshop, pages 82–82.
Lyons, M. J., Budynek, J., and Akamatsu, S. (1999). Au-
tomatic classification of single facial images. IEEE
Trans. Pattern Anal. Mach. Intell., 21(12):1357–1362.
Mathias, M., Benenson, R., Pedersoli, M., and Van Gool,
L. (2014). Face Detection without Bells and Whistles,
pages 720–735. Springer International Publishing.
Mehrabian, A. (1968). Communication without words. Psy-
chology Today, 1.2(4):53–56.
Noroozi, F., Marjanovic, M., Njegus, A., Escalera, S., and
Anbarjafari, G. (2017). Audio-visual emotion recog-
nition in video clips. IEEE Transactions on Affective
Computing, (99):1–1.
Palestra, G., Pettinicchio, A., Del Coco, M., Carcagn
`
ı, P.,
Leo, M., and Distante, C. (2015). Improved Perfor-
mance in Facial Expression Recognition Using 32 Ge-
ometric Features, pages 518–528. Springer Internati-
onal Publishing, Cham.
Pantic, M. and Patras, I. (2006). Dynamics of facial expres-
sion: recognition of facial actions and their temporal
segments from face profile image sequences. IEEE
Transactions on Systems, Man, and Cybernetics, Part
B (Cybernetics), 36(2):433–449.
Parkhi, O. M., Vedaldi, A., and Zisserman, A. (2015). Deep
face recognition. In British Machine Vision Confe-
rence.
Poursaberi, A., Noubari, H. A., Gavrilova, M., and Ya-
nushkevich, S. N. (2012). Gauss–laguerre wavelet tex-
tural feature fusion with geometrical information for
facial expression identification. EURASIP Journal on
Image and Video Processing, 2012(1):17.
Ravichander, A., Vijay, S., Ramaseshan, V., and Natarajan,
S. (2016). Automated Human Facial Expression Re-
cognition Using Extreme Learning Machines, pages
209–222. Springer International Publishing, Cham.
Rodriguez, P., Cucurull, G., Gonzlez, J., Gonfaus, J. M.,
Nasrollahi, K., Moeslund, T. B., and Roca, F. X.
(2017). Deep pain: Exploiting long short-term me-
mory networks for facial expression classification.
IEEE Transactions on Cybernetics, PP(99):1–11.
Tian, Y. I., Kanade, T., and Cohn, J. F. (2001). Recogni-
zing action units for facial expression analysis. IEEE
Transactions on Pattern Analysis and Machine Intel-
ligence, 23(2):97–115.
Triantafyllidou, D. and Tefas, A. (2016). Face detection ba-
sed on deep convolutional neural networks exploiting
incremental facial part learning. In 2016 23rd Inter-
national Conference on Pattern Recognition (ICPR),
pages 3560–3565.
Uddin, M. Z. and Hassan, M. M. (2015). A depth video-
based facial expression recognition system using ra-
don transform, generalized discriminant analysis, and
hidden markov model. Multimedia Tools and Appli-
cations, 74(11):3675–3690.
Uddin, M. Z., Hassan, M. M., Almogren, A., Alamri, A.,
Alrubaian, M., and Fortino, G. (2017). Facial ex-
pression recognition utilizing local direction-based ro-
bust features and deep belief network. IEEE Access,
5:4525–4536.
Viola, P. and Jones, M. (2001). Rapid object detection
using a boosted cascade of simple features. In Procee-
dings of the 2001 IEEE Computer Society Conference
on Computer Vision and Pattern Recognition. CVPR
2001, volume 1, pages I–511–I–518 vol.1.
Xiong, X. and la Torre, F. D. (2013). Supervised descent
method and its applications to face alignment. In 2013
IEEE Conference on Computer Vision and Pattern Re-
cognition, pages 532–539.
Yan, J., Zhang, X., Lei, Z., and Li, S. Z. (2014). Face de-
tection by structural models. Image and Vision Com-
puting, 32(10):790 799. Best of Automatic Face and
Gesture Recognition 2013.
Yoshihara, H., Seo, M., Ngo, T. H., Matsushiro, N., and
Chen, Y. W. (2016). Automatic feature point de-
tection using deep convolutional networks for quan-
titative evaluation of facial paralysis. In 2016 9th
International Congress on Image and Signal Proces-
sing, BioMedical Engineering and Informatics (CISP-
BMEI), pages 811–814.
Zhao, X. and Zhang, S. (2011). Facial expression recogni-
tion based on local binary patterns and kernel discri-
minant isomap. Sensors, 11(10):9573–9588.
VISAPP 2018 - International Conference on Computer Vision Theory and Applications
530