Expression Detector System based on Facial Images
José G. Hernández-Travieso, Carlos M. Travieso, Marcos del Pozo-Baños and Jesús B. Alonso
Signal and Communications Department, The Institute for Technological Development and Innovation on Communications,
University of Las Palmas de Gran Canaria, Campus Universitario de Tafira, sn, Ed. de Telecomunicación,
Pabellón B, Despacho 111, E35017, Las Palmas de Gran Canaria, Spain
Keywords: Expression Detection, Soft-biometrics, Facial Segmentation, Pattern Recognition.
Abstract: This paper proposes a emotion detector, applied for facial images, based on the analysis of facial
segmentation. The parameterizations have been developed on spatial and transform domains, and the
classification has been done by Support Vector Machines. A public database has been used in experiments,
The Radboud Faces Database (RAFD), with eight possible emotions: anger, disgust, fear, happiness,
sadness, surprise, neutral and contempt. Our best approach has been reached with decision fusion, using
transform domains, reaching an accurate up to 96.62%.
1 INTRODUCTION
In today's society, the use of Information and
Communication Technologies (ICT) is increasing
(Chin et al., 2008); (Eshete et al., 2010); (Siriak and
Islam, 2010). Technological advances have made
possible the proliferation of equipment and latest
technologies, making progresses that could
previously only imagine. One of many new
applications is the emotion detection, being the goal
of this work. It can be used for various purposes, as
the detection of possible symptoms of neurological
diseases in humans (Wang et al., 2008); (Wang et
al., 2007); (Ekman and Friesen, 1978).
It is also gaining importance the Emotional
Intelligence and another set of values and behaviours
aimed at achieving better welfare of the individual in
their work environment, emotional and affective.
This field of emotion detection is developing in
multiple applications and researches, which gives an
idea of the importance acquired and the multitude of
applications thereof. In this regard, many authors are
based on guidelines set by Ekman and Friesen, who
developed the Facial Action Coding System (FACS)
(Ekman and Friesen, 1978) that takes parameters of
the muscles of the face according to a particular
emotion, classifying them into Action Units (AU)
specific to each emotion.
The use of FACS is not limited to the field of
technological research, as it also has a great
importance in helping psychology to study human
behaviour. Only when an emotion is true, the correct
AU is made, something that does not happens when
you lie.
When transmitting a message, an important part
of the communication is the facial expression, the
gestures shown.
The state of the art in this field is quite broad,
emphasizing at this point only a few jobs.
As mentioned before, the implementation of
FACS has influenced works like (Pantic and Patras,
2004), who marked key points in the input images to
the system to detect the presence of emotion. In this
work, they found that the left half of the face
expresses emotion better than the right half. In
addition, it was found that the expression of
authentic emotions were symmetrical, other than
face feigned expressions. They used Hidden Markov
Model (HMM) reaching recognition rates of 87%.
(Arima et al., 2004) using Fourier descriptors
and discriminant analysis, studied the human
response to low frequency oscillations using
simulator ship movements and its passengers, to
study the effect of the boat trip oscillations. They
tried to establish a method of quantification of facial
expression and clarify the relationship between
facial expression and individual's mental status,
managing to reach an average rate of recognition of
82.2%.
In (Wong and Cho, 2006), using Gabor features,
developed a representation of facial emotion in Face
Emotion Tree Structure (FEETS) to detect emotions
411
Hernández-Travieso J., Travieso C., del Pozo-Baños M. and Alonso J. (2013).
Expression Detector System based on Facial Images.
In Proceedings of the International Conference on Bio-inspired Systems and Signal Processing, pages 411-418
DOI: 10.5220/0004322504110418
Copyright
c
SciTePress
in faces partially covered by sunglasses, veils, or any
element that hides from view any area, achieving
facial expression recognition results close to 90%.
(Fu et al., 2009) conducted a study which used
Java Agent Development Framework (JADE),
which linked the activity of the viewer using the
remote control combined with facial recognition,
with the emotions of the human being. Work that
can be used to support the research of the
Massachusetts Institute of Technology (MIT) on the
home of the future, in which changes in the
conditions of blood pressure, weight or abnormal
sleep, are monitored as precursors of heart failure
symptoms.
Also the study of (An and Chung, 2009) was
carried out, using Principal Components Analysis
(PCA) to study facial expression, when offering an
interactive TV and, on demand, offering
personalized services to viewers. In this study, they
achieved a success rate of 92.1%.
In (Petrantonakis and Hadjileontiadis, 2010),
using High Order Crossing Analysis (HOC),
implemented an emotion detector system based on
electroencephalogram (EEG), observing the graphs
obtained by showing a facial expression of certain
emotions. They achieved success rates of 100%.
(Dahmane and Meunier, 2001) developed an
emotions detector system, using histograms of
oriented gradients and Support Vector Machine
(SVM) for the classification of images used,
achieving a success rate of 70%.
Also (Gouizi, et al., 2001) developed an emotion
detector system from biological signals such as
electromyogram, respiration, skin temperature, skin
conductance, blood pressure and rate pressure. SVM
were used as a technique of classification.
Recognition rates reached 85%.
This area has developed some works during the
last years, and this work contributes to extend this
line, showing our innovation. In particular, our work
proposes the creation of an emotion detector system
for facial images. For that, facial features will be
extracted using spatial domains and transformed
domains for subsequent classification using SVM.
The distinctive part of this system is the
segmentation of the image, performing a deep study
that leads to obtain the significant value of each one
when an emotion is detected. This has not been
observed in previous studies.
2 PREPROCESSING
This section is composed by different steps in order
to do easy our face segmentation. Those steps are,
firstly, the face detection, after, a brightness
adjusting and a high pass filter and finally, a process
of binarization.
Figure 1: Block diagram of the system.
2.1 Extraction of the Facial Area of the
Input Image
Due to the high resolution of the input images
(681x1024 pixels) and that, in them, in addition to
the facial area of interest, other body parts as the
upper trunk and the top of the head are present, it
proceeds to extract the facial area. An algorithm
based on the face detector from (Viola and Jones,
2004), it has been used (see figure 2).
Figure 2: Extraction of facial area.
BIOSIGNALS2013-InternationalConferenceonBio-inspiredSystemsandSignalProcessing
412
2.2 Adjusting the Brightness of the
Image
The first step is to transform the incoming facial
images to luminance and chrominance components,
to highlight the eye and mouth areas of the face.
Subsequently, the luminance component is used to
modify the image brightness by multiplying the
value component to be called ESCALA (see figure
3).
Figure 3: Adjusting the brightness.
2.3 Filtering of the Images
In order to obtain better information on areas of
interest, to correctly detect the emotion present in
the facial image, it requires a high pass filter for a
better differentiation in the edges of the image (see
figure 4). We have applied a heuristic filter, and
finally, it is defined in the equation 1;
MFPA
1 1 1
1 9 1
1 1 1
(1)
Figure 4: Filtered image.
2.4 Image Binarization
Binarization of the image consists in converting a
gray scale image to a binary image, i.e., a black and
white image. To binarize the image a histogram of
the incoming picture luminance scale is made. It
shows the maximum number of times that the values
of the gray scale are present.
Using Otsu’s Method (Otsu, 1979) is not feasible
in this case, since the detection of valleys of the
histogram is not optimal, shifting the threshold to
lower values and losing information of important
areas, such as, the mouth.
It then chooses a threshold manually, using the
histogram, due to the need to find an optimal value
for the parts involved in this study.
Figure 5: Binarized image.
3 FACIAL SEGMENTATION
Figure 6: Segmented facial image.
Once the facial area has been pre-processed, it is
segmented into seven parts to discern on the
information given per each one in our process of
emotion detection. The segmented parts are:
forehead, both eyes together, right eye, left eye, right
cheek, left cheek and mouth. And in particular, the
definition of each segment is as follows:
TP: indicates that all segments of the facial
image (forehead, two eyes together, right eye, left
eye, right cheek, left cheek and mouth) are used.
DOLOBO: indicates that both eyes together,
right eye, left eye and mouth are used.
DOBO: indicates that both eyes together and
ExpressionDetectorSystembasedonFacialImages
413
mouth are used.
LOBO: indicates that right eye, left eye and
mouth are used.
FR: indicates that forehead is used.
DO: indicates that both eyes are used together.
LO: indicates that right eye and left eye are used.
4 FEATURE EXTRACTION
4.1 Facial Feature Extraction in the
Spatial Domain
The facial feature extraction in the spatial domain
consists of taking Euclidean distances between
various points of the face, with the binarized images,
to try to detect and emotion present on it. These
distances are normalized with respect to the distance
between the inner ends of the eyes, due to the variety
of the faces of the database for men, women and
children, to try to standardize the measures taken.
Figure 7: Euclidean distances.
4.2 Facial Feature Extraction in
Transformed Domains
For this work, the used transformed domains are 2
Dimensional Discrete Cosine Transform (2D-DCT)
(Gonzalez and Woods, 2002) and 2 Dimensional
Discrete Wavelet Transform (2D-DWT). They were
chosen due on their good behaviour in facial
identification and other biometric applications
(Vargas et al., 2010); (Fuertes et al., 2012).
On the one hand, the input image that will serve
to 2D-DCT is high pass filtered. This process is
performed to obtain a better definition of the edges
of the image, achieving a better highlight area of
facial expression characteristics such as, eyes and
mouth, for later extraction. That information on the
details, obtained filtering, is achieved through
spectral windows, given that working in space-
frequency resolution, that information must be
transformed into the spatial domain. The 2D-DCT
performs a low pass filter that provides general
information from the details of the incoming image.
On the other hand, the 2D-DWT (Gonzalez and
Woods, 2002), carries a high pass filter which
provides detailed information of the details from the
incoming image, the image used in this case is the
original image in colour (RGB). Being the image in
the visible domain, spatial information is provided.
5 CLASSIFICATION SYSTEM
5.1 Support Vector Machine (Svm)
The SVM is a well-known classifier and used on
different examples with large size of data (Yu et al.,
2003). The SVM only can distinguish between two
different classes (Vapnik, 1998); (Burges, 1998).
The technique is directly related to classification and
regression models (Vapnik, 1998). Given a set of
training examples (samples, called vectors), can be
labelled classes and train a SVM to build a model
that predicts the kind of a new sample. The idea
underlying the SVM is the hyperplane or decision
level, which can be defined as the plane of
separation between a set of samples from different
classes. Hyperplanes can be infinite, but only one of
them is the optimal one, this is what makes that the
separation between the samples is maximized
(Vapnik, 1998); (Jakkola, 2002) causing the margin
is maximized. We have used a supervised
classification system, with two different kernels,
Linear and Radial Basis Function (RBF) kernels
(Vapnik, 1998), under a one-versus-all multi-classes
strategy. In particular, we have used a SVM-Light
(Joachims, 1999).
5.2 Fusion Results
The last stage is the fusion of classification results.
This fusion is at the decision level from the output of
the SVM decision. Its mission is to correct certain
errors, since they are uncorrelated, which may have
occurred in the recognition phase. The objective
proposed, is to give more robustness to the final
results of our approach.
6 EXPERIMENTAL
METHODOLOGY
6.1 Database
We used a public database, The Radboud Faces
BIOSIGNALS2013-InternationalConferenceonBio-inspiredSystemsandSignalProcessing
414
Database (RAFD) (Langner et al., 2010). This
database was chosen over other available by several
factors, among which the brightness and image
resolution.
The RAFD database is a set of 8040 images of
67 models (20 adult Caucasian male, 19 female
Caucasian adults, 4 Caucasian children, 6 girls
Caucasian, 18 Moroccan adult male) with 23-24
pictures per model for each position on the camera,
which express 8 emotional expressions: anger,
disgust, fear, happiness, sadness, surprise, neutral
and contempt. Emotions expressed according to
FACS. The positions of the models to the chamber
range from -90° to 90° from the front of the camera
(which is assumed 0°). The database is an initiative
of the Institute of Behavioural Science of the
Radboud University Nijmegen, located in Nijmegen
(The Netherlands).
The file format is .jpg in colour, with dimensions
of 681x1024 pixels. The clothing of all models is
identical, a black shirt and the background is clear
and unchanged. To this system, 1600 images
corresponding to the front position of the model
about the camera were used. This database is public,
and is granted free of charge for use in research.
Figure 8: Samples of the database RAFD.
6.2 Experiments
In the experiments used SVM with RBF kernel and
linear following 50% hold-out validation method,
repeating the experiments three times, varying
percentages of training and test samples.
Being originally a SVM bi-classes and working
this system with more than two classes of emotions
(8 in total), it requires a multiclass system. Among
the several existing techniques, the one-versus-all
technique was used. Two experiments have been
developed;
6.2.1 Experiment 1: Feature Extraction in
the Spatial Domain
In the case of facial feature extraction in the spatial
domain, the distance measurements are concatenated
into a column vector, and subsequently are
introduced into a data structure, which will be the
input data to the classification stage.
6.2.2 Experiment 2: Feature Extraction in
Transformed Domains
The 2D-DCT applies to segmented images of the
high pass filtered facial area. This transform has the
property that the images do not undergo any
variation in size to perform it.
Each time a segment has been transformed,
becomes the data matrix which is formed in a
column vector, then concatenating each column
vector of each segment face to form a new column
vector that is introduced into a data structure, which
will be the input data to the classification stage.
In applying the 2D-DWT, followed a similar
pattern to those followed in the 2D-DCT. In this
case, the input image is high pass filtered, but is the
original image, because the 2D-DWT works with
images in colour (RGB).
Among the different types of existing wavelet,
we chose to use the Haar family for its simplicity
and family Bior4.4 due to its good result in previous
works (Mallat, 2009).
From the application of 2D-DWT, we have
worked with the high frequency, in order to get the
details of each image. This output image becomes a
column vector, as occurs with the 2D-DCT, by
concatenating all column vectors in columns
corresponding to the selected facial segments to
form a new column vector, that is introduced into a
data structure and it will be the input data to the
classification stage.
6.2.3 Experiment 3: Fusion
Once we have obtained the simulation results for
facial feature extraction in transformed domains
(2D-DCT, Bior4.4 2D-DWT and Haar 2D-DWT), a
fusion of the best results from each are performed.
This is achieved uncorrelated correct errors and
improves the emotion recognition.
6.3 Results and Discussion
The results are shown in mean and variance for each
ExpressionDetectorSystembasedonFacialImages
415
of the experiments performed.
6.3.1 Experiment 1: Feature Extraction in
the Spatial Domain
The best result obtained using the spatial domain
was of 32.58% ± 1.00 with a 50% of training
samples and using linear SVM.
In view of these results, it is proved that this
method is not decisive for detecting an emotion
present in the human being using the facial image,
because the information is not sufficient to achieve a
percentage of recognition enabling determine with
certainty the emotion present in the facial image.
6.3.2 Experiment 2: Feature Extraction in
Transformed Domains
By employing transformed domains was obtained
the following results, for 50% of training samples;
For 2D-DCT, it was 96.16% ± 0.69 with RBF
SVM, using TP and 86.41% ± 1.34 with Linear
SVM using DOLOBO.
For the case of Haar 2D-DWT, the best result
obtained was 86.41% ± 1.34 for RBF SVM using
TP, and 92.90% ± 0.33 for Linear SVM using TP.
For Bior4.4 2D-DWT, 96.33% ± 1.34, using
RBF SVM for TP, and 92.95% ± 1.97 for linear
SVM using TP.
If 60% of test samples were used, the results were;
For 2D-DCT was 90.52% ± 0.09 for RBF SVM
using TP, and 87.77% ± 0.22 for linear SVM using
TP.
With Haar 2D-DWT, 94.70% ± 0.62 for RBF
SVM using DOBO, and 91.18%±0.06 for Linear
SVM using TP.
For 2D-DWT bior4.4, 96.59% ± 0.32 for RBF
SVM using LOBO, and 92.46% ± 2.73 for Linear
SVM using TP.
In view of these results, it will conclude that the
extraction of facial features in transformed domains
is more effective to detect the emotion present in the
facial image of the human being. The most effective
one is Bior4.4 2D-DWT.
6.3.3 Experiment 3: Fusion
The best result for each percentage linear SVM is
chosen for fusion, the result obtained for 50% of
samples test was 96.62% success rate with a time of
28.86 milliseconds.
For 60% of test samples, the result obtained was
95.72% success rate with a time of 23.80
milliseconds.
With these values, it is clear the improvement
experienced in applying fusion for detecting the
emotion present in the human being.
Compared to previous systems in which there
has been no segmentation for detecting emotion, this
study achieved success rates over them. Thus it
proves the advantage of segmentation to detect
correctly the emotion present. Nowadays, The
RAFD Face Database has not been used to detect
emotions.
Table 1: Spatial domain results.
Spatial Domain Results
SVM
Linear RBF
50% training
32.58% ± 1.00 25.41% ± 0.41
40% training
32.11% ± 0.02 22.77% ± 2.13
Table 2: 2D-DCT results.
Transformed Domain
Results
2D-DCT
Linear SVM RBF SVM
50% training
(type of segment)
86.41% ± 1.34
(DOLOBO)
96.16% ± 0.69
(TP)
40% training
(type of segment)
87.77% ± 0.22
(TP)
90.52% ± 0.09
(TP)
Table 3: Haar 2D-DWT results.
Transformed Domain
Results
Haar 2D-DWT
Linear SVM RBF SVM
50% training
(type of segment)
92.90% ± 0.33
(TP)
95.37% ± 1.82
(TP)
40% training
(type of segment)
91.18%±0.06
(TP)
94.70% ± 0.62
(DOBO)
Table 4: Bior4.4 2D-DWT results.
Transformed Domain
Results
Bior4.42D-DWT
Linear SVM RBF SVM
50% training
(type of segment)
92.95% ± 1.97
(TP)
96.33% ± 1.34
(TP)
40% training
(type of segment)
92.46% ± 2.73
(TP)
96.59% ± 0.32
(LOBO)
7 CONCLUSIONS
Once realized the study, it has shown that the
segmentation of the face, its parametrization with
transform domains and the use of SVM classifier
gives a much higher percentage of recognition in
simulations with transformed domains in the
segments of the eye (in whole or separately) and the
mouth are present together, reaching accurate of
96.59% using RBF SVM and 2D-DWT bior4.4.
In contrast, the less influential zones on the
detection of emotion are the cheeks and forehead,
BIOSIGNALS2013-InternationalConferenceonBio-inspiredSystemsandSignalProcessing
416
due to the limited amount of information being
given. Especially, the forehead, the results were not
higher to 33.33%, using in this case, the Haar
wavelet family.
The importance of the information provided by
eyes and mouth is also checked empirically, because
when a person shows emotions, like surprise, the
parts of the face that more quickly and clearly serve
as indicative are the eyes and mouth. By showing
the eyes and mouth wide open, the emotion can be
detected without any doubts. Which does not occur
with the cheeks and forehead if considered
separately, because the movements of the muscles
associated with these areas is inconclusive in this
study.
ACKNOWLEDGEMENTS
This work is partially supported by funds from
Cátedra Telefónica 2009/10–ULPGC” and by the
Spanish Government, under Grant MCINN
TEC2012-38630-C04-02.
REFERENCES
An, K. H., Chung, M. J., 2009. Cognitive face analysis
system for future interactive TV. In IEEE
Transactions on Consumer Electronics. Vol. 55, no. 4,
pp. 2271-2279.
Arima, M., Ikeda, K., Hosoda, R., 2004. Analyses of
Facial Expressions for the Evaluation of Seasickness.
In Oceans ’04. MTTS/ IEEE Techno-Ocean ’04.
Vol.2, pp. 1129-1132.
Burges, C. J. C., 1998. A tutorial on Support Vector
Machines for Pattern Recognition. In Data Mining and
Knowledge Discovery, Vol. 2, pp.121-167.
Chin, K. L., Chang, E., Atkinson, D., 2008. A Digital
Ecosystem for ICT Educators, ICT Industry and ICT
Students. In Second IEEE International Conference on
Digital Ecosystems and Technologies. pp. 660-673.
Dahmane, M., Meunier, J., 2011. Emotion Recognition
using Dynamic Grid-based HoG Features. In IEEE
International Conference on Automatic Face &
Gesture Recognition and Workshops, pp. 884-888.
Ekman, P., Friesen, W., 1978. Facial Action Coding
System: A Technique for the Measurement of Facial
Movements. Consulting Psychologist Press, Palo Alto,
CA.
Eshete, B., Mattioli, A., Villafiorita, A., Weldemariam, K.,
2010. ICT for Good: Opportunities, Challenges and
the Way Forward. In Fourth International Conference
on Digital Society. pp. 14-19.
Fu, M. H., Kuo, Y. H., Lee, K. R., 2009. Fusing Remote
Control Usage and Facial Expression for Emotion
Recognition. In Fourth International Conference on
Innovative Computing, Information and Control. pp.
132-135.
Fuertes, J. J., Travieso, C. M., Naranjo, V., 2012. 2-D
Discrete Wavelet Transform for Hand Palm Texture
Biometric Identification and Verification. Wavelet
Transforms and Their Recent Applications in Biology
and Geoscience, Ed. InTech.
González, R. C., Woods, R. E., 2002. Digital Image
Processing. Prentice Hall, Upper Saddle River, New
Jersey.
Gouizi, K., Reguig, F. B., Maaoui, C., 2011. Analysis
Physiological Signals for Emotion Recognition. In 7th
International Workshop on Systems, Signal Processing
and their Applications (WOSSPA). pp. 147-150.
Jakkula, V., Tutorial on Support Vector Machine (SVM).
School of EECS, Washington State University,
Pullman 99164.
Joachims, T., 1999. Making large-Scale SVM Learning
Practical. Advances in Kernel Methods - Support
Vector Learning. B. Schölkopf and C. Burges and A.
Smola (ed.), MIT-Press.
Langner, O., Dotsch, R., Bijlstra, G., Wigboldus, D. H. J.,
Hawk, S. T., & van Knippenberg, A., 2010.
Presentation and validation of the Radboud Faces
Database. In Cognition & Emotion, Vol. 24, nº 8, pp.
1377-1388.
Mallat, S., 2009. A Wavelet Tour of Signal Processing
.
Third Edition: The Sparse Way.
Otsu, N., 1979. A Threshold Selection Method from
Gray_Level Histograms. In IEEE Transactions on
Systems, Man and Cybernetics. Vol. 9, no. 1, pp. 62-
66.
Pantic, M., Patras, I., 2006. Dynamics of Facial
Expression: Recognition of Facial Actions and Their
Temporal Segments From Face Profile Image
Sequences. In IEEE Transactions on System, Man and
Cybernetics-Part B: Cybernetics. Vol. 36, no.2, pp.
443- 449.
Petrantonakis, P. C., Hadjileontiadis, L. J., 2010. Emotion
Recognition from EEG Using High Order Crossing. In
IEEE Transactions on Information Technology in
Biomedicine. Vol. 14, no. 2, pp. 186-197.
Siriak, S., Islam, N., 2010. Relationship between
Information and Communication Technology (ICT)
Adoption and Hotel Productivity: An Empirical Study
of the Hotels in Phuket, Thailand. In Proceedings of
PICMET’10: Technology Management for Global
Economic Growth, pp. 1-9.
Vapnik, N. V., 1998. Statistical Learning Theory. Wiley
Interscience Publication, John Wiley & Sons Inc.,
Vargas, J. F., Travieso, C. M., Alonso, J. B., Ferrer. M. A.,
2010. Off-line signature Verification Based on Gray
Level Information Using Wavelet Transform and
Texture Features. In 12th International Conference on
Frontiers in Handwriting Recognition (ICFHR). pp.
587-592.
Viola, P., Jones, M. J., 2004. Robust Real-Time Face
Detection. In International Journal of Computer
Vision. Vol. 57, nº 2, pp. 137-154.
ExpressionDetectorSystembasedonFacialImages
417
Wang, P., Kohler, C., Barrett, F., Gur, R., Verma, R.,
2007. Quantifying Facial Expression Abnormality in
Schizophrenia by Combining 2D and 3D Features. In
IEEE Conference on Computer Vision and Pattern
Recognition. pp. 1-8.
Wang, P., Kohler, C., Martin E., Stolar, N., Verma, R.,
2008. Learning-based Analysis of Emotional
Impairments in Schizophrenia. In IEEE Computer
Society Conference on Computer Vision and Pattern
Recognition Workshops. pp. 1-8.
Wong, J. J., Cho, S. Y., 2006. Recognizing Human
Emotion from Partial Facial Features. In International
International Joint Conference on Neural Networks.
Vol. 1, pp. 166-173.
Yu, H., Yang, J., Han, J., 2003. Classifying Large Data
Sets Using SVMs with Hierarchical Clusters In The
Ninth ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, pp. 306-315.
BIOSIGNALS2013-InternationalConferenceonBio-inspiredSystemsandSignalProcessing
418