Facial Emotion Recognition in Presence of Speech using a Default

ARTMAP Classiﬁer

Sheir Afgen Zaheer

1,2

and Jong-Hwan Kim

School of Electrical Engineering, KAIST, Daejeon, Republic of Korea

Innovative Play Lab, Goyang, Republic of Korea

Keywords:

Emotion Recognition, Fuzzy Adaptive Resonance Theory, Default ARTMAP.

Abstract:

This paper proposes a scheme for facial emotion recognition in the presence of speech, i.e. the interacting

subjects are also speaking. We propose the usage of default ARTMAP, a variant of fuzzy ARTMAP, as

a classiﬁer for facial emotions using feature vectors derived from facial animation parameters (FAP). The

proposed scheme is tested on Interactive Emotional Dyadic Motion Capture (IEMOCAP) database. The results

show the effectiveness of the approach as a standalone facial emotion classiﬁer as well as its relatively superior

performance on IEMOCAP in comparison to the existing similar approaches.

1 INTRODUCTION

To realize emotional intelligence in robots and arti-

ﬁcial intelligence, ability to process emotional infor-

mation and recognize emotions is essential. People

communicate their emotions through various modes

of communication. Facial expressions are the most

dominant indicators of emotions among those com-

munication cues. Therefore analyzing facial informa-

tion for emotion recognition has attracted a lot of in-

terest as research issue various ﬁelds, such as affective

computing, social robotics and human robot interac-

tion (Liu et al., 2013; Hirota and Dong, 2008; Rozgi

et al., 2012).

In recent years, machine learning techniques for

facial emotion recognition have been very popular

(Liu et al., 2014; Li et al., 2015b). Among those

Convolutional Neural Networks (CNN) have been the

most successful and popular on the benchmark prob-

lems (Li et al., 2015a). These approaches use the im-

ages or sections of the images directly as training in-

puts. Though such approaches have been very suc-

cessful on popular facial emotion databases, such as

MMI and CKP facial expression database, they have

practical limitations with audiovisual data consisting

of multi-modal interactions. They work really well

for still image data or video data with facial expres-

sions only. However, this changes when the incom-

ing data is audiovisual and the user is speaking. The

variations in a speaking face are a compound effect

of both the facial expression (emotion) and the facial

movement to utter the words (lexicon).

Figure 1: Dependency of various factors on different areas

of the face. Darker color represents higher dependency.

To extract right features for facial emotion from

a speaking face, we ﬁrst need to understand how the

emotions and lexicon affect different regions of the

face. (Mariooryad and Busso, 2016) studied this vari-

ation in different regions of the face. Their ﬁndings

are shown in Fig. 1.

In this paper, we propose a facial emotion recog-

nition scheme using supervised adaptive resonance

theory (ARTMAP). The proposed scheme formulates

a feature vector based on the facial animation pa-

rameters (FAP) corresponding to the emotional re-

gion of the face, as shown in Fig. 1, and utilizes a

default ARTMAP as a classiﬁer for emotion recog-

nition. The database used in this paper is Interac-

tive Emotional Dyadic Motion Capture (IEMOCAP)

Zaheer S. and Kim J.

Facial Emotion Recognition in Presence of Speech using a Default ARTMAP Classiﬁer.

DOI: 10.5220/0006572204360442

(Busso et al., 2008). IEMOCAP database is an acted,

multimodal and multispeaker database, developed by

Signal Analysis and Interpretation Laboratory (SAIL)

lab at the University of Southern California (USC). It

contains approximately 12 hours of audiovisual data,

including audiovisuals, motion capture of face, text

transcriptions. The motion capture information, the

interactive setting to elicit authentic emotions, and the

diversity of the actors in the data base (ﬁve males and

ﬁve females) make this database a valuable, realistic

and challenging emotion corpus.

This paper is organised as follows: Section 2 de-

scribes the facial feature extraction. Section 3 ex-

plains the facial emotion classiﬁcation using default

ARTMAP, and the classiﬁcation results follow in Sec-

tion 4. Finally, the concluding remarks are presented

in Section 5.

2 FACIAL FEATURE

EXTRACTION

Recent studies using audiovisual data similar to ours

have shown that Face Animation Parameters (FAP)

can be an effective feature set choice for extraction of

emotional information even when the user is speaking

(Kim et al., 2013; Mower et al., 2011).

”A Face Animation Parameter (FAP) is a com-

ponent of the MPEG-4 International Standard

developed by the Moving Pictures Experts

Group. FAP represent displacements and rota-

tions of the feature points from the neutral face

position, which is deﬁned as: mouth closed,

eyelids tangent to the iris, gaze and head ori-

entation straight ahead, teeth touching, and

tongue touching teeth” (Petajan, 2005).

2.1 Motion Capture and FAP

As FAP are distances between two points on a face,

a prerequisite to calculating FAP is the availability of

the motion capture data for the corresponding points

on the face. Fig. 2 demonstrates the motion capture

points available in the database. Combining the infor-

mation from Fig. 1 and Fig. 2, desirable FAP can be

calculated. Our set of 30 FAP is similar to the ones

used by (Kim et al., 2013; Mower et al., 2011), with

the exception of FAP corresponding the mouth open-

ings. These FAP are shown in Fig. 3.

Figure 2: Visual representation of the motion capture pints

on the face.

Figure 3: Visual representation of the FAP used for facial

emotion recognition.

2.2 Facial Features based on FAP

After obtaining the required FAP, the next step is to

generate the feature vector for facial emotion classiﬁ-

cation. Ninety (x,y,z-components of each of the FAP)

FAP values are extracted from each frame of the au-

diovisual segment. The selected features for the au-

diovisual segment consists of means, standard devi-

ations, ranges, upper and lower quartiles, and quar-

tile ranges for all 90 values over the entire segment.

Consequently, the net feature vector consists of 540

features for each audiovisual segment.

2.3 Facial Feature Normalization

The database has multiple actors and they all have dis-

tinct facial features and sizes, which means that the

base values for their FAP are different. Therefore, the

FAP features need to be normalized to minimize the

effect of base value variation among different faces.

We use z-normalization for this purpose. Mean and

standard deviation for each face were calculated over

the entire spectrum of emotions expressed by the cor-

responding actor. These mean and standard deviation

values for each face are used to calculate feature val-

ues in terms of z-scores using:

FAPFeat

zscores

(FAPFeat − µ

FAPFeat

)

FAPFeat

, (1)

where FAPFeat are the FAP-based features, µ

FAPFeat

and σ

FAPFeat

are the means and standard deviations,

respectively, of the features across the entire spectrum

of emotions.

2.4 Facial Feature Scaling

The classiﬁer for facial emotion recognition is a

default ARTMAP neural network. Since default

ARTMAP is a variant of fuzzy ARTMAP, the inputs

to the network need to be scaled to a zero-to-one

range. (2) is used for scaling.

FAPFeat

scld

(FAPFeat

zscores

− FAPFeat

min

)

(FAPFeat

max

− FAPFeat

min

)

, (2)

where FAPFeat

max

and FAPFeat

min

are the maximum

and minimum values, respectively.

3 FACIAL EMOTION

CLASSIFICATION USING

ARTMAP

Even though FAP based features have been shown to

be quite effective for facial emotion, there are some

hindering issues in the choice of classiﬁers. These

issues stem from the way in which the feature vec-

tors are formulated. A common practice is to ac-

cumulate FAP over a segment or an utterance, and

then formulate a feature vector by applying statisti-

cal operations over the accumulated FAP. The statis-

tical operations applied in this case are: mean, stan-

dard deviation, range (max-min values), upper quar-

tile, lower quartile, and quartile range. This results

in relatively large feature vectors with a fewer train-

ing instances because each instance is sampled over

utterances/segments containing hundreds of frames.

Figure 4: The architecture of Fuzzy ARTMAP.

Therefore, fewer training instances (<3000) with a

feature vector size of over 500 presents a particular

case of ’curse of dimensionality’. This issue has been

tackled in research by coupling a classiﬁer (Neural

network or SVM) with a feature dimensionality re-

duction implemented through Information Gain (IG),

Principal Feature Analysis (PFA) , Deep Belief Net-

work (DBN), etc (Kim et al., 2013).

However, in this research, we opted for a

Fuzzy ARTMAP classiﬁer, Default ARTMAP clas-

siﬁer speciﬁcally. We chose ARTMAP because

it enables fast learning by simultaneously clus-

tering/categorizing and classifying. Additionally,

ARTMAP is plastic while maintaining spasticity, i.e.,

it can learn new information without forgetting what

it already has learnt. Before explaining the De-

fault ARTMAP classiﬁer that we employed in this

research, the following subsection will provide some

back ground on Fuzzy ARTMAP, in particular, and

Fuzzy Adaptive Resonance Theory (ART).

3.1 Fuzzy ARTMAP

Fuzzy ARTMAP, or supervised ART, is a combina-

tion of two ART neural networks that are connected

through a MAP ﬁeld (shown in Fig. 4) (Carpenter

et al., 1991b). The ﬁrst Fuzzy ART neural network,

ART

, categorizes the inputs, while the second one,

ART

, categorizes the output class labels. The associ-

ation between the two categorizations is mapped via

a MAP ﬁeld, hence the name ARTMAP.

Fuzzy ART implements fuzzy logic into ART’s

pattern recognition, thus enhancing generalizability

(Carpenter et al., 1992). The ﬁrst step in Fuzzy ART

learning is complement coding. This is done by con-

catenating fuzzy complement of the input at the end

of the input vector:

A = (a|a

). (3)

After complement coding the inputs, Fuzzy ART is

initialized by categorizing the ﬁrst input and initializ-

ing the weights and vigilance parameter, ρ. The vigi-

lance parameter controls the level of fuzzy similarity

acceptable to be categorized into the same category

node. The higher ρ means stricter categorization and

hence more category nodes. Once the Fuzzy ART has

been initialized, the next input is selected and the ac-

tivation signals to the committed nodes:

= |A ∧ w

| + (1 − α)(M − |w

|). (4)

Then, the activated nodes are checked for template

matching, or resonance, using the following criterion:

ρ|A| − |x| ≤ 0, (5)

where x = A ∧ w

. If there is a match, the weights are

updated using:

= (1 − β)w

+ β(A ∧ w

), (6)

where β is the learning rate. On the other hand, if

there is no match, a new new node j associated to the

input is created:

= A. (7)

In a Fuzzy ARTMAP, as well as a Default

ARTMAP, the same ART categorization and learn-

ing scheme is used. However, in the supervised case,

the vigilance parameter for the categorization is con-

trolled via the labels coming through the ART

and

MAP ﬁeld. Further explanation on that follows in the

next subsection.

3.2 Default ARTMAP

Default ARTMAP was used as the facial emotion

classiﬁer. The default ARTMAP (Amis and Carpen-

ter, 2007; Carpenter and Gaddam, 2010) is a fuzzy

ARTMAP with distributed coding for testing. Instead

of winner-takes-all (WTA) testing in the typical fuzzy

ARTMAP, the default ARTMAP employs the cod-

ing ﬁeld activation method (CAM) (Carpenter et al.,

1991a) for distributed testing. The training process

for default ARTMAP is trained as follows (Fig. 5

(Amis and Carpenter, 2007)):

1. Complement code M-dimensional training set

feature vectors, a, to produce 2M-dimensional in-

put vectors, A

2. Select the ﬁrst input vector, A, with associated ac-

tual output class, K.

3. Set initial weights.

4. Set vigilance, ρ, to its baseline value and reset the

code: y = 0.

5. Select the next input vector A, with associated ac-

tual output class, K.

6. Calculate signals to committed coding nodes

= |A ∧ w

| + (1 − α)(M − |w

|) (8)

Figure 5: Default ARTMAP notation.

7. Sort the committed coding nodes, N, in descend-

ing order of T

values.

8. Search for a coding node, J, that meets the match-

ing criterion and predicts the correct output class,

9. For the next sorted node that meets matching cri-

teria, set y

= 1(W TA)

10. If the active code, J, predicts the actual output

class, K. Otherwise, increase the ρ to add a new

node and redo initializing and matching.

11. Update coding weights and go to 4.

After the default ARTMAP is trained, the testing is

performed in the following steps:

1. Complement code M-dimensional training set

feature vectors, a, to produce 2M-dimensional in-

put vectors, A

2. Select the ﬁrst input vector.

3. Reset the code: y = 0.

4. Calculate signals to committed coding nodes

= |A ∧ w

| + (1 − α)(M − |w

|) (9)

5. Let Λ − {λ − 1...C : T

> αM} and Λ

− {λ −

1...C : T

− M} = {λ− 1...C : w

= A}.

6. Apply Increased Gradient (IG) CAM Rule to cal-

culate y

(Fig. 6).



































y

1/ ( )









































Figure 6: Default ARTMAP testing using Increased gradi-

ent CAM.

7. Calculate distributed output predictions: σ −

∑

j=1

8. Predict output classes from σ

3.3 Facial Emotion Recognition using

default ARTMAP

Default ARTMAP was used as the classiﬁer for Fa-

cial emotion recognition. The conﬁguration of the

ARTMAP network is shown in Fig. 7.

The Default ARTMAP was trained using four-fold

cross validation using 2442 training instances from

eight actors/speakers (4 males and 4 females). The

followings are the conﬁguration parameters used:

• Learning rate: 0.7

• Choice parameter, α: 0.27

• Base vigilance: 0.2

• CAM rule parameters: 1

After Default ARTMAP for facial parameters was

trained, we got a facial emotion ARTMAP classiﬁer

with the following conﬁguration: 1080 input nodes,

1112 category nodes, and ﬁve class nodes correspond-

Figure 7: Conﬁguration of the trained ARTMAP classiﬁer

for Facial emotion recognition.

Table 1: Confusion matrix for the FAP-based facial emotion

classiﬁer.

Neutral Happy Angry Frust. Sad

Neutral 0.529 0.059 0.059 0.235 0.118

Happy 0.019 0.830 0.057 0.075 0.019

Angry 0.036 0.071 0.5 0.393 0

Frust. 0.024 0.072 0.241 0.590 0.072

Sad 0.027 0.108 0.054 0.243 0.568

ing to ﬁve emotion classes: neutral, happiness, anger,

frustration, and sadness.

4 TESTS AND RESULTS

After training, the classiﬁer was tested using the train-

ing data from the remaining two actors (one male, one

female) in the IEMOCAP database. In other words,

the ARTMAP was trained using eight of the 10 ac-

tors in IEMOCAP and tested using the other two.

The classiﬁcation results showed a ﬁve class classi-

ﬁcation accuracy of over 68%. The confusion ma-

trix for the Default ARTMAP body language classi-

ﬁer is shown in Table 1. As evident from the confu-

sion matrix, the most frequent instances of misclassi-

ﬁcation/confusion occurred between angry and frus-

trated. This confusion is understandable as these two

emotions are often not easily distinguishable even for

humans.

We also compared our results with existing simi-

lar researches on IEMOCAP. These approaches used

support vector machines (SVM) preceded by feature

dimension reduction. Table 2 shows comparative re-

sults of default ARTMAP against the following:

• SVM with Reynolds Boltzman Machine (RBM-

SVM) (Shah et al., 2014)

• SVM with Principal Feature Analysis (PFA-

SVM) (Kim et al., 2013)

• SVM with Deep Belief Networks (DBN-SVM)

(Kim et al., 2013)

• Emotion proﬁled SVM (EP-SVM), where each

one-vs-all emotion classiﬁers used a feature vec-

Table 2: Comparative Results on IEMOCAP using FAP-

based features.

Classiﬁcation approach Accuracy

RBM-SVM (Shah et al., 2014) 60.71%

PFA-SVM (Kim et al., 2013) 65%

DBN-SVM (Kim et al., 2013) 68%

EP-SVM (Mower et al., 2011) 71%

Default ARTMAP 72.2%

tor proﬁled for that particular emotion (Mower

et al., 2011)

These results are for four class (neutral, happy,

anger, as sadness) classiﬁcation as those researches

used four class classiﬁcation. It is evident from the

table that our approach gives the best results for FAP-

based classiﬁer on IEMOCAP data set. Furthermore,

(Mower et al., 2011) and (Kim et al., 2013) used both

facial and vocal features. However, since they had

a similar set of facial features and they tested their

approaches on IEMOCAP, we used their results for

comparison as well.

5 CONCLUSION AND FUTURE

WORK

In this paper, we proposed facial emotion recognition

using a default ARTMAP classiﬁer. The proposed

classiﬁcation scheme along with the FAP-based fea-

tures was shown to be an effective facial emotion clas-

siﬁer in the presence of speech. The results show that

our approach also yielded better results than the exist-

ing state-of-the-art on IEMOCAP database.

In future, we plan to integrate our emotion recog-

nition with real time perception. Furthermore, we

also intend to investigate other conﬁgurations of

ARTMAP involving distributed training along with

the distributed testing used in this paper.

ACKNOWLEDGEMENT

This work was supported by the ICT R&D program

of MSIP/IITP. [2016-0-00563, Research on Adaptive

Machine Learning Technology Development for In-

telligent Autonomous Digital Companion]

REFERENCES

Amis, G. P. and Carpenter, G. A. (2007). Default artmap

2. In 2007 International Joint Conference on Neural

Networks, pages 777–782.

Busso, C., Bulut, M., Lee, C.-C., Kazemzadeh, A., Mower,

E., Kim, S., Chang, J. N., Lee, S., and Narayanan,

S. S. (2008). Iemocap: interactive emotional dyadic

motion capture database. Language Resources and

Evaluation, 42(4):335.

Carpenter, G. A. and Gaddam, S. C. (2010). Biased art: A

neural architecture that shifts attention toward previ-

ously disregarded features following an incorrect pre-

diction. Neural Networks, 23(3):435 – 451.

Carpenter, G. A., Grossberg, S., Markuzon, N., Reynolds,

J. H., and Rosen, D. B. (1992). Fuzzy artmap: A neu-

ral network architecture for incremental supervised

learning of analog multidimensional maps. Trans.

Neur. Netw., 3(5):698–713.

Carpenter, G. A., Grossberg, S., and Reynolds, J. H.

(1991a). Artmap: Supervised real-time learn-

ing and classiﬁcation of nonstationary data by a

self-organizing neural network. Neural Networks,

4(5):565 – 588.

Carpenter, G. A., Grossberg, S., and Rosen, D. B. (1991b).

Fuzzy art: Fast stable learning and categorization of

analog patterns by an adaptive resonance system. Neu-

ral Networks, 4(6):759 – 771.

Hirota, K. and Dong, F. (2008). Development of mascot

robot system in nedo project. In Intelligent Systems,

2008. IS ’08. 4th International IEEE Conference, vol-

ume 1, pages 1–38–1–44.

Kim, Y., Lee, H., and Provost, E. M. (2013). Deep learn-

ing for robust feature generation in audio-visual emo-

tion recognition. In IEEE International Conference on

Acoustics, Speech and Signal Processing (ICASSP).

Li, H., Lin, Z., Shen, X., Brandt, J., and Hua, G. (2015a).

A convolutional neural network cascade for face de-

tection. In The IEEE Conference on Computer Vision

and Pattern Recognition (CVPR).

Li, W., Li, M., Su, Z., and Zhu, Z. (2015b). A deep-learning

approach to facial expression recognition with candid

images. In Machine Vision Applications (MVA), 2015

14th IAPR International Conference on, pages 279–

282.

Liu, P., Han, S., Meng, Z., and Tong, Y. (2014). Facial

expression recognition via a boosted deep belief net-

work. In The IEEE Conference on Computer Vision

and Pattern Recognition (CVPR).

Liu, Z.-T., Min, W., Dan-Yun, L., Lue-Feng, C., Fang-Yan,

D., Yoichi, Y., and Kaoru, H. (2013). Communi-

cation atmosphere in humans and robots interaction

based on the concept of fuzzy atmosﬁeld generated

by emotional states of humans and robots. Journal of

Automation, Mobile Robotics and Intelligent Systems,

7(2):52–63.

Mariooryad, S. and Busso, C. (2016). Facial expression

recognition in the presence of speech using blind lex-

ical compensation. IEEE Transactions on Affective

Computing, 7(4):346–359.

Mower, E., Mataric, M. J., and Narayanan, S. (2011). A

framework for automatic human emotion classiﬁca-

tion using emotion proﬁles. IEEE Transactions on Au-

dio, Speech, and Language Processing, 19(5):1057–

1070.

Petajan, E. (2005). MPEG-4 Face and Body Animation

Coding Applied to HCI, pages 249–268. Springer US,

Boston, MA.

Rozgi, V., Ananthakrishnan, S., Saleem, S., Kumar, R., and

Prasad, R. (2012). Ensemble of svm trees for mul-

timodal emotion recognition. In Signal Information

Processing Association Annual Summit and Confer-

ence (APSIPA ASC), 2012 Asia-Paciﬁc, pages 1–4.

Shah, M., Chakrabarti, C., and Spanias, A. (2014). A multi-

modal approach to emotion recognition using undi-

rected topic models. In 2014 IEEE International Sym-

posium on Circuits and Systems (ISCAS), pages 754–

757.