Improving Bag-of-Visual-Words Towards Effective Facial Expressive

Image Classiﬁcation

Dawood Al Chanti and Alice Caplier

Univ. Grenoble Alpes, CNRS, Grenoble INP

∗

, GIPSA-lab, 38000 Grenoble, France

Keywords:

BoVW, k-means++, Relative Conjunction Matrix, SIFT, Spatial Pyramids, TF.IDF.

Abstract:

Bag-of-Visual-Words (BoVW) approach has been widely used in the recent years for image classiﬁcation

purposes. However, the limitations regarding optimal feature selection, clustering technique, the lack of spatial

organization of the data and the weighting of visual words are crucial. These factors affect the stability of the

model and reduce performance. We propose to develop an algorithm based on BoVW for facial expression

analysis which goes beyond those limitations. Thus the visual codebook is built by using k-Means++ method

to avoid poor clustering. To exploit reliable low level features, we search for the best feature detector that

avoids locating a large number of keypoints which do not contribute to the classiﬁcation process. Then,

we propose to compute the relative conjunction matrix in order to preserve the spatial order of the data by

coding the relationships among visual words. In addition, a weighting scheme that reﬂects how important

a visual word is with respect to a given image is introduced. We speed up the learning process by using

histogram intersection kernel by Support Vector Machine to learn a discriminative classiﬁer. The efﬁciency of

the proposed algorithm is compared with standard bag of visual words method and with bag of visual words

method with spatial pyramid. Extensive experiments on the CK+, the MMI and the JAFFE databases show

good average recognition rates. Likewise, the ability to recognize spontaneous and non-basic expressive states

is investigated using the DynEmo database.

1 INTRODUCTION & PRIOR ART

Bag of Visual Words model (BoVW) with distinctive

local features generated around keypoints has be-

come the most popular method for image classiﬁca-

tion tasks. It has been ﬁrst introduced by (Sivic and

Zisserman, 2003) for object matching in videos. Si-

vic and Zisserman described the BoVW method as an

analogy with text retrieval and analysis. Wherein, a

document is represented by word frequencies without

regard to their order. The word frequencies are con-

sidered as the signature of the document and are then

used to perform classiﬁcation.

Since 2003 till nowadays, it obtained state-of-the-

art performance on several applications in compu-

ter vision such as human action recognition (Peng

et al., 2016), scene classiﬁcation (Zhu et al., 2016)

and face recognition (Hariri et al., 2017). To the best

of our knowledge, this approach has not been inves-

tigated for the task of facial expression recognition

since around 2013 (Ionescu et al., 2013). In this pa-

per, we aim at introducing some improvements over

∗

Institute of Engineering Univ. Grenoble Alpes

the standard BoVW method to tackle facial expres-

sion recognition.

The standard steps for deriving the signature for

a facial expression image using BoVW are represen-

ted in ﬁgure 1 (a), (1): keypoints localization from

the image, (2): keypoints description using local des-

criptors, (3): vector quantization for the descriptors

by clustering them into k-clusters, using clustering

methods, resulting a visual words vocabulary which

forms the codebook, (4): establishing the signature

of each image by accumulating the visual words into

a histogram, (5): normalizing the histogram by divi-

ding the frequency of each visual word over the total

number of visual words, and (6): training a classiﬁer

using the obtained image signatures for classiﬁcation

task.

BoVW model has been the most frequent and do-

minant used technique for visual content description.

However this approach has some drawbacks that af-

fect the performance. First, during the feature de-

tection process, a large number of keypoints are lo-

cated. This increases the computational process, in

addition to the fact that most of these keypoints arise

Chanti, D. and Caplier, A.

Improving Bag-of-Visual-Words Towards Effective Facial Expressive Image Classiﬁcation.

DOI: 10.5220/0006537601450152

In Proceedings of the 13th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2018) - Volume 5: VISAPP, pages

145-152

ISBN: 978-989-758-290-5

145

Figure 1: (a): Standard BoVW representation for facial expressive image. (b): Three level spatial pyramid example.

in the background regions. The second problem is the

poor clustering, when the usual clustering method is

Lloyd’s algorithm referred as k-means algorithm, due

to the fact that several local features are encoded with

the same visual word. The third limitation is that as

standard BoVW represents an image as an unordered

collection of local descriptors, thus the spatial orga-

nization of the data is lost. The ﬁnal drawback is the

weighting scheme, where standard BoVW considers

all visual words equally while there might be some

visual words that are of greater importance.

In the literature, many attempts have been con-

ducted to improve standard BoVW model. In (La-

zebnik et al., 2006), an extension of spatial pyramid

matching is proposed to exploit the spatial infor-

mation. This technique works by partitioning the

image into increasingly ﬁne sub-regions and compu-

ting histograms of local features found inside each

sub-region. The technique shows signiﬁcantly im-

proved performance on scene categorization tasks. In

(Zhang et al., 2011), descriptive visual words and des-

criptive visual phrases are proposed as visual corre-

spondences to text words and phrases, where visual

phrases refer to the frequently co-occurring visual

word pairs. In (Xie et al., 2013), the descriptive abi-

lity of visual vocabulary has been investigated by pro-

posing a weighting based method. In (Altintakan and

Yazici, 2015), k-means clustering method has been

replaced by self-organizing maps (SOM) in codebook

generation as an alternative method. Obviously, state-

of-the-art methods proposed to improve the standard

BoVW method are dealing with one limitation at a

time. In this paper, we perform several improvements,

almost at each step of the standard method.

Our main contributions are: ﬁrst, we investigate

the use of BoVW for acted and spontaneous facial

expression recognition. More speciﬁcally, we search

for the best feature detector that suits our applica-

tion. We integrate the use of k-means++ method

(Arthur and Vassilvitskii, 2007) with BoVW method

instead of k-means algorithm in an attempt to obtain

a more distinctive codebook. We introduce relative

conjunction matrix in order to preserve the spatial

organization of the data. We introduce an efﬁcient

weighting scheme based on term-frequency inverse-

document-frequency (TF-IDF), in order to scale up

the rare visual words while damping the effect of the

frequent visual words. Finally, for learning distinctive

classiﬁers, we use the histogram intersection kernel

since we experience much faster training than with the

RBF kernel used by Support Vector Machine (SVM).

Our choice for using BoVW method as a technique

to tackle facial expression classiﬁcation is motivated

by the fact that differences between facial expressions

are contained in the changes of location, shape and

texture of facial clues (eyes, nose, eyebrows, etc.). We

also want to explore the generalization power of ge-

ometrical based methods in case of strong geometri-

cal deformations on faces (which is the case of acted

emotions) and in case of more subtle deformations

(which is the case of spontaneous facial expressions).

2 BEYOND STANDARD BAG OF

VISUAL WORDS MODEL

2.1 Feature Selection and Description

In image classiﬁcation, low level visual features are

used to represent different geometrical properties. Se-

lecting these features plays a key factor in developing

effective classiﬁcation. In order to recognize facial

expressions, low level visual features could be extrac-

ted from the facial deformations in the geometry of

the facial shape. For example, anger on a face can

be characterized by: eyebrows pulled down, upper

lids pulled up, lower lids pulled up, lips may be tig-

htened. Thereby, facial visual clues such as: eyes,

nose, mouth, cheeks, eyebrow, forehead, etc. (Re-

gion of Interest: RoI) provide observable changes

when an emotion occurs. Therefore, a feature detec-

VISAPP 2018 - International Conference on Computer Vision Theory and Applications

146

tor that locates keypoints around those RoI would li-

mit the risk of generating a huge number of redundant

keypoints. Back to ﬁgure 1 (a), we can see that 2D-

Harris detector is focused on locating keypoints over

the RoI. Although DoG detector has also focused on

RoI, the keypoints are not as numerous as required.

And thought dense feature extraction is known to be

good for many classiﬁcation problems (Furuya and

Ohbuchi, 2009), for facial expression recognition, the

located keypoints are huge and redundant.

Then, the extracted keypoints are described using

local descriptors, in our case SIFT descriptors be-

cause they are invariant to image transformations,

lighting variations and occlusions while being rich

enough to carry enough discriminative information.

2.2 k-means++ Clustering Algorithm

The next step is to perform vector quantization, in

order to quantize the space into a discrete number

of visual words. This step is important to map the

image from a set of high-dimensional descriptors to a

list of visual word numbers and though to provide a

distinctive codebook. The usual method is to use k-

means method. In most of the time, simple k-means

algorithm generates arbitrary bad clustering specially

when it is unbounded between n-data points and k-

integers (pre-deﬁned number of clusters) (Arthur and

Vassilvitskii, 2007). The simplicity of the k-means al-

gorithm comes at the price of accuracy. To tackle this

problem, we propose the use of k-means++ algorithm

which is the result of augmenting k-means algorithm

with a randomized seeding technique. The augmen-

tation improves both the speed and the accuracy of k-

means. The main steps of k-means++ clustering are:

1. Choose an initial center uniformly at random from

the data points.

2. For each data point x, compute D(x), the distance

between x and the nearest center that has already

been chosen.

3. Choose one new data point at random as a new

center, using a weighted probability distribution

where a point x is chosen with probability propor-

tional to D(x)

4. Repeat steps 2 and 3 until a total of k centers has

been selected.

5. Proceed as with standard k-means algorithm.

2.3 Relative Conjunction Matrix

The BoVW approach describes an image as a bag of

discrete visual words. The frequency distributions of

these words are used for image categorization. The

standard BoVW approach yields to a not complete re-

presentation of the data due to the fact that image fea-

tures are modeled as independent and orderless visual

words. Thus there is no explicit use of visual word

positions within the image. Traditional visual words

based methods suffer when faced with similar appea-

rances but distinct semantic concepts (Aldavert et al.,

2015). In this study, we assume that establishing spa-

tial dependencies might be useful for preserving the

spatial organization of data. Thus we develop a novel

facial image representation which uses the concept of

the relative conjunction matrix to take into account

links between the visual words.

A relative conjunction matrix of visual words de-

ﬁnes the spatial order by quantifying the relationships

of each visual word with other visual words. To es-

tablish the correlation between the visual words, the

neighborhood of each visual word feature is used.

Thereby, a facial image is described by a histogram

of pair-wise visual words. It provides a more discri-

minative representation since it contains the spatial ar-

rangement of the visual words.

We deﬁne a relative conjunction matrix to esta-

blish pairs of visual words by looking for all possible

pairs of visual words. This can be considered as a

representation of the contextual distribution of each

visual word with respect to other visual words of the

vocabulary. The relative conjunction matrix C has a

size N × N, where N is the vocabulary size. Each ele-

ment C

i, j

represents the pair of one independent fea-

ture to another. The obtained C has all possible pairs.

Each row vector of C stores how many times a par-

ticular visual word (for example W

) occurs with any

other visual words (for example W

, W

, ..., W

For a particular facial expression, if any two visual

words have similar contextual distribution, that me-

ans they are capturing something similar. Thus, they

are related to each other. The diagonal and the upper

part of C are considered for quantiﬁcation. For quan-

tifying this new representation, we adopt the method

used in (Scovanner et al., 2007), in which the cor-

relation between the distribution vectors of any two

visual words is computed. If the correlation is above

a certain threshold (experimentally we found that 0.6

is a good threshold), we join them together and their

corresponding frequency counts from their initial his-

togram into a new grouping histogram.

2.4 TF.IDF Weighting Scheme

Weighting of visual words is crucial for classiﬁca-

tion performance but standard BoVW just normali-

zes the visual words by dividing them with the total

number of visual words in the image. In (Van Ge-

Improving Bag-of-Visual-Words Towards Effective Facial Expressive Image Classiﬁcation

147

mert et al., 2010), the authors investigate several ty-

pes of soft-assignment of visual words to image fe-

atures. They prove the fact that choosing the right

weight scheme can improve the recognition perfor-

mance. Each weight has to take into account the im-

portance of each visual word in the image. For fa-

cial expression recognition, we are interested in sca-

ling up the weights corresponding to visual words ex-

tracted from the nose, the eyebrows, and the mouth

etc. while damping the effect of frequent visual words

that describe the hair and some non-deformable re-

gions like the background. Therefore, it is possi-

ble to leverage the usage of term- frequency inverse-

document-frequency (TF.IDF) (Leskovec et al., 2014)

weighting scheme to scale up the rare visual words

while scaling down the frequent ones.

The standard weighting method used in traditional

BoVW approach is equivalent to the term frequency

referred as T F(vw), where vw is the visual word. It

measures how frequently a visual word occurs in an

image. It is normalized by dividing it with the total

number of visual words in the image. The utilization

of T F(vw) in classiﬁcation is rather straightforward

and usually results in decreased accuracy due to the

fact that all visual words are considered equally im-

portant.

However, inverse-document-frequency referred as

IDF(vw) assigns different weights to features. It pro-

vides information about the general distribution of vi-

sual word vw amongst facial images of all classes.

The utilization of IDF(vw) is based on its ability to

distinguish between visual words with some semanti-

cal meanings and simple visual words. The IDF(vw)

measures how unique a vw is and how infrequently it

occurs across all training facial expression images.

IDF(vw) = log(

)

T : total number of training images.

: number of occurrences of vw in the whole trai-

ning database T .

However, if we assume that certain visual words

may appear a lot of times but have little importance,

then we need to weight down the most frequent vi-

sual words while scaling up the rare ones, by compu-

ting the term-frequency-inverse-document-frequency

referred as T F.IDF. Where:

T F.IDF(vw) = T F(vw) · log(

) (1)

The TF.IDF

vw,I

assigns to visual word vw a weight

in image I, where I ∈ T , such that: high weight when

vw occurs frequently within a small number of ima-

ges, thus lending high discriminating power to those

images. Low weight when the vw occurs less fre-

quently in an image, or occurs in many images (for

example, vw ∈ background), thus offering a less pro-

nounced relevance signal.

3 FACIAL EXPRESSION

CLASSIFICATION

The proposed Improved Bag-of-Visual-Word model

(ImpBoVW) for facial expression classiﬁcation is

summarized as follows:

1. Locate and extract salient features (keypoints)

from facial images either based on a feature detec-

tor such as: Difference of Gaussian (DoG), 2D-

Harris detector, or by deﬁning a grid with pre-

speciﬁed spatial step (for example 5 pixels) to ex-

tract local feature descriptors from.

2. Describe local features over the selected salient

keypoints, the SIFT descriptor is used.

3. Quantize the descriptors gathered from all the

keypoints by clustering them into k-clusters,

using k-Mean++. It quantizes the space into a

pre-speciﬁed number (vocabulary size) of visual

words. The cluster centers represent the visual

words. Resulting visual words vocabulary forms

the codebook.

4. Map a set of high dimensional descriptors into a

list of visual words by assigning the nearest visual

word to each of its features in the feature space.

This results the histogram of visual words. It sum-

marizes the entire facial image and it is considered

as the signature of the image.

5. Build feature grouping among the words. A co-

occurrence based criterion is used for learning dis-

criminative word groupings using Relative Con-

junction Matrix.

6. Introduce the proper T F.IDF weighting scheme

based on equation 1.

7. Train a SVM classiﬁer over the diagonal and the

upper parts of the weighted conjunction matrix for

facial expressions recognition. Histogram Inter-

section kernel (equation 2) is used by SVM to le-

arn a discriminative classiﬁer.

4 BoVW MODEL WITH SPATIAL

PYRAMID REPRESENTATION

In order to evaluate the effectiveness of the proposed

method and for fair comparison, we have also im-

VISAPP 2018 - International Conference on Computer Vision Theory and Applications

148

plemented the spatial pyramid BoVW model presen-

ted in (Lazebnik et al., 2006) in addition to the stan-

dard BoVW method. BoVW method with spatial py-

ramid has shown signiﬁcantly improved performance

on scene categorization tasks (Zhu et al., 2016). Spa-

tial Pyramid BoVW (SP BoVW) representation is an

extension of an orderless BoVW image representa-

tion. It aims at subdividing the image into increa-

singly ﬁne resolutions and at computing histograms

of local features. Thus, it aggregates statistics of lo-

cal features over ﬁxed sub-regions. A match between

two keypoints occurs if they fall into the same cell of

the grid. Suppose X and Y are two sets of vectors

in a d-dimensional feature space. Let us construct a

sequence of grids at resolutions 0, ..., L, such that the

grid at level l has 2

cells along each dimension, for

a total of D = 2

cells. Let H

and H

denote the

histograms of X and Y at this resolution, such that

(i) and H

(i) are the number of points from X and

Y that fall into the ith cell of the grid. Then the num-

ber of matches at level l is given by the histogram

intersection function:

I(H

, H

) =

∑

i=1

min(H

(i), H

(i)) (2)

The weight associated with level l is proportional to

the cell width at that level:

L−l

However, pyramid match kernel (PMK) aims at pena-

lizing matches found in larger cells since they involve

increasing dissimilar features, thus:

PMK

(X, Y ) = I(H

, H

)

L−1

∑

l=0

L−l

− I

l+1

)

∑

l=1

L−l+1

(3)

Equation 3 is known as Mercer kernel that combines

both the histogram intersection and the pyramid ma-

tch kernel (Grauman and Darrell, 2007).

For spatial pyramid representation, the pyramid mat-

ching in 2D-image space is performed and k-means

clustering algorithm is used to quantize all feature

vectors into M discrete channels. Each channel gives

two dimensional vectors, X

and Y

corresponding to

the coordinates of the features of channel m found in

the respective images.

The ﬁnal kernel (FK) represents the sum of the sepa-

rate channel kernels:

F K

(X, Y ) =

∑

m=1

PMK

, Y

) (4)

Figure 1 (b) represents the construction of a three le-

vel spatial pyramid. The image has different feature

types, indicated by different colors. At the top, the

facial image is sliced at two different levels of resolu-

tion. Then, for each resolution and each channel, the

features that fall in each spatial bin are counted. Fi-

nally, each spatial histogram is weighted according to

equation 3.

5 EXPERIMENTAL RESULTS

In this section, we present the experimental design

used to evaluate the proposed algorithm and compare

it to other approaches. First, we present the datasets

and protocols. Then, we describe the evaluation pro-

cedure and ﬁnally we present the results.

5.1 Data Exploration

For effective and fair comparison, four different data-

bases are used: three with Ekman’s caricatured facial

expressions (the JAFFE database (Lyons et al., 1998),

the extended Cohn Kanade database (CK+) (Kanade

et al., 2000) and the MMI facial expression databases

(Pantic et al., 2005)) and one with spontaneous ex-

pressions (the DynEmo database (Tcherkassof et al.,

2013)).

The JAFFE Database: it is a well-known data-

base made of acted facial expressions. It contains 213

facial expression images with 10 different identities.

It includes: “happy”, “anger”, “sadness”, “surprise”,

“disgust”, “fear” and “neutral”. The head is in frontal

pose. The number of images corresponding to each

is roughly the same (around 21). Seven identities are

used during the training phase while three other iden-

tities are used during the test phase.

The CK+ Database: it is a widely used database

containing acted Ekman’s expressions. It is composed

of 123 different identities. In our study, we pick out

the last frame which represents the peak of emotion.

Each subject performed different sessions correspon-

ding to different emotions. In total we collected 306

images associated with its ground truth label.

The MMI Database: it is composed of more than

1500 samples of both static images and image sequen-

ces of faces in frontal and in proﬁle views displaying

various facial expressions. It is performed by 19 diffe-

rent people both genders, ranging in age between 19

and 62, having a different ethnic background. Ima-

ges are given a single label that belongs to one of six

Ekman emotion. In this study, we collect 600 sta-

tic frontal images and we exploit the 900 sequences

to extract from each sequence other static expressive

images. From this database we create a collection of

1900 static images. This database is used to check

Improving Bag-of-Visual-Words Towards Effective Facial Expressive Image Classiﬁcation

149

the scalability of the proposed approach on a large

databse.

The DynEmo Database: it is a database con-

taining elicited facial expressions. It is made of six

spontaneous expressions which are: “irritation”, “cu-

riosity”, “happiness”, “worried”, “astonishment”, and

“fear”. The database contains a set of 125 recordings

of facial expressions of ordinary Caucasian people

(ages 25 to 65, 182 females and 176 males) ﬁlmed in

natural but standardized conditions. 480 expressive

images that correspond to 65 different identities are

extracted from the database. The head is not totally

in frontal pose. The number of images corresponding

to each of the six categories of expressions is roughly

the same (80 images per class). The dataset is chal-

lenging since it is closer to natural human behaviour

and each person has a different way to react to a given

emotion.

Training Protocol: Identities that appear in the

training sets do not appear in the test sets.

Train set: 70% of randomly shufﬂed images per class

are picked out as training sets. Therefore, for the

JAFFE dataset we have 143 training images (20 ima-

ges per class), for the CK+ dataset we have 216 trai-

ning images (36 images per class), for the MMI da-

taset we have 1330 images (around 221 images per

class) and ﬁnally for the DynEmo we have 360 trai-

ning images (60 images per class).

Development Set: Leave-one-out cross validation is

considered over the training set to tune the algorithm

hyper-parameters.

Test Set: 30% images per class are picked out as test

set, that is 70 test images from the JAFFE set (10 ima-

ges per class), 90 test images (15 images per class) for

the CK+ set, 570 test images (95 images per class) for

the MMI set and 120 test images from the DynEmo

set (20 images per class), randomly shufﬂed, to test

the performance of the proposed method.

5.2 Experimental Setup and Results

We focus the experimental evaluation of the propo-

sed method on the following four questions: What

is the best feature detector that locates salient and

reliable feature points for facial expression recogni-

tion? Does each of the proposed novelties improve

the performance of the SBoVW? What is the inﬂuence

of using k-means++? Is the proposed approach ef-

ﬁcient for facial recognition and scalable for larger

databases?

The proposed model has many parameters that

inﬂuence its classiﬁcation performance: the usage

of weighting scheme TF.IDF, the usage of Rela-

tive Conjunction Matrix (RCM), the combination of

T F.IDF weighting scheme along with RCM, the

usage of K-mean and K-mean++ as clustering met-

hods, the choice of the best feature detector and des-

criptor. Thereby, in order to answer the ﬁrst three

questions, we report performance of facial expression

recognition with SBoVW representation along with

each novelty. The JAFFE (caricatured facial expressi-

ons) and the DynEmo (non-caricatured facial expres-

sions) databases are used for the method evaluation

and setting and to ﬁgure out the best feature detector

that suit facial expression recognition. The ﬁnal ques-

tion is addressed using the CK+ and the MMI databa-

ses to establish the performance of ImpBoVW model

and to compare it with SBoVW and SP BoVW.

Multi-class classiﬁcation is done using SVM trai-

ned using one-versus-all rule. The histogram inter-

section kernel presented in equation 2 is used. Com-

pared to RBF kernel, we experience faster compu-

tation while accuracy rate has a smaller variance.

One-hold-out cross-validation method is used over

the training set in order to tune the algorithm hyper-

parameters such as the regularization parameter C

( the optimal value is 10.0), the gamma parameter

which stands for Gaussian kernel to handle non-linear

classiﬁcation if considered. We ﬁx the vocabulary

size to 2000 visual words, since experimentally it

shows the best classiﬁcation performance. For spa-

tial pyramid representation, we notice that at level-2

and level-3 same performance is achieved. Thus to re-

duce the complexity of feature computation, we only

consider two levels.

Figure 2 represent the performance of each no-

velty and its contribution to the ﬁnal ImpBoVW over

the JAFFE database. ImpBoVW represents the best

ﬁnal model which is a combination of SBoVW re-

presentation along with RCM and T F.IDF weighting

scheme (representd as a star in the ﬁgure 2 and 3).

Its quantization is based on K-mean++. Its features

are located using 2D-Harris detector and described

using SIFT descriptor. The ﬁnal average recognition

rate obtained over the JAFFE database is 92%. If we

compare ImpBoVW model using DoG feature detec-

tor, we got 84% accuracy while 89,5% is achieved

using dense features. 2D-Harris is less computatio-

nally expensive than dense features due to the fact that

it produces less redunadat features. In addition, ﬁgure

2 shows that if each novelty stand alone along with

SBoVW, a noticed increase in the recognition rate is

achieved. More importantly, the ﬁgure shows that the

usage of K-means++ increases the classiﬁcation rate

signiﬁcantly.

In order to fairly estimate the performance of Imp-

BoVW model and to get a fair judgment about the

choice and the best feature detector and clustering

VISAPP 2018 - International Conference on Computer Vision Theory and Applications

150

Figure 2: Classiﬁcation accuracy obtained for SBoVW as-

sociated with improved novelties over acted expressions

(JAFFE) combined with different feature detection met-

hods.

Figure 3: Classiﬁcation accuracy obtained for SBoVW as-

sociated with improved novelties over spontaneous expres-

sions (DynEmo) combined with different feature detection

methods.

method, we evaluate the same steps over the DynEmo

database. DynEmo represents naturalistic static ima-

ges where subtle deformations occur. ImpBoVW mo-

del using 2D-Harris detector achieved 64% recogni-

tion rate (see ﬁgure 3), 55% using dense features and

49% using DoG features, all along with K-means++.

However, for the same setting using K-means, the

following results achieved respectively: 53% (2D-

Harris), 49% (dense) and 42% (DoG) respectively. Fi-

gures 2 and 3 prove the fact that: 2D-Harris is a good

detector for localizing salient features, K-means++

clustering method has a signiﬁcant contribution over

the ﬁnal recognition rate, the combination of SBoVW

along with RCM and TF.IDF improves the represen-

tational quality of the image signature.

Computational Time Performance: In order to

compare the time complexity of the proposed method

with k-means++ and the histogram intersection ker-

nel, we report in ﬁgure 5 the time taken in minutes for

the whole training phase of SBoVW+RCM+TF.IDF

with either Kmean or Kmean++ and with either RBF

kernel or Intersection kernel. Method number 4 repre-

Figure 4: Performance over four databases compared with

SBoVW and SP BoVW.

Figure 5: Computational time for generating the BoVW fe-

atures.

sent ImpBoVW. The contribution of k-Mean++ helps

speeding up the process of vector quantization. In ad-

dition, the histogram Intersection kernel has also con-

tributed in decreasing the complexity of the learning

part by SVM. Figure 5 show that our method achieved

a good compuational time compared to the original al-

gorithm.

6 CONCLUSION

In this paper, we introduced an improved BoVW ap-

proach for automatic emotion recognition. We exa-

mined several aspects of the SBoVW approach that

are linked directly to gain classiﬁcation performance,

speed and scalability. It has been proved that Harris

detector is suitable for emotion recognition. It selects

adaptable and reliable salient keypoints. We improve

the codebook generation process through employing

k-means++ as a clustering method, wherein we gain

speed and accuracy. The importance of spatial orga-

nization of the data has been examined, and we opti-

mized the feature representation by introducing a re-

lative conjunction matrix to preserve the spatial or-

der. We properly weighted the visual words after pre-

serving the spatial order using TF.IDF based on their

Improving Bag-of-Visual-Words Towards Effective Facial Expressive Image Classiﬁcation

151

occurrences. Histogram intersection kernel has been

used to decrease the complexity of the algorithm. We

implemented SP BoVW for comparison purpose and

different feature detection methods are evaluated. We

noticed that the geometrical based method is robust

if strong geometrical deformations are present on the

face, which is the case with acted expressions. Howe-

ver for spontaneous expressions where the facial de-

formations are more subtle, it appears that geometri-

cal based methods alone are not so efﬁcient to achieve

good performance due to the fact that each person has

a different way to react to a given emotion. For fu-

ture work, the idea would be to combine the proposed

approach with an appearance based facial expression

recognition method we developed in (Chanti and Cap-

lier, 2017), in order to take beneﬁt of the advantages

of both approaches.

REFERENCES

Aldavert, D., Rusi

nol, M., Toledo, R., and Llad

os, J. (2015).

A study of bag-of-visual-words representations for

handwritten keyword spotting. International Jour-

nal on Document Analysis and Recognition (IJDAR),

18(3):223–234.

Altintakan, U. L. and Yazici, A. (2015). Towards ef-

fective image classiﬁcation using class-speciﬁc code-

books and distinctive local features. IEEE Transacti-

ons on Multimedia, 17(3):323–332.

Arthur, D. and Vassilvitskii, S. (2007). k-means++: The

advantages of careful seeding. In Proceedings of the

eighteenth annual ACM-SIAM symposium on Discrete

algorithms, pages 1027–1035. Society for Industrial

and Applied Mathematics.

Chanti, D. A. and Caplier, A. (2017). Spontaneous fa-

cial expression recognition using sparse representa-

tion. In Proceedings of the 12th International Joint

Conference on Computer Vision, Imaging and Com-

puter Graphics Theory and Applications (VISIGRAPP

2017), pages 64–74.

Furuya, T. and Ohbuchi, R. (2009). Dense sampling and fast

encoding for 3d model retrieval using bag-of-visual

features. In Proceedings of the ACM international

conference on image and video retrieval, page 26.

ACM.

Grauman, K. and Darrell, T. (2007). The pyramid match

kernel: Efﬁcient learning with sets of features. Jour-

nal of Machine Learning Research, 8(Apr):725–760.

Hariri, W., Tabia, H., Farah, N., Declercq, D., and Benoua-

reth, A. (2017). Geometrical and visual feature quan-

tization for 3d face recognition. In VISAPP 2017 12th

Joint Conference on Computer Vision, Imaging and

Computer Graphics Theory and Applications.

Ionescu, R. T., Popescu, M., and Grozea, C. (2013). Lo-

cal learning to improve bag of visual words model for

facial expression recognition. In Workshop on chal-

lenges in representation learning, ICML.

Kanade, T., Cohn, J. F., and Tian, Y. (2000). Comprehensive

database for facial expression analysis. In Automa-

tic Face and Gesture Recognition, 2000. Proceedings.

Fourth IEEE International Conference on, pages 46–

53. IEEE.

Lazebnik, S., Schmid, C., and Ponce, J. (2006). Beyond

bags of features: Spatial pyramid matching for recog-

nizing natural scene categories. In 2006 IEEE Compu-

ter Society Conference on Computer Vision and Pat-

tern Recognition (CVPR’06), volume 2, pages 2169–

2178.

Leskovec, J., Rajaraman, A., and Ullman, J. D. (2014). Mi-

ning of massive datasets. Cambridge University Press.

Lyons, M., Akamatsu, S., Kamachi, M., and Gyoba, J.

(1998). Coding facial expressions with gabor wa-

velets. In Automatic Face and Gesture Recognition,

1998. Proceedings. Third IEEE International Confe-

rence on, pages 200–205. IEEE.

Pantic, M., Valstar, M., Rademaker, R., and Maat, L.

(2005). Web-based database for facial expression ana-

lysis. In Multimedia and Expo, 2005. ICME 2005.

IEEE International Conference on, pages 5–pp. IEEE.

Peng, X., Wang, L., Wang, X., and Qiao, Y. (2016). Bag of

visual words and fusion methods for action recogni-

tion: Comprehensive study and good practice. Com-

puter Vision and Image Understanding, 150:109–125.

Scovanner, P., Ali, S., and Shah, M. (2007). A 3-

dimensional sift descriptor and its application to

action recognition. In Proceedings of the 15th ACM

international conference on Multimedia, pages 357–

360. ACM.

Sivic, J. and Zisserman, A. (2003). Video google: A text

retrieval approach to object matching in videos. In

null, page 1470. IEEE.

Tcherkassof, A., Dupr

e, D., Meillon, B., Mandran, N., Du-

bois, M., and Adam, J.-M. (2013). Dynemo: A vi-

deo database of natural facial expressions of emoti-

ons. The International Journal of Multimedia & Its

Applications, 5(5):61–80.

Van Gemert, J. C., Veenman, C. J., Smeulders, A. W., and

Geusebroek, J.-M. (2010). Visual word ambiguity.

IEEE transactions on pattern analysis and machine

intelligence, 32(7):1271–1283.

Xie, Y., Jiang, S., and Huang, Q. (2013). Weighted visual

vocabulary to balance the descriptive ability on gene-

ral dataset. Neurocomputing, 119:478–488.

Zhang, S., Tian, Q., Hua, G., Huang, Q., and Gao, W.

(2011). Generating descriptive visual words and vi-

sual phrases for large-scale image applications. IEEE

Transactions on Image Processing, 20(9):2664–2677.

Zhu, Q., Zhong, Y., Zhao, B., Xia, G.-S., and Zhang, L.

(2016). Bag-of-visual-words scene classiﬁer with lo-

cal and global features for high spatial resolution re-

mote sensing imagery. IEEE Geoscience and Remote

Sensing Letters, 13(6):747–751.

VISAPP 2018 - International Conference on Computer Vision Theory and Applications

152