Robust Face Identiﬁcation with Small Sample Sizes using Bag of Words

and Histogram of Oriented Gradients

Mahir Faik Karaaba, Olarik Surinta, L. R. B. Schomaker and Marco A. Wiering

Institute of Artiﬁcial Intelligence and Cognitive Engineering (ALICE),

University of Groningen, Nijenborgh 9, Groningen 9747AG, The Netherlands

Keywords:

Face Recognition, Histogram of Oriented Gradients, Bag of Words, Small Sample Problem.

Abstract:

Face identiﬁcation under small sample conditions is currently an active research area. In a case of very few

reference samples, optimally exploiting the training data to make a model which has a low generalization

error is an important challenge to create a robust face identiﬁcation algorithm. In this paper we propose to

combine the histogram of oriented gradients (HOG) and the bag of words (BOW) approach to use few training

examples for robust face identiﬁcation. In this HOG-BOW method, from every image many sub-images are

ﬁrst randomly cropped and given to the HOG feature extractor to compute many different feature vectors.

Then these feature vectors are given to a K-means clustering algorithm to compute the centroids which serve

as a codebook. This codebook is used by a sliding window to compute feature vectors for all training and test

images. Finally, the feature vectors are fed into an L2 support vector machine to learn a linear model that will

classify the test images. To show the efﬁciency of our method, we also experimented with two other feature

extraction algorithms: HOG and the scale invariant feature transform (SIFT). All methods are compared on

two well-known face image datasets with one to three training examples per person. The experimental results

show that the HOG-BOW algorithm clearly outperforms the other methods.

1 INTRODUCTION

Face recognition is an important skill which we hu-

mans perform without much effort. Computers, on

the other hand, still do not perform good enough to

be fully trusted in real-world applications. There are

two distinct application ﬁelds which are both gener-

ally called face recognition. One is face identiﬁca-

tion, in which the question is to whom a given face

image belongs, the other is face veriﬁcation that tries

to answer the same/not same question given two face

images. While face identiﬁcation is basically a multi-

class classiﬁcation task and requires a reference train-

ing image dataset for identity registration, face ver-

iﬁcation is a binary classiﬁcation task and does not

require a reference training set containing the identity

of persons. In this paper, we focus on the face identi-

ﬁcation problem.

Face identiﬁcation is an active research ﬁeld due

to different important possible applications and sev-

eral difﬁculties which are not yet solved (Jafri and

Arabnia, 2009). Some of these difﬁculties have to

do with pose variances and facial expressions, which

arise from the capability we have to move our head

and to express ourselves with our faces. Being able to

move our heads in various angles results in very dif-

ferent poses of the face of the same person (Zhang and

Gao, 2009). If we tilt our heads clockwise or counter

clockwise, a simple geometrical alignment procedure

is enough to transform the face image to its frontal

position. On the other hand, if we turn our head to the

left, right, up or down, then without a complex 3d in-

terpolation technique (Chu et al., 2014), geometrical

normalization is very difﬁcult, which in turn causes

signiﬁcant performance losses for a face recognition

algorithm. Another difﬁculty is the non-rigidity of

the face because we can change the appearance of our

faces signiﬁcantly (opening and closing of mouth and

eye, etc). Yet another difﬁculty is related to occlu-

sions which can be caused by different objects such

as glasses, hands we can bring to our face, and shawls

(Azeem et al., 2014).

There are many face recognition algorithms that

rely on a large amount of training data to work opti-

mally. Since more data will include more variances,

the trained classiﬁers can generalize better to the un-

known distribution of the test images. However, in a

variety of application ﬁelds such as forensic research,

data collection is very difﬁcult and the obtained refer-

ence data set may not include more than a couple of

582

Karaaba, M., Surinta, O., Schomaker, L. and Wiering, M.

Robust Face Identiﬁcation with Small Sample Sizes using Bag of Words and Histogram of Oriented Gradients.

DOI: 10.5220/0005722305820589

In Proceedings of the 11th Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2016) - Volume 4: VISAPP, pages 582-589

ISBN: 978-989-758-175-5

images per person. This is called the small sample

problem (SSP). Many research attempts target SSP

(Yan et al., 2014; Lu et al., 2013; Su et al., 2010), and

in this paper we also propose a new algorithm to deal

with few training examples for face identiﬁcation.

Related Work. The ﬁrst successful face recog-

nition algorithm, called Eigenfaces (Turk and Pent-

land, 1991), was based on the nowadays well-known

subspace method principal component analysis. An-

other often used method is Fisherfaces (Belhumeur

et al., 1997) that uses linear discriminant analysis.

These methods can perform well if a large amount

of correctly aligned and normalized face data is avail-

able. However, since they directly use pixel intensi-

ties as input data, pose variances and alignment errors

can easily deteriorate the performance of these algo-

rithms.

To cope with the noise caused by illumination

and pose variances, edge and local feature extraction

based methods have been proposed. Some of the best

known of these are Gabor ﬁlters (Jemaa and Khan-

ﬁr, 2009), the histogram of oriented gradients (HOG)

(Dalal and Triggs, 2005), the scale invariant feature

transform (SIFT) (Lowe, 2004) and local binary pat-

terns (LBP) (Ahonen et al., 2004). These methods

have been shown to yield better performances than

the use of Eigenfaces or Fisherfaces. However, with-

out additional preprocessing on the input data and a

sufﬁcient number of training images, they cannot very

well handle pose differences or alignment errors.

To cope with pose differences and alignment prob-

lems, the bag of words (BOW) method (Csurka et al.,

2004), which has been successfully applied for differ-

ent computer vision problems (Shekhar and Jawahar,

2012; Montazer et al., 2015), was proposed for the

face recognition problem (Li et al., 2010; Wu et al.,

2012). In this method, input images are treated non-

holistically by their many sub-images. These sub-

images are processed by a clustering algorithm to cre-

ate a codebook (the bag of words) and this codebook

is then used to extract feature vectors from images

which are ﬁnally given to the classiﬁer.

Similarly to the BOW approach, in (Simonyan

et al., 2013), many sub-images processed by the SIFT

descriptor are used to train gaussian mixture models

to compute improved Fisher vectors (Perronnin et al.,

2010) for face veriﬁcation. The results reported in

their paper are comparable with the results of state-

of-the-art face veriﬁcation papers.

As for classiﬁers used for face recognition, k-

nearest neighbour (K-NN), support vector machines

(SVM) (Vapnik, 1998) and artiﬁcial neural networks

(ANN) have been shown to be successful. If classiﬁer

speed is important and features from face images are

selected robustly, then K-NN can be a good choice.

Since no training is required for using the K-NN clas-

siﬁer, it is practical for fast face recognition applica-

tions, in which possibly new people are continuously

added to the dataset. However, if accuracy is more

important than speed, then an SVM (Wei et al., 2011)

and an ANN can be preferable, even though they need

retraining in case the dataset is augmented with new

people and images.

Convolutional neural networks (CNNs), as a pow-

erful feature extractor and classiﬁer, are currently

considered by researchers as one of the state-of-the-

art machine learning algorithms. CNN is a special

kind of multi-layer perceptron, which has many spe-

cialized layers used for feature extraction and classiﬁ-

cation. In a recent CNN based face veriﬁcation study

(Parkhi et al., 2015), a novel database construction

and a CNN architecture are presented. Here, they con-

struct a face database with 2.6K subjects composing

of total 2.6M images from Internet, removing the du-

plicate images by employing a state-of-the face recog-

nition application as well as a group of human annota-

tors. After the database construction, they optimize a

relatively simpler new CNN which integrates a com-

bination of the most efﬁcient features of the state-of-

the-art CNNs proposed recently for face recognition.

The SVM has also several varieties. Although it

was ﬁrst proposed as a linear classiﬁer, non-linear

models have been proposed to classify data sets,

which are not separable with the standard linear SVM.

Another popular SVM algorithm is the L2-norm reg-

ularized SVM (L2-SVM) (Koshiba and Abe, 2003;

Deng et al., 2012). It is used to tackle the problem

that occurs when the size of the feature vectors is very

long (e.g. more than 2,000 items) which cannot be

handled very efﬁciently by the standard SVM.

2 FACE RECOGNITION BY THE

HOG-BOW METHOD

Contributions. In this paper, as our main contribu-

tion, a bag of words (BOW) algorithm is proposed

that uses feature vectors extracted with the histogram

of oriented gradients (HOG) to recognize faces under

small sample per person conditions (SSPP). Although

the HOG and BOW algorithms are well-known algo-

rithms, to the best of our knowledge, the combination

of them is not evaluated for face recognition, espe-

cially in the case of SSPP.

In our method, a K-means clustering algorithm

is used to compute the visual codebook from fea-

ture vectors extracted by HOG from many randomly

cropped sub-images. Then this codebook is used to

Robust Face Identiﬁcation with Small Sample Sizes using Bag of Words and Histogram of Oriented Gradients

583

Training Data Images Randomly Cropped

Sub-Images

Patch

Creator

HOG

Processor

HOG

HOG Features

(m =number

of patches)

Cent

Means

Centroids

(n = number of

centroids)

n < m

Figure 1: Graphical depiction of the codebook construction in HOG-BOW.

compute feature vectors from all images in the train-

ing and test set. The computed feature vectors and the

labels from the training images are subsequently fed

into an L2-SVM classiﬁer to learn the model which is

used to classify faces.

Additionally, we compared the HOG-BOW

method to two other well-known methods, namely

HOG and the scale invariant feature transform (SIFT),

both using a standard-SVM with the radial basis func-

tion (RBF) kernel as the classiﬁer since the feature

vectors created by these methods are relatively shorter

in size than those of the HOG-BOW method. We

performed experiments using two datasets, namely

FERET (Phillips et al., 1998) and LFW (Huang et al.,

2007) with one, two and three training images per per-

son. The results show that the HOG-BOW method

clearly outperforms the other methods.

Paper Outline. The rest of the paper is organized

as follows: In Section 2, the proposed face recogni-

tion algorithm is described. In Section 3, experimen-

tal settings and the results are presented. In Section 4,

the conclusion and future work are given.

The idea of the bag of visual words (BOW) is that,

just as a text is composed of many words, an image

is composed of many sub-images which resemble vi-

sual words that can be present in an image (Csurka

et al., 2004). In our proposed HOG-BOW method,

the bag of words model is constructed by using fea-

tures extracted by HOG from sub-images, instead of

directly using pixel intensities. We will now explain

the codebook construction, the computation of the ac-

tivity matrix of visual words on the entire image, and

the ﬁnal creation of the feature vector containing vi-

sual word activities per block. Note that we use the

L2-norm regularized SVM as classiﬁer, but we will

not explain it because it is a well-known supervised

learning algorithm.

2.1 Codebook Construction

Random cropping is used to extract a large number

of sub-images (in our experiment we used 500,000

sub-images) from the training set. Then these sub-

images are processed by the HOG ﬁlter and the ex-

tracted feature vectors are given to a K-means cluster-

ing algorithm that computes the centroids which serve

as the visual words and make up the codebook. For

the graphical illustration of the codebook construc-

tion, see Figure 1.

2.2 Creating Activity Matrix

After the codebook is constructed, the activities of all

visual words are calculated per image. These activi-

ties denote the presence of different visual words in

the image. For this, sub-images are obtained using a

sliding window approach using a stride of 1 pixel. To

compute the activities the soft assignment approach

is adopted in our system. Soft assignment schemes

have previously been shown to outperform hard as-

signment schemes where one sub-image only acti-

vates the winning cluster (or visual word). We will

now explain in detail how the activities a

i j

of the ac-

tivity matrix A are computed for a single image, where

i is the cluster index, and j is the index of the sub-

image (patch). Our method used the soft assignment

scheme proposed in (Coates et al., 2011):

i j

= max{0,

d − d

i j

} (1)

where

d is the mean of the elements of d

i j

and d

i j

is the Euclidean distance between a cluster c

and an

image patch p

i j



− c



(2)

Note that p

is the HOG ﬁltered sub-image vector and

is a cluster centroid computed from feature vectors

extracted by HOG.

VISAPP 2016 - International Conference on Computer Vision Theory and Applications

584

Input

Face Image

Cropped

Sub-Images

Patch

Creator

HOG

Processor

HOG

HOG Features

(m = number

of patches)

Activity

Matrix (n×m)

soft assign.

with

Codebook

n centroids

pooling

feature vector

items

(n×4)

Figure 2: Graphical illustration of the feature vector creation from the codebook in HOG-BOW.

(a)

(b)

Figure 3: Sample aligned face images of two subjects from the FERET dataset.

(a)

(b)

Figure 4: Sample aligned face images of two subjects from the LFW dataset.

2.3 Image Partitioning and Feature

Vector Construction

After the centroid activities are computed for each

sub-image, each row of the activity matrix (which cor-

responds to centroid activities for all sub-images) is

summed up per image block. We will use B blocks

to partition each image and to better keep the spatial

relations between activated visual words. For this we

compute visual word activities I

for each cluster i

and each block b:

∑

i j

, j ∈ block(b) (3)

After this the size of the resulting feature vector

is B × n. These feature vectors are then given to a

classiﬁer. In our experiments we use 4 blocks of equal

size. For the feature vector creation, see Figure 2.

3 EXPERIMENTAL SETTINGS

AND RESULTS

In this section, we ﬁrst brieﬂy explain the datasets

used in the experiments, the alignment of the face im-

ages, and the selected parameters. After that the re-

sults are presented and discussed.

3.1 Datasets

In our experiments we use two datasets, namely

FERET (Phillips et al., 1998) and Labeled Faces in

the Wild (LFW) (Huang et al., 2007). We divide each

dataset into train and test sets by selecting from 1 up

to 3 reference images randomly as training data and

the rest is used as test data.

The FERET dataset was created by the defence

advanced research projects agency (DARPA) and the

national institute of standards and technology (NIST)

to evaluate face recognition algorithms. We selected

Robust Face Identiﬁcation with Small Sample Sizes using Bag of Words and Histogram of Oriented Gradients

585

a subset of this dataset to use in our experiments,

in total 196 subjects are used with 7 face samples

per subject. This subset has basically 3 features:

illumination, pose and expression variances which

present challenges for the performance of a typical

face recognition system. For example face photos of

FERET, see Figure 3.

The LFW dataset is introduced in (Huang et al.,

2007) to evaluate face recognition algorithms under

unconstrained conditions. It contains approximately

13,000 images of around 6,000 subjects. These im-

ages are mainly collected from news web sites. In the

experiments, we have selected 150 subjects each of

which contains at least 7 samples. For example face

photos of the LFW dataset, see Figure 4.

For both datasets, we adopted a similar experi-

mental setup as described in (Yan et al., 2014). The

differences between our and their protocols are brieﬂy

given as follows: For the FERET, (Yan et al., 2014)

uses 200 subjects for which we could ﬁnd only 196 in

the copy of our FERET dataset folder with the same

subset speciﬁcation they deﬁned. The second differ-

ence is that while in (Yan et al., 2014) LFW sub-

jects are chosen as 10 samples per subject where even

some of these samples are chosen from subject fold-

ers which contain more than 10 samples, we choose

the subject folders which contains at least 7 subjects

and without a maximum number limit.

3.2 Alignment

We use an eye-coordinate based 2D alignment for

all the face images before the experiments. In this

method, eye centers are used to compute the roll angle

of the face. Then the face is rotated to roll-normalized

position as described in (Karaaba et al., 2015). All

eye coordinates are obtained from the dataset directo-

ries, except for some images (of each subject) of the

FERET dataset for which we used an automatic align-

ment algorithm.

3.3 Selected Parameters

In this section, we will present the selected parame-

ters that worked best in our experiments. For all the

train and test images, we use 80×88 as the image res-

olution. For SIFT, we used 40×44 as the patch size

which corresponds to 4 sub-images for each face im-

age. Then for each sub-image by applying the stan-

dard SIFT algorithm, we obtained a feature vector

with size (128×4) = 512.

For HOG, 10×11 is used as the patch size (8×8

= 64 patches) and the number of bins is chosen as

24. Hence 8×8×24 is used and the size of the feature

vector is 1,536.

For HOG-BOW, 600 centroids are used. For the

FERET dataset, 15×15 is selected as the patch size

and for the LFW dataset we selected 20×20 as the

patch size, which worked better for LFW. The reason

different patch sizes were found to work best can be

due to differences in the resolution of the two datasets.

For both datasets, 4 block partitions are used resulting

in a feature vector with size (600×4) = 2,400. For all

methods a linear L2-norm regularized SVM is used,

for which the C parameter is tuned using cross valida-

tion.

3.4 Experiments and Results

In our experiments, 10-fold cross validation is used.

We randomly select (t = 1, 2, 3) samples for each sub-

ject from the training set and the rest of the samples

is used as the test data. It should be noted that in (Yan

et al., 2014), 20-fold cross validation is employed.

Tables 1, 2, and 3 show the results

(average accuracy and standard deviation) on

FERET and LFW for t = 1, t = 2 and t = 3, respec-

tively. The results show that the HOG-BOW method

obtains the best performances for both datasets, ex-

cept for LFW without mirrored images with t = 3. Es-

pecially when the available training data is the small-

est in number, the HOG-BOW method shows a sig-

niﬁcant performance gain (9% and 18% for FERET,

and 4% and 1% for LFW for the mirrored and non-

mirrored case respectively) compared to the HOG

method, which performs second best. The average

performance gain over all 12 experimental results of

HOG-BOW compared to HOG is slightly more than

5%.

As for the mirrored image samples, a signiﬁcant

performance improvement is obtained for the FERET

dataset, especially where t = 1. The improvement

becomes smaller when more original training data is

provided. For instance, while the performance dif-

ference is only around 1% for t = 3 for almost all

the methods, for t = 1 this is 4% for the HOG-BOW

method and even 13% for the HOG and SIFT meth-

ods. This shows that mirrored data sampling is a pow-

erful way to boost the face identiﬁcation performance

for the FERET dataset when there are only one or

two training examples per person. On the other hand,

for the LFW dataset, mirrored images, except for the

HOG-BOW method, do not provide any signiﬁcant

performance gains and even decrease the performance

in some cases (e.g. the HOG method with t = 1). This

Note that results of DMMA and MS-CFB are refer-

enced from the same source (Yan et al., 2014).

VISAPP 2016 - International Conference on Computer Vision Theory and Applications

586

Table 1: Face Recognition Results on FERET and LFW (t = 1).

Methods

FERET LFW

Mirrored Non-Mirrored Mirrored Non-Mirrored

HOG 70.87±1.3 57.62±0.7 23.51±0.6 23.73±0.8

SIFT 70.47±1.2 56.51±1.2 22.53±1.0 21.56±0.9

HOG-BOW 79.41±3.3 75.97±1.1 27.14±1.0 24.68±0.8

DMMA (Yan et al., 2014) - 65.24±2.0 - 22.17±2.8

MS-CFB (Yan et al., 2014) - 66.60±2.1 - 21.15±2.9

Table 2: Face Recognition Results on FERET and LFW (t = 2).

Methods

FERET LFW

Mirrored Non-Mirrored Mirrored Non-Mirrored

HOG 85.18±0.7 77.78±1.3 36.99±1.2 37.25±1.0

SIFT 84.48±0.8 75.75±0.8 37.14±1.0 36.14±1.1

HOG-BOW 89.68±0.6 86.13±1.3 39.95±1.3 39.10±1.1

MS-CFB (Yan et al., 2014) - 80.60±1.4 - 37.17±1.8

Table 3: Face Recognition Results on FERET and LFW (t = 3).

Methods

FERET LFW

Mirrored Non-Mirrored Mirrored Non-Mirrored

HOG 87.28±0.8 86.44±0.9 47.22±1.6 48.25±1.2

SIFT 88.88±0.6 85.93±1.1 45.85±1.3 46.02±1.3

HOG-BOW 92.39±0.6 92.62±0.8 48.92±1.6 47.16±0.7

MS-CFB (Yan et al., 2014) - 84.72±1.3 - 43.10±1.5

1 2 3

100

Training Sample Size (Mirrored)

Performance

HOG − BOW

HOG

SIFT

(a) FERET dataset

1 2 3

Training Sample Size (Mirrored)

Performance

HOG − BOW

HOG

SIFT

(b) LFW dataset

Figure 5: Average recognition performance of different methods versus different number of training samples per person

on the FERET (a) and LFW (b) datasets with mirrored face images.

might be due to the nature of the LFW dataset where

low resolution, occlusions and a high-degree of pose

differences are prevalent.

The HOG-BOW method also signiﬁcantly outper-

forms two state-of-the-art face recognition algorithms

for the non-mirrored case with few training examples.

These methods are the multi-subregion based corre-

lation ﬁlter bank (MS-CFB) (Yan et al., 2014) with

the cosine similarity metric and discriminative multi-

manifold analysis (DMMA) (Lu et al., 2013), which

were specially designed for face recognition problems

with few examples.

We also show two additional ﬁgures drawn from

the results to obtain more insights. The ﬁrst one is the

comparison of the methods in relation to the training

sample size, see Figure 5. The second one is to see the

performance effect when mirrored data is added, see

Figure 6. As can be seen from the method comparison

ﬁgures, the HOG-BOW method is always better than

the other methods for each training data size if the

images are mirrored and its performance stays a large

margin above the performances of the other methods.

Figure 6 shows that adding mirrored data helps to in-

crease the performance of HOG-BOW the most when

the training data size is the smallest (t = 1), although

in most cases it improves the results.

Robust Face Identiﬁcation with Small Sample Sizes using Bag of Words and Histogram of Oriented Gradients

587

1 2 3

100

Training Sample Size

Performance

Mirrored

Non − Mirrored

(a) FERET dataset

1 2 3

Training Sample Size

Performance

Mirrored

Non − Mirrored

(b) LFW dataset

Figure 6: Average recognition performance of HOG-BOW method with and without mirrored data versus different number

of training samples per person on the FERET (a) and LFW (b) dataset.

4 CONCLUSION

In this paper, we described a new face identiﬁca-

tion algorithm, namely a bag of visual words using

extracted features of histogram of oriented gradients

(HOG-BOW). This method is designed to cope with

small sample sizes in the training set, which is a chal-

lenge for obtaining good performances. We compared

the HOG-BOW method with two other algorithms:

the scale invariant feature transform (SIFT) and HOG,

both with a standard SVM as classiﬁer.

We have shown the effectiveness of the HOG-

BOW method over the others. On the FERET dataset,

for instance, it performs much better than the other

methods for all the different selected small sample

sizes of the training set. On the LFW dataset, except

for t = 3 with the non-mirrored case, it also performs

signiﬁcantly better than the other methods. We also

compared our results with two state-of-the-art face

recognition algorithms by following similar dataset

selections. From the results it can be seen that, HOG-

BOW obtains state-of-the-art performances for face

recognition with few training examples.

In future work, we plan to work on more datasets

and we will further optimize the parameters of HOG-

BOW to obtain higher accuracies. We are inter-

ested to use local binary patterns or features ex-

tracted with pre-trained convolutional neural net-

works (Krizhevsky et al., 2012) instead of HOG as

the feature extraction scheme, and combine them with

the bag of words approach. Finally, we want to ex-

periment with other clustering algorithms which may

work better than simple K-means clustering.

REFERENCES

Ahonen, T., Hadid, A., and Pietikinen, M. (2004). Face

recognition with local binary patterns. In Pajdla, T.

and Matas, J., editors, Computer Vision - ECCV 2004,

volume 3021 of Lecture Notes in Computer Science,

pages 469–481. Springer Berlin Heidelberg.

Azeem, A., Sharif, M., Raza, M., and Murtaza, M. (2014).

A survey: face recognition techniques under partial

occlusion. Int. Arab J. Inf. Technol., 11(1):1–10.

Belhumeur, P., Hespanha, J., and Kriegman, D. (1997).

Eigenfaces vs. Fisherfaces: recognition using class

speciﬁc linear projection. Pattern Analysis and Ma-

chine Intelligence, IEEE Transactions on, 19(7):711–

720.

Chu, B., Romdhani, S., and Chen, L. (2014). 3d-aided face

recognition robust to expression and pose variations.

In Computer Vision and Pattern Recognition (CVPR),

2014 IEEE Conference on, pages 1907–1914.

Coates, A., Ng, A. Y., and Lee, H. (2011). An analysis of

single-layer networks in unsupervised feature learn-

ing. In Proceedings of the Fourteenth International

Conference on Artiﬁcial Intelligence and Statistics,

AISTATS 2011, Fort Lauderdale, USA, April 11-13,

2011, pages 215–223.

Csurka, G., Dance, C. R., Fan, L., Willamowski, J., and

Bray, C. (2004). Visual categorization with bags of

keypoints. In Workshop on Statistical Learning in

Computer Vision, ECCV, pages 1–22.

Dalal, N. and Triggs, B. (2005). Histograms of oriented gra-

dients for human detection. In Computer Vision and

Pattern Recognition, 2005. CVPR 2005. IEEE Com-

puter Society Conference on, volume 1, pages 886–

893.

Deng, N., Tian, Y., and Zhang, C. (2012). Support Vector

Machines: Optimization Based Theory, Algorithms,

and Extensions. Chapman & Hall/CRC, 1st edition.

Huang, G. B., Ramesh, M., Berg, T., and Learned-Miller,

E. (2007). Labeled faces in the wild: A database for

studying face recognition in unconstrained environ-

VISAPP 2016 - International Conference on Computer Vision Theory and Applications

588

ments. Technical Report 07-49, University of Mas-

sachusetts, Amherst.

Jafri, R. and Arabnia, H. R. (2009). A survey of face recog-

nition techniques. Journal of Information Processing

Systems, 5(2):41–68.

Jemaa, Y. B. and Khanﬁr, S. (2009). Automatic local Gabor

features extraction for face recognition. International

Journal of Computer Science and Information Secu-

rity (IJCSIS)), 3(1).

Karaaba, M. F., Surinta, O., Schomaker, L. R. B., and Wier-

ing, M. A. (2015). In-plane rotational alignment of

faces by eye and eye-pair detection. In Proceedings

of the 10th International Conference on Computer Vi-

sion Theory and Applications, pages 392–399.

Koshiba, Y. and Abe, S. (2003). Comparison of L1 and L2

Support Vector Machines. In Neural Networks, 2003.

Proceedings of the International Joint Conference on,

volume 3, pages 2054–2059 vol.3.

Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Im-

agenet classiﬁcation with deep convolutional neural

networks. In Pereira, F., Burges, C., Bottou, L., and

Weinberger, K., editors, Advances in Neural Informa-

tion Processing Systems 25, pages 1097–1105. Curran

Associates, Inc.

Li, Z., Imai, J., and Kaneko, M. (2010). Robust face

recognition using block-based bag of words. In Pat-

tern Recognition (ICPR), 20th International Confer-

ence on, pages 1285–1288.

Lowe, D. G. (2004). Distinctive Image Features from Scale-

Invariant Keypoints. International Journal of Com-

puter Vision, 60:91–110.

Lu, J., Tan, Y.-P., and Wang, G. (2013). Discriminative

multi-manifold analysis for face recognition from a

single training sample per person. volume 35, pages

39–51.

Montazer, G., Soltanshahi, M., and Giveki, D. (2015). Ex-

tended bag of visual words for face detection. Ad-

vances in Computational Intelligence, 9094:503–510.

Parkhi, O. M., Vedaldi, A., and Zisserman, A. (2015). Deep

face recognition. In Proceedings of the British Ma-

chine Vision Conference (BMVC).

Perronnin, F., S

anchez, J., and Mensink, T. (2010). Improv-

ing the ﬁsher kernel for large-scale image classiﬁca-

tion. In Proceedings of the 11th European Conference

on Computer Vision: Part IV, ECCV’10, pages 143–

156, Berlin, Heidelberg. Springer-Verlag.

Phillips, P. J., Wechsler, H., Huang, J., and Rauss, P. (1998).

The FERET database and evaluation procedure for

face recognition algorithms. Image and Vision Com-

puting, 16(5):295–306.

Shekhar, R. and Jawahar, C. (2012). Word image retrieval

using bag of visual words. In Document Analysis Sys-

tems (DAS), 2012 10th IAPR International Workshop

on, pages 297–301.

Simonyan, K., Parkhi, O. M., Vedaldi, A., and Zisserman,

A. (2013). Fisher Vector Faces in the Wild. In British

Machine Vision Conference.

Su, Y., Shan, S., Chen, X., and Gao, W. (2010). Adap-

tive generic learning for face recognition from a single

sample per person. In Computer Vision and Pattern

Recognition (CVPR), the Twenty-Third IEEE Confer-

ence on, pages 2699–2706.

Turk, M. and Pentland, A. (1991). Eigenfaces for recogni-

tion. Journal of cognitive neuroscience, 3(1):71–86.

Vapnik, V. (1998). Statistical Learning Theory. Wiley.

Wei, J., Jian-qi, Z., and Xiang, Z. (2011). Face recognition

method based on support vector machine and parti-

cle swarm optimization. Expert Systems with Appli-

cations, 38(4):4390 – 4393.

Wu, Y.-S., Liu, H.-S., Ju, G.-H., Lee, T.-W., and Chiu, Y.-

L. (2012). Using the visual words based on afﬁne-sift

descriptors for face recognition. In Signal Informa-

tion Processing Association Annual Summit and Con-

ference (APSIPA ASC), 2012 Asia-Paciﬁc, pages 1–5.

Yan, Y., Wang, H., and Suter, D. (2014). Multi-subregion

based correlation ﬁlter bank for robust face recogni-

tion. Pattern Recognition, 47(11):3487 – 3501.

Zhang, X. and Gao, Y. (2009). Face recognition across pose:

A review. Pattern Recognition, 42(11):2876 – 2896.

Robust Face Identiﬁcation with Small Sample Sizes using Bag of Words and Histogram of Oriented Gradients

589