Deep Learning for Facial Keypoints Detection

Mikko Haavisto, Arto Kaarna and Lasse Lensu

Machine Vision and Pattern Recognition Laboratory, Department of Mathematics and Physics,

Lappeenranta University of Technology, Lappeenranta, Finland

Keywords:

Deep Learning, Deep Belief Net, Restricted Boltzmann Machine, Denoising Score Matching.

Abstract:

A new area of machine learning research called deep learning has moved machine learning closer to one of its

original goals: artiﬁcial intelligence and feature learning. Originally the key idea of training deep networks

was to pretrain models in completely unsupervised way and then ﬁne-tune the parameters for the task at hand

using supervised learning. In this study, deep learning is applied to a facial keypoints detection. The task is to

predict the positions of 15 keypoints on grayscale face images. Each predicted keypoint is speciﬁed by a real-

valued pair in the space of pixel coordinates. In the experiments, we pretrained a Deep Belief Network (DBN)

and ﬁnally performed discriminative ﬁne-tuning. We varied the depth and size of the network. We tested both

deterministic and sampled hidden activations, and the effect of additional unlabeled data on pretraining. The

experimental results show that our model provides better results than the publicly available benchmarks for

the dataset.

1 INTRODUCTION

A feature learning algorithm could be described as a

way of automatically learn features from data at mul-

tiple levels of abstraction. The learned features might

be different depending on the task, even for the same

data. For example, a classiﬁcation task requires dis-

tinctive features between classes whereas more gener-

ative features are needed in an image inpainting task.

In the case of images, the ﬁrst abstraction level could

contain simple ﬁlters like straight lines with different

orientations. The level above the ﬁrst one forms more

complex shapes, such as polygons, based on the ab-

stractions from the layer below.

There are at least three reasons why a feature

learning is of interest to researchers. First, higher-

level abstractions often become hard for humans to

specify in terms of raw data. Second, learned features

are adapted for the task at hand. To obtain features

for a speciﬁc task normally requires a huge amount

of domain knowledge, for example features used in

image processing are quite different from features in

speech recognition. Third, a feature learning scales

and adapts to the data compared to the traditional fea-

ture extractor methods. The last one is a fascinating

property: More data will provide better features. For

smaller problems, the lack of sufﬁcient data might

also be a crucial limitation.

The depth of the architecture refers to the num-

ber of levels of composition of non-linear operations

in the learned function. Most current supervised

learning algorithms correspond to shallow architec-

tures (1, 2 or 3 levels), such as neural networks with

one hidden layer, support vector machines and ran-

dom forests. In addition to the limitations caused by

the shallow architectures (Bengio, 2009), these algo-

rithms are trained using only labeled data and they are

unable to beneﬁt from unlabeled data. Even though

researchers have put huge effort on labeling data sets,

this underlying problem still remains in every ma-

chine learning application using supervised learning.

On the contrary, for example, object recognition

in the visual cortex, uses many layers of nonlinear

processing and requires very little labeled input (Lee

et al., 1998). Object recognition, which is a common

example of artiﬁcial intelligence (AI) related task, is

still a difﬁcult task in computer vision research. This

is because the image of an object vary in different

view points, scale and sizes or when an object is trans-

lated or rotated into a different pose. Objects can

also be partially occluded from the view, which is ex-

tremely common in video sequences and real world

images.

The goal of this study is to ﬁnd out how well a

deep learning model called Deep Belief Net (DBN)

can learn features from small grayscale images to de-

tect facial keypoints. The facial keypoints dataset

available through the Kaggle competition, was used

289

Haavisto M., Kaarna A. and Lensu L..

Deep Learning for Facial Keypoints Detection.

DOI: 10.5220/0005272202890296

In Proceedings of the 10th International Conference on Computer Vision Theory and Applications (VISAPP-2015), pages 289-296

ISBN: 978-989-758-090-1

 2015 SCITEPRESS (Science and Technology Publications, Lda.)

in the experiments.

Section 2 gives a brief overview of deep learning

and its success on different application areas. Sec-

tion 3 gives a detailed description of methods used in

this study. The experiments that were done are ex-

plained in Section 4. The results are discussed in Sec-

tion 5 and a short conclusion of this study is given in

Section 6.

2 DEEP LEARNING

Increased computing power and massive data sets

have made the implementation of larger and more

complex machine learning systems feasible. These

factors have had a huge impact on the development

of artiﬁcial neural networks research, which is now

commonly referred as deep learning.

In 2006, Hinton et al. (Hinton et al., 2006) intro-

duced unsupervised learning algorithm for deep gen-

erative models called Deep Belief Net. The building

block for DBN is an energy-based graphical model

called restricted Boltzmann machine (RBM) that can

learn a probability distribution over its set of inputs.

Recently in 2012, Hinton et al. (Hinton et al.,

2012) developed an interesting biology-inspired

method to prevent overﬁtting by randomly omitting

half of the feature detectors on each training case.

This prevents complex co-adaptations in which a fea-

ture detector is only helpful in the context of several

other speciﬁc feature detectors. This method, referred

as “dropout”, can be seen as an efﬁcient way of per-

forming model averaging.

Experiments in speech recognition show that pre-

training the model with DBNs can signiﬁcantly out-

perform the conventional context-dependent Gaus-

sian mixture model (GMM) (Dahl et al., 2012)(Jaitly

et al., 2012). In the ﬁeld of object recognition,

large, deep Convolutional Neural Network (CNN)

was considerably better than the previous state-of-the-

art methods in classiﬁcation and localization tasks in

ImageNet Large Scale Visual Recognition Challenge

2012 (Krizhevsky et al., 2012). The network con-

tained 60 million parameters and 650,000 neurons,

and was trained on raw RGB pixel intensities.

High-dimensional data can be converted to low-

dimensional codes by training a multilayer neu-

ral network with a small central layer to recon-

struct high-dimensional input vectors. This deep

autoencoder network signiﬁcantly outperforms com-

mon methods, such as Principal Component Analy-

sis (PCA) and Latent Semantics Analysis (LSA), and

autoencoder networks can be used for data visual-

ization purposes (Hinton and Salakhutdinov, 2006).

Also, a technique called “semantic hashing”, pro-

vides very fast method for document and image re-

trieval (Salakhutdinov and Hinton, 2009).

An important notion is that deep learning methods

do not necessarily require domain speciﬁc knowledge

of the particular task at hand. Deep learning moves

machine learning closer to one of its original goals:

artiﬁcial intelligence and automatic feature learning.

Deep learning models have been applied successfully

to a wide variety of machine learning competitions in

Kaggle (Kaggle, 2010), for example, molecular ac-

tivity prediction (Dahl, 2012) and job salary predic-

tion (Mnih, 2013).

Hierarchical face parsing (Luo et al., 2012) is an

interesting method on how to combine prior knowl-

edge on facial parts and components with deep learn-

ing strategy for several face-related tasks. Another in-

teresting work in the domain is the convolutional net-

work cascade model (Sun et al., 2013), which fuses

CNNs with multi-level regression. While the results

of both studies are very good, one drawback is that

the architectures used are carefully designed requir-

ing a lot of tedious manual engineering.

3 MODELING OF DEEP BELIEF

NETS

3.1 Binary RBM

A Restricted Boltzmann machine is a particular type

of Markov random ﬁeld that has a two-layer architec-

ture (Smolensky, 1986). In the machine the visible,

binary stochastic units v ∈ {0,1}

are connected to

hidden binary stochastic units h ∈ {0, 1}

. The en-

ergy of the model with V visible units and H hidden

units is deﬁned as follows:

E(v,h) = −

∑

i=1

∑

j=1

i j

−

∑

i=1

−

∑

j=1

, (1)

where W

V ×H

represents the symmetric weights

and b

and a

are bias terms. The conditional dis-

tribution over the hidden units is as follows:

P(h

= 1|v) = g

∑

i=1

i j

+ a

, (2)

where g(x) = 1/(1+exp(−x)) is the logistic func-

tion. Since v and h play a symmetric role in the energy

function,

P(v

= 1|h) = g

∑

j=1

i j

+ b

. (3)

VISAPP2015-InternationalConferenceonComputerVisionTheoryandApplications

290

3.1.1 Contrastive Divergence

Because the maximum likelihood learning for energy

based models is often infeasible (Lecun et al., 2005),

an approximation is needed. Contrastive Divergence

(CD) (Hinton, 2002) is an efﬁcient learning algorithm

for such models and in the case of binary RBMs, this

equals to the following learning rule:

∆W = α



data





− E





(4)

∆a = α



data

[h] − E

[h]



(5)

∆b = α



data

[v] − E

[v]



, (6)

where α is the learning rate, P

data

is the training set

empirical distribution and P

represents a distribution

deﬁned by running a Gibbs sampling, initialized at the

data, for T full steps. Setting T = ∞ recovers max-

imum likelihood learning. However, the CD learn-

ing with T = 1 has been shown to work well (Hinton,

2002).

3.2 Gaussian RBM

With the usual parameterization of the energy func-

tion for Gaussian RBM, it is difﬁcult to learn vari-

ances from natural images using Contrastive Di-

vergence. Therefore, variances are often ﬁxed to

unity (Cho et al., 2011) (Wang et al., 2012). To over-

come this limitation, we introduce a modiﬁed energy

function for Gaussian RBM that can be trained with

Denoising Score Matching (DSM) algorithm.

3.2.1 Denoising Score Matching

Score matching (SM) is another approach to esti-

mate parameters of probability models whose parti-

tion functions are intractable. SM minimizes the ex-

pected squared distance between the gradient of the

log-density given by the model, referred as score ψ,

and the gradient of the log-density of the observed

data (Hyv

arinen, 2005).

Pascal Vincent showed that there is an equivalence

between Denoising Autoencoder (DAE), a successful

alternative to RBM that reconstructs the input from

a corrupted version of it (Vincent et al., 2008), and

Gaussian RBM trained using denoising score match-

ing (Vincent, 2011):

DSM

= E

( ˜v,v)



ψ( ˜v) −

∂logq

( ˜v|v)

∂ ˜v



. (7)

( ˜v,v)

[·] is the expectation with respect to a noise

model and using additive Gaussian noise model, the

latter term in the squared distance corresponds to

∂logq

( ˜v|v)

∂ ˜v

(v − ˜v). (8)

By introducing the following energy function of

Gaussian RBM, we achieve the equivalence:

E(v,h) = −

b + v

W h + h

a −

v) (9)

P(h|v) = g



W v + a



(10)

P(v|h) = N



W h + b,σ



(11)

ψ(v) = −

(v − b − g



W v + a



), (12)

where {W , b,a,σ} are the model parameters similar

to binary RBM with an added variance term σ

for

visible units. Substituting Eq. 8 and Eq. 12 in Eq. 7,

leads to the ﬁnal objective function, which can then

be minimized by using gradient descent.

3.3 Deep Belief Nets

To extract higher-level features, a Deep Belief Net-

work combines a set of RBMs that are learned se-

quentially (Hinton et al., 2006). The greedy layer-

wise algorithm trains one RBM at a time, using the

previous layer’s activations as inputs.

Finally, a normal feed-forward neural network can

be initialized using the weights learned by the pre-

trained RBMs. For output, a softmax regression layer

is usually chosen for a classiﬁcation task. For con-

tinuous values, a linear Gaussian layer can be used to

model conditionally Gaussian data. The full neural

network can be then ﬁne-tuned with Stochastic Gra-

dient Descend (SGD) algorithm. The gradients for

the parameter updates are obtained using the standard

backpropagation algorithm (Rumelhart et al., 1988).

3.4 Momentum

Nesterov’s Accelerated Gradient (NAG) is a

momentum-based acceleration method that is capable

of accelerating directions of low-curvature, which

is particularly useful in training deep networks. It

explores new regions of the parameter space that

are local minima of better quality (Sutskever et al.,

2013). The NAG update rule for parameters θ is

given by:

t+1

= µv

− α∇ f (θ

+ µv

) (13)

t+1

= θ

+ v

t+1

, (14)

This is not the same energy function as presented in

Eq. 4.3. in (Vincent, 2011) because in that energy function

there is no hidden variables and therefore is not an RBM.

Parameterization in Eq. 9 is proposed by Ian Goodfellow

in Pylearn2 (Goodfellow et al., 2013) implementation of

Gaussian RBM.

DeepLearningforFacialKeypointsDetection

291

where α > 0 is the learning rate, µ ∈ [0, 1] is the mo-

mentum coefﬁcient, and ∇ f (x) is the gradient at x.

Intuitively, this tells us ﬁrst to take a step in the di-

rection of the velocity v

, which is the direction of

the previous parameter update, and then calculate the

gradient update. By setting µ = 0, this becomes the

normal gradient descent update rule.

3.5 Regularization

While layer-wise generative pretraining itself is a

powerful regularizer, we introduce three regulariza-

tion methods that improve generalization of large

models even further.

3.5.1 L1 Weight-decay

Sparse features effectively lower the dimensions and

mitigates the curse of dimensionality. An easy way to

increase sparsity in a model is to use the L1 weight-

decay (Bach et al., 2012). By penalizing weights with

the L1-norm, the weight-decay induces sparsity to the

model in the sense that it forces many of the weights

to become exactly zero while allowing a few of the

weights to grow large.

3.5.2 Dropout

Dropout is a technique that prevents overﬁtting and

provides an efﬁcient way of doing model averaging.

The ﬁrst phase corresponds to dropping out units dur-

ing training: Each unit in a layer, with all its incom-

ing and outgoing connections, is dropped out with a

ﬁxed probability p independent of other units. With

p = 0.5, a neural net with n units can be seen as

a collection of 2

possible thinned networks that all

share weights. At test time, a very simple approxi-

mate model averaging method is used: Using network

without dropout and multiplying weights by p. This

ensures that the expected output of any hidden unit is

the same as the actual output at test time (Srivastava

et al., 2014).

3.5.3 Data Augmentation

A very fundamental approach to improve generaliza-

tion is getting more training data. If a training set is

scarce and has some transformation-invariance prop-

erties, generating additional data may be useful. In

this particular case, we assume that the facial images

are extracted using standard face extractor. Thus, we

assume frontal faces with some rotational variance

and for example images in much smaller scale are er-

rors of the extractor and they are considered as out-

liers. This analysis is also supported by the distortion

intervals that produced the best generalization error

on the validation set.

We randomly added small variations to the train-

ing set, corresponding to rotation and scale. The

amount of variation for each distortion were uni-

formly sampled from the interval, which were deter-

mined using the validation set. In supervised ﬁne-

tuning, the keypoints were also transformed accord-

ingly. Small distortions provided additional training

data that is consistent with the original data distribu-

tion. For images, even small random rotation could be

enough to prevent deep network to memoize its input.

4 EXPERIMENTS

In this section, we experimentally evaluated different

DBN architectures to ﬁnd out a suitable model for fa-

cial keypoints detection. We varied the depth and the

size of an architecture. In unsupervised pretraining,

we tested sampled and expected activations and the

effect of additional unlabeled data.

4.1 Facial Keypoints Dataset

The facial keypoints dataset available through Kag-

gle competition (Kaggle, 2013) was used in exper-

iments. The task was to predict keypoint positions

on 96×96 pixel grayscale facial images. The prob-

lem is challenging because facial features vary from

one individual to another. Environmental conditions

such as illumination, viewing angle and pose express

a large amount of variation as well. Features learned

to predict the keypoints can also be suitable in several

other applications, such as tracking faces, analyzing

facial expressions, detecting dysmorphic facial signs

for medical diagnosis and face recognition.

The dataset was divided into a training set and a

public test set. Each predicted keypoint is speciﬁed

by an (x,y) real-valued pair in the space of pixel co-

ordinates. There are 15 keypoints

, which represent

locations of eyes, eyebrows, nose and mouth. 1783

images in the public test set do not have keypoint val-

ues annotated and the set is used to test performance

against other competitors through the Kaggle plat-

form. There are also two benchmarks in the public

leaderboard: Averages benchmark and patch search

Keypoint ids: left eye center, right eye center, left

eye inner corner, left eye outer corner, right eye inner

corner, right eye outer corner, left eyebrow inner end,

left eyebrow outer end, right eyebrow inner end, right

eyebrow outer end, nose tip, mouth left corner, mouth

right corner, mouth center top lip and mouth center

bottom lip.

VISAPP2015-InternationalConferenceonComputerVisionTheoryandApplications

292

benchmark. Even though the training set size is 7049

images, only 2140 samples have all the 15 keypoints

and the rest have only 4 keypoints labeled. To make

use of all the training data, algorithms need to some-

how handle this missing label information.

4.2 Setup for Experiments

We used Pylearn2 (Goodfellow et al., 2013) machine

learning research library to implement the models.

It is built on top of Theano (Bergstra et al., 2010)

which can compile code for both CPU and GPU

backends. Many deep learning algorithms, like Con-

trastive Divergence for RBM, can be parallelized for

GPU which signiﬁcantly speeds up the learning.

In data preprocessing, we ﬁrst downsampled the

images by a factor of 3 which decreased the input di-

mensions to 32×32 = 1024. This common practise in

image processing is usually applied to achieve reason-

able memory usage and faster training times. Then

we applied global contrast normalization so that the

images became to have zero mean and unit variance.

The set of 2140 fully labeled images was split into

a training, validation and a test set. The test and the

validation sets consisted of 300 fully labeled exam-

ples. Images with imperfect label information gave

a convenient way to test the effect of additional un-

labeled data. The additional data were used in unsu-

pervised pretraining. Training epochs were ﬁxed and

the best model was selected based on results on the

validation set.

4.3 Architectures

We varied the size of the network, more speciﬁcally

the number of hidden units in each layer. These three

hidden layer conﬁgurations were tested: 1000–1000–

4000 (small), 1500–1500–6000 (medium) and 2000–

2000–8000 (large). The linear Gaussian output layer

has 30 units which corresponds to (x,y) pixel coordi-

nates of 15 keypoints. The medium size architecture

used in experiments is outlined in Figure 1.

The surrounding feed-forward neural network

with a linear Gaussian output layer was the same for

each pretrained DBN. To test the effect of increasing

the depth, we trained a new RBM above the previ-

ous DBN. We ﬁne-tuned each DBN separately, which

led to the total of three DBNs per architecture. We

transformed the data from the lower layer by calcu-

lating the expectations of the hidden units. These de-

terministic activations can be considered as features

that the layer can extract. Because the hidden units

are binary stochastic units, it is natural to interpret the

activations to include a sampling from the Bernoulli

distribution. We also ran tests by using sampled ac-

tivations from deterministic expectations each time a

training example is shown to the model.

Figure 1: The architecture of medium size Deep Belief Net

used in experiments.

4.4 Pretraining

We trained the ﬁrst layer Gaussian RBM with 1024

visible units (32 × 32 pixels) using Denoising Score

Matching with 0.4 Gaussian corruptor and 0.01 learn-

ing rate. The second and third layers are normal

binary RBMs which we trained using CD-1 algo-

rithm with 0.001 learning rate. The coefﬁcient for L1

weight-decay was set to 0.0001. Momentum constant

for NAG was initialized at 0.5 and after 300 epochs

it reached its maximum value of 0.7. The data distor-

tion intervals of scale [1.1, 0.9] and rotation [−8

◦

]

in data augmentation were used. The values for hy-

perparameters were determined using the validation

set. We trained the models for 300 epochs. We used

the objective function on validation set to select the

best model with Denoising Score Matching and re-

construction error with Contrastive Divergence.

4.5 Fine-Tuning

The same values for the momentum constant, data

distortion and L1 weight-decay coefﬁcient were used

as in pretraining. We used the dropout probability of

0.8 for the input layer and 0.5 for the other layers.

0.001 was used as the learning rate. We trained the

DeepLearningforFacialKeypointsDetection

293

models for 3000 epochs. In supervised ﬁne-tuning,

the best model was selected on Mean Squared Error

of predictions on validation set.

4.6 Results

We followed the convention from the Kaggle platform

and used Root Mean Squared Error (RMSE) between

the ground truth and the predicted (x,y) points as a

performance metric.

The experiments showed that there were a little

difference on performance whether deterministic or

sampled activations were used on pretraining. Also,

the results did not provide decisive evidence that ad-

ditional unlabeled data on pretraining improves the

performance. However, the largest model was able

to beneﬁt from the additional unlabeled data. There is

also a trend that increasing the size and depth of the

network improved the performance.

Our best model was the large size DBN-3, trained

using extended training set and deterministic activa-

tions between layers. The best DBN-3 in each size

category overcame the reference benchmarks in Kag-

gle competition (Table 1). Test results are summa-

rized in Table 2.

Table 1: Comparison to Kaggle benchmarks. “(e)” denotes

the usage of extended training set and “(s)” denotes sampled

hidden activations between layers on pretraining.

Model RMSE

Averages Benchmark 3.96244

Patch Search Benchmark 3.80685

DBN-3 (Small) (s) 3.25079

DBN-3 (Medium) (s) 3.23646

DBN-3 (Large) (e) 3.21211

Table 2: Prediction metrics (in RMSE) for various DBNs.

“(e)” denotes the usage of extended training set and “(s)”

denotes sampled hidden activations between layers on pre-

training.

Model Small Medium Large

DBN-1 1.71733 1.62732 1.60542

DBN-2 1.58002 1.54632 1.51221

DBN-3 1.53438 1.50519 1.50385

DBN-2 (s) 1.56710 1.56018 1.49564

DBN-3 (s) 1.53192 1.49432 1.49945

DBN-1 (e) 1.67990 1.63921 1.63162

DBN-2 (e) 1.59267 1.52828 1.51318

DBN-3 (e) 1.53839 1.50037 1.47267

4.7 Visualizations

Figure 2 shows some of the ﬁlters learned by the ﬁrst-

level Gaussian RBM. Example keypoint predictions

by the best model are shown in Figure 3. Keypoints

from faces, which match the dominant scale in the

dataset, were predicted well. The mouth keypoints

were causing some error on these images when the

face was not aligned. When an image was in no-

tably smaller scale, predictions were clearly wrong

and somewhat corresponded to an average guess. The

model tolerated quite well occlusions like glasses,

hair and beard.

Figure 2: Some of the weights learned by Gaussian RBM

in the ﬁrst layer.

5 DISCUSSION

The ﬁlters in Figure 2 show that modeling the real-

valued input data using Gaussian RBM and Denois-

ing Score Matching gave meaningful results. The ﬁrst

layer plays an important role in the whole network

because the upper layers are trained based on its hid-

den activations. Filters clearly corresponds to recog-

nizable face images. This is due the fact that images

in the dataset have roughly the same scale and align-

ment. These weights are clearly better than random

values to initialize a feed-forward network. We want

to emphasize that these useful ﬁlters were learned in

completely unsupervised way. Experiments showed

that using the extended training set improved the re-

sults on the largest model and we are intrigued by the

possibility to exploit large databases such as Google

image search. There were no signiﬁcant difference

between the deterministic and sampled activations,

when transforming the data from the lower layer dur-

ing pretraining.

Analysis of keypoint predictions tells us that our

VISAPP2015-InternationalConferenceonComputerVisionTheoryandApplications

294

Figure 3: Keypoint predictions. Green color indicates point

positions from the left side and red color from the right side

of the face relative to subject.

model can predict meaningful keypoint positions on

facial images which have the same scale. Faces with

small rotations got some error typically on one side,

but the prediction is probably still good enough for

tracking, for example. The largest errors were on

images that are at a completely different scale. Pre-

dictions for those cases are useless and can be inter-

preted that the model just made an educated guess.

We found that the model tolerated well occlusions that

were caused by glasses, hair and beard. That is a use-

ful feature to have because real-world use cases most

certainly contain noisy data.

There is a relatively large bias between the test re-

sults and Kaggle benchmarks. Some of that is prob-

ably caused by the fact that for the 4-keypoint im-

ages, the nose tip is systematically labeled below the

nose. These partially labeled images were only used

in the pretraining when testing the effect of additional

data and 4-keypoint images were not used in the more

powerful supervised training stage. The public test set

contained more examples with a smaller scale. Simi-

lar scale variance were present in particular in the 4-

keypoints dataset. This imbalance did not cause error

on our local tests because we only used the fully la-

beled examples in the supervised learning and tests.

The fully connected neural network architectures

used in our experiments does not scale linearly with

respect to the number of parameters. For example,

the small architecture had about 6 million parameters

while the large architecture had over 22 million pa-

rameters. If we increased the size further, for exam-

ple to 3000–3000–12000 architecture, it would cor-

respond to 48 million parameters. This suggests that

a different network architecture would have to be ex-

ploited to train even larger models.

6 CONCLUSIONS

The goal of this study was to ﬁnd out how well Deep

Belief Nets can learn features from small grayscale

images to detect facial keypoints. We introduced

the main building block of deep learning, an energy-

based generative model called Restricted Boltzmann

Machine and an unsupervised learning algorithm

called Contrastive Divergence. A novel approach,

Denoising Score Matching, was introduced to train

Gaussian RBM to model real-valued data. We de-

scribed how Deep Belief Nets can be constructed by

using the greedy layer-wise training.

The facial keypoints dataset available through the

Kaggle competition, was used in the experiments.

The analysis of predicted keypoints showed that the

proposed deep model was able to predict meaning-

ful keypoints on facial images which have a similar

scale. It also overcame the reference benchmarks in

the Kaggle competition. The model tolerated occlu-

sions that were caused by glasses, hair and beard,

well. On pretraining, the largest model was able to

beneﬁt from the extended unlabeled training set.

The results of this work imply that a Gaussian

RBM trained with DSM can successfully model real-

valued pixel intensities, and increasing the depth and

the size of the network improves the performance.

To further improve the model, a small amount of

translation invariance could be added to the archi-

tecture by using a combination of convolutional and

max-pooling layers. The weight sharing principle of

the convolutional layer will signiﬁcantly decrease the

number of parameters, which makes it feasible to train

larger networks using images with realistic sizes. The

scale invariance is harder to achieve and it would re-

quire larger changes in the presented network archi-

tecture.

REFERENCES

Bach, F., Jenatton, R., Mairal, J., and Obozinski, G. (2012).

Optimization with sparsity-inducing penalties. Foun-

dations and Trends in Machine Learning, 4(1):1–106.

Bengio, Y. (2009). Learning deep architectures for AI.

DeepLearningforFacialKeypointsDetection

295

Foundations and Trends in Machine Learning, 2(1):1–

127.

Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu,

R., Desjardins, G., Turian, J., Warde-Farley, D., and

Bengio, Y. (2010). Theano: a CPU and GPU math

expression compiler. In Proceedings of the Python for

Scientiﬁc Computing Conference (SciPy).

Cho, K., Ilin, A., and Raiko, T. (2011). Improved learning

of gaussian-bernoulli restricted boltzmann machines.

In Proceedings of the 21th international conference

on Artiﬁcial neural networks - Volume Part I, pages

10–17, Berlin, Heidelberg. Springer-Verlag.

Dahl, G. (2012). Deep learning how I did it:

Merck 1st place interview. http://blog.kaggle.

com/2012/11/01/deep-learning-how-i-did-it-merck -

1st-place-interview/. Accessed September 29, 2014.

Dahl, G., Yu, D., Deng, L., and Acero, A. (2012).

Context-dependent pre-trained deep neural networks

for large-vocabulary speech recognition. Audio,

Speech, and Language Processing, IEEE Transactions

on, 20(1):30–42.

Goodfellow, I. J., Warde-Farley, D., Lamblin, P., Dumoulin,

V., Mirza, M., Pascanu, R., Bergstra, J., Bastien, F.,

and Bengio, Y. (2013). Pylearn2: a machine learning

research library. arXiv preprint arXiv:1308.4214.

Hinton, G. and Salakhutdinov, R. (2006). Reducing the di-

mensionality of data with neural networks. Science,

313(5786):504–507.

Hinton, G. E. (2002). Training products of experts by min-

imizing contrastive divergence. Neural Computation,

14(8):1771–1800.

Hinton, G. E., Osindero, S., and Teh, Y.-W. (2006). A fast

learning algorithm for deep belief nets. Neural Com-

putation, 18(7):1527–1554.

Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I.,

and Salakhutdinov, R. R. (2012). Improving neural

networks by preventing co-adaptation of feature de-

tectors. arXiv preprint arXiv:1207.0580.

Hyv

arinen, A. (2005). Estimation of non-normalized statis-

tical models by score matching. In Journal of Machine

Learning Research, pages 695–709.

Jaitly, N., Nguyen, P., Senior, A. W., and Vanhoucke, V.

(2012). Application of pretrained deep neural net-

works to large vocabulary speech recognition. In IN-

TERSPEECH.

Kaggle (2010). Machine learning competitions.

http://www.kaggle.com. Accessed September

29, 2014.

Kaggle (2013). Facial keypoints detection.

http://www.kaggle.com/c/facial-keypoints-detection.

Accessed September 29, 2014.

Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Im-

agenet classiﬁcation with deep convolutional neural

networks. In Advances in Neural Information Pro-

cessing Systems 25, pages 1097–1105. Curran Asso-

ciates, Inc.

Lecun, Y., Jie, F., and Jhuangfu (2005). Loss functions

for discriminative training of energy-based models. In

Proc. of the 10-th International Workshop on Artiﬁcial

Intelligence and Statistics.

Lee, T. S., Mumford, D., Romero, R., and Lamme, V. A.

(1998). The role of the primary visual cortex in higher

level vision. Vision Research, 38(15/16):2429–2454.

Luo, P., Wang, X., and Tang, X. (2012). Hierarchical face

parsing via deep learning. In Conference on Com-

puter Vision and Pattern Recognition, pages 2480–

2487. IEEE.

Mnih, V. (2013). Q&A with job salary pre-

diction ﬁrst prize winner Vlad Mnih.

http://blog.kaggle.com/2013/05/06/qa-with-job-

salary-prediction-ﬁrst-prize-winner-vlad-mnih/.

Accessed September 29, 2014.

Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1988).

Learning representations by back-propagating errors.

In Neurocomputing: Foundations of Research, pages

696–699. MIT Press, Cambridge, MA, USA.

Salakhutdinov, R. and Hinton, G. (2009). Semantic hash-

ing. International Journal of Approximate Reasoning,

50(7):969–978.

Smolensky, P. (1986). Information processing in dynami-

cal systems: foundations of harmony theory. In Par-

allel distributed processing: explorations in the mi-

crostructure of cognition, vol. 1, pages 194–281. MIT

Press, Cambridge, MA, USA.

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I.,

and Salakhutdinov, R. (2014). Dropout: A Simple

Way to Prevent Neural Networks from Overﬁtting.

Journal of Machine Learning, 15:1929–1958.

Sun, Y., Wang, X., and Tang, X. (2013). Deep convolutional

network cascade for facial point detection. In Com-

puter Vision and Pattern Recognition (CVPR), 2013

IEEE Conference on, pages 3476–3483. IEEE.

Sutskever, I., Martens, J., Dahl, G. E., and Hinton, G. E.

(2013). On the importance of initialization and mo-

mentum in deep learning. In Proceedings of the

30th International Conference on Machine Learning,

pages 1139–1147.

Vincent, P. (2011). A connection between score match-

ing and denoising autoencoders. Neural Computation,

23(7):1661–1674.

Vincent, P., Larochelle, H., Bengio, Y., and Manzagol, P.-

A. (2008). Extracting and composing robust features

with denoising autoencoders. In Proceedings of the

25th International Conference on Machine Learning,

pages 1096–1103, New York, NY, USA. ACM.

Wang, N., Melchior, J., and Wiskott, L. (2012). An analy-

sis of gaussian-binary restricted boltzmann machines

for natural images. In Proceedings of the 20th Euro-

pean Symposium on Artiﬁcial Neural Networks, Com-

putational Intelligence and Machine Learning, pages

287–292.

VISAPP2015-InternationalConferenceonComputerVisionTheoryandApplications

296