the sift feature descriptor at the stage of feature
extraction. In the recognition stage, we adopt an
improved method from sparse representation
classification (Liao et al., 2013) to get the matching
result, and then use the Bayesian fusion method
(Chen et al., 2014) between the matching results
from SDM and STM.
The paper is organized as follows: Section 2
describes the pose estimation using deep learning
and the generation of partial MARS map. Section 3
presents the feature extraction based on 3DLBP
feature maps and the process of recognition. The
experimental results are provided in Section 4.
Finally, some concluding remarks are given in
Section 5.
2 DEEP LEARNING AND MARS
MAP
In this paper, we use a deep learning method (Xu et
al., 2015) with convolutional neural network (CNN)
for facial pose estimation to classify the images from
the different databases into 5 poses. The detailed
network structure is designed as follows: There are
three convolutional layers, two max-pooling layers,
a fully connected layer and the soft-max output layer
indicating five classes which stand for the five
poses. The first two convolution layers are weight
sharing and have 64 convolution kernels with size of
5*5 respectively. The following two max-pooling
layers also have 64 convolution kernels, and the size
of each kernel is 2*2. The third convolutional layer
is fully connected to the second max-pooling layer,
and the last hidden layer is fully connected to the
third convolutional layer. In the soft-max output
layer with 5 classes, the class with the maximum
probability is expected the estimated one.
In the image preprocessing stage, we take two
steps: Firstly, we extract the face region from
original images; then we crop each processed face
image into five patches, and resize patches into
32*32 as the input of the network. The patches
correspond to the four corners and central part with
eighty percent of the whole image. In the training
stage, we use total five patches as training data, and
in the testing stage, we only use the central patch for
estimation.
2.1 Facial Pose Estimation using Deep
Learning
In this paper, we use a deep learning method (Xu et
al., 2015) with convolutional neural network (CNN)
for facial pose estimation to classify the images from
the different databases into 5 poses. The detailed
network structure is designed as follows: There are
three convolutional layers, two max-pooling layers,
a fully connected layer and the soft-max output layer
indicating five classes which stand for the five
poses. The first two convolution layers are weight
sharing and have 64 convolution kernels with size of
5*5 respectively. The following two max-pooling
layers also have 64 convolution kernels, and the size
of each kernel is 2*2. The third convolutional layer
is fully connected to the second max-pooling layer,
and the last hidden layer is fully connected to the
third convolutional layer. In the soft-max output
layer with 5 classes, the class with the maximum
probability is expected the estimated one.
In the image preprocessing stage, we take two
steps: Firstly, we extract the face region from
original images; then we crop each processed face
image into five patches, and resize patches into
32*32 as the input of the network. The patches
correspond to the four corners and central part with
eighty percent of the whole image. In the training
stage, we use total five patches as training data, and
in the testing stage, we only use the central patch for
estimation.
Our algorithm is tested in CMU PIE database
and CAS PEAL face database. The CMU PIE
database contains 68 subjects with 13 poses. We
choose 19584 images with the 5 poses mentioned
above. The CAS PEAL face database contains 1040
subjects with 21 poses, and we choose 4160 images
from this database including the images with 5 poses
in the Pose database and all the images in the
Normal database. Of the total data, the first 50
persons from the CMU PIE database and the data
from the Normal database in the CAS PEAL face
database are selected to train the network, and the
rest of data is used for testing. The network with a
learning rate of 10-3 is obtained by the 300 epochs.
In our experiment, we get the accuracy of 98.7% in
the training stage, and 98.4% in the testing stage.
2.2 Generation of Partial MARS Map
We select the effective area of the entire MARS map
according to each situation of the 5 views and
generate 5 partial MARS maps. Firstly, 3D face
images are collected from the 3D scanning optical
system, and each subject contains three point clouds
respectively from three views of the left, right and
front side. After merging the three point clouds, we
get a more complete 3D point cloud called the fusion
ICPRAM 2017 - 6th International Conference on Pattern Recognition Applications and Methods
634