3D Head Model Fitting Evaluation Protocol on Synthetic Databases for

Acquisition System Comparison

Catherine Herold

1,2,3,4

, Vincent Despiegel

1,2

, St´ephane Gentric

1,2

S´everine Dubuisson

and Isabelle Bloch

1,3

Identity & Security Alliance (The Morpho and Telecom ParisTech Research Center), Paris, France

Morpho, Safran Group, 11 boulevard Galli´eni, Issy-les-Moulineaux, France

Institut Mines-Telecom, Telecom ParisTech, CNRS LTCI, Paris, France

ISIR, UPMC Sorbonne Universit´es, Paris, France

Keywords:

3D Face Reconstruction, Metrics, Evaluation Protocol, Comparison of Acquisition System Conﬁgurations.

Abstract:

Automatic face recognition has been integrated in many systems thanks to the improvement of face comparison

algorithms. One of the main applications using facial biometry is the identity authentication at border control,

which has already been adopted by a lot of airports. In order to proceed to a fast identity control, gates have

been developed, to extract the ID document information on the one hand, and to acquire the facial information

of the user on the other hand. The design of such gates, and in particular their camera conﬁguration, has a high

impact on the output acquisitions and therefore on the quality of the extracted facial features. Since it is very

difﬁcult to validate such gates by testing different conﬁgurations on real data in exactly the same conditions,

we propose a validation protocol based on simulated passages. This method relies on synthetic sequences,

which can be generated using any camera conﬁguration with ﬁxed parameters of identities and poses, and

can also integrate different lighting conditions. We detail this methodology and present results in terms of

geometrical error obtained with different camera conﬁgurations, illustrating the impact of the gate design on

the 3D head ﬁtting accuracy, and hence on facial authentication performances.

1 INTRODUCTION

With the recent improvements of face recognition al-

gorithms, facial biometry now offers very high per-

formances in terms of recognition rate when acquisi-

tions are performed in good conditions. Thanks to

these advances, a lot of automatic face recognition

systems have emerged, implying different levels of

cooperation from the user. Among them, automatic

border control gates have already been validated and

deployed in several airports. Nevertheless, the main

systems evaluated until now require the passengers

to position themselves in front of a captor in order

to acquire a frontal view, which is constraining from

the user point of view. A new challenge today is to

provide a simpler system for users, while ensuring

high biometric performances. In case of such uncon-

strained scenarios, an important criterion impacting

the face recognition quality is the pose of the face in

the images (frontal or not), besides other factors such

as the resolution or the illumination conditions.

In this paper, we consider on-the-ﬂy systems

which do not require any speciﬁc behavior of users

with respect to the cameras. To optimize the system

performances, it is therefore necessary to carefully

position the sensors in order to deal as well as pos-

sible with the various poses of faces in the system. As

3D face ﬁtting is an important step for face recogni-

tion against the frontal image of a passport, we com-

pare different acquisition systems in terms of camera

number and positions by their 3D ﬁtting accuracy. We

propose therefore a complete methodology to validate

the 3D head model estimated from the corresponding

acquisitions using geometricevaluation. Further stud-

ies on biometric evaluation and impact of ageing and

expression are not part of this paper.

A crucial point when comparing different systems

with respect to a given parameter is to ﬁx all the re-

maining ones. However, when proceeding to real ac-

quisitions, it is impossible to reproduce exactly the

same illumination conditions and to ask users to have

296

Herold C., Despiegel V., Gentric ., Dubuisson S. and Bloch I..

3D Head Model Fitting Evaluation Protocol on Synthetic Databases for Acquisition System Comparison.

DOI: 10.5220/0004670502960305

In Proceedings of the 9th International Conference on Computer Vision Theory and Applications (VISAPP-2014), pages 296-305

ISBN: 978-989-758-009-3

 2014 SCITEPRESS (Science and Technology Publications, Lda.)

identical behaviors and face positions. To completely

control the parameters which should be stable when

evaluating the camera conﬁguration, we propose to

do the evaluation on synthetic data in order to ﬁx all

other acquisition parameters (identity, pose, illumina-

tion). Hence, no noise is introduced by variations be-

tween different not studied parameters.

We ﬁrst present our global face recognition work-

ﬂow, and detail the 3D model we use in Section 2. In

Section 3, we propose a methodology to evaluate dif-

ferent acquisition systems for face recognition gates

without any real acquisition. This includes a synthetic

database generation step and the metrics characteriz-

ing the quality of a conﬁguration on these simulated

sequences. We brieﬂy present two algorithms we use

for the evaluation in Section 4. The corresponding re-

sults with our methodology are detailed in Section 5,

and show the impact of the gate design on the 3D head

ﬁtting quality.

2 CONTEXT

2.1 Face Recognition Workﬂow

Face recognition systems can be based on different

types of sensors, such as range scanners, infrared or

visible cameras. As visible range cameras are the

most commonly used, we limit our study to acquisi-

tion systems based on these last sensors. Thus, the in-

put of the face recognition algorithms is a set of video

sequences, and the ﬁnal output is a binary decision

corresponding to the face authentication result.

The different steps of the algorithm are as follows.

While the person walks in the gate, a ﬁrst step of

face and ﬁducial point detection is performed on each

available view. After the initialization step, tracking

and/or detections are performed in the next frames, to

obtain the features needed in each frame to estimate

the speciﬁcities of the face seen in the videos. As the

pose is unconstrained in gate scenarios, this is done

using a 3D model which offers robustness to pose

variations (Blanz et al., 2005). This model is ﬁtted to

the observations, to extract the speciﬁcities of the per-

son to authenticate. As illustrated in Figure 1(a), this

ﬁtting can be performedin a recursiveway, by making

a ﬁrst estimation at the beginning, and then updating

the model with the new observations, or globally, by

using all observations together. Once the model has

been ﬁtted to the observations (Figure 1(b)), a frontal

view can be generated (Figure 1(c)) to proceed to the

face comparison. For our study, we focus speciﬁcally

on the quality of the model ﬁtting (Figure 1(b)). In

the next part, we brieﬂy present this shape model and

the associated parameters to be estimated.

Face and feature point detection

Model ﬁtting

frame 1

Initialization

Head detection and/or tracking

Global

model ﬁtting

Model improvement

Frontal view synthesis

to N

Face comparison with

the reference picture

End of sequence

Authentication

(a) Acquisition and authentication workﬂow. In orange: speciﬁc steps for an global approach.

In magenta: recursive approach.

(b) Model ﬁtting on the observations (c) Frontal view

Figure 1: Global workﬂow: from detection to authentica-

tion.

2.2 3D Head Reconstruction

Among the different face models which have been

proposed in the past, we choose a 3D deformable

shape model constructed in a similar way as the 3D

Morphable Model (3DMM) introduced in (Blanz and

Vetter, 1999). As the ﬁnal aim is to establish a com-

parison score between the frontal view of the esti-

mated face and its corresponding ID picture, it is nec-

essary to adapt the model such that it ﬁts as well as

possible the observed identity. The 3DMM describes

the face space on the two following aspects:

• The shape space, characterized by a mean shape

and a set of eigenvectors



,i = 1,...,M



com-

puted by principal component analysis over a

database of aligned head scans. These vectors cor-

respond to deformations describing shape varia-

tions in the face class. Each instance of this model

can then be written as:

S = κ(

S+

∑

i=1

), (1)

where α

are the weighting parameters which

characterize the similarity with the mean shape

and κ is a scaling factor. The mean shape

S is

deﬁned by a set of n

3D vertices, and each vec-

tor s

correspondsto deformations associated with

3DHeadModelFittingEvaluationProtocolonSyntheticDatabasesforAcquisitionSystemComparison

297

this set of points. An equivalent equation can be

written for a vertex v, as s

= κ( ¯s

∑

i=1

where s

and ¯s

are positions and s

a deformation

relative to the vertex v. A mesh is then deﬁned

from these vertices, by adding facets deﬁnition to

describe the entire head surface.

• The texture, that associates a color with each ver-

tex of the mesh.

The shape and the texture of each instance can be

adapted in order to ﬁt to the observations.

In this article, we will only evaluate the quality of

the estimation for the geometrical part of the model,

given various system conﬁgurations. Some instances

of the morphable shape model are given in Figure 2,

illustrating its variations depending on the different

sets of parameter values {α

,i = 1,...,M}.

Figure 2: Some instances of the deformable shape model

(all faces are generated at the same pose and scale, and with

identical lighting conditions). The global shape changes for

each instance, and more speciﬁcally the nose shape, the ear

orientation or the chin.

The ﬁrst algorithms to estimate the shape and tex-

ture parameters of the 3DMM used stochastic gradi-

ent descent (Blanz and Vetter, 1999) or Levenberg-

Marquardtoptimization (RomdhaniandVetter, 2005)

on a single image only. Nevertheless, the informa-

tion is not complete when only single images are used

to perform the ﬁtting, especially in the case of low-

resolution images. Moreover, due to the projection

from the 3D world into the image plane and the oc-

clusions of some parts of a face in an image, some

information is missing and the estimation might be

erroneous. This is why new algorithms based on mul-

tiple image ﬁtting have been proposed to take multi-

views or video sequences into account, thus increas-

ing the estimation accuracy. In (Amberg et al., 2007),

the ﬁtting algorithm proposed in (Romdhani and Vet-

ter, 2005) was adapted to a set of images acquired

simultaneously, which improves the results of algo-

rithms using only a single image. In (Van Rootseler

et al., 2011), two experiments were proposed to ex-

ploit video sequences: the ﬁrst one consists in esti-

mating independently the parameters at each instant

before linearly combining these estimations. The sec-

ond one uses all the input images together to opti-

mize the parameters, leading to a single estimation

based on the whole sequence. The ofﬂine method we

chose in this paper is close to the latter, as it estimates

the set of shape parameters using all images together.

Besides, we also use the recursive method proposed

in (Herold et al., 2012) and based on a particle ﬁl-

ter. Thus, temporal constraints can also be used to im-

prove the pose and shape ﬁtting. These two methods

are summarized in Section 4 and used for our evalua-

tion.

3 DATABASE GENERATION AND

QUALITY MEASURES

3.1 Methodology

The validation of real systems raises several issues.

First, wide acquisition campaigns have to be per-

formed to collect video sequences with different per-

sons. Moreover, to compare the different acquisition

systems, any parameter that could impact the perfor-

mances should be ﬁxed, in order to evaluate prop-

erly each system’s characteristics. Unless the differ-

ent systems are acquiring simultaneously sequences

of users passing through the gate, there is no way

to reproduce exactly the same trajectory of a person,

thus making the comparison on identical inputs im-

possible. Finally, each of the systems has to be mate-

rially conceived, which is costly and time consuming.

We propose a methodology based on evaluations over

different sets of synthetic databases to evaluate the ac-

curacy of pose and shape estimation algorithms with

respect to different gate conﬁgurations, thus provid-

ing a way of comparing different system conﬁgura-

tions.

In the remainder of this section, we describe the

type of video sequences which have been generated

to proceed to the evaluation. The process of synthetic

sequence generation is summarized in Figure 3.

3.2 Identities

Each identity deﬁnition is composed of shape and tex-

ture information. The generation of synthetic views is

possible using these two aspects together.

VISAPP2014-InternationalConferenceonComputerVisionTheoryandApplications

298

Real

faces

3D face

scans

Set of 2D

images

Shape

Texture

Real

sequences

Acquisition

system

Identities

Set of

poses

Calibration

Synthetic

images

Model

ﬁtting

Model

ﬁtting

PovRay

Figure 3: Synthetic sequences generation.

Shape. 40 man and 35 woman real head scans have

been acquired to obtain a raw 3D representation of

each face. A 3D ﬁtting procedure has then been ap-

plied to represent these shapes with the same mesh

structure as the model introduced in Section 2.2.

Thus, we obtain the 3D position of each vertex of the

model for the given scan. This step is necessary to

compare the estimated face to the real one. The shape

of each synthetic face S

is then created by a combi-

nation of four 3D real head scans chosen from those

available:

∑

i=1

σ(i, j)

, (2)

with the constraints 0 ≤ c

≤ 1 and

∑

i=1

1. {σ(i, j),i ∈ {1, ..., 4}} deﬁnes which shapes have

been used to generate the resulting one. The parame-

ters c

have been sampled randomly, the proportions

of the corresponding shapes S

σ(i, j)

are therefore all

different. Synthetic faces of men and women have

been created with the corresponding real scans to re-

spect the morphology differences.

Texture. The texture associated with the shape

gives a colordeﬁnition for each facet of the model. To

obtain the texture for each ID, the following process

is applied. Our model is ﬁtted on images of a real face

seen under various poses to extract the visible part of

the texture in each of them. The extracted textures are

then merged together to obtain the complete texture.

For each synthetic ID, images of a different person

have been used to diversify the generated textures.

Both shape and texture components of faces gen-

erated in this way come from real faces and character-

ize therefore realistic identities. A total of 47 identi-

ties (36 men and 11 women) has been created follow-

ing this process. We consider here that the combina-

tion of independent shape and texture does not alter

the validity of the resulting faces. Nevertheless, other

acquisition systems generating simultaneously depth

maps and corresponding 2D color images could be

used to recreate synthesized sequences corresponding

entirely to real faces (with the Microsoft Kinect

for

instance).

3.3 Associated Sequences

Once the identities are deﬁned, we have to specify

scenarios to generate synthetic sequences of people

walking through a gate. To this aim, we have to simu-

late an acquisition system and its possible conﬁgura-

tions. In our experiments, we used the conﬁgurations

illustrated in Figure 4. One of them is equivalent to

a real system we have already built in our laboratory,

the others are simulated variants which have not been

constructed yet. For the ﬁrst one, four cameras are

considered,two on each side of the outdoorframe; the

second one has only one camera on each side, in ad-

dition to one camera above the door. All cameras are

pointing towards the center of the gate, located about

two meters in front of the door. The following conﬁg-

Figure 4: Conﬁguration of the acquisition system in the 4

and 3-cameras gate.

urations are considered, by using some or all cameras

of one of the systems (numbers refer to Figure 4):

• 2A: 2 cameras aligned vertically (0,2)

• 2B: 2 cameras aligned horizontally (0,1)

• 2C: 2 crossed cameras (0,3)

• 3A, 3B, 3C: 3 cameras (0,1,3), (1,2,3) and (4,5,6)

• 4A: 4 cameras (0,1,2,3)

3DHeadModelFittingEvaluationProtocolonSyntheticDatabasesforAcquisitionSystemComparison

299

Additionally to the extrinsic parameters of each

camera, the impact of the image resolution can also

be evaluated by generating images of different sizes.

Indeed, the face and feature point detection quality

depends on the resolution of the face, and this param-

eter should then be taken into account when evaluat-

ing an acquisition system. For further studies, light-

ing systems can also be added in the scene deﬁnition

to evaluate their impact.

Finally, a pose has to be deﬁned for each times-

tamp of the sequence. We deﬁne this set of poses

given the real poses of heads observed in sequences

acquired with persons using our real 4-camera sys-

tem. Thus, we describe usual trajectories done by

users in real systems. The poses deﬁned in this way

characterize the move of a person walking regularly

from the entrance of the gate to the limit of the visi-

ble area by the cameras. Ten poses cover this move,

which correspond to camera acquisitions at 5 − 8

frames per second for a medium speed walk.

Figure 5 gives some examples of images which

have been generated for different poses and identities.

The software POV-Ray (PovRay, 2012) has been used

to generate these sequences. The lighting or the image

resolution can easily be modiﬁed to generate other se-

quences in order to evaluate the various parameters of

the acquisition system outline above. The use of real

data to generate the sequences in terms of faces, cam-

eras and trajectories ensures that the generated syn-

thetic data are close to the real ones.

Figure 5: Examples of synthetic images generated from a

4-cameras conﬁguration.

3.4 Quality Measures

Different metrics have been proposed to evaluate the

quality of a shape ﬁtting or reconstruction(Park et al.,

2002), and their signiﬁcance depends on the purpose

of this estimation. In our case, with the aim of com-

paring face information with an ID picture, we per-

form the evaluation via geometrical measures com-

puted over a subset of vertices corresponding to the

frontal part of the face V

, as shown in Figure 6. First,

Figure 6: In blue: V

, which represents the set of vertices

selected to compute the mean error between a 3D head scan

and the estimation of the same face in the video. The red

vertices are not taken into account, as they are not used to

compare the faces in the biometric step afterwards, and are

not ﬁtted to the observations.

the following 3D point-to-point error can be com-

puted in the gate coordinate system G:

Err

∑

v=1

− s

, (3)

where k···k

is the Euclidean norm, N

is the num-

ber of vertices belonging to V

, and s

is the true posi-

tion of the v

vertex of the head scan in G computed

as:

= R

v,0

+ T

, (4)

where s

v,0

is the same vertex of the scan at frontal

pose, R

the rotation and T

the translation

used to generate the images. The position s

given by the estimated pose (R

) and shape

(κ

,{α

,i = 1, ..., M}) as follows:

= κ

( ¯s

∑

i=1

) + T

. (5)

It is necessary to take the shape and the pose estima-

tion together into account to estimate the ﬁtting qual-

ity. Indeed, as they are estimated jointly, several so-

lutions of joint pose and shape can verify good head

ﬁtting on the observations, this is why we compare

the solution on the vertex positions computed in G.

This measure, illustrated in Figure 7, is the closest to

the error which is minimized in the shape and pose

ﬁtting procedure. It characterizes how close are the

VISAPP2014-InternationalConferenceonComputerVisionTheoryandApplications

300

•

Err

(a) Front view

Err

•

/∈ mesh(S

)

Err

•

∈ mesh(S

)

(b) Side view

Figure 7: Error Err

for vertex s

calculated from the real

shape and the estimated one.

estimated vertices to their real positions in the gate

coordinate system.

Other measures can also be computed, such as

the following point-to-surface error, which compares

more speciﬁcally the shape estimation to the ground

truth shape:

Err

∑

v=1

d(s

), (6)

where d characterizes the distance between a ground

truth vertex s

and the closest point of the surface de-

scribed by the estimated mesh S

. This allows local

misalignment (which can happen due to missing tex-

tures in some face areas) as long as the surfaces are

close to each other (Figure 8).

Finally, for comparison of 2D frontal views, the

following 2D point-to-point error can also be used:

Err

∑

v=1



f,v

.xy− s

f,v

.xy



, (7)

where X.xy corresponds to the 2-dimensional vec-

tor composed of the x and y coordinates, which are

the image coordinates when making the orthographic

projection. s

f,v

is the vertex position in the frontal

•

Err

(a) Front view

Err

•

/∈ mesh(S

)

Err

= 0

•

∈ mesh(S

)

(b) Side view

Figure 8: Error Err

for vertex s

calculated from the real

shape and the estimated one.

head scan, and the estimated vertices are computed

as:

f,v

= R

−1

− T

). (8)

4 FACE RECONSTRUCTION

ALGORITHMS

The proposed evaluation protocol can be tested with

any algorithm which estimates the pose and the shape

in video sequences. Similar conclusions can be drawn

concerningthe different gate conﬁgurationswhich are

evaluated, whatever the tested method. This is il-

lustrated in this paper by providing results obtained

with one method performing the estimation globally

over the whole sequence and one sequential method.

Both of them take as input the images and a set of

facial ﬁducial points which have been automatically

detected. This information is called observations and

denoted generically y

at time t. The output is the 3D

shape estimation (the scale parameter κ and the shape

deformation parameters {α

,i = 1,...,M}), which is

denoted by θ, and pose estimation at each instant

,...,T

}. The pose and the shape estima-

tions must be handled together, as both parameters

3DHeadModelFittingEvaluationProtocolonSyntheticDatabasesforAcquisitionSystemComparison

301

impact the observations, and as different combina-

tions of (shape, pose) can explain the sparse set of ob-

servations used for the ﬁtting. We detail here shortly

the two standard methods used to evaluate the pose

and shape from video sequences.

4.1 Levenberg-Marquardt

Optimization

The Levenberg-Marquardt (LM) (Marquardt, 1963)

method iteratively minimizes an energy E combin-

ing gradient descent and Gauss-Newton algorithms.

In our case, we applied it in an ofﬂine manner (Fig-

ure 1(a)), estimating jointly the poses for all frames

and the shape parameters (the same for the whole se-

quence) given the video.

This algorithm starts from an initial guess u



,...,T

,θ



of all unknownvalues to be es-

timated. The 3D pose R

of the face at each time

t is estimated given a set of 3D points reconstructed

from the correspondingdetections in the different im-

ages acquired at this instant using the calibration pa-

rameters. Following the method in (Umeyama, 1991),

the pose parameters are adapted by ﬁtting the mean

model to these points. The initial shape deformation

parameters are set to zero, which corresponds to the

mean model used for the pose ﬁtting. Given the func-

tion which associates the state u to the corresponding

observations, an error can be computed between the

real observationsand the ones generated from u. Con-

sidering only the feature point criterion, the aim of the

algorithm is to minimize the associated energy: E =

∑

t=1

D(t)

∑

D(t)

p=1

||m(p,t,u) − o(p,t)||

, where D(t) is

the number of detected feature points at time t, o(p,t)

their 2D positions and m(p,t,u) the projection of the

corresponding points from the model on the images

given the current pose and shape estimations. We aim

at minimizing this error, by applying recursively cor-

rection steps to u, given the current error and the Ja-

cobian of the function f.

This method uses all frames together to proceed to

the optimization. Thus, a single value θ is estimated,

common to all frames. Indeed, as the shape parame-

ters characterize the identity, these are supposed to be

constant (assuming that the person does not change

its facial expression).

4.2 Particle Filter Optimization

The particle ﬁlter (PF) method used to evaluate the

pose and shape throughout a sequence is inspired

from (Herold et al., 2012). The idea of this algorithm

is to integrate the shape parameters θ to be estimated

in the particle state, and to update the density p(θ)

with each new observation. The particle weights are

computed by comparing the projection of the land-

marks given the particle state (a pose and a set of pa-

rameters) to the ones detected in the images. This

method is applied recursively (Figure 1(a)), meaning

that the shape estimation is updated at each instant

given the new observations.

At each time t, the following procedure is applied

given the set of N particles and the new observations

• for each particle i: (i) move the static shape pa-

rameters to obtain a new hypothesis θ

(i)

; (ii) esti-

mate the pose R

(i)

given a subset of the fea-

ture point detections and the particle shape param-

eters θ

(i)

; (iii) update its weight by computing the

likelihood of the state with the observations;

• compute the current output state (R

,θ

). This

is done by choosing the particle with the highest

weight, or by computing the weighted mean over

the set of particles.

Unlike the LM method, only the observations until

time t are used when computing the evaluation at this

instant. As only few features are used in each view

to evaluate the pose and the shape of the face, this

method allows us to maintain different shape param-

eters hypotheses and to validate them when new dis-

criminant observations are available.

5 EVALUATION

In this section, we apply the proposed methodology

to evaluate the head model ﬁtting quality depending

on the number and the positions of the cameras used

to acquire the images in the gate. This evaluation is

done considering the results obtained on the synthetic

sequences with the two ﬁtting algorithms presented

in Section 4. The LM implementation is based on the

levmar library available online (Lourakis, 2004). We

do not use the known feature point positions as inputs

for the two ﬁtting algorithms. Instead, we launch the

feature point detectors used for real sequences, in or-

der to have the same noise and eventual bad or miss-

ing detections associated to these detectors.

Errors presented below are not given in pixels but

in percentage of the distance between the two eyes

to have an absolute measure. Figure 9 illustrates

the error distribution for a subset of sequences of

the database using some of the conﬁgurations listed

above. The PF method has been used to generate

these results, which are given as a percentage of the

VISAPP2014-InternationalConferenceonComputerVisionTheoryandApplications

302

inter-eyedistance (ied). We can see that in most cases,

using 4 cameras (green bars) outperforms all other

conﬁgurations, as no sequence has an error above

11% of the ied. The 3-cameras conﬁguration (blue

bars) comes after, with almost all errors below 11%

of the ied. Finally, with only 2 cameras, some errors

reach 14% of the ied. The errors are however smaller

with the 2C conﬁguration (in purple) where more

points can be seen thanks to the viewpoint change.

Figure 10 illustrates how well the estimated model is

consistent with the observations. Only a small set of

feature points related to the edges plotted on the left

are used to evaluate the pose and shape parameters.

8 9 10 11 12 13 14

Error (% of the inter-eye distance)

Number of sequences

Figure 9: Error (Err

) variations (in % of the inter-eye

distance) with different camera conﬁgurations for some se-

quences: two cameras (2A

, 2B , 2C ), three cameras

(3B

) and four cameras (4A ).

Table 1 shows the mean errors computed over all

the sequences of the database with respect to the cam-

era conﬁgurations with the PF and the LM methods.

We can see that the error and the number of cameras

are correlated. The general trend is that the ﬁtting

is improved when more views are used. Neverthe-

Figure 10: Example of edge and mesh ﬁtting with the par-

ticle ﬁlter method (zoom on the face area). The inter-eye

distance on the input image is 36 pixels.

less, for a given number of cameras, their positions

can also impact the quality, as shown for instance

for 2A (vertical alignment) and 2C (crossed cameras).

The improvement with the PF method is due to the

sampled subset of feature points which leads to more

robustness to the outliers. The stability of this sam-

pling based method has been veriﬁed by running it

ﬁve times for the whole set of sequences. The stan-

dard deviation of the means of Err

computed on

conﬁguration 2A (resp. 4A) is 0.12% (resp. 0.05%)

of the inter-eye distance, which is small relatively to

the highest error variations observed between the dif-

ferent conﬁgurations.

Table 1: Mean errors with the PF and the LM methods given

different camera conﬁgurations. For the PF method, the

mean is computed over 5 runs. Errors are given as a per-

centage of the inter-eye distance.

System 2A 2B 2C 3A 3B 3C 4A

LM Err

23.3 22.8 22.9 22.7 22.9 26.2 22.2

PF Err

12.2 11.8 11.0 10.8 10.5 10.4 10.5

LM Err

16.7 16.0 16.1 16.0 16.2 18.5 15.4

PF Err

6.6 5.8 5.6 4.6 4.5 5.0 4.7

LM Err

9.2 9.5 8.1 7.9 8.3 8.9 7.6

PF Err

8.4 8.2 8.2 7.6 7.2 6.5 6.9

The error magnitude in this table should be corre-

lated with the resolution of the images (600 × 800),

the distance of the person to the sensors (between

1.5 and 2 meters), and the sparse distribution of the

features used for the ﬁtting. Nevertheless, the rela-

tive gain between the worst and the best conﬁguration

reaches 14.7% (resp. 4.7%) for the PF method (resp.

LM method) considering the error Err

Figure 11 illustrates the error repartition over the

face for three faces of our database, using the 4-

cameras conﬁguration. The 3D errors are not dis-

tributed uniformly over the mesh, because we only

use a few ﬁducial points to perform the ﬁtting. In-

deed, in some areas of the mesh such as on the neck,

above the ear or on the cheeks, there are therefore no

clues to guarantee the ﬁtting. This explains the higher

errors in these areas, in comparison with the eyes ar-

eas, where the error is less than 10% of the inter-eye

distance.

Figure 11: 3D error (Err

) distribution over the face for

three faces of the synthetic database. The particle ﬁlter

method with 4-cameras has been used to estimate the shape.

Errors are given as a percentage of the inter-eye distance.

3DHeadModelFittingEvaluationProtocolonSyntheticDatabasesforAcquisitionSystemComparison

303

Inﬂuence of the shape and texture on the accuracy.

We now verify the inﬂuence of texture or shape on

the pose and shape estimation quality. To this end, we

used two new sets of synthetic data:

• base B

shape

: ten sequences, changing only the

shape from one sequence to another one, all other

parameters remaining ﬁxed;

• base B

tex

: ten sequences, changing only the tex-

ture from one sequence to another one.

We evaluated the pose and shape estimation using the

Levenberg-Marquardt algorithm. The accuracy vari-

ation for each of these bases is given in Table 2. We

report only the 2D errors (Err

) using the 3-cameras

conﬁguration 3A.

Table 2: Error variations depending on shape or texture

variations only. The Levenberg-Marquardt optimization has

been used on conﬁguration 3A.

Variation Mean Sigma Min Max

Shape (B

shape

) 7.65 1.8 5.72 11.87

Texture (B

tex

) 7.57 0.45 7.03 8.46

The results are signiﬁcantly more stable with the

base B

tex

than with the base B

shape

. This can be ex-

plained by the fact that texture variations slightly al-

ter the detector quality at ﬁxed pose and shape. For

instance, the appearance of an eye corner does not

change considerably for different facial textures. The

detected points are therefore almost the same for all

sequences of B

tex

, leading to very similar estimations.

For the base B

shape

, the texture and the poses are

ﬁxed for all sequences, so we can assume that the

quality of the detections is equivalent for all of them.

Nevertheless, the errors obtained for this base are

more varied than for B

tex

, which is due to the shape

variability in the sequences. Indeed, some real shapes

cannot be generated because of the model constraints.

Some faces will therefore be easy to represent and

lead to low errors, but for others, it will not be possi-

ble to ﬁt correctly the model to the data. This explains

why it is important to use real head scans when gen-

erating the synthetic sequences, in order to reproduce

this problem when evaluating the pose and estimation

algorithms.

6 CONCLUSIONS AND FUTURE

WORK

We have presented a complete workﬂow to evalu-

ate conﬁgurations of face recognition gates in terms

of 3D ﬁtting quality. The methodology we propose

is based on synthetic data, which can be generated

with any number and conﬁguration of cameras, light-

ing condition and resolution, while maintaining other

conditions ﬁxed (identities, face poses). This allows

us to test an unlimited number of alternatives, with-

out bias introduced by people behavior and trajectory

variations, or constraints related to real campaign ac-

quisitions and material conception. The evaluation is

based on the accuracy measure of the 3D head ﬁtting,

which is easily computable as we beneﬁt from the

groundtruth used to generate the sequences. The gen-

eral trend shows that increasing the number of cam-

eras improves the accuracy of the estimation. More-

over, for a ﬁxed number of cameras, their position

also impacts the accuracy: diversifying the points of

view increases the estimation quality (two crossed

cameras are better than two vertical cameras...). This

factor can be optimized with simulations, thus limit-

ing the number of real systems to build when making

the real data evaluation (for instance, evaluation of

the conﬁguration 3C is not available with the initial

4-cameras system). In the future, such studies could

be extended to other factors, such as lighting and ex-

pression.

We limited our evaluation to geometrical results

on synthetic data. Another extension to this work

would be to develop the following aspects. First, it

would be interesting to compute geometrical mea-

sures on real data. The difﬁculty of this point is to

get the real position of each face vertex during a se-

quence. Additional depth sensors should be used to

this aim, or, at least, the ground truth of the face

should be known (using a 3D scanner for instance).

Besides, the relation between biometric performances

and errors on the estimation (3D pose and shape)

should be deepened, with respect to different face

comparison algorithms.

REFERENCES

Amberg, B., Blake, A., Fitzgibbon, A., Romdhani, S., and

Vetter, T. (2007). Reconstructing High Quality Face-

Surfaces using Model-Based Stereo. In International

Conference on Computer Vision, pages 1–8.

Blanz, V., Grother, P., Phillips, P., and Vetter, T. (2005).

Face Recognition Based on Frontal Views Generated

from Non-Frontal Images. In Conference on Com-

puter Vision and Pattern Recognition, pages 454–461.

Blanz, V. and Vetter, T. (1999). A Morphable Model for the

Synthesis of 3D Faces. In SIGGRAPH, pages 187–

194.

Herold, C., Despiegel, V., Gentric, S., Dubuisson, S., and

Bloch, I. (2012). Head Shape Estimation using a Par-

ticle Filter including Unknown Static Parameters. In

International Conference on Computer Vision Theory

and Applications, pages 284–293.

VISAPP2014-InternationalConferenceonComputerVisionTheoryandApplications

304

Lourakis, M. (2004). levmar: Levenberg-Marquardt

Nonlinear Least Squares Algorithms in C/C++.

http://www.ics.forth.gr/ lourakis/levmar/.

Marquardt, D. (1963). An Algorithm for Least-Squares

Estimation of Nonlinear Parameters. Journal of

the Society for Industrial and Applied Mathematics,

11(2):431–441.

Park, I. K., Lee, K. M., and Lee, S. U. (2002). Efﬁcient

Measurement of Shape Dissimilarity between 3D

Models Using Z-Buffer and Surface Roving Method.

EURASIP, 2002(10):1127–1134.

PovRay (2012). Persistence of Vision Raytracer (version

3.6). http://www.povray.org/download/.

Romdhani, S. and Vetter, T. (2005). Estimating 3D Shape

and Texture using Pixel Intensity, Edges, Specular

Highlights, Texture Constraints and a Prior. In Con-

ference on Computer Vision and Pattern Recognition,

pages 986–993.

Umeyama, S. (1991). Least-Squares Estimation of Trans-

formation Parameters Between Two Point Patterns.

IEEE Transactions on Pattern Analysis and Machine

Intelligence, 13(4):376–380.

Van Rootseler, R. T. A., Spreeuwers, L. J., and Veldhuis, R.

N. J. (2011). Application of 3D Morphable Models

to Faces in Video Images. In Symp. on Information

Theory in the Benelux, pages 34–41.

3DHeadModelFittingEvaluationProtocolonSyntheticDatabasesforAcquisitionSystemComparison

305