3D Face Reconstruction from RGB-D Data by Morphable Model to Point

Cloud Dense Fitting

Claudio Ferrari, Stefano Berretti, Pietro Pala and Alberto Del Bimbo

Media Integration and Communication Center, University of Florence, Florence, Italy

Keywords:

3DMM Construction, 3DMM Fitting, 3D Face Analysis.

Abstract:

3D cameras for face capturing are quite common today thanks to their ease of use and affordable cost.

The depth information they provide is mainly used to enhance face pose estimation and tracking, and face-

background segmentation, while applications that require ﬁner face details are usually not possible due to the

low-resolution data acquired by such devices. In this paper, we propose a framework that allows us to de-

rive high-quality 3D models of the face starting from corresponding low-resolution depth sequences acquired

with a depth camera. To this end, we start by deﬁning a solution that exploits temporal redundancy in a

short-sequence of adjacent depth frames to remove most of the acquisition noise and produce an aggregated

point cloud output with intermediate level details. Then, using a 3DMM speciﬁcally designed to support local

and expression-related deformations of the face, we propose a two-steps 3DMM ﬁtting solution: initially the

model is deformed under the effect of landmarks correspondences; subsequently, it is iteratively reﬁned using

points closeness updating guided by a mean-square optimization. Preliminary results show that the proposed

solution is able to derive 3D models of the face with high visual quality; quantitative results also evidence the

superiority of our approach with respect to methods that use one step ﬁtting based on landmarks.

1 INTRODUCTION

Recent advances in 3D scanning technologies make it

possible to acquire registered RGB and depth frames

at affordable cost. The availability of depth and RGB

data greatly simpliﬁes the design of video analysis

modules for object detection and segmentation. In

fact, discontinuities in the depth domain highlight

the presence of object boundaries that can be difﬁ-

cult to detect in the RGB domain especially in the

case of background clutter and/or uneven illuminat-

ing conditions. However, a common trait of these

low-cost RGB-D cameras—including the Microsoft

Kinect, the Asus Xtion and the Intel RealSense depth

cameras—is that depth data of individual frames are

badly affected by noise that prevents the accurate

reconstruction of the 3D geometry of the observed

scene. This is particularly true if the task of recon-

structing the 3D geometry of non-planar object sur-

faces is targeted, such as the case of reconstructing

the 3D geometry of faces. In the last few years, sev-

eral 3D face reconstruction approaches have been pro-

posed, also in truly uncooperative, in the wild, condi-

tions (Booth et al., 2018). However, a common trait

of these approaches is that reconstruction of the 3D

face model from data observed in a generic image

or video frame is ﬁnalized to reproduce realistic ren-

derings of the observed face in a different pose (e.g.,

frontal pose) for the purpose of boosting the accu-

racy of person or facial expression recognition. In

these solutions, smoothness of the reconstructed 3D

face model is privileged with respect to ﬁtting to the

actual 3D geometry of the face. Indeed, this yields

pleasant and realistic face renderings, but may prove

inadequate to provide a precise reconstruction of the

3D geometry of the face and its deformations in the

presence of voluntary and involuntary expressions.

Motivated by these premises, in this paper we

propose a novel face modeling approach that start-

ing from an RGB-D low-resolution sequence of the

face is capable of reconstructing an accurate 3D face

model over time also in the presence of facial expres-

sions and generic facial deformations. This opens the

way to application scenarios that aim to analyse the

face deformations at a ﬁne scale so as to understand

the 3D local face variations with respect to a neutral

model. For example, this can have applications in face

rehabilitation, where a subject at home (e.g., a patient

recovering after a stroke or a face surgery) can be in-

structed to perform some facial deformations in front

728

Ferrari, C., Berretti, S., Pala, P. and Bimbo, A.

3D Face Reconstruction from RGB-D Data by Morphable Model to Point Cloud Dense Fitting.

DOI: 10.5220/0007521007280735

In Proceedings of the 8th Inter national Conference on Pattern Recognition Applications and Methods (ICPRAM 2019), pages 728-735

ISBN: 978-989-758-351-3

of a PC. In such scenario, 3D high-resolution scan-

ners cannot be employed due to their size and cost,

and RGB data alone cannot provide the required level

of accuracy. In our proposed solution, we start by

the idea of using a 3DMM that also includes modes

of deformation associated to facial expression varia-

tions and is speciﬁcally designed to account for lo-

cal changes of the face. This is possible thanks to a

dictionary learning framework that gives the atoms of

the dictionary the capability of producing local de-

formations of the average model of the face. The

model is then ﬁt to a point cloud as captured by a low-

resolution scanner (e.g., Kinect) through a coarse-to-

ﬁne solution: the initial ﬁtting is driven by the cor-

respondence of a set of landmarks; the coarsely de-

formed model is then reﬁned by an iterative closest

points reassignment that minimizes the mean square

error between corresponding points in an ICP like

manner. Preliminary results have been obtained by

measuring the reconstruction error between the 3D

models derived by ﬁtting the 3DMM to the target

scans of two large face datasets and the target scans

themselves used as ground truth. Results are univocal

in indicating superior accuracy for the proposed so-

lution with respect to using a one step 3DMM ﬁtting

based on landmarks.

The rest of the paper is organized as follows:

In Section 2, we summarize the previous works on

3DMM construction and ﬁtting that are more close to

our proposal; In Section 3, we ﬁrst report on the pro-

posed method for 3DMM construction; Then, the way

the 3DMM is ﬁt to depth scans is detailed; The de-

noising operation applied to low-resolution sequences

of face scans before 3DMM ﬁtting is introduced in

Section 4; Experimental results are reported in Sec-

tion 5; Section 6 concludes the paper by reporting dis-

cussion and directions for future work.

2 RELATED WORK

In general, methods capable of reconstructing a 3D

model of the face from low-resolution depth data can

be categorized as either driven by the data or based

on ﬁtting a 3D (morphable) model.

Methods in the ﬁrst category build a 3D face

model by integrating tracked live depth images into

a common ﬁnal 3D model. For example, a method

to produce laser scan quality 3D face models from

a freely moving user with a low-cost, low-resolution

depth camera was proposed in (Hernandez et al.,

2012). The model is initialized with the ﬁrst depth

image, and then each subsequent cloud of 3D points

is registered to the reference using a GPU implemen-

tation of the ICP algorithm. This registration is robust

in that it rejects poor alignment due to facial expres-

sions, occlusions, or a poor estimation of the transfor-

mation. Temporal and spatial smoothing of the suc-

cessively incremented model is performed thanks to

the introduction of the Bump Images framework to

parameterize the 3D facial surface in cylindrical co-

ordinates. One evident limitation of this approach

is that it is not capable of reconstructing expressive

models of the face. In (Newcombe et al., 2011), the

Kinect Fusion approach was proposed to fuse all of

the depth data streamed from a Kinect sensor into a

single global implicit surface model of the observed

scene in real-time. To this end, the current pose of

the sensor is obtained by tracking the live depth frame

relative to the global model using a coarse-to-ﬁne ICP

algorithm, which uses all of the observed depth data

available. The Kinect Fusion method was developed

for a ﬁxed scene; in (Izadi et al., 2011) the method

was extended by considering dynamic actions of the

foreground. Though the Kinect Fusion approach is

general, its application to 3D face reconstruction re-

sults into models that reduce the noise with respect

to individual frames, but still show a quite visible

gap with respect to high-quality scans. In (Anasosalu

et al., 2013), a method is presented for producing an

accurate and compact 3D face model in real-time us-

ing an RGB-D sensor like the Kinect camera. To this

end, after initialization, Bump Images are updated in

real time by using every RGB-D frame with respect to

the current viewing direction and head pose; these lat-

ter are estimated using a frame-to-global-model reg-

istration strategy. Though this method takes a live

sequence of RGB-D images streamed from a ﬁxed

consumer RGB-D sensor with unknown head pose,

it is assumed the relative movement of the head be-

tween two successive frames to be small, and that the

facial expression does not change during reconstruc-

tion. The work in (Zollh

ofer et al., 2014) presents a

combined hardware/software solution for marker-less

reconstruction of non-rigidly deforming objects with

arbitrary shape. First, a smooth template model of

the subject as he/she moves rigidly is scanned. This

geometric surface prior is used to avoid strong scene

assumptions, such as a kinematic human skeleton or a

parametric shape model. Next, a GPU pipeline per-

forms non-rigid registration of live RGB-D data to

the smooth template using an extended non-linear as-

rigid-as-possible framework. High-frequency details

are fused onto the ﬁnal mesh using a linear deforma-

tion model. However, this solution relies on a spe-

ciﬁc stereo matching algorithm to estimate real-time

RGB-D data. Other solutions were speciﬁcally de-

signed for faces with no expressions in constrained

3D Face Reconstruction from RGB-D Data by Morphable Model to Point Cloud Dense Fitting

729

scenarios (Berretti et al., 2014; Bondi et al., 2016).

Methods in the second category, i.e., methods that

use a 3DMM to reconstruct a face model from depth

data, are few. In fact, in the most practiced solutions

a 3DMM is ﬁt to a 2D RGB target image (Blanz and

Vetter, 2003; Ferrari et al., 2017), with applications

that span from face rendering and relighting, to face

and facial expression recognition. In this work, in-

stead, we are interested to the particular case where

the target is a 3D low-resolution face scan as can be

acquired by a Kinect-like camera. The method de-

scribed in (Zollh

ofer et al., 2011) employed a 3DMM

that is ﬁt to the depth images obtained from an RGB-

D camera. The template mesh and the incoming

frame are aligned using features detected in the RGB

image as a coarse alignment step. The template is

then aligned non-rigidly to the incoming frame, and

the 3DMM is ﬁt to the template. Unfortunately, this

approach produces results that are biased towards the

template. The work of (Kazemi et al., 2014) con-

tributes a real time method for recovering facial shape

and expression from a single depth image. The output

is the result of minimizing the error in reconstructing

the depth image, achieved by applying a set of iden-

tity and expression blend shapes to the model. A dis-

criminatively trained prediction pipeline is used that

employs random forests to generate an initial dense,

but noisy correspondence ﬁeld. Then, a fast ICP-

like approximation is exploited to update these cor-

respondences, allowing a quick and robust initial ﬁt

of the model. The model parameters are then ﬁne

tuned to minimize the true reconstruction error using

a stochastic optimization technique.

However, none of these solutions can reconstruct

ﬁne details of expressive faces using a 3DMM.

3 3D MORPHABLE FACE MODEL

The work in (Blanz and Vetter, 1999) ﬁrst presented

a complete solution to derive a 3DMM by transform-

ing the shape and texture from a training set of 3D

face scans into a vector space representation based on

PCA. The 3DMM was further reﬁned into the Basel

Face Model in (Paysan et al., 2009) with several other

subsequent evolutions (Patel and Smith, 2009; Booth

et al., 2016). Expressive scans were not part of the

training in all the solutions above. Indeed, two as-

pects have a major relevance in characterizing the dif-

ferent methods for 3DMM construction: (1) the hu-

man face variability captured by the model, which di-

rectly depends on the number and heterogeneity of

training scans; (2) the capability of the model to ac-

count for facial expressions; also this feature directly

derives from the presence of expressive scans in the

training. One of the few 3DMM in the literature that

exposes both these features is the Dictionary Learn-

ing based 3DMM (DL-3DMM) proposed in (Ferrari

et al., 2017). Since we mainly develop on this model,

below we ﬁrst describe the peculiar features that make

the DL-3DMM suitable for our purposes, then we fo-

cus on the proposed ﬁtting procedure.

DL-3DMM Construction. The ﬁrst problem to be

solved in the construction of a 3DMM is the selection

of an appropriate set of training data. This should in-

clude sufﬁcient variability in terms of ethnicity, gen-

der, age of the subjects so as to include a large vari-

ance in the data. Apart for this, the most difﬁcult as-

pect in preparing the training data is the need to pro-

vide dense, i.e., vertex-by-vertex, alignment between

the 3D scans. Differently from works in the literature

that either use optical-ﬂow (Blanz and Vetter, 1999)

or the non-rigid ICP algorithm (Paysan et al., 2009),

the dense alignment of the training data for the DL-

3DMM was obtained with a solution based on face

landmarks detection. These landmarks are then used

for partitioning the face into a set of non-overlapping

regions, each one identifying the same part of the

face across all the scans. Re-sampling the internal of

the region based on its contour, a dense correspon-

dence is derived region-by-region and so for all the

face. Such method showed to be robust also to large

expression variations as those occurring in the Bing-

hamton University 3D facial Expression (BU-3DFE)

database (Yin et al., 2006). This dataset was used in

the construction of the DL-3DMM.

Once a dense correspondence is established across

the training data, these are used to estimate a set of

M deformation components C

, usually derived by

PCA, that will be linearly combined to generate novel

shapes S starting from an average model m:

S = m +

|M|

∑

i=1

. (1)

In the DL-3DMM, a dictionary of deformation com-

ponents is learned by exploiting the Online Dictio-

nary Learning for Sparse Coding technique (Mairal

et al., 2009). Learning is performed in an unsuper-

vised way, without exploiting any knowledge about

the data (e.g., identity or expression labels). The av-

erage model is deformed using the dictionary atoms

in place of C

in Eq. (1). More details on the dic-

tionary learning procedure can be found in (Ferrari

et al., 2017). The average model m, the dictionary D

and w, constitute the DL-3DMM.

ICPRAM 2019 - 8th International Conference on Pattern Recognition Applications and Methods

730

3DMM Fitting. The 3DMM was originally de-

signed with the goal of reconstructing the 3D shape

of a face from single images (Blanz and Vetter, 2003);

the large number of different techniques to ﬁt the

3DMM to a face image developed later on can be di-

vided in two main categories: analysis-by-synthesis,

and geometric based. Methods in the former cat-

egory perform a complex iterative procedure aimed

at generating a synthetic image as similar as possi-

ble to the input one, optimizing with respect to the

3DMM (shape and texture) and rendering (e.g., cam-

era or illumination) parameters. Despite their com-

plexity, the resulting reconstructions are rather ac-

curate. Nonetheless, given a textured rendering, it

is hard to discern if the retrieved shape resembles

the real geometry of the face; this because the same

rendering might be the result of different combina-

tions of the many involved parameters. Alternatively,

methods in the geometric-based category try to de-

form the 3DMM so as to match some geometrical fea-

tures detected on the image, like facial landmarks or

edges (Bas et al., 2016). These approaches exploit

the fact that human faces are composed of muscles—

hence facial movements involve an extended surface

rather than a single point—that are constrained to

ﬁxed anthropometric proportions and limited variabil-

ity. Thus, when trying to deform the 3DMM to ﬁt

a set of sparse landmarks, the surrounding surfaces

will smoothly follow the deformation in a statistically

plausible way. This motivates the coarse reconstruc-

tion of the shape of the whole face based only on few

control points. Obviously, the resulting reconstruc-

tion will be a coarse, but smooth approximation of

the real surface.

The proposed method attempts to ﬁll this gap; it

builds upon the geometric approaches and extends

the ﬁtting based on facial landmarks to a whole point

cloud. This extension implies a 3D scan correspond-

ing to the face image to be available, which changes

the context and the objective. In this conﬁguration,

the goal becomes deforming a generic face shape to

match a target one, both represented as point clouds;

indeed, the problem can be seen as the non-rigid reg-

istration of point clouds, which is a well-known prob-

lem in computer vision for which many solutions have

been proposed throughout the years (Chui and Ran-

garajan, 2000; Amberg et al., 2007; Myronenko and

Song, 2010). All the approaches addressing the prob-

lem, however, are intended to work with generic point

clouds representing arbitrary objects, while in this

case the problem is bounded to human faces. The

main difﬁculty is that faces are highly deformable

objects, which often makes such approaches fail in

matching the two shapes. On the opposite, we can ex-

ploit this prior knowledge to leverage a statistical tool

such as the 3DMM to bound the deformation.

The proposed approach, ﬁrst performs a similarity

transformation to map the target shape into the aver-

age model space, accounting for 3D rotation, transla-

tion and scale (SimilarityTransform in Algorithm 1).

This is achieved by means of a set of 49 landmarks

∈ R

49×3

, which are detected on the face image and

back-projected to the mesh. Depending on the data,

the association method to be applied might change.

To initialize the approach and account for large

shape differences that might impair the subsequent

steps, we apply the DL-3DMM ﬁtting using the land-

marks similarly to (Ferrari et al., 2015). The av-

erage model m ∈ R

p×3

is deformed on the target

shape

t ∈ R

k×3

minimizing the Euclidean distance of

the landmarks, whose indices on m are indicated as

∈ N

(LandmarkFitting in Algorithm 1). Differ-

ently from (Ferrari et al., 2015), the ﬁtting is per-

formed directly in the 3D space and projection on the

image plane is avoided. The deformation coefﬁcients

α are retrieved using the dictionary atoms d

as:

min



− m(I

) −

|D|

∑

i=1

)α



+ λ



α ◦

−1



. (2)

In the equation, d

) indicates that, for each dictio-

nary atom, only the elements associated to the vertices

corresponding to the landmarks are involved in the

minimization. The solution is found in closed form

and the average model m is deformed using Eq. (1) to

obtain an initial estimate

Then, we perform a rigid ICP registration between

m and

t to reﬁne the alignment, and compute the per-

vertex distance between the two meshes. We sub-

sequently associate each vertex of

m to its nearest

neighbor in

t, obtaining a re-parametrization (Vertex-

Association in Algorithm 1) of p indices of

t. Note

that k 6= p in general, thus a vertex of

m might be

associated with multiple vertices of

t; even if p = k,

this can still happen because of points that share the

same nearest neighbor. Once the association is done,

the DL-3DMM is ﬁt minimizing the Euclidean dis-

tance between each pair of associated points, using

a least squares solution. The ﬁtting method reported

in (Ferrari et al., 2015) uses a regularized formulation

(Eq. (2)), which is necessary to avoid uncontrolled de-

formations. In our case, we use all the vertices to ﬁt

the target shape; thus the usefulness of the regulariza-

tion becomes marginal. The minimization of Eq. (2)

becomes:

min



t −

m −

|D|

∑

i=1



. (3)

The procedure is repeated until the error between sub-

sequent iterations is lower than a threshold τ or the

3D Face Reconstruction from RGB-D Data by Morphable Model to Point Cloud Dense Fitting

731

Algorithm 1: Point Cloud Fitting (PCF).

Input: Average Model m, Dictionary D, Weights w,

Target Shape t, Landmarks L

, m(I

), Error

Threshold τ, Iterations Limit MaxIter

Output: Deformed Model

t = SimilarityTransform(L

, t, m(I

));

m = LandmarkFitting(L

, m(I

), D, w);

i = 0;

while i < MaxIter k err > τ do

ICP(

m);

t = VertexAssociation(

m);

m = ShapeFitting(

m, D, w);

err = ComputeEuclideanError(

m);

i = i + 1

maximum number of iterations if reached. Algo-

rithm 1 reports the pseudo-code of the proposed Point

Cloud Fitting procedure (PCF).

4 DEPTH DATA DENOISING

Low cost RGB-D scanners, such as the Kinect, can

acquire multimodal video streams consisting of reg-

istered RGB and depth data at approximately 30fps.

However, since the depth data are badly affected by

noise, a pre-processing is needed for noise reduction

and reconstruction of a regularized surface that ap-

proximates the 3D geometry of the observed object

(i.e., a face in our domain of interest). Depth data are

(a) (b) (c)

Figure 1: Effect of local regression ﬁltering for depth noise

removal: (a) raw depth scan affected by noise as provided

by the Kinect; (b) depth data after smoothing through a

Laplacian operator; (c) depth scan after local regression ﬁl-

tering ( points closer than 9cm to the nose tip are retained).

regarded as z values of a function on the (x, y) plane,

perpendicular to the line of sight of the scanner. Then,

regularization is performed through the local regres-

sion scheme described in (Cleveland, 1979): estima-

tion of the true depth value at point p

= (x

, y

, z

)

is accomplished by ﬁtting a low-dimensional polyno-

mial to the nearest neighbors N (p

) of p

. Opera-

tively, the cardinality of N (p

) is controlled through

a smoothing parameter γ ∈ (0, 1). Large values of

γ produce smooth regression functions that wiggle

the least in response to ﬂuctuations in the data. The

smaller γ is, the closer the regression function will

conform to the data, thus yielding poor robustness

to noise. The noise free estimate of point p

is ob-

tained by computing a least squares bilinear ﬁt on

pairs (x

, y

) 7→ (x

, y

, z

), ∀(x

, y

, z

) ∈ N (p

). Fig-

ure 1 shows the effect of the application of the local

regression module for noise reduction.

5 EXPERIMENTAL RESULTS

In the following, we report on the evaluation of the

proposed approach for accurate 3D face reconstruc-

tion using the DL-3DMM and the PCF procedure.

The experiments are conceived to demonstrate that

multimodal RGB and depth data, even if affected by

noise and provided by low resolution scanners, can be

processed through the PCF procedure to reconstruct

the 3D face shape. Furthermore, ﬁtting the morphable

model through PCF yields more accurate 3D recon-

struction compared to ﬁtting the model to the land-

marks.

In order to quantitatively evaluate the reconstruc-

tion accuracy of the proposed approach, we iden-

tiﬁed two publicly available datasets of 3D facial

scans, namely, the Binghamton University 3D Facial

Expression database (BU-3DFE) (Yin et al., 2006)

and the Face Recognition Grand Challenge database

(FRGC) (Phillips et al., 2005).

The BU-3DFE dataset has been largely employed

for 3D expression/face recognition; it contains scans

of 44 females and 56 males, with age ranging from 18

to 70 years old, acquired in a neutral plus six different

expressions: anger, disgust, fear, happiness, sadness,

and surprise (2500 scans in total). The subjects are

distributed across different ethnic groups or racial an-

cestries. This dataset has been used to train the DL-

3DMM and a fully registered version of 1779 out of

the 2500 scans is available.

The FRGC dataset is composed of 466 individu-

als, for a total of 4007 scans collected in two sepa-

rate sessions. Approximately, the 60% of such are

in neutral expression, while the others show sponta-

neous expressions. For the experiments, we used the

“fall2003” session, comprising 1729 scans. In the fol-

lowing, we ﬁrst present and discuss results on BU-

3DFE and FRGC, then for some Kinect scans. For

all the reported experiments, the regularization term

λ of Eq. 2 has been ﬁxed to 0.01; the error threshold τ

and the maximum number of iterations in Algorithm 1

ICPRAM 2019 - 8th International Conference on Pattern Recognition Applications and Methods

732

200 400 600 800 1000 1200 1400 1600

3D Model

Error (mm)

BU-3DFE

Full procedure

Landmarks only

500 1000 1500

3D Model

Error (mm)

FRGC

Full procedure

Landmarks only

Figure 2: Comparison between landmark ﬁtting (FL) with

blue color, and full point cloud ﬁtting (FL+PCF) with red

color. BU-3DFE (left) and FRGC (right).

have been ﬁxed to 0.001 and 50 respectively.

Reconstruction Accuracy by Nearest Neighbor As-

sociation. In this experiment, we evaluate the ac-

curacy of the vertices association by computing, for

each vertex of the deformable model, its distance to

the closest vertex of the 3D point cloud. It should

be noticed that such nearest neighbor point associa-

tion might not fully reﬂect the correctness of the re-

construction. Consider the case in which the 3DMM

fails to ﬁt an open mouth; the distance between the

two models should be ideally large, while it is likely

that the error resulting from a nearest neighbor search

will be small (or at least, smaller). To highlight the

potential of the proposed solution, values of the mean

accuracy are computed with reference to two distinct

cases: i) DL-3DMM ﬁtting based on facial landmarks

(FL), and ii) DL-3DMM ﬁtting based on facial land-

marks and PCF (FL+PCF). The mean error and stan-

dard deviation across all the models for the BU-3DFE

and FRGC datasets are reported in Table 1.

Table 1: Reconstruction error (in mm) for the landmark

ﬁtting (FL) and full point cloud ﬁtting (FL+PCF).

Dataset FL FL+PCF

BU-3DFE 4.507 ± 2.809 0.802 ± 0.235

FRGC 8.272 ± 4.939 0.455 ± 0.282

In Fig. 2 accuracy values are reported at a higher

level of detail, showing for each model of the dataset

(both BU-3DFE and FRGC are considered) the error

obtained by using FL (in blue color), and the error

obtained by using FL+PCF (in red color). Results

demonstrate a noticeable improvement is obtained

with the proposed procedure. To provide a quali-

tative yet representative description of the accuracy

of 3D face reconstruction, Fig. 4 reports some heat-

maps obtained by encoding with a chromatic value

the error associated with each vertex of the recon-

structed model; it can be appreciated how the real

500 1000 1500

3D Model

0.5

1.5

2.5

3.5

4.5

Error (mm)

BU-3DFE

With Landmarks Initialization

Without Landmarks Initialization

500 1000 1500

3D Model

0.5

1.5

2.5

3.5

Error (mm)

FRGC

With Landmarks Initialization

Without Landmarks Initialization

Figure 3: Comparison either using the landmark initializa-

tion or not. BU-3DFE (left) and FRGC (right).

surface is accurately reconstructed and ﬁne details of

regions where landmarks are missing are accurately

replicated, while maintaining a smoother surface. The

higher accuracy of the proposed solution is demon-

strated by the presence of large regions with blue/cyan

colors (low error values) for models reconstructed

using FL+PCF compared to regions with red/yellow

colors (high error values) for models reconstructed

using FL.

Landmarks Initialization. Detecting facial land-

marks on face images (or 3D data) is itself a challeng-

ing task, which is not exempt from failures. In order

to assess the robustness of our approach to the initial-

ization procedure, we compared the ﬁnal accuracy in

case the method is whether initialized or not. From

Fig. 3, we can observe that the approach is rather

robust and still can accurately reconstruct the shape

without landmarks initialization. For some models,

the error increases signiﬁcantly; this is the case of

models with large topological differences, e.g., open

mouth, which are more difﬁcult to handle if a seman-

tic association is not available. Note that this happens

mostly in the BU-3DFE, while for the FRGC dataset,

such behavior happens in both the cases. This is mo-

tivated by the fact that, as above mentioned, the land-

mark detection can fail; if this happens, the initialized

shape

m might be farther from the ground truth with

respect to the average model m.

Kinect Data Reconstruction. Our approach has

been experimented also on Kinect data, processed as

expounded in Sect. 4. Because of the lack of datasets

containing RGB-depth pairs, we collected a few se-

quences where to qualitatively test our method on,

while quantitative and statistically meaningful results

were presented in the previous sections. The average

error obtained in such sequences is 35, 89 and 1, 70

mm, respectively, for the FL and FL+PCF, demon-

strating the effectiveness of the approach even for

low-resolution data. From Fig. 4, we can appreci-

3D Face Reconstruction from RGB-D Data by Morphable Model to Point Cloud Dense Fitting

733

Figure 4: Qualitative comparison between DL-3DMM ﬁtting based on facial landmarks (FL) and DL-3DMM ﬁtting based on

facial landmarks and PCF (FL+PCF). Three leftmost columns: reconstructions and ground truth (GT) models; two rightmost

columns: error heat-maps. From top to bottom, BU-3DFE, FRGC, Kinect models (last two rows).

ate that the sole landmarks were not sufﬁcient to re-

produce the real shape, e.g., the nose, but it could

only coarsely capture the expression. A critical as-

pect of dealing with low-resolution data is the near-

est neighbor association; each collected Kinect scan k

has about 3K vertices, while the 3DMM has 6704, al-

most the double. If the association is performed with

respect to the average model m, then each vertex of

k will be associated, on average, to 2 vertices of m.

The other way around, if we perform the association

with respect to k, not all the vertices of m will have a

mated point. In the former case, the resulting recon-

struction will be more accurate, but some noise due

to the redundancy in the points association will be in-

troduced (Fig. 4, bottom row); in the latter, the recon-

struction would be smoother, but less accurate (Fig. 4,

third row). Moreover, since we do not have control

on the whole point cloud, the ﬁtting needs to be per-

formed using the regularization term, i.e., Eq. (2), to

avoid excessive noise. A feasible way to solve this is-

sue is that of modifying the 3DMM formulation so as

to make the deformation components D sparse.

6 CONCLUSIONS

In this paper, we have proposed a 3DMM based so-

lution to reconstruct a highly-detailed 3D model of

the face starting from an RGB-D low-resolution face

sequence. This is obtained by the combined effect

of two speciﬁc algorithmic solutions for 3DMM con-

struction and ﬁtting: on the one hand, we used a dic-

tionary learning based 3DMM implementation that

makes possible modeling local deformations of the

face; on the other, the model is ﬁt to a target point

cloud by a two steps approach, where the 3DMM

is ﬁrst deformed under the effect of the correspon-

dence between a limited set of landmarks, and sub-

sequently reﬁned by an iterative local adjustment of

point correspondences. Further, a robust denoising

ICPRAM 2019 - 8th International Conference on Pattern Recognition Applications and Methods

734

solution is applied to the RGB-D sequences in order

to preprocess sets of consecutive frames before apply-

ing the 3DMM ﬁtting. Preliminary results have been

reported that show the reconstruction errors between

the 3D models derived after 3DMM ﬁtting and the

corresponding ground truth scans. It clearly emerges

as the proposed framework provides superior results

with respect to a landmark-based solution.

As further step in the direction of proposing our

system for face rehabilitation purposes, we are col-

lecting a face dataset that includes RGB-D sequences

of the face captured by a Kinect camera, and the

corresponding high-resolution scans acquired with a

3dMD scanner.

REFERENCES

Amberg, B., Romdhani, S., and Vetter, T. (2007). Optimal

step nonrigid ICP algorithms for surface registration.

In IEEE Conf. on Computer Vision and Pattern Recog-

nition (CVPR), pages 1–8.

Anasosalu, P. K., Thomas, D., and Sugimoto, A. (2013).

Compact and accurate 3-D face modeling using an

rgb-d camera: Let’s open the door to 3-D video con-

ference. In IEEE Int. Conf. on Computer Vision Work-

shops, pages 67–74.

Bas, A., Smith, W. A., Bolkart, T., and Wuhrer, S. (2016).

Fitting a 3D morphable model to edges: A comparison

between hard and soft correspondences. In Asian Con-

ference on Computer Vision, pages 377–391. Springer.

Berretti, S., Pala, P., and Del Bimbo, A. (2014). Face recog-

nition by super-resolved 3D models from consumer

depth cameras. IEEE Trans. on Information Forensics

and Security, 9(9):1436–1449.

Blanz, V. and Vetter, T. (1999). A morphable model for the

synthesis of 3D faces. In ACM Conf. on Computer

Graphics and Interactive Techniques, pages 187–194.

Blanz, V. and Vetter, T. (2003). Face recognition based on

ﬁtting a 3D morphable model. IEEE Trans. on Pattern

Analysis and Machine Intelligence, 25(9):1063–1074.

Bondi, E., Pala, P., Berretti, S., and Del Bimbo, A.

(2016). Reconstructing high-resolution face models

from kinect depth sequences. IEEE Trans. on Infor-

mation Forensics and Security, 11(12):2843–2853.

Booth, J., Roussos, A., Ververas, E., Antonakos, E.,

Ploumpis, S., Panagakis, Y., and Zafeiriou, S. (2018).

3D reconstruction of ’in-the-wild’ faces in images and

videos. IEEE Trans. on Pattern Analysis and Machine

Intelligence, 40(11):2638 – 2652.

Booth, J., Roussos, A., Zafeiriou, S., Ponniahand, A., and

Dunaway, D. (2016). A 3D morphable model learnt

from 10,000 faces. In IEEE Conf. on Computer Vision

and Pattern Recognition, pages 5543–5552.

Chui, H. and Rangarajan, A. (2000). A feature registration

framework using mixture models. In IEEE Workshop

on Mathematical Methods in Biomedical Image Anal-

ysis, pages 190–197.

Cleveland, W. S. (1979). Robust locally weighted regres-

sion and smoothing scatterplots. Journal of the Amer-

ican Statistical Association, 74(368):829–836.

Ferrari, C., Lisanti, G., Berretti, S., and Del Bimbo, A.

(2015). Dictionary learning based 3D morphable

model construction for face recognition with varying

expression and pose. In Int. Conf. on 3D Vision.

Ferrari, C., Lisanti, G., Berretti, S., and Del Bimbo,

A. (2017). A dictionary learning-based 3D mor-

phable shape model. IEEE Trans. on Multimedia,

19(12):2666–2679.

Hernandez, M., Choi, J., and Medioni, G. (2012). Laser

scan quality 3-D face modeling using a low-cost depth

camera. In European Signal Processing Conf. (EU-

SIPCO), pages 1995–1999. IEEE.

Izadi, S., Kim, D., Hilliges, O., Molyneaux, D., Newcombe,

R., Kohli, P., Shotton, J., Hodges, S., Freeman, D.,

Davison, A., and Fitzgibbon, A. (2011). Kinectfusion:

Real-time 3D reconstruction and interaction using a

moving depth camera. In ACM Symposium on User

Interface Software and Technology, pages 559–568.

Kazemi, V., Keskin, C., Taylor, J., Kohli, P., and Izadi, S.

(2014). Real-time face reconstruction from a single

depth image. In IEEE Int. Conf. on 3D Vision, vol-

ume 1, pages 369–376.

Mairal, J., Bach, F., Ponce, J., and Sapiro, G. (2009). Online

dictionary learning for sparse coding. In Int. Conf. on

Machine Learning, pages 689–696.

Myronenko, A. and Song, X. (2010). Point set registration:

Coherent point drift. IEEE Trans. on Pattern Analysis

and Machine Intelligence, 32(12):2262–2275.

Newcombe, R. A., Izadi, S., Hilliges, O., Molyneaux, D.,

Kim, D., Davison, A. J., Kohi, P., Shotton, J., Hodges,

S., and Fitzgibbon, A. (2011). Kinectfusion: Real-

time dense surface mapping and tracking. In IEEE

Int. Symposium on Mixed and Augmented Reality.

Patel, A. and Smith, W. A. P. (2009). 3D morphable face

models revisited. In IEEE Conf. on Computer Vision

and Pattern Recognition, pages 1327–1334.

Paysan, P., Knothe, R., Amberg, B., Romdhani, S., and

Vetter, T. (2009). A 3D face model for pose and il-

lumination invariant face recognition. In IEEE Int.

Conf. on Advanced Video and Signal Based Surveil-

lance (AVSS), pages 296–301.

Phillips, P. J., Flynn, P. J., Scruggs, T., Bowyer, K. W.,

Chang, J., Hoffman, K., Marques, J., Min, J., and

Worek, W. (2005). Overview of the face recognition

grand challenge. In IEEE Work. on Face Recognition

Grand Challenge Experiments, pages 947–954.

Yin, L., Wei, X., Sun, Y., Wang, J., and Rosato, M. J.

(2006). A 3D facial expression database for facial be-

havior research. In IEEE Int. Conf. on Automatic Face

and Gesture Recognition, pages 211–216.

Zollh

ofer, M., Martinek, M., Greiner, G., Stamminger, M.,

and S

ußmuth, J. (2011). Automatic reconstruction of

personalized avatars from 3D face scans. Computer

Animation and Virtual Worlds, 22(2-3):195–202.

Zollh

ofer, M., Nießner, M., Izadi, S., Rehmann, C., Zach,

C., Fisher, M., Wu, C., Fitzgibbon, A., Loop, C.,

Theobalt, C., and Stamminger, M. (2014). Real-

time non-rigid reconstruction using an RGB-D cam-

era. ACM Trans. on Graphics, 33(4):156:1–156:12.

3D Face Reconstruction from RGB-D Data by Morphable Model to Point Cloud Dense Fitting

735