(Hartley and Zisserman, 2004) which iteratively re-
fines the 3D placement of this keypoints. In a second
phase, the paper uses a variant of Delaunay triangula-
tion to obtain the final mesh of the object.
Despite these already explained works, current ap-
proaches mainly make use of the so-called range cam-
eras, which provide a depth image in addition to the
usual visible image. One of the first integral systems
using this kind of cameras was the work presented in
(Rusinkiewicz et al., 2002), including the acquisition,
registration, surface triangulation and final rendering.
Its main drawback was the need of user manipulation
in some aspects of the process, producing in addition
a higher time consumption.
A more recent approach using range cameras is
presented in (Weise et al., 2011). This paper also con-
tains the whole process of the 3D modeling, from the
acquisition to the final rendering, and solves most of
the problems present in (Rusinkiewicz et al., 2002),
in part thanks to the computer processing advance be-
tween both papers.
It must be taken into account that all the literature
available nowadays requires the rotation of the object
around its own axis, or equivalently, the rotation of the
camera. These both possibilities are not suitable for
our application, since the rotation of a person could
produce small movements in his body and the rotation
of a camera around the person requires a huge need of
space.
3 PRELIMINARY WORK
3.1 Camera Used: Microsoft Kinect
The range camera used during experiments is the Mi-
crosoft Kinect, recently released. The Kinect device
has a RGB camera, an IR camera and one laser-based
IR projector. In order to obtain the range image, this
camera does not use the method of Time Of Flight
(Gokturk et al., 2004), but triangulation between cap-
tured image and a known pattern missed by sensor.
While the laser-based IR projector emits a mesh of IR
light to the scene, the IR camera captures the distance
of every point of the IR mesh emitted by the projector.
For a typical Kinect, the resulting RGB image has
a resolution of 640 x 480 pixels, and the depth im-
age has a resolution of 320 x 240 pixels. The IR and
RGB cameras are separated by a small baseline so
they must be calibrated between themselves. How-
ever, since this is a common used range camera, the
values of the calibration are well known by the com-
munity.
Images obtained with the Microsoft Kinect are
Figure 1: Sequence of the scanning process using a turning
table example. Only 4 scans are shown, but the sequence
can be composed by a large number of scans. For each scan,
the RGB image and the depth image is shown.
noisy and static objects tend to be detected with dif-
ferent range in consecutive captures. In addition, the
device has problems in detecting object contours and
usually small structures could not be detected. For
these reasons the depth image should usually be fil-
tered in order to avoid these inconveniences.
The resulting 3D image after using the Kinect is
a set of 3D points, without any surface information.
However, thanks to the known IR pattern emitted by
the sensor it is simple to directly connect the neighbor
3D points and make a fast triangulation.
3.2 Problems with Existing Approaches
In the literature exist different ways to solve the pro-
posed problem of 3D modeling. In this work we study
only the two more relevant ones: the turntable ap-
proach and the multiple cameras approach.
3.2.1 Turntable Approach
The most common method used in 3D modeling con-
sists on placing the object on a turning table, allowing
the capture of the object from several viewpoints. The
3D sensor can be fixed in an appropriate place and
successive 3D captures of the object are obtained dur-
ing its rotation. The result of this scanning process is
a set of partial scans of the object, including both the
depth and the RGB information. An example using a
model person is shown in Figure 1.
Once the different partial scans have been ob-
tained, the multiple views are registered together in
order to obtain a full-side representation. For this pur-
pose usually a 2-step method is used, starting with
the so-called pairwise registration between pairs of
partial scans (Besl and McKay, 1992), followed by
the multiview registration, which makes use of the lo-
cal information of multiple pairwise registrations and
minimizes the global registration error (Sharp et al.,
2004), (Shih et al., 2008).
However, this kind of acquisition method is not
suitable for human modeling. Minimal movements of
the subject during his rotation can produce errors in
the final registration. Also, many people are reluctant
to be rotated and this can be a problem for a possible
commercial product.
VISAPP 2012 - International Conference on Computer Vision Theory and Applications
318