3 3D SHAPE RECONSTRUCTION
In a first step the relevant information, in this case the
captured human, has to be extracted from the individ-
ual camera images by background subtraction. This
results in a binary silhouette image with each pixel
marked as either belonging to the human or to the
irrelevant background. In this work we employ the
technique presented in (Cheung et al., 2000).
Once the silhouette images for the current frame
have been computed successfully, they are used to re-
construct a 3-dimensional representation of the hu-
man. This is achieved by the theoretical concept of
the visual hull, the largest volume whose projection
into the cameras’ image planes exactly matches the
corresponding silhouettes.
Due to the fact that this visual hull can have an
arbitrary shape, it is usually only approximated by
discretizing the 3-dimensional space into a finite grid
of voxels and finding the subset of voxels that best
represents the actual visual hull. This is achieved by
projecting each voxel into the image planes of the in-
dividual cameras and classifying it as part of the vi-
sual hull if its projection intersects the corresponding
silhouette image. For checking this intersection the
projected voxel area is sampled at a small number of
pixels and the whole region is classified based sim-
ply on the ratio of silhouette sample pixels to non-
silhouette sample pixels, as in (Cheung et al., 2000).
We further simplify the computation of the projected
voxel region by representing a voxel with a disk par-
allel to the image plane, resulting in an easy to sample
circular shape, instead of the hexagonal projection re-
sulting from a cubic voxel.
Since we ultimately want to match the recon-
structed object to a surface model of the human body
(see 4), the further removal of any internal vox-
els, identified by having all of their respective 6-
connected neighbors belonging to the foreground, is
an obvious optimization step to reduce the complex-
ity of the following steps.
4 POSE ESTIMATION
Based on the 3-dimensional reconstruction of the hu-
man his current pose is to be estimated, as represented
by the joint angles of a kinematic skeleton. The use of
an underlying kinematic skeleton as an abstraction of
the human motion system guarantees a valid pose in-
side the constraints of the human body in each frame.
Unfortunately the visual hull does not carry any
topological or semantic information, it need not even
be connected due to errors in the background subtrac-
tion or the voxel reconstruction. The usual approach
is therefore to match the visual hull to an idealized
geometric model of the human body (Cheung et al.,
2000)(Caillette and Howard, 2004)(Kehl and Gool,
2006)(Corazza et al., 2010). Due to the human body
mainly consisting of tubular parts, the simplest geo-
metric model to represent its shape is a set of ellip-
soids (Cheung et al., 2000)(Luck et al., 2001)(Cail-
lette and Howard, 2004), assigning each skeleton seg-
ment to a corresponding ellipsoid that describes the
geometry of the surrounding body part.
So in a first step the ellipsoid model has to real-
ize the current pose of the reconstructed human by
adapting it to his current voxel reconstruction. For
this classic cluster analysis problem an Expectation-
Maximization algorithm is a viable approach (Cheung
et al., 2000)(Caillette and Howard, 2004):
1. Each voxel is assigned to the ellipsoid with the
shortest distance to it, resulting in the classifica-
tion of the voxels into body parts (fig. 1 (a)).
2. The ellipsoids’ parameters are recomputed using
a principal component analysis of the assigned
voxels, resulting in the ellipsoids adapting to the
voxel hull’s pose (fig. 1 (b)).
Once the pose of the geometry and the locations
of the individual body parts are known, the corre-
sponding kinematic pose can be extracted therefrom.
The joint angles are therefore computed using inverse
kinematics, with the ellipsoids’ center points defining
the goals of their corresponding skeleton segments’
centers (fig. 1 (c)). The iterative nature of the numeric
IK methods profits from the previous frame’s skeleton
pose already being a good intial value for the estima-
tion of the current frame’s pose. Occlusions or errors
in the background subtraction may result in ellipsoids
not fitted correctly, which should not be used to drive
the IK. Therefore, whenever the number of assigned
voxels of an ellipsoid does not reach half the aver-
age number of assigned voxels over the whole motion,
this ellipsoid does not define a goal for its correspond-
ing segment, similar to the measure used in (Caillette
and Howard, 2004).
In practice doing a complete IK over the whole
skeleton turns out to be not very robust due to the high
maneuverability of the root joint. Therefore, the po-
sition and orientation of the root is precomputed ex-
plicitly based on a few assumptions:
• The line connecting both hip joints always lies in
a horizontal plane.
• The horizontal orientation of the root is equal to
the horizontal orientation of the upper torso.
• The height of the root above the ground does not
change significantly over the whole motion.
GRAPP2013-InternationalConferenceonComputerGraphicsTheoryandApplications
398