GPU-accelerated Real-time Markerless Human Motion Capture

Christian Rau and Guido Brunnett

Computer Graphics Group, Chemnitz University of Technology, Strasse der Nationen 62, Chemnitz, Germany

Keywords:

Motion Capturing, Shape-from-Silhouette, Voxelization, Pose Estimation, GPGPU.

Abstract:

We present a system for capturing human motions based on video data from multiple cameras. It realizes a 3-

dimensional voxel-based reconstruction of the human together with an estimation of the pose of his complete

body in each frame. The use of an underlying kinematic skeleton together with an idealized geometric model

can guarantee a valid pose even in the face of occlusions caused by the incomplete spatial information gained

from the cameras. The data-parallel nature of the used algorithms makes them well-suited for the implementa-

tion on modern graphics hardware. In this way the motion can be captured in real-time on a single PC despite

the computation of a reconstruction accurate enough for a high quality pose estimation.

1 INTRODUCTION

Most systems for accurately measuring human mo-

tion require the human to wear special markers or

sensors, which may restrict his movement and can be

time-consuming to attach. Furthermore, the complex

technology of those systems makes them quite expen-

sive. For these reasons there has been much research

on approaches to markerless motion capturing, trying

to recreate the motion based solely on the image data

from one or more cameras without requiring the hu-

man to wear special equipment. But the problem of

extracting a 3-dimensional motion from a set of 2-

dimensional videos is a complex task often preventing

the methods from performing in real-time.

But recently the programmability of graphics pro-

cessors has reached a ﬂexibility which enables them

to be used for various tasks beyond just the genera-

tion of images, often outperforming CPUs by orders

of magnitude for highly data-parallel computations,

like those arising in markerless human motion captur-

ing techniques. In this work we want to present such

a system doing markerless motion capture on modern

graphics hardware in real-time.

2 RELATED WORK

The markerless capturing of human motions has

been studied extensively, a thorough overview can

be gained from (Moeslund et al., 2006). The usual

approach is to ﬁrst extract the relevant information

(the human) from the cameras’ images by doing a so-

called background subtraction (Toyama et al., 1999).

In the next step a 3-dimensional representation of

the human needs to be computed from this informa-

tion. Whereas there exist approaches for the extrac-

tion of surface-based polyhedral reconstructions (Ma-

tusik et al., 2001), the usual approach is the approxi-

mation by voxelization (Cheung et al., 2000)(Caillette

and Howard, 2004)(Kehl and Gool, 2006)(Corazza

et al., 2010). From this reconstruction the pose of

the human can be estimated. This is often done us-

ing an underlying kinematic model of the human mo-

tion system (Luck et al., 2001)(Caillette and Howard,

2004)(Kehl and Gool, 2006)(Corazza et al., 2010).

Estimation can be based on various clues, be it im-

age space information (Kehl and Gool, 2006), cluster

analysis (Cheung et al., 2000)(Caillette and Howard,

2004), or anatomical knowledge about the human

body (Luck et al., 2001).

In (Caillette and Howard, 2004) a hierarchical

voxel grid is used to accelerate the capturing and

achieve real-time performance, but a hierarchical grid

can introduce a severe discretization error based on

the coarseness of the base grid. The highly data-

parallel nature of graphics hardware has already been

used in (Hasenfratz et al., 2003) for accelerating the

voxel reconstruction of a human actor from images,

but hardware was not as ﬂexible at that time, requir-

ing the mapping of computational problems into the

restricted domain of graphics processing.

397

Rau C. and Brunnett G..

GPU-accelerated Real-time Markerless Human Motion Capture.

DOI: 10.5220/0004282203970401

In Proceedings of the International Conference on Computer Graphics Theory and Applications and International Conference on Information

Visualization Theory and Applications (GRAPP-2013), pages 397-401

ISBN: 978-989-8565-46-4

 2013 SCITEPRESS (Science and Technology Publications, Lda.)

3 3D SHAPE RECONSTRUCTION

In a ﬁrst step the relevant information, in this case the

captured human, has to be extracted from the individ-

ual camera images by background subtraction. This

results in a binary silhouette image with each pixel

marked as either belonging to the human or to the

irrelevant background. In this work we employ the

technique presented in (Cheung et al., 2000).

Once the silhouette images for the current frame

have been computed successfully, they are used to re-

construct a 3-dimensional representation of the hu-

man. This is achieved by the theoretical concept of

the visual hull, the largest volume whose projection

into the cameras’ image planes exactly matches the

corresponding silhouettes.

Due to the fact that this visual hull can have an

arbitrary shape, it is usually only approximated by

discretizing the 3-dimensional space into a ﬁnite grid

of voxels and ﬁnding the subset of voxels that best

represents the actual visual hull. This is achieved by

projecting each voxel into the image planes of the in-

dividual cameras and classifying it as part of the vi-

sual hull if its projection intersects the corresponding

silhouette image. For checking this intersection the

projected voxel area is sampled at a small number of

pixels and the whole region is classiﬁed based sim-

ply on the ratio of silhouette sample pixels to non-

silhouette sample pixels, as in (Cheung et al., 2000).

We further simplify the computation of the projected

voxel region by representing a voxel with a disk par-

allel to the image plane, resulting in an easy to sample

circular shape, instead of the hexagonal projection re-

sulting from a cubic voxel.

Since we ultimately want to match the recon-

structed object to a surface model of the human body

(see 4), the further removal of any internal vox-

els, identiﬁed by having all of their respective 6-

connected neighbors belonging to the foreground, is

an obvious optimization step to reduce the complex-

ity of the following steps.

4 POSE ESTIMATION

Based on the 3-dimensional reconstruction of the hu-

man his current pose is to be estimated, as represented

by the joint angles of a kinematic skeleton. The use of

an underlying kinematic skeleton as an abstraction of

the human motion system guarantees a valid pose in-

side the constraints of the human body in each frame.

Unfortunately the visual hull does not carry any

topological or semantic information, it need not even

be connected due to errors in the background subtrac-

tion or the voxel reconstruction. The usual approach

is therefore to match the visual hull to an idealized

geometric model of the human body (Cheung et al.,

2000)(Caillette and Howard, 2004)(Kehl and Gool,

2006)(Corazza et al., 2010). Due to the human body

mainly consisting of tubular parts, the simplest geo-

metric model to represent its shape is a set of ellip-

soids (Cheung et al., 2000)(Luck et al., 2001)(Cail-

lette and Howard, 2004), assigning each skeleton seg-

ment to a corresponding ellipsoid that describes the

geometry of the surrounding body part.

So in a ﬁrst step the ellipsoid model has to real-

ize the current pose of the reconstructed human by

adapting it to his current voxel reconstruction. For

this classic cluster analysis problem an Expectation-

Maximization algorithm is a viable approach (Cheung

et al., 2000)(Caillette and Howard, 2004):

1. Each voxel is assigned to the ellipsoid with the

shortest distance to it, resulting in the classiﬁca-

tion of the voxels into body parts (ﬁg. 1 (a)).

2. The ellipsoids’ parameters are recomputed using

a principal component analysis of the assigned

voxels, resulting in the ellipsoids adapting to the

voxel hull’s pose (ﬁg. 1 (b)).

Once the pose of the geometry and the locations

of the individual body parts are known, the corre-

sponding kinematic pose can be extracted therefrom.

The joint angles are therefore computed using inverse

kinematics, with the ellipsoids’ center points deﬁning

the goals of their corresponding skeleton segments’

centers (ﬁg. 1 (c)). The iterative nature of the numeric

IK methods proﬁts from the previous frame’s skeleton

pose already being a good intial value for the estima-

tion of the current frame’s pose. Occlusions or errors

in the background subtraction may result in ellipsoids

not ﬁtted correctly, which should not be used to drive

the IK. Therefore, whenever the number of assigned

voxels of an ellipsoid does not reach half the aver-

age number of assigned voxels over the whole motion,

this ellipsoid does not deﬁne a goal for its correspond-

ing segment, similar to the measure used in (Caillette

and Howard, 2004).

In practice doing a complete IK over the whole

skeleton turns out to be not very robust due to the high

maneuverability of the root joint. Therefore, the po-

sition and orientation of the root is precomputed ex-

plicitly based on a few assumptions:

• The line connecting both hip joints always lies in

a horizontal plane.

• The horizontal orientation of the root is equal to

the horizontal orientation of the upper torso.

• The height of the root above the ground does not

change signiﬁcantly over the whole motion.

GRAPP2013-InternationalConferenceonComputerGraphicsTheoryandApplications

398

(a) expectation

→

(b) maximization

→

(c)

Figure 1: Ellipsoid ﬁtting and IK-based pose estimation: (a) Assign voxels to closest ellipsoids. (b) Recompute ellipsoid

parameters based on assigned voxels. (c) Ellipsoid centers as IK targets for corresponding segments.

Based on those assumptions the root orientation

can be derived from the line connecting the shoulder

joints as a measure for the horizontal orientation of

the upper torso and from the up-axis of the torso found

by a principal component analysis of all the torso’s

voxels. Since we don’t know the shoulder joints’ po-

sitions yet, the values from the previous frame are

used, based on the frame-to-frame coherence. The

position of the root joint can be derived from the di-

rection of the upper legs, given by the corresponding

ellipsoids, intersected with a horizontal plane at the

hips’ height.

During the whole pose estimation process we

make heavy use of the frame-to-frame coherence by

requiring the ellipsoid model’s and skeleton’s pose to

match the human’s pose of the previous frame. Of

course this does not hold for the ﬁrst frame. There-

fore, the skeleton pose and ellipsoid parameters for

the ﬁrst pose have to be determined manually at the

beginning. This can be simpliﬁed by requiring the

human to take a certain reference pose. But in addi-

tion to a proper initialization we also need to correct

the ellipsoid model each frame by readapting it to the

computed kinematic pose. Otherwise the ellipsoids

would tend to degenerate over the course of the mo-

tion, especially in the presence of incorrectly assigned

ellipsoids.

5 GPU IMPLEMENTATION

The high data-parallel nature of the used algorithms

makes them well suited for being implemented on

modern many-core architectures, in particular mod-

ern programmable graphics accelerators (GPUs). In

this way they can be accelerated up to real-time per-

formance even for very detailed voxel grids.

The background subtraction transforming the

camera images into the binary silhouette images is

a standard image processing task done independently

for each pixel and requiring no synchronization be-

tween individual pixels, which makes it well suited

for being implemented on the GPU. One might think

about incorporating the background subtraction di-

rectly into the voxel’s foreground test and thus per-

forming it for each sample pixel of each voxel instead

of each image pixel. But in practice this approach

performs less efﬁciently, especially for larger voxel

grids where the number of sample pixels is much

larger than the number of image pixels, which makes

the increased complexity of the voxel foreground test

hide any possible gain from the omission of the back-

ground subtraction step.

The voxels’ foreground test can also be invoked

independently for each individual voxel. In this case

we assign one thread to each voxel of the grid. This

utilizes the GPU’s resources sufﬁciently and since this

thread performs a silhouette test for each individual

sample pixel and each camera, its computational com-

plexity is still high enough to hide memory latencies.

The silhouette test of the sample pixels in turn proﬁts

from the GPU’s texturing hardware optimized for 2-

dimensional memory access. After the internal voxels

have been marked as background in a following step,

we ﬁnally have the boolean foreground ﬂags of each

individual voxel of the whole grid. In order for the

next steps to concentrate on the relevant data only, the

voxel grid needs to be compacted into a list of only

the foreground voxels. This is a standard stream com-

paction problem often arising as part of data-parallel

algorithms and can be solved efﬁciently by a parallel

preﬁx sum to compute the foreground voxels’ list in-

dices (Sengupta et al., 2008), followed by a scattering

step realizing the actual compaction. Since the po-

sition of a voxel is uniquely encoded in its position

inside the grid and the grid resolution of each dimen-

sion does not need to be larger than 256 in practice,

each foreground voxel can be compactly represented

with a single 32-bit value, leaving one byte for addi-

tional state.

The expectation step of the ellipsoid ﬁtting can

again be parallelized easily without the need for syn-

chronization by assigning a single thread to each indi-

vidual foreground voxel. Since this thread computes

the distance of the voxel to each ellipsoid, it again has

a high computational complexity compared to a small

number of memory accesses, requiring only a single

load and store from/to global off-chip memory that

GPU-acceleratedReal-timeMarkerlessHumanMotionCapture

399

(a) reconstruction

→

(b) expectation

→

(d) pose estimation

Figure 2: Workﬂow of the motion capturing process for a single frame.

can be coalesced between multiple threads for maxi-

mum memory throughput. The ID of the nearest el-

lipsoid can thereby be stored compactly in the vacant

byte of the 32-bit voxel state.

In order for the ellipsoids’ means and covari-

ance matrices needed for the maximization step to

be accumulated efﬁciently, we ﬁrst sort the voxels

based on the IDs of their nearest ellipsoids. This can

be achieved efﬁciently using a parallel bucket sort

(Satish et al., 2009), since the number of ellipsoids

is very small (15 in our case). The actual accumu-

lation of the means and covariances can then be im-

plemented as a segmented reduction as a modiﬁcation

of the segmented preﬁx sum, computing the data for

each ellipsoid (segment). The only problem with this

approach is the quite large amount of shared on-chip

memory required by the individual means and covari-

ance matrices of the voxels, which slightly reduces

the number of resident threads and therefore the uti-

lization of the GPU’s resources.

Since the complexity of the ﬁnal IK-based pose

estimation does not depend on the number of voxels

but only on the very small number of kinematic joints,

it would not proﬁt from a GPU implementation and

is therefore still done on the CPU. But since it only

needs information about the ellipsoids computed in

the previous steps, it is not necessary to retrieve the

voxel data from the GPU. And since the voxel data

is computed anew each frame based on the silhouette

images, the only large data that needs to be copied

between CPU and GPU each frame are the cameras’

images. But in practice these are usually needed on

the GPU for visualizing them, anyway.

6 RESULTS

The workﬂow of the system for a single frame is

depicted in ﬁg. 2. To evaluate the presented mo-

tion capturing system we tested it in an artiﬁcial sce-

nario, generated by capturing the animation of a vir-

Table 1: Performance of the CPU and GPU based imple-

mentations for different voxel grid resolutions.

grid 32

128

256

CPU FPS 25 12 4 1

GPU FPS 124 121 106 49

tual human from 4 different viewpoints and emulating

a standard camera setup with an image resolution of

640×480 pixels at 30 Hz per camera. The advantage

of using artiﬁcial input data is that we know the exact

motion from which it was generated and can, there-

fore, objectively evaluate the quality of the captured

motion. The used motion in this case is a 25 second

dancing motion which exhibits a large range of dif-

ferent sub-motions. For measuring the error of the

capturing, we can simply take the distance between

the joint positions in the captured pose and the refer-

ence pose for each joint in each frame. This error can

further be averaged over all joints and over all frames

to gain an overall measure of the capturing quality,

lieing between 3 and 4 cm for the tested scenario.

For evaluating the inﬂuence of the voxel grid res-

olution on the capturing quality ﬁg. 3 shows the aver-

age joint errors for different voxel grids plotted over

the course of the whole motion. It can be seen that an

increase of the voxel grid resolution results both in a

smaller overall error, as well as a much smoother er-

ror curve, eliminating high-frequency errors and thus

resulting in a smoother motion. This is a natural con-

sequence of the reconstruction quality’s direct depen-

dence on the voxelization’s discretization error.

The performance is shown in tab. 1 for both a

CPU-based solution tested on an Intel Core i7 with

3.4 GHz and the proposed GPU-based solution tested

on an NVIDIA GeForce GTX 580 (implemented with

OpenCL). It can be seen that the GPU implementation

realizes motion capturing in real-time up to a maxi-

mum voxel grid resolution of 256

voxels.

GRAPP2013-InternationalConferenceonComputerGraphicsTheoryandApplications

400

Figure 3: Average joint errors in meters over the whole motion for different voxel grid resolutions.

7 CONCLUSIONS AND FUTURE

WORK

As we have seen the markerless capture of human mo-

tions is well suited for implementation on GPUs. This

allows the capturing to be realized in real-time while

providing a highly accurate voxel reconstruction, nec-

cessary for a smooth and accurate motion capture.

But there is still much room for improvement of

the presented system. First of all, we haven’t tested it

yet with real input data, which requires additional re-

search for the background subtraction method. This

was a trivial problem for the tested artiﬁcial sce-

nario, but in reality one has to cope with illumina-

tion changes through shadows and similar problems.

Furthermore, the initialization of the model for the

initial pose has to be done manually, leaving room

for automatization of this process. Last but not least

the pose estimation makes some slightly restricting

assumptions about the captured motion and still has

problems with very fast motions.

REFERENCES

Caillette, F. and Howard, T. (2004). Real-time markerless

human body tracking with multi-view 3-d voxel re-

construction. In Proceedings of BMVC, pages 597–

606.

Cheung, K.-M., Kanade, T., Bouguet, J.-Y., and Holler, M.

(2000). A real time system for robust 3d voxel re-

construction of human motions. In Proceedings of the

2000 IEEE Conference on Computer Vision and Pat-

tern Recognition (CVPR ’00), pages 714–720.

Corazza, S., M

undermann, L., Gambaretto, E., Ferrigno,

G., and Andriacchi, T. P. (2010). Markerless motion

capture through visual hull, articulated icp and subject

speciﬁc model generation. International Journal of

Computer Vision, 87(1):156–169.

Hasenfratz, J.-M., Lapierre, M., Gascuel, J.-D., and Boyer,

E. (2003). Real-time capture, reconstruction and in-

sertion into virtual world of human actors. In Vision,

Video and Graphics, pages 49–56. Elsevier.

Kehl, R. and Gool, L. V. (2006). Markerless tracking of

complex human motions from multiple views. Com-

puter Vision and Image Understanding, 104(2):190–

209.

Luck, J., Small, D., and Little, C. (2001). Real-time track-

ing of articulated human models using a 3d shape-

from-silhouette method. In Robot Vision, Lecture

Notes in Computer Science, pages 19–26. Springer

Verlag.

Matusik, W., Buehler, C., and McMillan, L. (2001). Polyhe-

dral visual hulls for real-time rendering. In Proceed-

ings of the 12th Eurographics Workshop on Rendering

Techniques, pages 115–126. Springer Verlag.

Moeslund, T. B., Hilton, A., and Kr

uger, V. (2006). A sur-

vey of advances in vision-based human motion cap-

ture and analysis. Computer Vision and Image Under-

standing, 104(2):90–126.

Satish, N., Harris, M., and Garland, M. (2009). Design-

ing efﬁcient sorting algorithms for manycore gpus. In

Proceedings of the 2009 IEEE International Sympo-

sium on Parallel and Distributed Processing, pages

1–10. IEEE Computer Society.

Sengupta, S., Harris, M., and Garland, M. (2008). Efﬁcient

parallel scan algorithms for gpus. Technical report,

NVIDIA.

Toyama, K., Krumm, J., Brummit, B., and Meyers, B.

(1999). Wallﬂower: Principles and practice of back-

ground maintenance. In Proceedings of the Seventh

IEEE International Conference on Computer Vision,

pages 255–261.

GPU-acceleratedReal-timeMarkerlessHumanMotionCapture

401