RGB-D Tracking and Reconstruction for TV Broadcasts

Tommi Tykk

, Hannu Hartikainen

, Andrew I. Comport

and Joni-Kristian K

ainen

Machine Vision and Pattern Recognition Laboratory, Lappeenranta University of Technology (LUT Kouvola),

Lappeenranta, Finland

Department of Media Technology, Aalto University, Aalto, Finland

CNRS-I3S/University of Nice Sophia-Antipolis, Nice, France

Department of Signal Processing, Tampere University of Technology, Tampere, Finland

Keywords:

Dense Tracking, Dense 3D Reconstruction, Real-time Tracking, RGB-D, Kinect.

Abstract:

In this work, a real-time image-based camera tracking solution is developed for television broadcasting studio

environments. An affordable vision-based system is proposed which can compete with expensive matchmov-

ing systems. The system requires merely commodity hardware: a low cost RGB-D sensor and a standard

laptop. The main contribution is avoiding time-evolving drift by tracking relative to a pre-recorded keyframe

model. Camera tracking is deﬁned as a registration problem between the current RGB-D measurement and

the nearest keyframe. The keyframe poses contain only a small error and therefore the proposed method is

virtually driftless. Camera tracking precision is compared to KinectFusion, which is a recent method for si-

multaneous camera tracking and 3D reconstruction. The proposed method is tested in a television broadcasting

studio, where it demonstrates driftless and precise camera tracking in real-time.

1 INTRODUCTION

Rendering virtual elements, props and characters to

live television broadcast combines augmented reality

(AR) and video production. A robust camera esti-

mation method is required for rendering the graphics

from camera viewing direction. In ﬁlm industry the

process is known as matchmoving and it is tradition-

ally done in post-processing (Dobbert, 2005). Match-

moving typically requires manual effort because the

available tools are not fully automatic. In online

broadcasting, special hardware based solutions exist

for automatic camera tracking, but their value is de-

graded by limited operating volume, weaker reality

experience and expensive price.

The goal of this work is to develop an affordable,

portable and easy-to-use solution for television pro-

duction studios. Marker-based AR techniques, such

as ARToolKit (Kato and Billinghurst, 1999), are triv-

ial to use but visible markers are irritating in the studio

scene. Thus, we seek solution from more recent tech-

niques which are able to track the camera without any

markers. Visual simultaneous localisation and map-

ping (vSLAM) techniques are a true option, because

they track the camera using a 3D model which is con-

currently built based on visual measurements (Davi-

son et al., 2007). Recently, low-cost RGB-D sensors

have developed to a level where they can provide real-

time stream of dense RGB-D measurements which

are directly usable for camera tracking and scene re-

construction purposes.

In this work, we introduce an image registra-

tion based dense tracking and reconstruction method

particularly for TV studios. Tracking precision is

good, but over long sequences, time-evolving drift

will eventually displace virtual props. We avoid drift

by generating a RGB-D keyframe model, and track-

ing the camera relative to the nearest keyframe. The

results are demonstrated with real studio shots. We

thank Heikki Ortamo and Jori P

olkki for their profes-

sional support in a TV production studio.

1.1 Related Work

Traditionally vSLAM methods detect and track a

sparse set of feature points which are matched in sev-

eral images. The ﬁrst feature-based visual SLAM

methods used the extended Kalman ﬁlter (EKF) to up-

date pose and structure (Davison et al., 2007). How-

ever, bundle adjustment has replaced EKF, because

it is more accurate. PTAM (Klein and Murray,

2007) separated tracking and mapping into two par-

247

Tykkälä T., Hartikainen H., Comport A. and Kämäräinen J..

RGB-D Tracking and Reconstruction for TV Broadcasts.

DOI: 10.5220/0004279602470252

In Proceedings of the International Conference on Computer Vision Theory and Applications (VISAPP-2013), pages 247-252

ISBN: 978-989-8565-48-8

 2013 SCITEPRESS (Science and Technology Publications, Lda.)

allel modules, where the mapping part was essen-

tially bundle adjustment. To avoid feature extraction

and matching problems, the raw pixel measurements

can be used directly. DTAM was introduced as the

dense version of PTAM, which allows every pixel

to contribute to pose and structure estimation (New-

combe et al., 2011a). Finally, with KinectFusion sys-

tem which replaces monocular camera by a RGB-D

sensor (Newcombe et al., 2011b), pose tracking and

structure estimation have become mature enough to

be considered for live TV broadcasts.

In this work, we adopt the dense RGB-D ap-

proach (Tykkala et al., 2011; Audras et al., 2011;

Comport et al., 2007). Our method differs rather

strongly from the KinectFusion, because we estimate

camera pose using dense RGB-D measurements in-

stead of depth maps only. This is necessary because

studio settings often contain planar surfaces for which

the KinectFusion fails to track the camera. We do

not estimate 3D structure concurrently because it can

be solved prior to broadcasting. This simpliﬁcation

lightens computation and enables running the system

with a low-end hardware. Robustness to outlier points

is increased by selecting the photometrically stable

pixels and by omitting unreliable regions (occlusion

and moving objects) with the M-estimator.

2 DENSE TRACKING METHOD

2.1 Cost Function

Our dense pose estimation is deﬁned as a direct

color image registration task between the current im-

age I : R

⇒ R and a frame K

∗

= {P

∗

}, where

∗

= {P

,...,P

} is a set of 3D points and c

∗

,...,c

} are the corresponding color intensi-

ties. Our goal is to ﬁnd the correct pose increment

T(ω,υ) which minimizes the following residual:

e = I (w (P

∗

;T(ω,υ))) −c

∗

, (1)

where w(P ; T) is a projective warping function which

transforms and projects P into a new view using the

4 ×4 transformation matrix T and the intrinsic ma-

trix K (constant, omitted in the notation). The exact

formula is

w(P ; T, K) = w



P ;



R t

0 1





= N(K(RP +t)),

(2)

where N(p) = (p

, p

) dehomogenizes a

point. T(ω,υ) is deﬁned as the exponential mapping

which forms a Lie group T(ω,υ) ∈ SE(3) (Ma et al.,

2004):

T(ω,υ) = e

A(ω,υ)

, A(ω,υ) =



[ω]

0 0



, (3)

where ω and υ are 3-vectors deﬁning rotation and

translation. From the practical point of view it is con-

venient to re-deﬁne (3) as

T(ω,υ) =

A(ω,υ)

, (4)

which generates smooth increments around the base

transform

T. This allows using iterative optimization

and particularly the inverse compositional trick for

estimating the transformation efﬁciently (Baker and

Matthews, 2004).

2.2 Minimization

We adopt the inverse compositional approach for ef-

ﬁcient minimization of the cost function (Baker and

Matthews, 2004). The cost is reformulated as

∗

(ω,υ) = I

∗



∗

−A(ω,υ)



e = c

∗

(ω,υ) −I



∗

;



where reference colors c

∗

are now a function of the

inverse motion increment. The beneﬁt is gained when

computing the Jacobian

i j

= ∆I

∗

(w(P

;I))

∂w(P

;I)

∂x

, (5)

where ∆I (p) = [

∂I (p)

∂x

∂I (p)

∂y

]. By this trick J does not

depend on the current transformation anymore and it

is sufﬁcient to computed it only once for each K

∗

Gauss-Newton iteration is well-suited for minimiza-

tion when the initial guess is near the minimum and

the cost function is smooth. The cost function is lo-

cally approximated by the ﬁrst-order Taylor expan-

sion e(x) = e(0) +Jx, where e(0) is the current resid-

ual. Now the scalar error function becomes

e(x)

e(x) =

e(0)

e(0) + x

e(0) +

(6)

where the derivative is zero when

Jx = −J

e(0) . (7)

Because J

J is a 6 ×6 positive deﬁnite matrix,

the inversion can be done efﬁciently using Cholesky

method. The matrix multiplication associativity in

SE(3) enables collecting the previous increments into

the base transform by

···

A(ω,υ)

⇒

A(ω,υ)

(4).

VISAPP2013-InternationalConferenceonComputerVisionTheoryandApplications

248

2.3 Keyframe based Reference K

∗

Minimisation of the cost function (1) with the Gauss-

Newton method in Sec 2.2 provides us real-time dense

RGB-D camera tracking (and scene reconstruction),

which frame by frame ﬁnds the optimal transforma-

tion T(ω,υ) between a previous frame K

∗

and a cur-

rent frame (image I ). This approach has one disad-

vantage: small per frame estimation error cumulates

into global drift which gradually displaces virtual el-

ements. Fortunately, the studio scene can be pre-

recorded into a set of keyframes prior to broadcasting

(see Fig. 1). The tracking is then deﬁned relative to

the nearest keyframe K

∗

← select closest(K

∗

). For

small studios, a single dense tracking sweep is sufﬁ-

cient, but methods exist also for building larger mod-

els (Henry et al., 2012).

Figure 1: Stored scene keyframes and their pixels mapped

to a common 3D world model.

2.4 Robust Estimation

Now we know that our reference keyframes contain

merely static geometry with Lambertian reﬂection, it

is possible to increase robustness of the estimation

by emphasising stronger color and depth correlation

between the static reference points and the current

3D points matched in each iteration. We do this by

employing M-estimation in which uncertainty based

weights w

∈ [0, 1] ∈ R are given to the residual ele-

ments e

. Color correlation is enforced by the Tukey

weighting function

c ∗median(e

)

, (8)

(

(1 −(

)

if |u

| <= b

0 if |u

| > b

, (9)

where e

= {ke

k,...,ke

k}, c = 1.4826 for robust

standard deviation, and b = 4.6851 is the Tukey spe-

ciﬁc constant.

The depth correlation weights are computed si-

multaneously by a depth lookup

= Z (w(P ; T(ω,υ))) −e

T(ω,υ)





, (10)

where Z : R

⇒ R is the depth map function of the

current RGB view and e

= (0,0,1,0) selects the

depth coordinate. When the standard deviation of

depth measurements is τ, the warped points whose

depth differs more than τ from the current depth map

value can be interpreted as foreground actors/outliers.

Thus we deﬁne a diagonal matrix

= max



(1 −e

(k)/τ

),0



, (11)

The weighted Gauss-Newton step is obtained by

re-writing (7)

WJx = −J

We, (12)

where W is a diagonal matrix with diag(W)

. The robust step is obtained by J ⇐

√

and e ⇐

√

We. The both weight components are

quadratic and therefore the square roots are not eval-

uated in practise.

2.5 Selecting Keyframe Points

The reference points P

∗

can be freely selected from

the keyframe. When considering the minimization

of the photometrical error, only the 3D points with

image gradients infer pose parameters. Therefore the

majority of the points can be neglected based on the

magnitude of the image gradient. Gradient vector

∇c is evaluated at the projection of P

∗

using bilin-

ear interpolation. Instead of sorting |∇c|, we generate

the histogram of the magnitudes. The value range is

bounded to [0,255]. We seek the bin B

for which

255

k=B

−1

h(k) < n <= Σ

255

k=B

h(k), where n is the num-

ber of points to be selected.

3 IMPLEMENTATION DETAILS

Our method was implemented in Ubuntu Linux en-

vironment using open software tools, Kinect RGB-

D sensor and a commodity PC laptop hardware on

which the method runs real-time. The main limita-

tions are the operating range (≈ 1m −5m) and that

it uses controlled IR lighting which may not work in

outdoors. Below, we discuss Kinect related imple-

mentation issues.

RGB-DTrackingandReconstructionforTVBroadcasts

249

3.1 Kinect Calibration

Since Microsoft Kinect can be set into a special mode

where the raw IR images can be stored, it is possible

to use standard stereo calibration procedure for ob-

taining the calibration parameters for the IR and the

RGB view (Bouguet, 2010) (Fig. 2). IR view and the

depth view are trivially associated with image offset

(−4,−3).

A single camera calibration is used to initialize

IR and RGB camera parameters. Then calibration

is followed by a stereo procedure which re-estimates

all free parameters (K

, K

RGB

, kc

RGB

). The lens

distortion parameters of the IR camera are forced to

zero, because data has already been used to gener-

ate the raw disparity map. This means that the IR

lens distortion is compensated by tweaking other pa-

rameters. The RGB camera lens distortion parameters

RGB

are estimated without any special concerns. In

practice, the distortion seems to be minor and the

ﬁrst two radial coefﬁcients are sufﬁcient (kc

RGB

(0.2370,−0.4508,0,0,0), and the stereo baseline is

b = 25.005mm. T

stores the baseline transform as

4×4 matrix. The conversion from raw disparities into

depth values can be done by z =

8p f

B−d

, where p is the

baseline between the projector and the IR camera, B

is a device speciﬁc constant and f is IR camera fo-

cal length in pixel units. p and B are estimated by

solving the linear equation



−1 Z



A B



= D,

where A = 8p f , Z is n ×1 matrix of reference depth

values z

from the chessboard pattern (Caltech cali-

bration), and D is a n ×1 matrix whose elements are

. The parameters will be p ≈75mm and B ≈1090.

Note, that this reconstruction method is merely an ap-

proximation which procludes measurements at long

ranges. There are also dedicated calibration toolboxes

for RGB-D sensors, which model disparity distortion

accurately (Herrera et al., 2012).

Figure 2: RGB and IR images of the calibration pattern.

3.2 Dense Tracking with Kinect

The reference point clouds

{

∗

}

in (1) were gen-

erated from Kinect RGB image and disparity map in

the following way. Kinect Bayer images were con-

verted into RGB format, downsampled into 320x240

and undistorted from lens distortions. Downsampling

is almost lossless due to sparsity of RGB values in the

Bayer pattern. The raw disparity map was ﬁrst con-

verted into a depth map, downsampled into 320x240

size using max ﬁlter and then transformed into a point

cloud P

∗

. Maximum ﬁltering is chosen because it

does not produce artiﬁcial geometry. P

∗

was then

generated from T

∗

, where T

is the baseline trans-

formation between the IR and RGB cameras. Points

p ∈ P

∗

do not exactly project to the pixel centers of

the RGB image grid, and thus, bi-linear interpolation

is used for generating the corresponding intensities c

∗

The cost function is minimized using coarse-to-

ﬁne approach using image pyramid with 80 × 60,

160×120 and 320×240 layers for each RGB-D input

frame.

4 EXAMPLES

4.1 Dense Tracking vs Kinfu

Kinfu is the open source implementation of Kinect-

Fusion (Rusu and Cousins, 2011). We compare our

dense tracking accuracy with Kinfu using the RGB-

D SLAM benchmark provided by Technical Univer-

sity of M

unich (Sturm et al., 2011) (Figure 3). Dense

tracking is executed incrementally by using a recent

view as the reference. Thus, both methods aim at

tracking the camera pose without a prior model and

small drift will be present. The major difference be-

tween the systems is that Kinfu optimizes a voxel

based 3D structure online and uses the iterative clos-

est point (ICP) approach for pose estimation. Kinfu

has small drift when the voxel size is small and geom-

etry is versatile. Kinfu fails in bigger operating vol-

umes, because voxel discretization becomes coarse

and, for these sequences especially, the volume will

also contain planar ﬂoor which break downs ICP

In broadcasting studios, scenes are often larger than

(3m)

and geometrical variations can not be guaran-

teed. Our dense tracking demonstrates robust track-

ing even when planar surfaces are present, because

the cost function matches also scene texturing. Mem-

ory consumption is low even in larger operating vol-

umes, because RGB-D keyframes can be memory-

optimized based on the viewing zone.

Table 1 shows the comparison between our

method and Kinfu numerically. The dense track-

ing drifts 1.08cm/s with the slower freiburg2/

desk sequence and 2.60cm/s with the faster

freiburg1/desk sequence. Our dense tracking has

smaller error when the camera is moving faster. Kinfu

Video: http://youtu.be/tNz1p1sdTrE

VISAPP2013-InternationalConferenceonComputerVisionTheoryandApplications

250

Table 1: Drift is evaluated using standard sequences with

known ground truth trajectories. The proposed approach

has smaller drift when a camera is moving faster. The

keyframe model can be memory-efﬁciently built. Kinfu has

smaller drift when a camera is moving slower but bigger

scenes are not possible due to inscalability of the voxel grid.

Problems exist with planar surfaces. The computational re-

quirements are higher even though Kinfu is executed on a

powerful desktop hardware.

Dataset Our drift Kinfu(3) Kinfu(8) Camera speed

freiburg1/desk 2.60cm/s 8.40cm/s 3.97cm/s 41.3cm/s

52.2ms 135ms 135ms

freiburg2/desk 1.08cm/s 0.64cm/s 1.30cm/s 19.3cm/s

35.5ms 135ms 135ms

has lower drift when using (3m)

voxel grid, but fails

to operate in bigger volumes (8m)

that match with

broadcasting studios. Also the computational require-

ments of Kinfu are signiﬁcantly higher compared to

our approach even though Kinfu is executed on a pow-

erful desktop hardware. Drift was measured by divid-

ing the input frames into subsegments of several sec-

onds (10 and 2 correspondingly) whose median error

was measured against the ground truth. 1 second aver-

age error was computed from the median subsegment.

The error values are computed from bigger windows

to average out random perturbations and neglect gross

tracking failures which occur with Kinfu in all cases

except on freiburg2/desk using (3m)

volume.

4.2 Driftless Keyframe Tracking

The relative poses between the keyframes could,

in theory, be obtained by bundle adjustment tech-

niques (Triggs et al., 2000) if feature point extrac-

tion and matching succeeded, and a good initial guess

would exist. With the studio sequences recorded,

texturing was so limited that popular tools, such as

Bundler (Snavely et al., 2006) failed without manual

annotated feature points. As an alternative solution,

we utilised the proposed dense RGB-D tracker to in-

crementally build a keyframe model. It is noteworthy,

that a single quick sweep of the scene produces accu-

rate keyframe model

. Figure 4 illustrates how dense

tracking is used to sweep a keyframe model of the stu-

dio scene rapidly. With pre-recorded keyframes, the

online broadcasts are guaranteed to operate with the

correct 3D geometry. K

∗

were selected by picking

RGB-D frames evenly from the recorded sequence.

A keyframe database is illustrated in Figure 1.

In Figure 5, we show how the drift increases in a

studio environment when using dense tracking. On

the right-hand side, the drift problem is solved by

Video: http://youtu.be/wALQB3eDbUg

Figure 3: On the left Kinfu result for freiburg desk2 se-

quence using 3x3x3, 5x5x5 and 8x8x8 meter voxel volume.

Kinfu gains lower drift due to structural optimization, but

planar surfaces cause tracking failures. 3x3x3 volume does

not contain ﬂoor and therefore Kinfu works well. Limited

operating volume is a problem for practical use in the stu-

dio. On the right, our dense tracking drift illustrated when

a ﬁxed keyframes are not used. Problems with planar sur-

faces or limited operating volume do not exist. Green dots

reveal the selected points for given RGB-D measurement.

Figure 4: Camera trajectory solved from Kinect input by

minimizing the cost function (1). A small shark is rendered

into the studio scene.

tracking relative to keyframes

. Note that in long-

term use, even a small number of keyframes eventu-

ally outperforms the dense tracking due to drift. How-

ever, a small keyframe number produces drift jumps

between the keyframes which can be visually disturb-

ing. With sufﬁcient number of keyframes, the error

remains small.

5 CONCLUSIONS

In this work, camera pose tracking was deﬁned as a

photometrical registration problem between a refer-

Video: http://youtu.be/zfKdZSkG4LU

RGB-DTrackingandReconstructionforTVBroadcasts

251

Figure 5: Camera moving front and back along a ﬁxed

3.30m studio rail. On the left, three images taken from

the beginning, middle and the end of the rail. Green re-

gions illustrated selected points. On the right, comparison

between dense tracking and keyframe tracking. In dense

tracking drift increases in time, but keyframe tracking main-

tains small bounded error. A person is moving in the scene

during the last four cycles.

ence frame and the current frame. To remove the

global drift in incremental tracking, the closest pre-

recorded keyframe was chosen to be the motion refer-

ence. The system was designed to be an affordable so-

lution for TV broadcasting studios relying only on the

Kinect sensor and a commodity laptop. The proposed

approach performs robustly in a standard benchmark,

where KinectFusion has problems with planar sur-

faces and limited voxel grid resolution. Our future

work will address the practical issues how studio staff

and camera men can use our computer vision system

in live broadcasts. Moreover, combination of the best

properties of our approach and KinectFusion will be

investigated.

REFERENCES

Audras, C., Comport, A. I., Meilland, M., and Rives, P.

(2011). Real-time dense rgb-d localisation and map-

ping. In Australian Conference on Robotics and Au-

tomation. Monash University, Australia, 2011.

Baker, S. and Matthews, I. (2004). Lucas-kanade 20 years

on: A unifying framework. Int. J. Comput. Vision,

56(3):221–255.

Bouguet, J.-Y. (2010). Camera calibration toolbox

for matlab. http://www.vision.caltech.edu/bouguetj/

calib doc.

Comport, A., Malis, E., and Rives, P. (2007). Accu-

rate quadri-focal tracking for robust 3d visual odome-

try. In IEEE Int. Conf. on Robotics and Automation,

ICRA’07, Rome, Italy.

Davison, A., Reid, I., Molton, N., and Stasse, O. (2007).

MonoSLAM: Real-time single camera SLAM. PAMI,

29:1052–1067.

Dobbert, T. (2005). Matchmoving: The Invisible Art of

Camera Tracking. Sybex.

Henry, P., Krainin, M., Herbst, E., Ren, X., and Fox, D.

(2012). RGB-D mapping: Using Kinect-style depth

cameras for dense 3D modeling of indoor environ-

ments. The International Journal of Robotics Re-

search, 31(5):647–663.

Herrera, C., Kannala, J., and Heikkila, J. (2012). Joint depth

and color camera calibration with distortion correc-

tion. IEEE PAMI, 34(10).

Kato, H. and Billinghurst, M. (1999). Marker tracking and

hmd calibration for a video-based augmented reality

conferencing system. In Proceedings of the 2nd Inter-

national Workshop on Augmented Reality (IWAR 99),

San Francisco, USA.

Klein, G. and Murray, D. (2007). Parallel tracking and map-

ping for small ar workspaces. Proceedings of the In-

ternational Symposium on In Mixed and Augmented

Reality (ISMAR), pages 225–234.

Ma, Y., Soatto, S., Kosecka, J., and Sastry, S. (2004). An in-

vitation to 3-D vision: from images to geometric mod-

els, volume 26 of Interdisciplinary applied mathemat-

ics. Springer, New York.

Newcombe, R., Lovegrove, S., and Davison, A. (2011a).

Dtam: Dense tracking and mapping in real-time. In

ICCV, volume 1.

Newcombe, R. A., Izadi, S., Hilliges, O., Molyneaux, D.,

Kim, D., Davison, A. J., Kohli, P., Shotton, J., Hodges,

S., and Fitzgibbon, A. (2011b). Kinectfusion: Real-

time dense surface mapping and tracking. ISMAR,

pages 127–136.

Rusu, R. B. and Cousins, S. (2011). 3D is here: Point Cloud

Library (PCL). In IEEE International Conference on

Robotics and Automation (ICRA), Shanghai, China.

Snavely, N., Seitz, S. M., and Szeliski, R. (2006). Photo

tourism: Exploring photo collections in 3d. In ACM

TRANSACTIONS ON GRAPHICS, pages 835–846.

Press.

Sturm, J., Magnenat, S., Engelhard, N., Pomerleau, F., Co-

las, F., Burgard, W., Cremers, D., and Siegwart, R.

(2011). Towards a benchmark for rgb-d slam evalua-

tion. In Proc. of the RGB-D Workshop on Advanced

Reasoning with Depth Cameras at Robotics: Science

and Systems Conf. (RSS), Los Angeles, USA.

Triggs, B., McLauchlan, P., Hartley, R., and Fitzgibbon, A.

(2000). Bundle adjustment - a modern synthesis. In

Triggs, B., Zisserman, A., and Szeliski, R., editors, Vi-

sion Algorithms: Theory and Practice, volume 1883

of Lecture Notes in Computer Science, pages 298–

372. Springer-Verlag.

Tykkala, T. M., Audras, C., and Comport, A. (2011). Direct

iterative closest point for real-time visual odometry. In

ICCV Workshop CVVT.

VISAPP2013-InternationalConferenceonComputerVisionTheoryandApplications

252