ITERATIVE RIGID BODY TRANSFORMATION ESTIMATION FOR

VISUAL 3-D OBJECT TRACKING

Micha Hersch, Thomas Reichert and Aude Billard

LASA Laboratory, EPFL, 1015 Lausanne, Switzerland

Keywords:

Stereo vision tracking, Rigid body transformation estimation.

Abstract:

We present a novel yet simple 3D stereo vision tracking algorithm which computes the position and orientation

of an object from the location of markers attached to the object. The novelty of this algorithm is that it does

not assume that the markers are tracked synchronously. This provides a higher robustness to the noise in the

data, missing points and outliers. The principle of the algorithm is to perform a simple gradient descent on

the rigid body transformation describing the object position and orientation. This is proved to converge to the

correct solution and is illustrated in a simple experimental setup involving two USB cameras.

1 INTRODUCTION

Estimating the 3-D rigid body transformation align-

ing two noisy sets of identiﬁable points is considered

a solved problem in computer vision. Indeed, various

closed form solutions have been suggested in the last

two decades (Arun et al., 1987; Horn, 1987; Walker

et al., 1991), and those solutions have been widely

used and compared (Eggert et al., 1997). However,

in spite of those existing solutions, we address once

again this problem and suggest an iterative solution to

the rigid body estimation problem. Our belief is that

in many applications, an iterative solution is prefer-

able to a closed-form solution, especially if the rigid

body transformation changes in time, for example

when tracking a moving object. The major reasons for

this is that an iterative solution would be more robust

to noise in the data and that and would not assume

synchronicity of the set of points.

2 SETTING AND NOTATIONS

We consider a rigid body transformation T transform-

ing a set of n vectors {x

} into another set of n vectors

}. This transformation is described by a rotation R

around an axis passing through the origin and a trans-

lation V by a vector v:

= T(x

) = R(x

) + v. (1)

When considering a 3-D tracking application, the

rigid body transformation T can be used to describe

the position and orientation of the tracked object, rel-

atively to a reference position and orientation. The

reference positions of the n markers on the objects

make the set of {x

}. The positions of those mark-

ers when tracked by a stereo vision system constitute

the set of {y

}. It is assumed that the markers can be

distinguished one from another, for example by using

different colors. If the object is moving, the evolution

of T yields the trajectory of the object.

3 ROTATIONS

In this paper, we use the spinor representation of ro-

tations which is brieﬂy recalled here, adopting the ap-

proach described in (Hestenes, 1999). This represen-

tation is very similar to the quaternion representation.

The spinor ¯q representing the rotation R is given by

a scalar α and imaginary vector bi. The direction of

b yields the rotation axis (passing through the origin)

and its norm is equal to sin(θ/2), where θ is the ro-

tation angle. The scalar α is given by cos(θ/2).The

rotation of a vector x by a spinor ¯q is given by the

following equation.

(x) = (1− 2b

b)x+ 2

(1− b

b)b× x+ 2(b

x)b,

(2)

where R

denotes the rotation represented by b.

674

Hersch M., Reichert T. and Aude A. (2008).

ITERATIVE RIGID BODY TRANSFORMATION ESTIMATION FOR VISUAL 3-D OBJECT TRACKING.

In Proceedings of the Third International Conference on Computer Vision Theory and Applications, pages 674-677

DOI: 10.5220/0001087106740677

 SciTePress

4 ITERATIVE ESTIMATION OF A

RIGID BODY

TRANSFORMATION

We now present to the algorithm for iteratively esti-

mating a rigid body transformation given a set of n

points {x

} and its noisy transform {y

}. The princi-

ple of the algorithm is quite trivial. Starting from an

initial guess for the parameters b and v of the trans-

formation, it consists simply on a gradient descent on

the squared distance between the measurement y

and

the transformed point T

b,v

)

∆b = −ε

∂

∂b



− T

b,v

)



(3)

∆v = −ε

∂

∂v



− T

b,v

)



, (4)

where ε is the learning rate. One assumes that i

takes values from 1 to n in a uniformly distributed

manner. So at each time step, an index i is selected

among the available points and b and v are update

according to (3) and (4).

The actual development of those two equations yields:

∆b = 2ε



− T

b,v

)





− 2x

−

(1− b

(b× x

+ 1

(1− b

b)x

↑ +

+ (b



(5)

∆v = ε



− T

b,v

)



, (6)

where I is the 3× 3 identity matrix and the unary

operator ↑ is deﬁned as

x↑

∂

∂b

(b× x) =





0 x

(3)

−x

(2)

−x

(3)

0 x

(1)

(2)

−x

(1)





, (7)

with x = [x

(1)

(2)

(3)

]

This concludes the description of the algorithm.

For efﬁciency purposes, it is preferable to choose ref-

erence positions so that the x

are centered on the ori-

gin. This allows to reduce the inﬂuence of b on the

computation of v.

5 CONVERGENCE

In this section, we prove that if there exists a rigid

body transformation matching the two sets of points

} to {y

}, then the iterative algorithm described

above will converge to it.

Let T

∗

be the true transformation mapping a ﬁnite

set of points {x

} = V into their correspondingimage.

If V contains at least three unaligned points, there

is only one such transformation. Let T 6= T

∗

be the

current estimate of this transformation.

We then deﬁne the following function E(T)

E(T) =

∑

i=1

(T), with E

(T) =

kTx

−T

∗

(8)

Here and in the rest of this paper, the parentheses

around x

are omitted to lighten the notation. We

also deﬁne the vector p = [b

]

to be the vector

parameterizing the transformation.

We ﬁrst show that the algorithm always converges

to a solution. If ε tends to zero and t is the time, then

−1

∆b and ε

−1

∆v tend respectively to

∂

∂t

b and

∂

∂t

v .

So the gradient descent of the algorithm means that

∂

∂t

p = −

∂

∂p

(T). We thus have

∂

∂t

E =

∂

∂t

∑

i=1

∑

i=1

∂

∂t

∑

i=1

∂

∂p

∂

∂t

p =

∑

i=1

∂

∂p

(−

∂

∂p

) =

∑

i=1

−(

∂

∂p

)

≤ 0

The function E(T), being positive, the algorithm

always converges to a solution. It remains to be

shown that this solution is correct.

In order to show that the algorithm converges to

the right solution T

∗

, we show that for any T, T

∗

, V ,

satisfying the conditions mentioned above, there is a

transformation T

†

belonging to a neighborhood of T

such that

E(T

†

) < E(T) (9)

This amounts to saying that there is no local minimum

for E(T). We assume, without loss of generality, that

the x

are centered. Let us consider the transformation

†

deﬁned by translation vector v

†

and rotation R

†

= v+ ε(v

∗

− v) (10)

†

= εR

◦ R with ε > 0. (11)

In the above expression εR

is an inﬁnitesimal rota-

tion of unit rotation axis given by

= z

∑

× R

∗

(12)

where z = k

∑

× R

∗

−1

. This means that R

†

in the neighborhood of R. If ε is small enough, we

have, see (Altmann, 1986),

†

x = Rx+ ε(b

× Rx). (13)

ITERATIVE RIGID BODY TRANSFORMATION ESTIMATION FOR VISUAL 3-D OBJECT TRACKING

675

Thus the variation in E when moving from T to T

†

given by

∆E= E(T

†

) − E(T) (14)

∑

†

− T

∗

−

∑

kTx

− T

∗

∑

†

− kTx

− 2(T

∗

)

†

− Tx

)

∑

†

+ v

†

− kRx

+ vk

− 2(R

∗

+ v

∗

)

†

+ v

†

− Rx

− v)

∑

kRx+ v+ ε(v

∗

− v+ b

× Rx

−

kRx

+ vk

− 2(R

∗

+ v

∗

)

(Rx

+ εb

× Rx

+ v+ ε(v

∗

− v) − Rx

− v)

∑

2ε



∗

− v+ b

× Rx

)

(Rx

+ v) −

∗

+ v

∗

)

× Rx

+ v

∗

− v)



+ O (ε

) (15)

If ε is small enough, we can discard terms in

O (ε

∆E≃

∑

2ε(v

∗

− v+ b

× Rx

)

(Rx

− R

∗

+ v− v

∗

)

= 2ε



− nkv

∗

− vk

+ (v

∗

− v)

(

∑

−

∑

∗

)

∑

× Rx

)

(Rx

− R

∗

)



= 2ε



− nkv

∗

− vk

∑

× Rx

)

(Rx

− R

∗

)



= −2ε



nkv

∗

− vk

∑

× Rx

)

∗



(16)

We now show that the sum in (16) is also positive.

Using the matrix representation of rotation,

∑

× Rx

)

∗



∑



∑

× R

∗

) × Rx

)

∗



= z

∑

i, j



(Rx

× R

∗

) × Rx



∗

= z

∑

i, j



∗

(Rx

)

− Rx

∗

)

∗

= z

∑

i, j

∗

)

∗

− x

∗

= zn

∑

− x

∗

> 0 ∀R 6= R

∗

. (17)

In the last equation C is the covariance matrix of

the x

. The last inequality is justiﬁed by the fact that

the rotation matrix R

∗

breaks the alignment be-

tween the principal component of C and the direction

rotation vector

translation vector

iterations

−2

−1

−0.2

−0.1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0 50 100 150 200

Figure 1: Convergence of the algorithm. The three vector

components are indicated by x, y and z. The dotted horizon-

tal lines are the true parameter values of rigid body trans-

formation and the solid lines show the evolution of the esti-

mated values using the learning algorithm.

of maximum variance in V . Putting (17) and (16) to-

gether shows that E decreases when moving from T

to T

†

. There is thus no local minima in E, so E is

a Lyapunov function of the system, which proves the

convergence.

6 EXPERIMENTS

The ﬁrst experiment aims at illustrating the conver-

gence properties of the algorithm described above and

is performed in simulation. A rotation vector b

∗

and

a translation vector v

∗

were randomly generated. The

estimated rigid body transformation was initialized to

the identity b = v = 0 and the algorithm was run on

randomly generated points x

. The results can be seen

in Figure 1. One sees that both b and v converge to

∗

and v

∗

respectively, as is expected from the con-

vergence properties studied above.

The next experiment involves a tracking task in

a real stereo vision setting made of two low qual-

ity USB cameras mounted on a ﬁxed support. Three

color patches were taped on the object to be tracked.

A software, based on the OpenCV library can track

color blobs and locate them in three dimensions. The

object was moved by hand, so the only information

about the position of the object is given by the stereo

vision system. So the real position of the object is

unknown, i.e., there is no ground truth.

Using the data recordedfrom the stereo vision sys-

tem, the position and orientation of the end-effector

were computed using two different algorithms, the it-

erative one described in this paper (5) and (6) and the

closed-form solution described in (Horn, 1987). This

algorithm ﬁnds the rigid body transformation by opti-

mizing a least square criterion similar to E(T) deﬁned

VISAPP 2008 - International Conference on Computer Vision Theory and Applications

676

0 500 1000 1500 2000 2500 3000

Rotation angle (in degree)

0 500 1000 1500 2000 2500 3000

iterative solution closed-form solution

frames

Figure 2: The behaviors of the tested algorithms in case of

noise in the data. The iterative algorithm (left graphs) is less

noisy than the closed-form algorithm. The object is static,

and the same data was used on both algorithms.

in (8).

In both cases, the data was taken as it is, without any

preprocessing. The iterative algorithm was initialized

using the closed-form algorithm on the initial patch

positions. In the absence of ground truth, the pre-

cision of the tracking algorithm is not investigated.

Rather, we compare the behaviors of the iterative and

closed-form algorithms

The ﬁrst experiment was made with a static object.

Using the same marker position data coming from the

stereo vision software, we ran both algorithms to es-

timate the position and the orientation of the object.

The results can be seen in Figure 2. One sees that the

iterative solution is much less sensitive to noise in the

data. This is because the closed-form solution has no

memory, whereas the iterative solution can only up-

date its current estimate up to a certain amount, which

produces a smoothing effect.

The second experiment was made with a moving

object. In this experiment, the effect of missing points

is investigated. Two different scenarios were tested.

In the ﬁrst scenario (periodic occlusion), a randomly

selected point was removed in each frame. In the sec-

ond scenario (lasting occlusion) a given point was re-

moved from the data for 10 consecutive frames. The

results can be seen in Figure 3. One sees that for both

scenarios, the closed-form algorithm (dotted lines)

cannot deal with the missing points as it requires at

least three concomitant points. To the contrary, the

iterative algorithm (dashed-dotted line) can deal with

the missing points as it has no such requirement. It

can follow pretty well the position given by the base-

line (solid line). This baseline was obtained by using

the closed-form algorithm and smoothing the result.

7 DISCUSSION

The results presented above show that an iterative so-

lution to a rigid body transformation can be advan-

0 10 20 30 40 50 60 70

frames

rotation angle (in degree)

20 25 30 35 40 45 50 55 60

frames

baseline

iterative solution

closed-form solution

periodic occlusion lasting occlusion

Figure 3: The behaviors of the tested algorithms in case

of occlusions. The iterative algorithm can deals well with

points periodically missing and points missing for a number

of consecutive frames (lasting occulsions).

tageous in a tracking application. The main advan-

tages come from the fact that the iterative solution

does not make the assumption that it has concomitant

points. Moreover, it ensures a continuity in the es-

timates, which is not guaranteed by the memoryless

closed-form solution. When using the iterative solu-

tion suggested here, the learning rate must be care-

fully chosen to be big enough to avoid loosing track

of the object, while remaining small enough to ensure

a smooth estimate of the transformation.

Although it was not investigated in this paper, we be-

lieve that the suggested algorithm could be useful in

other applications, especially in iterative algorithms

like ICP. This algorithm could also most probably be

easily extended to include uniform scaling of rigid

body transformations.

REFERENCES

Altmann, S. (1986). Rotations, Quaternions and Double

Groups, chapter 4, page 80. Oxford University Press.

Arun, K., Huang, T., and Blostein, S. (1987). Least-squares

ﬁtting of two 3-d point sets. IEEE Transactions on

Pattern Analysis and Machine Intelligence.

Eggert, D., Lorusso, A., and Fisher, R. (1997). Estimating

3-d rigid body transformation: a comparison of four

major algorithms. Machine Vision and Applications.

Hestenes, D. (1999). New Foudations for Classical Me-

chanics, pages 277–305. Fundamental Theories of

Physics. Kluwer Academic Publishers, 2 edition.

Horn, B. (1987). Closed-form solution of absolute orien-

tation using unit quaternions. Journal of the Optical

Society of America A, 4(4):629–641.

Walker, M., Shao, L., and Volz, R. (1991). Estimating 3-

d location parameters using dual number quaternions.

CVGIP: Image Understanding.

ITERATIVE RIGID BODY TRANSFORMATION ESTIMATION FOR VISUAL 3-D OBJECT TRACKING

677