RELIABLE DETECTION OF CAMERA MOTION BASED ON

WEIGHTED OPTICAL FLOW FITTING

Rodrigo Minetto, Neucimar J. Leite

Institute of Computing, State University of Campinas, Campinas, SP, Brazil

Jorge Stolﬁ

Institute of Computing, State University of Campinas, Campinas, SP, Brazil

Keywords:

Camera motion, video segmentation, optical ﬂow, least-squares ﬁtting, KLT algorithm.

Abstract:

Our goal in this paper is the reliable detection of camera motion (pan/zoom/tilt) in video records. We propose

an algorithm based on weighted optical ﬂow least-square ﬁtting, where an iterative procedure is used to im-

prove the corresponding weights. To the optical ﬂow computation we used the Kanade-Lucas-Tomasi feature

tracker. Besides detecting camera motion, our algorithm provides a precise and reliable quantitative analysis

of the movements. It also provides a rough segmentation of each frame into “foreground” and “background”

regions, corresponding to the moving and stationary parts of the scene, respectively. Tests with two real videos

show that the algorithm is fast and efﬁcient, even in the presence of large objects movements.

1 INTRODUCTION

Our goal in this work is the reliable detection of

camera motion in video images. This detection is

closely related, for example, to the topic of content-

based video retrieval in which the camera motion in-

formation is used for structural segmentation and in-

dexing of video databases. Usually, a video con-

sists of many continuous parts, called shots, separated

by abrupt cuts or gradual transitions (like fades, dis-

solves, wipes, etc). The problem of detecting such

editing transitions and segmenting a video into shots

has received much attention in recent years. How-

ever, due to the various camera motion inside a shot,

this level of segmentation is often insufﬁcient, e.g. for

choosing key-frames for each video segment. Our mo-

tion detection algorithm can be used to provide this

second level of segmentation and improve the video

abstraction in an indexing process.

This work deals with a camera in a ﬁxed location

exhibiting three degrees of freedom — pan (left or

right), tilt (up or down), and zoom (in or out).

In tests with real videos, our approach reported

about 95% of the frame pairs where camera motion

occurred, and about 98% of these frames did have

camera motion. Besides detecting camera motion, our

method also provides a precise and reliable quantita-

tive analysis of the movements — namely, the amount

of pan, tilt and zoom between any two consecutive

frames. It also provides a rough segmentation of each

frame into “foreground” and “background” regions,

corresponding to the moving and stationary parts of

the scene.

Our method consists of two main steps applied to

each pair of consecutive frames. The ﬁrst step deter-

mines the optical ﬂow ﬁeld which can be interpreted

as the velocity of the scene “ﬂowing” throught each

pixel. Then, the second step estimates the camera mo-

tion from the obtained optical ﬂow ﬁeld.

For the ﬁrst step, we propose to use the

Kanade-Lucas-Tomasi feature tracking algorithm

(KLT) (Tomasi and Kanade, 1991), which derives

from the work of Lucas and Kanade (Lucas and

Kanade, 1981).

The second step considers a weighted least-

squares matching in the decomposition of the optical

ﬂow ﬁeld into the sum of four component ﬁelds. The

ﬁrst three are the ﬁelds expected from camera move-

ments (pan, tilt, and zoom). The fourth is a residual

“error” ﬁeld that is assumed to be due to events such

as scene motion, video quantization, etc.

This paper is organized as follows. The next sec-

tion describes previous works on camera motion de-

tection. Section 3 and 4 describe our algorithm in de-

435

Minetto R., J. Leite N. and Stolﬁ J. (2007).

RELIABLE DETECTION OF CAMERA MOTION BASED ON WEIGHTED OPTICAL FLOW FITTING.

In Proceedings of the Second International Conference on Computer Vision Theory and Applications - IU/MTSV, pages 435-440

 SciTePress

tails. Section 5 discusses the method used for the opti-

cal ﬂow computing, and Section 6 shows some results

obtained with real videos. In section 7 we present

some conclusions and directions for future works.

2 RELATED WORK

Most existing techniques to determining camera mo-

tion can be classiﬁed according to the way they ob-

tain the raw motion data, e.g., from MPEG motion

vectors, by analysis of spatio-temporal textures, or by

analysis of the optical ﬂow.

2.1 Analysis of MPEG Motion Vectors

Some algorithms try to derive the camera motion pa-

rameters directly from the MPEG compressed video

stream (Ewerth et al., 2004). They consider the fact

that the MPEG encoding divides the frame into an ar-

ray of ﬁxed-size blocks, and try to estimate an average

velocity vector within each block. These vectors are

then compared with the velocities that are predicted

from the camera motion parameters. This approach

can be very efﬁcient because it works directly on the

MPEG stream without decompression, and does not

have to compute the motion vectors. However, the

MPEG velocity vectors are chosen on a local basis for

efﬁcient compression, and may diverge considerably

from the actual image motion (Huhn, 2000).

Figure 1: Optical ﬂow between a pair of consecutive video

frames sampled at a regular grid. The area of each dot is

proportional to the weight w

of the vector. The camera is

doing a combination of pan to the left and tilt up.

2.2 Analysis of 2D Spatio-Temporal

Slices

A video recording can be seen as a 3D image, with

two space axis x,y and a time axis t. A spatio-

temporal slice is a 2D image of a video, with t as

one axis and an arbitrary line on the x, y plane as the

other axis. Ngo et al (Ngo et al., 2003) proposed an

method to detect camera movement by analysis of 2D

spatio-temporal slices — such as horizontal (x,t) and

and vertical (y,t). The key idea is that camera mo-

tions tend to produce characteristic textures in such

slices. Standard image analysis techniques, such as

histograms of the structural tensor, can then be used

to identify such textures and estimate the motion pa-

rameters from their orientation. The main problem

with this approach is distinguishing camera move-

ment from object motion. Moreover, combinations

of pan/tilt/zoom often causes misclassiﬁcation.

2.3 Methods Based on Optical Flow

The most reliable methods proposed so far are based

on explicit computation of the optical ﬂow be-

tween consecutive frames. A well-known contribu-

tion by Srinivasan et al. (Srinivasan et al., 1997)

uses the Nelder-Meade non-linear minimization pro-

cedure (Press et al., 1986) to compute the best-ﬁtting

camera-motion parameters. They use a detailed cam-

era motion model, which includes rolling and track-

ing, and also accurately models the pincushion-like

distortion on the ﬂow that is observed when panning

or tilting with a wide angle of view. On the other

hand, their algorithm allows only limited amounts of

scene motion. Moreover, the Nelder-Meade optimiza-

tion algorithm is known to be expensive, and gets eas-

ily trapped in local minima.

3 METODOLOGY

3.1 Weighted Optical Flow

Conceptually, the optical ﬂow from an image I to a

later image J is a function f that, to each point u of the

image domain D, associates a velocity vector f(u),

such that

I(u) ≈ J(u+ f(u)) (1)

whenever u and u + f(u) lie both within D. The

similarity criterion ≈ in equation (1) depends on the

context, but usually includes similarity of image val-

ues and possibly of other local information, such as

derivatives or texture.

Our algorithm assumes that the optical ﬂow f is

sampled at a ﬁxed set of points u

,.. .,u

, yield-

ing a list of vectors f

, f

,.. ., f

. See ﬁgure 1.The

algorithm also requires, for each vector f

, a corre-

sponding weight w

, which expresses its reliability.

The weight is usually higher for points u

where the

local texture of image I is easier to track, and was lo-

cated in image J with high conﬁdence.

3.2 The Canonical Motions

The central step in our algorithm is the approxima-

tion of the optical ﬂow ﬁeld by a combination of three

canonical camera motion ﬂows p, t and z. These

ﬁelds are the distinctive optical ﬂows that ideally

would result from a static scene being images which

represent pan, tilt, or zoom motion, with a speciﬁc

speed.

The canonical pan ﬂow p, by deﬁnition, has

p(u) = (1, 0) at every point u of the image’s do-

main. Similarly, the canonical tilt ﬂow t and zoom

ﬂow z are deﬁned, respectivelly, as t(u) = (0,1) and

z(u) = 2u for all u (see ﬁgure 2). We use an im-

age coordinate system whose origin is at the center

of each frame, with the x axis pointing left and y

pointing up. The unit of measurement is such that the

image domain D is the rectangle [−0.500, +0.500] ×

[−0.375,+0.375].

Figure 2: The canonical camera motion ﬂows: pan p(u), tilt

t(u), zoom z(u), sampled at a 7×5 grid of points.

The canonical pan ﬂow corresponds to a rotation

of the camera around the local vertical axis that causes

the aim point to sweep horizontally from right to left,

just fast enough to completely replace the ﬁeld of

view from one frame to the other. Assuming an an-

gular ﬁeld of view fairly small.

3.3 Analyzing the Optical Flow

The next step in our algorithm is to approximate the

optical ﬂow f between the two given frames by a lin-

ear combination

f of the canonical ﬂows, namely

f(u) = P∗p(u) + T ∗t(u) + Z ∗z(u) (2)

for every u ∈ D.

The coefﬁcients P, T and Z, to be determined, will

indicate the amount of pan, tilt and zoom, respec-

tively, that seem to have occurred between two con-

secutive frames. Note that a negative value for a coef-

ﬁcient means that the apparent motion is opposite to

the corresponding canonical movement (that is, a pan

to the right, a tilt-up, or a zoom-out, respectively).

We compute the coefﬁcients P, T,Z by a straight-

forward weighted least squares procedure. For that

purpose, we deﬁne the scalar product of two ﬂows a

and b, with a weight function w, as

a|b

w(u)a(u)b(u)du

w(u)

(3)

The discrete version of this formula, assuming that

the images are sampled at points u

,.. .,u

a|b

∑

i=1

∑

i=1

(4)

As usual, we also deﬁne the norm of a (sampled) ﬂow

f as

f|f

. Formulas (3) and (4) obviously

satisfy the deﬁnitions of scalar product and norm, as

long as the weights w

are all positive.

We seek P, T and Z that minimize the discrep-

ancy between the given ﬂow f and the ideal ﬂow

f of

equation ( 2). The discrepancy is the ﬂow d = f −

and its overall magnitude can be measured by the

square error Q(P, T, Z) =



f −

f|f −



. As

in standard least-squares ﬁtting, the values of P,T,Z

that minimize Q are found by solving the system of

linear equations





p|p

ii hh

p|t

ii hh

p|z

t|p

ii hh

t|t

ii hh

t|z

z|p

ii hh

z|t

ii hh

z|z

















f|p

f|t

f|z





(5)

3.4 Weight Adjustment for Vectors

The least-squares method (5) works ﬁne if the scene

is stationary. Moving objects change the optical ﬂow,

and therefore introduce errors in the ﬁtted parame-

ters P, T, and Z, as the least-squares procedure yields

some average of the two ﬂows. This is not a signiﬁ-

cant problem if the moving objects cover a small frac-

tion of the image and/or their speed is small compared

to the camera motion ﬂow. However, if the scene con-

tains fast moving objects, their ﬂow may easily dom-

inate the ﬁtted ﬂow

In order to alleviate this problem, we deﬁne the

weights w

as being the reliability weights ω

pro-

vided by the optical tracking procedure, divided by

the length of the corresponding ﬂow vectors f

, that is

+ ε

(6)

where ε is a small constant bias, introduced to avoid

division by zero or very small numbers.

Note that this formula increases the relative

weight of small ﬂow vectors, while reducing that of

large vectors. The justiﬁcation for this correction is

that small ﬂow vectors are indeed more signiﬁcant,

statistically, than large ones. If the sampled optical

ﬂow f contains a signiﬁcant number of very small

vectors mixed with some large ones, the explanation

is that the camera is stationary, and the set K of points

with small vectors is part of the background.

4 ITERATIVE WEIGHT

ADJUSTMENT

The least-squares method implicitly assumes that the

deviations between the observed ﬂow f and the ideal

ﬂow

f are due to independent additive noise with

Gaussian distribution. With this assumption, the least

squares method can be derived from the maximum

likelihood principle.

This assumption may be adequate for static scenes

with rich texture, where the main sources of errors in

the ﬂow f are expected to be due to camera and video

encoding noise. For scenes with moving objects, this

assumption is grossly incorrect: it often happens that

all the ﬂow vectors in a large region of the domain

are completely replaced by large “noise” vectors, all

pointing in the same general direction. In such cases,

the least squares method will produce an average of

the camera and scene motion ﬂows.

To solve this problem, we perform several itera-

tions of the least squares ﬁtting procedure, while ad-

justing the weights so as to exclude from considera-

tion those data vectors that seem to be due to moving

objects.

More precisely, we ﬁrst compute the mean dis-

crepancy vector µ between the given ﬂow f and the

previously ﬁtted camera motion ﬂow

µ =

∑

i=1

( f

−

)

∑

i=1

(7)

and its standard deviation

σ =



−

−µ



(8)

Points p whose discrepancy f(p) −

f(p) has length

greater than 3σ are eliminated by setting their weights

to zero. The ﬁtting procedure is then applied again

with the adjusted weights to compute P,T and Z. This

process is repeated until there are no points with dis-

crepancy greater than 3σ.

5 COMPUTING THE OPTICAL

FLOW

To compute the optical ﬂow vectors f

, we use the

Kanade-Lucas-Tomasi (KLT) algorithm, modiﬁed to

yield also the corresponding reliability weights ω

The core of the KLT algorithm is a procedure that,

given two images I, J and a point u in the image’s do-

main D, locates a point v = u+ f such that the neigh-

borhood of u in I is most similar to that of v in J. We

use the multi-scale algorithm that computes a ﬁrst es-

timate of v using low-resolution versions of I and J,

and then improves that estimate by several stages of

local search, increasing the image resolution at each

step. This approach makes the algorithm more robust,

and reduces the overall search time.

At each stage, the displacement vector f is ob-

tained by minimization of the sum of square differ-

ences S(u,v) of the I and J pixel values, within two

comparison windows centered at points u and v, re-

spectively. The radius r of the windows is a parameter

of the algorithm.

As it is described in the literature, the KLT algo-

rithm provides this estimate only indirectly. The stan-

dard implementation performs the multiscale search

in parallel for a list u

,.. .,u

of sample points,

but returns displacement vectors only for those points

with reliable matches. Internally, the algorithm ﬁrst

evaluates the “matchability” of each sample point u

by computing the matrix

G(u) =

∑

(

∂I

∂x

)

∑

(

∂I

∂x

)(

∂I

∂y

)

∑

(

∂I

∂x

)(

∂I

∂y

)

∑

(

∂I

∂y

)

(9)

The summations range over all pixels within the com-

parison window centered at the pixel u. The “matcha-

bility” score λ(u) of u is taken to be the smallest sin-

gular value of the matrix G(u) (Press et al., 1986). It

is claimed that λ is small if pixel values are roughly

constant inside the window, and large for windows

that contain distinctive features such as noise, dots,

corners, strong textures, etc..

The KLT implementation discards any sample

point u

whose matchability λ(u

) lies below an arbi-

trary threshold. The remaining points are then given

to the multiscale matching procedure, which com-

putes the best displacement vector for each point. The

algorithm then computes the average absolute pixel

difference A(u

) between the two windows, dis-

cards any point for which this quantity is above a sec-

ond threshold, and returns the surviving pairs (u

, f

We modiﬁed the KLT implementation so that it re-

turns the best displacement f

and the reliability score

of every input point u

. We deﬁned the score ω as

the ratio λ(u

)/A(u

) so that a point u

gets a low

weight ω if its neighborhood in I is featureless (hence

poorly trackable), or if the algorithm fails to ﬁnd a

good match in the J image.

This change makes better use of the information

provided by the KLT in the sense that, in any statis-

tical estimation, it is better to use all data, weighted

according to its reliability, than to discard some data

and treat the rest as having the same importance.

Table 1: Data for the videos used in our experiments.

Video Length Scene M-type

(minutes) (frames) transitions frame pairs

1 39:25 59140 80 1360

2 11:04 16600 23 688

6 TESTS

We tested our algorithm with two continuous video

recordings of our University Council meetings. See

table 1. Both videos were recorded in color, with a

single camera at 320×240 resolution and converted

to grayscale images (PGM) at half the original reso-

lution (to reduce the camera noise, recoding artifacts,

computational costs, etc).

-0.3

-0.2

-0.1

0.1

0.2

0.3

0 5 10 15 20 25 30 35 40 45 50

motion magnitude

number of frames

0 5 10 15 20 25 30 35 40 45 50

number of frames

discrepancy

Figure 3: Plots of estimated camera motion coefﬁcients

P, T,Z and residual error Q for 50 consecutive frame pairs

in test video 1. Selected pairs marked L1 through L4 are

shown below. In pair L1 (frames 5 and 6) the camera is sta-

tionary and there is a small amount of scene motion. Frames

14 to 49 comprise a single transition between two speakers.

Pairs L2, L3 and L4 are part of a fast right pan, a slower left

pan, and a zoom-in, respectively.

The comparison window radius r was set to 3 pix-

els (meaning a 7×7 window). We used a regular grid

of 16×11 sample points u

, spanning the whole image

minus a safety margin of 34 pixels around the edges.

To evaluate the program’s performance, all con-

secutive frame pairs in the video sequence were clas-

siﬁed by hand into two classes, moving camera (M)

pairs and stationary camera (S) pairs. The scene tran-

sitions in table 1 corresponds to the number of times

one or more camera motion occurred between two pe-

riods of stationary camera.

For every pair of consecutive frames, our program

was used to compute four relevant numbers: the coef-

ﬁcients P, T,Z of the ﬁtted ﬂow, and the magnitude

Q =



f −



of the discrepancy between the ob-

served and ﬁtted ﬂows. Figure 3 shows the results of

the analysis for 50 consecutive images from video 1,

spanning a scene transition event. Note that the plots

of the P and Z coefﬁcients accurately identify and de-

scribe the transition event.

Figure 4 is a log-scale plot of the quantities V =

√

+ T

+ Z

(which can be interpreted as the mag-

nitude of ﬁtted ﬂow

f) and E =

√

Q =



f −



(the

norm of the residual ﬂow).

Observe that the two classes M and S can be

separated quite cleanly by a simple threshold in V.

Namely, frame pairs with V ≥ 0.0022 are almost al-

ways M, while pairs with V ≤ 0.0022 are almost al-

ways S. The threshold in V corresponds to a ﬁtted

ﬂow where the average vector has length less than one

pixel. The plot shows that the residual errors are very

small, probably due to camera vibration, video noise,

or tracking errors.

The single V threshold fails on a few S frame pairs

with large V, which get classiﬁed as M (such as pairs

L3 and L6 in ﬁgure 4). Those pairs happen to have

substantial amounts of scene motion, e.g. where most

of the image is blocked by a person walking across

the camera’s ﬁeld of view. In such cases, the iterative

least-squares procedure ends up rejecting the back-

ground data vectors, and interprets the optical ﬂow

of the moving person as being due to camera motion.

It turns out that many of those false positives also

have a very large residual ﬂow, so they can be dis-

carded by a second threshold in E. Namely, pairs with

V ≥ 0.0022 but E ≥ 7 are more likely to be S frames,

and therefore can be classiﬁed as such. An explana-

tion for this fact is that the walking person’s image

is usually out of focus, so the optical ﬂow algorithm

has a hard time ﬁnding the correct correspondences.

Moreover, the person is often moving in such a way

that its optical ﬂow cannot be well approximated by a

combination of the basic ﬂows. Both causes typically

lead to large errors. To complete the argument, we

observe that M pairs are rare to begin with, so a frame

pair with large amounts of scene motion is far more

likely to be S than M.

The method’s performance can be quantiﬁed by its

precision (p), recall (r) and error (e) metrics, deﬁned

p =

+ F

r =

+ F

−

e =

+ F

−

(10)

where T

is the number of true positives (frame pairs

1e-06

1e-05

1e-04

0.001

0.01

0.1

0.001 0.01 0.1 1 10 100

motion

stationary

threshold

Figure 4: Plot of estimated camera velocity V against residual error E for all frame pairs in test video 2. Frame pairs of type

M (moving camera) and S (stationary camera) are respresented by triangles and circles, respectively. Six events of the video

2 are choosed to show what they represent.

of class M that were correctly identiﬁed as such), F

is the number of false positives (S pairs identiﬁed as

M pairs) and F

−

false negatives (M pairs identiﬁed

as S). Table 2 highlights the good performance of the

method through the computation of the above men-

tioned metrics.

Table 2: Precision, recall and error of the algorithm on the

test videos.

Video p r e T

−

1 0.98 0.95 0.018 1292 25 68

2 0.99 0.95 0.010 655 7 33

7 CONCLUSION

This paper described a new method for detecting cam-

era motions in video images based on an efﬁcient

weighted least-squares procedure, which deﬁnes the

best-ﬁtting between a set of camera motion mod-

els and the computed optical ﬂow of an image se-

quence. This optical ﬂow was obtained through the

KLT tracking method properly modiﬁed to give infor-

mation about the camera motion and provide a quanti-

tative analysis of the movements (pan, tilt and zoom)

between two consecutive frames. This modiﬁcation

yielded also a rough segmentation of the frames into

“foreground” and “background” regions associated,

respectively, with the moving and stationary parts of

a scene.

Tests with real videos of meetings, recorded with

a camera in a ﬁxed location, illustrated the good be-

havior of the method in terms of robustness and ex-

ecution time, as well as its ability to deal with com-

binations of the considered movements. Extensions

to this work include, for instance, analysis of more

elaborated clustering methods to improve the classi-

ﬁcation step, ﬁltering the camera parameters to avoid

the inﬂuence of brief object motions, and selection of

video key-frames.

REFERENCES

Ewerth, R., Schwalb, M., Tessmanny, P., and Freisleben,

B. (2004). Estimation of arbitrary camera motion in

mpeg videos. 17th International Conference on Pat-

tern Recognition (ICPR), 1:512–515.

Huhn, P. M. (2000). Camera motion estimation using fea-

ture points in mpeg compressed domain. IEEE Inter-

national Conference on Image Processing (ICIP).

Lucas, B. D. and Kanade, T. (1981). An iterative image

registration technique with an application to stereo vi-

sion. In IJCAI81, pages 674–679.

Ngo, C.-W., Pong, T.-C., and Zhang, H.-J. (2003). Motion

analysis and segmentation through spatio-temporal

slices processing. IEEE Transactions on Image Pro-

cessing, 12(3):341–355.

Press, W. H., Flannery, B. P., Teukolsky, S. A., and Vet-

terling, W. T. (1986). Numerical Recipes: The Art of

Scientiﬁc Computing. Cambridge University Press.

Srinivasan, M. V., Venkatesh, S., and Hosie, R. (1997).

Qualitative estimation of camera motion parame-

ters from video sequences. Pattern Recognition,

30(4):593–606.

Tomasi, C. and Kanade, T. (1991). Detection and tracking

of point features. Technical Report CMU-CS-91-132,

Carnegie Mellon University.