REAL-TIME TEMPLATE BASED TRACKING WITH GLOBAL
MOTION COMPENSATION IN UAV VIDEO
Yuriy Luzanov and Todd Howlett
Air Force Research Laboratory, USAF, 525 Brooks Rd, Rome NY, 13441, USA
Mark Robertson
Keywords:
Kalman filter, frame alignment, target tracking, UAV video.
Abstract:
In this paper we describe a combination of Kalman filter with global motion estimation, between consecutive
frames, implemented to improve target tracking in the presence of rapid motions of the camera encountered
in human operated UAV based video surveillance systems. The global motion estimation allows to retain the
localization of the tracked targets provided by the Kalman filter. The original target template is selected by the
operator. SSD error measure is used to find the best match for the template in video frames.
1 INTRODUCTION
Unmanned Aerial Vehicles (UAVs) are used in mili-
tary and law enforcement for surveillance and recon-
naissance missions. There are several types of sensors
available on UAV platforms. But, video cameras are
the most common type of information gathering sen-
sors. At this time, real-time detection and tracking of
ground targets in UAV video is accomplished man-
ually. Automatic target detection and tracking algo-
rithms do not possess the level of reliability required
for real-time operation in the field.
There are many challenges in automatic real-time
tracking of ground targets in video from UAVs. To
name a few, the background is continuously changing
due to the motion of the airframe and motion of the
pan/tilt/zoom camera; the quality of the video is poor;
camera assemblies tend to undergo sudden rapid mo-
tions, primarily due to human operators, but also tur-
bulence and air frame maneuvering.
A popular technique for tracking targets in a video
stream is template matching. This paper discusses
an improvement to the real-time, template match-
ing based, ground target tracking algorithm in UAV
video.
A typical target tracking system consisting of a
template matching module and a Kalman filter was
implemented. The template matching module was
used to locate the target in each successive video
frame. The Kalman filter was used to predict the lo-
cation of the target based on its previous location in
the image and a target’s motion model. The search
region was restricted to a small window, centered at
the predicted location. This allows for a tight local-
ization of the target and the reduction of false positive
matches. But, when sudden and rapid changes occur
in the video the predictions of the targets’ locations
can have significant errors, often resulting in the fail-
ure of the tracking algorithm.
To negate the effects and account for such rapid
motions in the video it is proposed to use a frame
alignment algorithm to determine the global motion
between consecutive frames. The global motion is
applied as a control input to the Kalman filter. As
a result, the prediction of the location of the target
is significantly improved, increasing the tracking per-
formance. The system presented in this paper is able
to achieve a real-time performance of 24 ms of com-
putational time per frame, with 640x480 video, on a
Pentium IV class workstation.
The paper is organized as follows. First, tem-
plate matching techniques used in the implementation
of the tracking algorithm are discussed in section 2.
Then, an implementation of the Kalman filter is dis-
cussed in section 3. The next section, 4, presents the
details of the frame alignment algorithm, describing
the integration with the Kalman filter. Finally, in sec-
tion 5, the results and conclusions are presented. Val-
515
Luzanov Y., Howlett T. and Robertson M. (2007).
REAL-TIME TEMPLATE BASED TRACKING WITH GLOBAL MOTION COMPENSATION IN UAV VIDEO.
In Proceedings of the Second International Conference on Computer Vision Theory and Applications - IU/MTSV, pages 515-518
Copyright
c
SciTePress
ues for all of the parameters used in the implementa-
tion are presented in the Appendix.
2 TEMPLATE MATCHING
Template matching is used to determine the location
of the target in each frame of the video sequence. The
main assumptions made with this technique are that
there are no significant occlusions of the tracked tar-
get and the target’s appearance changes gradually.
The objects in the video change appearance over
time. The initial template selected by the operator
in the first frame will become inaccurate after a pe-
riod of time. A simple approach to try and overcome
this problem is to update the template after a certain
amount of frames have been processed. The main dif-
ficulty here is to find the proper update rate. If it is too
high, the template is being updated too often, then the
error will accumulate quickly resulting in a drift off of
the object originally selected by the operator. On the
other hand when the update rate is too slow, the tar-
get will be lost due to dissimilarities between its cur-
rent appearance and the template. There were several
techniques proposed in order to determine when and
how to update the template (Matthews et al., 2004).
Here the template is simply replaced every p frames
by the best matched region. The update rate,
1
p
, is de-
termined experimentally.
Let I(x, y, t) denote a pixel intensity at location
(x, y) in the video frame of size N × M pixels at time
t, where x [0, N 1] and y [0, M 1]. The tem-
plate is initialized by the operator at time t = 0. Let
S(i, j, l) be a pixel intensity at location (i, j) of a tem-
plate extrated from (lp)
th
frame, I(x, y, lp), where l
and p are nonnegative integers. The size of the tem-
plate S(i, j,l) is K × K pixels, so that i, j [0, K 1].
To eliminate any ambiguities in determining the cen-
ter of S(i, j, l), K is picked to be an odd positive inte-
ger.
Using Sum of Squared Difference (SSD) error
measure, the error between the template S(i, j, l) and
the image I(x, y,t) at point (a, b) can be written as fol-
lows:
e =
K1
i=0
K1
j=0
(S(i, j, l) I(a
K
2
+ i, b
K
2
+ j,t))
2
(1)
where
K
2
is rounded down to the nearest integer.
By computing the error in equation (1) for every
pixel of the frame at time t, and finding the minimum
error, the best matched region could be determined.
This approach would result in a longer computation
time and poor localization of the target due to false
positive matches. One solution is to define a region
R
s
of size W ×W pixels, centered at the most proba-
ble location of the target. Then the search for the best
match would be carried out only within R
s
. In the im-
plementation W ranges betweenW
min
and W
max
and is
determined by how fast the target moves in the image.
3 KALMAN FILTER
In order to approximate the next location of the target
in the video a Kalman filter is employed (Welch and
Bishop, 2004). A basic Kalman filter was found to
be sufficient for this application. When the operator
initializes the template in the first frame, by select-
ing a point inside the target region, a Kalman filter
is also initialized. The point, originally selected by
the operator, will be tracked through the video. The
discrete-time state-space representation of the linear
process, the state of which is being estimated, can be
expressed as:
x
k
= Ax
k1
+ Bu
k1
+ w
k1
(2)
z
k
= Hx
k
+ v
k
(3)
In equation (2) the state vector, x R
4
, contains the
position and the velocity of a 2D point being tracked.
State transition matrix, A R
4×4
, relates the previ-
ous state, x
k1
, to the current state, x
k
, where, k, is a
nonnegative integer. Control input matrix, B R
4×2
,
provides coupling between the control input vector,
u R
2×1
, and the state. The control input is the 2D
global displacement of the image. In equation (3),
z R
2
, is the measurement vector. The measurement
is a 2D position of the target’s point. The observation
matrix, H R
2×4
, relates the state to the measure-
ment. The random variable, w, represents the process
noise and the random variable, v, represents the mea-
surement noise. They are independent, white and nor-
mally distributed:
w N(0, Q) (4)
v N(0, R), (5)
where Q R
4×4
is the process noise covariance ma-
trix and R R
2×2
is the measurement noise covari-
ance matrix. Both Q and R are assumed to be con-
stant.
A Kalman filter consists of two stages: the pre-
diction and the correction. In the prediction stage an
estimate of the current state, ˆx
k
, and estimate of the
current error covariance matrix,
ˆ
P
k
, are made:
ˆx
k
= Ax
k1
+ Bu
k1
(6)
ˆ
P
k
= AP
k1
A
T
+ Q, (7)
where error covariance matrix P R
4×4
. The correc-
tion stage uses the measurement z
k
to refine the pre-
diction and compute the Kalman gain K
k
, the current
state x
k
and error covariance P
k
:
K
k
=
ˆ
P
k
H
T
(H
ˆ
P
k
H
T
+ R)
1
(8)
x
k
= ˆx
k
+ K
k
(z
k
H ˆx
k
) (9)
P
k
= (I K
k
H)
ˆ
P
k
(10)
For each frame the following computations are per-
formed. First a prediction of the target’s location is
made using (6), (7). Then the search region R
s
is ini-
tialized with predicted target’s position from ˆx
k
. Now,
the template matching is performed within the region
R
s
and the center of the best match is used as the mea-
surement z
k
to correct the filter with equations (8), (9)
and (10).
The motion of the object in the video is usually
linear, except when the camera undergoes sudden and
rapid movements. If that happens the assumption of
linearity is violated and the Kalman filter fails.
4 ROBUST FRAME ALIGNMENT
Frame alignment is used to compute the global mo-
tion and supply the control input to the Kalman filter.
Section 4.1 provides the formulation of the alignment
algorithm, after which Section 4.2 shows how to make
the formulation robust to outliers and data not well
modeled by the original formulation.
4.1 Aligning Background
We denote the intensity of pixel (x, y) at time t as
I(x, y,t). To relate the pixels of two frames together,
one can apply the intensity constraint
I(x, y,t) = I(x+ x, y+ y, t 1), (11)
which if we approximate with the linear terms of a
Taylor approximation we obtain
x
x
I(x, y,t) + y
y
I(x, y,t)
t
I(x, y,t) 0 (12)
The motion at pixel (x, y) is (x, y). For notational
convenience, we define f
x
, f
y
, and f
t
as the par-
tial derivatives in (12). These partial derivatives are
implemented as digital approximations of horizontal,
vertical, and temporal gradients. The various quanti-
ties are subscripted with an i to denote their values
at the i
th
pixel.
By imposing a global motion model at each pixel
location (x
i
, y
i
) in the image, we can parameterize
the motion to make a solution at each pixel more
tractable. A two-parameter motion model that ac-
counts for translation in the horizontal and vertical
directions would be
x
i
= a
13
, y
i
= a
23
(13)
A weighted LS derivation yields the following system
of equations:
w
i
f
2
x
i
w
i
f
x
i
f
y
i
w
i
f
x
i
f
y
i
w
i
f
2
y
i
a
13
a
23
=
w
i
f
x
i
f
t
i
w
i
f
y
i
f
t
i
(14)
where w
i
is the weight for the i
th
pixel position. These
equations are easily solved for the two motion param-
eters. More complex global motion models are also
possible, but at the cost of additional computational
time.
4.2 Making Alignment Robust
Frame alignment as described in the previous subsec-
tion works well when the two frames are well mod-
eled by the particular global motion model in use.
However, even when the underlying background is
well modeled, sometimes large errors can still result.
For example, if there are objects moving indepen-
dently in the scene, their motion will not match that
of the background, and large errors will occur in these
positions. Similarly, if there are burned-in metadata
on the frames, they will not move according to the
background, and large errors will result as well. To
make the frame alignment procedure more robust to
such errors, we employ an iterative reweighted LS
scheme (Odobez and Bouthemy, 1994). On iteration
k = 0, when there is no knowledge about errors, all
weights are chosen as one, w
(0)
i
= 1. The motion
parameters (for whichever model is being used) are
solved to yield motion estimates for the k
th
iteration,
(x
(k)
i
, y
(k)
i
). The residual error for location i at iter-
ation k is then computed as
r
(k)
i
= I(x
i
, y
i
,t) I(x
i
+ x
(k)
i
, y
i
+ y
(k)
i
,t 1) (15)
The residual error represents the difference between
the i
th
pixel value of frame t, and the pixel value at
location in frame t 1 to which we believe (at iter-
ation k) it has moved. If we use the Tukey biweight
function (Odobez and Bouthemy, 1994), we then pick
the weight for the next iteration k+ 1 as:
q
w
(k+1)
i
=
(
C
2
h
r
(k)
i
i
2
|r
(k)
i
| < C
0 otherwise
(16)
This weight is then used in the weighted LS formula-
tion to get an estimate of the motion at iteration k+ 1,
after which the procedure continues iteratively until a
Figure 1: Tracking a car in consecutive frames, left to right.
convergence criterion is met or else some fixed num-
ber of iterations have completed. A similar weighting
scheme to (16) is given as
q
w
(k+1)
i
=
C |r
(k)
i
| |r
(k)
i
| < C
0 otherwise
, (17)
which makes use of an absolute value rather than a
square, and may be slightly faster on some architec-
tures.
Reweighting according to (16) or (17) is equiva-
lent to a formulation that minimizes not the sum of
the squares of the error as in LS, but rather the sum of
a function of the errors.
Once the global motion, (x, y), between the last
and the current frames is estimated it is applied to
the Kalman filter prediction step as a control input,
u
k1
= [x, y]
T
for k 1, in equation (6).
5 RESULTS AND CONCLUSIONS
The algorithm was implemented on a 3.0 GHz Intel
Xeon workstation in a single process, without using
any special instructions. A real-time performance was
achieved. On average, 24 ms was required to process
a single 640x480 frame of video. This includes the
computation of the global motion between the current
and the previous frame, Kalman filter prediction and
correction, and template matching.
As part of the test, sample UAV videos from
the DARPA sponsored VIVID program were used,
(Collins et al., 2005). Figure 1 illustrates the algo-
rithm tracking a car in three consecutive frames in the
presence of a significant camera motion.
As a result of using the global motion compensa-
tion the tracker was able to maintain the track of the
target using a small search region.
There are certain advantaged of employing the
global motion compensation. This is apparent in situ-
ations with complex scenes, i.e. urban scenes, where
multiple objects in the video frame have similar ap-
pearance. In order to successfully track a target a
small size search window is used to avoid false posi-
tive matches.
The disadvantage of using the global motion com-
pensation is that it takes a very long time to compute.
If the complexity of the scene permits, the target has
a unique appearance, it may be faster to use a larger
search area, rather then to compute the global motion.
REFERENCES
Collins, R., Zhou, X., and Teh, S. K. (2005). An open
source tracking testbed and evaluation web site. In
IEEE International Workshop on Performance Evalu-
ation of Tracking and Surveillance.
Matthews, I., Ishikawa, T., and Baker, S. (2004). The tem-
plate update problem. IEEE Transactions on Pattern
Analisys and Machine Intelligence, 26(4):810–815.
Odobez, J.-M. and Bouthemy, P. (1994). Robust multireso-
lution estimation of parametric motion models applied
to complex scenes. Publication Interne IRISA 788,
Institut de Recherche en Informatique et Syst
`
emes
Al
´
eatoires.
Welch, G. and Bishop, G. (2004). An introduction to the
kalman filter. Technical Report TR 95-041, University
of North Carolina at Chapel Hill.
APPENDIX
The following are the values for the parameters used
in the implementation of the algorithms.
For template matching 11 × 11 templates were
used, thus K = 11 and
K
2
= 5. The size of the search
area R
s
was between W
min
= 10 and W
max
= 32 pix-
els. The template was updated every 15 frames, thus
p = 15.
In the implementation of the Kalman filter the fol-
lowing state-space matrices were employed:
A =
1 0 1 0
0 1 0 1
0 0 1 0
0 0 0 1
, B =
1 0
0 1
0 0
0 0
(18)
H =
1 0 0 0
0 1 0 0
(19)
The process noise and measurement noise covariance
matrices were:
Q = 0.01I
4×4
, R = I
2×2
(20)
The filter is initialized with x
k1
, P
k1
and u
k1
at
k = 0, thus the initial values are x
1
, P
1
and u
1
and
they are:
x
1
=
x
init
y
init
0 0
T
(21)
P
1
= 10I
4×4
(22)
u
1
= 0 (23)
In equation (21) the values x
init
and y
init
are provided
by the operator’s initial selection of the target.
The Tukey biweight parameter was C = 40.