In this paper, we introduce a novel visual attention
model learned from fixation data captured by an eye-
tracker. Based on this model, we predict the parts of
surveillance videos that are likely to attract visual at-
tention. Example prediction results for a single frame
are depicted in Figure 1. We use these predictions to
adapt the playback velocity of surveillance videos ac-
cording to the visual saliency of the frames. Uninter-
esting parts are accelerated while periods that show
high visual saliency are presented in slow-motion.
Hence, the time required for analyzing a sequence as
well as boredom are reduced while operators can keep
track of relevant activities.
1.1 Visual Attention Models
The guided search model by (Wolfe, 1994) claims that
attention is guided exogenous (i.e., based on the prop-
erties of visual stimuli; bottom-up) as well as endoge-
nous (i.e., based on the demands of the observer; top-
down). (Jasso and Triesch, 2007) explain in more de-
tail that “bottom-up mechanisms are frequently char-
acterized as automatic, reflexive, and fast, requiring
only a comparatively simple analysis of the visual
scene, top-down mechanisms are thought of as more
voluntary and slow, requiring more complex infer-
ences or the use of memory”. According to Wolfe,
early vision stages separate the visual stimuli into dif-
ferent feature maps. Each feature map contains a dif-
ferent feature channel, such as color, orientation, mo-
tion, or size. The feature maps are combined by a
weighted sum into a single activation map, where the
bottom-up activation represents a measure of how un-
usual a feature is compared to its vicinity (for each
feature map). In contrast, the top-down activation
emphasizes the features in which the subject is inter-
ested in (e.g., request for blue objects). The activa-
tion map determines which location receives attention
(winner-take-all mechanism) and in which order: first
the global maximum, then the second maximum, and
so on (inhibition-of-return). The bottom-up activation
does neither depend on the knowledge of the user nor
on the search task.
Different visual attention models were developed
to estimate the areas that attract attention. Most of
these models are based on the bottom-up cues (Itti
and Koch, 2001). One issue concerning such models
is that these “saliency models do not accurately pre-
dict human fixations” (Judd et al., 2009). Therefore,
learned models were proposed. For instance, (Judd
et al., 2009) utilize a linear support vector machine
to train a model of visual saliency including low-level
(e.g., intensity, orientation, color contrast), mid-level
(horizon line detector), and high-level features (face
detector, people detector) to combine bottom-up sig-
nal cues and semantic top-down cues.
(Itti, 2005) presents an approach to calculate
bottom-up saliency of video data. He collects eye-
tracking data of subjects, and creates saliency maps
using a computational model that considers low-level
features. He further identifies that motion and tem-
poral features are more important than color, in-
tensity, and orientation. However, the best predic-
tions are achieved by a combination of all these fea-
tures. (Davis et al., 2007) train a focus-of-attention
model to create pathways for PTZ (pan/tilt/zoom)
cameras. Their model utilizes a single feature, trans-
lating motion, to capture the amount of activity.
(Kienzle et al., 2007) train a feed-forward neural
net with sigmoid basis functions. In their approach,
the video is smoothed spatially and filtered tempo-
rally. Training of the neural net optimizes the tem-
poral filters together with their weights. Another ap-
proach (Nataraju et al., 2009) combines a modified
version of Kienzle’s method with the visual attention
model of (Itti et al., 1998), which is based on saliency
maps. This approach uses a neural net to train the co-
efficients of three low-level descriptors (color inten-
sity, orientation, and motion).
In contrast to the above mentioned methods, the
model we introduce in this paper is not restricted
to a single feature/channel (Kienzle et al., 2007), a
saliency map from a single feature/channel (Davis
et al., 2007), or a set of predefined channels (Nataraju
et al., 2009). Our learning approach is based on tem-
poral and spatial rectangle features and can thus rep-
resent rather arbitrary channels, such as lightness con-
trast, color contrast, motion, orientation, and symme-
try. This means, we do not require manually modeled
channels, we learn the bottom-up cues from train-
ing data. Further, the contribution of each feature to
the final saliency map is determined by the training
process. Hence, two important issues with channel-
based saliency maps are addressed: the selection of
features as well as their weights. Note that our ap-
proach also covers top-down mechanisms, such as the
cues learned by (Judd et al., 2009): face and peo-
ple detectors. Such high-level features are implic-
itly learned by our method. However, our approach
does not consider the top-down mechanisms originat-
ing from memory effects.
The main contribution of this paper is the indica-
tion that visual attention (modeled by a classifier that
is trained on eye-tracking data) is an excellent mea-
sure of relevance for adaptive video fast-forward. Fur-
ther, we introduce a novel method to learn a visual at-
tention model and show that this model is able to pro-
vide proper relevance feedback for surveillance video
ICPRAM 2012 - International Conference on Pattern Recognition Applications and Methods
26