tasks, target features are a ubiquitous source of atten-
tion guidance (Einhauser et al., 2008). For complex
target objects in natural scenes, there are other fea-
tures that can drive visual attention. In (Kanan et al.,
2009), an appearance-based saliency model was de-
rived in a Bayesian framework. Responses of filters
derived from natural images using independent com-
ponent analysis (ICA) were used as the features. In
(Rao et al., 2002), targets and scenes were represented
as responses from oriented spatio-chromatic filters at
multiple scales, and saliency maps were computed
based on the similarity between a top-downiconic tar-
get representation and the bottom-up scene represen-
tation.
A prevailing view is that bottom-up and top-down
attention is combined to direct our attentional behav-
ior. An integration method should be able to ex-
plain when and how to attend to a top-down visual
item or skip it for the sake of a bottom-up salient
cue (Borji and Itti, 2013). In (Ehinger et al., 2009),
computational models of search guidance from three
sources, including bottom-up saliency, visual features
of target appearance, and scene context, were investi-
gated and combined by simple multiplication of three
components. In (Zelinsky et al., 2006), the pro-
portions of BU and TD components in a saliency-
based model were manipulated to investigate top-
down and bottom-up information in the guidance of
human search behavior. In (Navalpakkam and Itti,
2007), the top-down component, derived from accu-
mulated statistical knowledge of the visual features
of the desired target and background clutter, was used
to optimally tune the bottom-up maps such that the
speed of target detection is maximized.
A hierarchical Bayesian inference model for early
visual processing was proposed in (Lee and Mum-
ford, 2003). In this framework, the recurrent feed-
forward/feedback loops in the cortex serve to inte-
grate top-down contextual priors and bottom-up ob-
servations, effectively implementing concurrent prob-
abilistic inference along the visual hierarchy. It is well
known that the sizes of the receptive fields of neurons
increase dramatically as visual information traverses
successive visual areas along the two visual streams
(Serre et al., 2007; Tanaka, 1996)). For example, the
receptive fields in V4 or the MT area are at least four
times larger than those in V1 at the corresponding ec-
centricities (Gattass et al., 1988), and the receptive
fields in the IT area tend to cover a large portion of
the visual field. This dramatic increase in receptive-
field sizes leads to a successive convergence of visual
information necessary for extracting invariance and
abstraction (e.g., translation and scaling), but it also
results in the loss of spatial resolution and fine details
in the higher visual areas (Lee and Mumford, 2003).
Inspired by the works of (Lee and Mumford,
2003) and the center-surround organization of recep-
tive fields in the early visual cortex, we propose a hy-
pothesis that neurons of the hierarchically organized
visual cortex encode the conditionalprobability of ob-
serving visual variables in specific contexts.
To test this hypothesis, we developed a hierarchi-
cal Bayesian model of vision attention. We used a set
of PDs based on the independent components (ICs) of
natural scenes in a hierarchical center-surround con-
figuration. The neurons at higher levels have larger re-
ceptive fields and lower resolutions, and provide local
contexts to the neurons at lower levels. We estimated
these PDs from natural scenes and derived measures
of BU visual saliency and TD attention, which can
be combined optimally. Finally, we conducted an ex-
tensive evaluation of this model and found that it is
a good predictor of human fixations in free-viewing
and object-searching tasks.
2 HIERARCHICAL BAYESIAN
MODELING OF VISUAL
ATTENTION
An input image is subsampled into a Gaussian pyra-
mid. The original image at scale 0 has the finest reso-
lution, and the subsampled image at the top scale has
the coarsest resolution. At any location in an image,
we sample a set of image patches of size N*N pixels
at all levels of the pyramid. The local feature at scale
s is denoted as F
s
. In this pyramid representation, the
feature at the scale s + 1 is the context of the feature
at scale s (C
s
), as shown in Fig.1. Thus, the nodes
in the higher levels have lower resolutions and larger
receptive fields than the nodes in the lower levels, and
provide the context for the features of the nodes in
the lower levels. It should be pointed out that the con-
textual patch and object context are different; C
s
is the
contextual patch of F
s
, which may or may not include
the object context. Generally, the contextual patches
at higher levels are more probable to cover some ob-
ject context, and a contextual patch at a lower level
just has object features, as shown in Fig.1. By using
the hierarchical center-context structure, both object
and its context features are supposed to be included.
The knowledge of a target object O and its context
includes appearance features at all scales F
i
and loca-
tion X . Assume that the distribution of object features
does not change with spatial locations, then
P(F
0
,F
1
,...,F
n
,X) = P(F
0
,F
1
,...,F
n
)P(X). (1)
VISAPP2014-InternationalConferenceonComputerVisionTheoryandApplications
348