all the features obtain positive weights even if they
may erode visual attention. As we have shown in our
previous work, nonlinear fusion of feature maps
sounds more reasonable biologically. In (Kouchaki,
Nasrabadi and Maghooli, 2011), we proposed a
novel nonlinear feature fusion strategy to fuse three
conspicuity maps through Fuzzy Interface System
which had better results in comparison with the
basic saliency model in detecting desired object.
However, it combines three conspicuity maps rather
than 42 feature maps that could be more effective.
Moreover, in (Bahmani, Nasrabadi, Hashemi
Golpayegani, 2008), a combinational approach of
multiplicative weighted feature maps was proposed
which multiply 42 feature maps after weighing them
purposefully as in (Itti and Koch, 2001). Although a
remarkable improvement was achieved, the simplest
nonlinear function was employed. Unquestionably,
the real biological system of visual attention is more
complicated than a simple multiplication. In this
study, we have tried to indicate the biological
process of consisting saliency map through 42
feature maps. In order to select a nonlinear function,
which could show the details of creating saliency
map from 42 feature maps more reasonably, we
thought of Artificial Neural Network (ANN) as it
considerably resembles the biological neural
network.
After extracting 42 feature maps by Itti’s model
(Itti et al, 1998), we applied them as the inputs of the
network. The 42 feature maps were weighed
automatically through training process by
considering target masks as the desired output of the
network. In fact, we compensated the lack of
bottom-up information with considering target
information as top-down cues to adjust desired
weights.
The rest of this paper is as follows. In section 2,
we discuss about the basic bottom-up model of
visual attention. Then we present our methodology
in section 3. The details of the Variadic Neural
Network structure will be discussed in section 4.
Section 5 presents the details of our proposed model.
Experimental results are discussed in section 6.
Finally, section 7 concludes the paper.
2 THE BASIC BOTTOM-UP
MODEL
This part discuss about the details of computing the
bottom-up saliency map which proposed by Itti et al
(1998). Whereas an image is placed at the input of
the Itti’s model, it is filtered by a low-pass filter.
After low pass filtering, different spatial scales are
generated in three different channels of colour,
intensity and orientation by Dyadic Gaussian
Pyramids. These Gaussian Pyramids subsample the
input colour image in different scales. After that, the
feature maps are constructed in three different
channels of colour, intensity and orientation with
“centre-surround “operation. Subtraction between
fine and coarse scales images, which is a point-by-
point subtraction, yields 42 feature maps that consist
of 12 colors, 6 intensities and 24 orientation maps.
All the feature maps in each channel are linearly
fused into a conspicuity map which finally leads to
three conspicuity maps. Each conspicuity map is an
indication for one of the three features. After linear
combination of three conspicuity maps, the final
saliency map is formed which is based on the
bottom-up cues.
3 METHODOLOGY
In this study we want to promote some of the
computational weaknesses of the bottom-up visual
attention models for the object detection purpose. In
this study, we thought of designing a nonlinear
fusion kernel for combining 42 feature maps which
can indicate the biological details of forming
saliency map. Furthermore, the feature maps should
be weighed purposefully to be fit for object
detection purpose (Walther, 2006). As a result, we
assumed that Artificial Neural Network could be a
good choice for nonlinear fusion of 42 feature maps
as it resembles biological neural network. However,
in the beginning, combination of 42 images with the
big size through neural network seemed impossible
due to having 42 huge sized images as the inputs to
the network. But, finally, we found the Variadic
Neural Network (McGregor, 2007
), as a suitable
network which could meet our needs in this respect
due to accepting n-dimensional vectors as its inputs.
The top-down information could be considered in
the model by training the network using available
target masks. The supervised neural network could
be trained using the target information to weigh the
42 feature maps purposefully. As we know,
searching the desired object which is known
previously for the viewer is easier and faster than
searching it without prior knowledge. As we know,
one of the important factors for modelling the
human visual attention is considering the learning
ability. We have considered this matter with training
the network. The proposed visual attention structure
is illustrated in Figure 1. As shown, after deriving 42
VISAPP 2012 - International Conference on Computer Vision Theory and Applications
458