ROTATIONAL INVARIANCE AT FIXATION POINTS
Experiments using Human Gaze Data
Johannes Steffen, Christian Hentschel, Afra’a Ahmad Alyosef,
Klaus Toennies and Andreas Nuernberger
Otto-von-Guericke University Magdeburg, Magdeburg, Germany
Keywords:
Eye-Tracking, Early Vision.
Abstract:
An important aspect in machine vision concerns the extraction of meaningful patterns at salient image regions.
Invariance w.r.t. affine transformations has usually been claimed to be a crucial attribute of these regions.
While continuing research on the human visual cortex has suggested the correctness of these assumptions
at least in later stages of vision, only lately the availability of accurate and cheap eye tracking devices has
offered the possibility to provide empirical evidence to these claims. We present an experimental setting that
is qualified to analyse various assumptions on human gaze target properties. The proposed setting aims at
reducing high-level influence on the fixation process as much as possible. As a proof of concept we present
results for the assumption human fixation targeting is rotational invariant. Even though high-level aspects
could not be completely suppressed, we were able to detect and analyse this relation in the gaze data. It was
found that there is a significant correlation between fixated regions within stimuli over different orientations.
1 INTRODUCTION
By analysing the visual field at different resolutions
and sampling points the human visual system (HVS)
is able to rapidly perceive and understand a com-
plex visual stimulus. This ability is yet unmatched
by any technical approach to this problem. Being
able to understand the methodologies the HVS em-
ploys should help to improve existing approaches to
machine driven image analysis and understanding.
While it is still not fully known, which information is
processed at which level of resolution various studies
have given evidence that certain tasks such as read-
ing and visual search require a higher resolution than
others such as simple scene classification. The ma-
chine vision community has mainly focused on the
extraction of local features at salient regions in im-
ages. Algorithms that aim at the detection of these
regions usually strive to maximize invariance (e.g.
w.r.t. affine transformation) in order to provide detec-
tors robust to varying image acquisition conditions.
Whether or not invariance is equally important in hu-
man gaze targeting is the focus of this paper.
We conceived of an experimental setting that is
qualified to analyse the impact of invariance at hu-
man fixation point selection by recording gaze data
for stimuli carefully selected to reduce high-level in-
fluence on the recognition process as much as pos-
sible. While this setting is intended for later use in
more complex scenarios we show its general valid-
ity by analysing whether the specific example of ro-
tational invariance is a coherent property of early hu-
man gaze targeting.
In the following sections we will first give an
overview of the related literature in this field and pro-
vide a motivation for our own work. We then present
the experimental setting and the stimuli data used dur-
ing our tests. We will further present a question-
ing scheme we developed to reduce high level influ-
ences during the experiment. Finally, we provide the
obtained results and give some evidence on the ex-
tendibility of the presented experimental approach to
broader and more complex scenarios.
2 RELATED WORK
The selection of fixation target by the Human Visual
System (HVS) has been subject of research for years
and two major factors for eye movements have been
identified (Henderson, 2003). Bottom-up factors are
stimulus intrinsic attractors for fixation targets mean-
451
Steffen J., Hentschel C., Ahmad Alyosef A., Toennies K. and Nuernberger A. (2012).
ROTATIONAL INVARIANCE AT FIXATION POINTS - Experiments using Human Gaze Data.
In Proceedings of the 1st International Conference on Pattern Recognition Applications and Methods, pages 451-456
DOI: 10.5220/0003782104510456
Copyright
c
SciTePress
ing that the selection of these targets is solely driven
by neuronal analysis of the visual features within the
stimulus. On the other hand, top down factors intro-
duced by the observer’s “internal state” and imposed
by a search task or prior knowledge about the stim-
ulus likewise affect the selection of fixated regions
(Hopfinger et al., 2000; Corbetta et al., 2000). While
the latter factors are more challenging to model, mod-
els that predict visual saliency in compliance with hu-
man bottom-up factors have been subject of research
as they are easier understood.
Early psychophysical experiments with human
observers (Treisman and Gelade, 1980; Bergen and
Julesz, 1983) have proven that a limited set of bottom-
up features (among these color, orientation of line
segments, certain shape parameters such as curva-
ture) are detected at an early stage of human vision.
Later in the visual process, by combination of these
simple features, more complex neurons respond to
higher-level features such as corners and junctions
(Hubel and Wiesel, 1965; Pasupathy and Connor,
1999), edges (Marr and Hildreth, 1980), curved seg-
ments (Dobbins et al., 1989) and key points (Heit-
ger et al., 1992; Rodrigues and du Buf, 2006). Be-
ing biologically plausible, the empirical support for
the importance of these features in human vision was
given in (Biederman, 1987) where it is shown that
human recognition performance is decreasing largely
when the corners of an object are removed. In (Koch
and Ullman, 1985), the first explicit model that fuses
the response of several early visual features (intensity,
color, orientation and temporal change) into a single
saliency map was described. A computational im-
plementation of this model was derived later (Niebur
and Koch, ; Itti et al., 1998). While not necessarily
primarily with the aim to model properties of early
human vision, the computer vision community like-
wise has developed a vast corpus of local feature de-
tection algorithms, designed to detect edges, blobs,
key points etc. as basic elements for machine vision.
An extensive overview can be found in (Mikolajczyk
et al., 2005) and (Tuytelaars and Mikolajczyk, 2007).
The availability of inexpensive and rather accu-
rate eye trackers that record the direction of gaze of
a human observer while regarding a visual stimulus
(typically an image on a computer screen) recently
led to approaches that analyse the neighborhood of
fixation targets and try to model saliency as targets
of overt attention. The results of the conducted ex-
periments give support to the biologically inspired
models of saliency. Edge density was reported to be
higher at fixation points (Mannan et al., 1996) and
in (Krieger et al., 2000) two-dimensional image fea-
tures like curved lines and edges, occlusions, isolated
spots, etc. have been identified as important fixa-
tion candidates. Later studies of image statistics at
human gaze data (Parkhurst and Niebur, 2003; Ra-
jashekar et al., 2007) obtain similar results. Finally
(Parkhurst et al., 2002) describes a significant correla-
tion between computed visual saliency (by advancing
from the aforementioned biologically plausible mod-
els) and human eye movement data.
In a previous work (Alyosef, 2011), we analysed
the consistency of five of these computational models
for local feature selection with the fixation targets of
human observers solving a retrieval task. The inten-
tion was to investigate to what extent these detectors
are qualified to predict bottom-up as well as top-down
gaze targets. While our results (the key point localiza-
tion step of the SIFT algorithm performed best) cor-
respond to the findings in (Harding and Robertson,
2009) and (Rajashekar et al., 2007) we identified a
strong interplay of top-down cues (i.e. the retrieval
task was too simple w.r.t. the presented stimuli).
We therefore decided to restrict ourselves to anal-
yse solely bottom-up properties of human gaze tar-
geting. Invariance w.r.t. varying image acquisition
conditions has been considered an important prop-
erty of most of the aforementioned algorithms in ma-
chine vision. Similarly, transformation invariances
have been considered to be important by neurally mo-
tivated models of vision (e.g. see (Wallis et al., 1993;
Deco and Rolls, 2004)) . Whether or not invariance is
important in human gaze targeting is the focus of this
paper. We conceived of an experimental setting that is
qualified to analyse the impact of invariance at human
fixation points. While this setting is intended for later
use in more complex scenarios we prove its general
validity by analysing the human gaze data on stimuli
that are qualified to prove or dismiss the theory that
selection of fixation targets of the HVS is rotational
invariant.
3 GAZE DATA ACQUISITION
3.1 Experimental Setup
For obtaining our sampling data we used the T60 eye
tracker from Tobii Technology
1
. The T60 is a table-
mounted video-based eye tracker that uses an infra-
red camera system to record gaze data. The cameras
are built into a 17” TFT screen with a native resolu-
tion of 1280x1024 pixels that is used to present the
stimuli data. Being non-intrusive, the tracker offers
freedom of head movements (within an eye tracking
1
Tobii Technology: http://www.tobii.com/
ICPRAM 2012 - International Conference on Pattern Recognition Applications and Methods
452
box of W:44cm x H:22cm x D:30cm at 70cm from the
eye tracker) which allows the participants to behave as
naturally as in front of any other computer screen. No
additional chin or forehead rest was used. The tracker
collects raw eye movement data points every 16.6 ms
(i.e. the sampling data rate is 60Hz) and provides a
tracking accuracy of 0.5
. Drift effects (caused e.g.
by varying pupil size due to varying screen illumina-
tion levels) are reduced to < 0.3
.
We set up the T60 on a blank desk and removed all
possibly irritating objects behind it. We then mounted
a fixed chair that we adjusted for each participant as-
suring an eye-screen-distance of about 70 cm. To
avoid different illumination on the scene and assuring
consistent ambient illumination we closed all curtains
and used artificial light sources leading to low light
conditions instead.
Stimuli presentation, tracker calibration and gaze
data acquisition was done using the freely available
OGAMA (OpenGazeAndMouseAnalyzer
2
) software.
3.2 Stimuli Description
We rendered 10 black polygons on white background
with 3 to 8 edges. To make fixation points at poly-
gons’ corners clearly distinguishable polygons were
selected to exhibit visually separable vertices (w.r.t.
distance) and a minimal and maximal opening angle
between two edges. Polygons were scaled to fit most
of the trackers’ screen, yielding to an average visual
angle of approx. 18
. The polygons are depicted in
Fig. 2. Each polygon was rotated around the center of
the screen. Rotation angles of multiples of 60
were
used. Thus, we obtained 6 different projections of the
same polygon – the original and 5 rotated versions.
An additional set of 20 polygons was rendered in
the same way to be used as settling data and was pre-
sented to the participants before the actual experiment
started. In order not to bias the experiment, this was
not communicated to the participants. The last set of
16 randomly rendered polygons was used to distract
the participant from the fact that they are observing
10 unique polygons that are just rotated differently
and were presented in random order in the experiment
data after the settling set. That leads to 96 images.
The motivation for this step was to avoid that a
participant could make up any assumption on what to
look for within a presented polygon and would there-
fore be biased while examining succeeding images.
This set was selected as to convince the participants
of the random structure of the presented stimuli and
to avoid that a participant would look for any regulari-
ties or any repetitive character of the presented images
2
OGAMA: http://www.ogama.net/
throughout the rest of the experiment. Any such as-
sumption of a participant would represent a top-down
bias for the process of gaze targeting. The partici-
pants were unaware of this step and were left under
the impression of examining a total of 96 images.
3.3 Task Description
As reducing the impact of top-down factors on the
participant’s fixated regions was crucial throughout
the design of our experimental setting we conceived
of a generic viewing task that was intended to help
achieving this goal. To do so we created a set of 30
questions and the participants were asked to answer
one of them after each presented stimulus.
The questions were designed to be very simple as-
suring a) that the cognitive load of the participants
stays at a constant level and that they are not too
distracted understanding the question and b) that the
questions can be answered in a quick manner to re-
duce the experiment duration. In addition to that we
shuffled the questions randomly to make them as un-
predictable as possible and thus eliminated bias that
could have raised by knowing which parts of the stim-
ulus is or could be important for answering the con-
secutive question on it. The correct response to these
questions was neither given to the participants nor
was it of any importance for the outcome of the ex-
periment. As mentioned before the overall aim of the
questions was rather to ensure a level of overt atten-
tion spatially equally spread over the entire stimulus.
Here are some examples of the mentioned questions:
“Have there been more than 4 edges?”
“Have you seen circular edges?”
“Have you seen exactly one object?”
3.4 Participants and Process
Fixations have been recorded for 15 participants who
examined each of the 20 + 76 polygons described in
3.2. After the calibration process, the participants
were instructed to view each presented image care-
fully in order to be able to answer the subsequently
posed questions. All 96 images were presented in a
slide-show-like character to each of the participants.
Each image was presented for a duration of 5 sec-
onds. After each presentation a random question was
selected from the catalog (see Section 3.3). The par-
ticipant was asked to answer the question. While the
correct answer did not matter for the experiment it-
self, a new image was not shown before the subject
gave any answer at all. To avoid any distraction of the
participants by asking orally we decided to present
ROTATIONAL INVARIANCE AT FIXATION POINTS - Experiments using Human Gaze Data
453
the question as a part of the slide show directly af-
ter a stimulus. Thus, the participant had to answer
the question by remembering the presented stimulus
without seeing it, which otherwise would have meant
a high-level top-down influence in the fixation selec-
tion process (e.g. a task specific search on the image).
Additionally, presenting a neutral gray image before
each new stimulus avoided that different stimuli im-
ages interfered during the recording process. Once
the question was answered orally the next image was
presented.
3.5 Gaze Data Analysis
All gaze movements of a participant were recorded
using OGAMA. As a post processing step partici-
pants’ fixations were filtered containing information
about the fixation duration, start time, end time, and
position for each trial within the experiment. Param-
eters for calculating a fixation out of all gaze point
were set fix for all participants and every trial.
Because eye gaze fixation are not steady we con-
sidered a gaze point as a part of a real fixation if the
maximum distance between its coordinates and the
coordinates of the average fixation point did not ex-
ceed 20 pixels in total. Furthermore, the minimal
number of gaze point samples to be considered as a
fixation was set to 5. Finally, all calculated fixations
that are within the maximum distance of any neigh-
boring fixation point were merged to be just one fix-
ation point. To avoid having overlapping fixations
when switching to a consecutive trial (e.g. when we
switched from the neutral grey background image for
questioning to the next trial) the first fixation point
was discarded if it was at the same place as the last
fixation point of the foregoing trial with a duration
less than 200 ms.
For analysing the gaze data, saliency maps were
generated using an aggregated Gaussian distribution
(Vosskuehler, 2009):
f (x,y) =
1
2πσ
2
exp(
x
2
+ y
2
2σ
2
) (1)
where x, y [s,s] and σ = s/5. All multiplied
kernels per trial are then added to a new image having
the same size as the given stimulus (1280x1024 pix-
els) and are finally normalized. We set the Gaussian
kernel size to σ = 40pixels.
In the next step the average saliency map for each
of the 60 stimuli over all 15 participants was gener-
ated. For better comparability the saliency maps were
rotated back to the original orientation of 0
. To give
an impression on the resulting data, figure (2) shows
the stimuli polygons as well as their corresponding 6
saliency maps colored using a rainbow gradient.
To measure the similarity between two saliency
maps A and B the two-dimensional correlation coeffi-
cient p is calculated using the Pearson’s linear corre-
lation as described in (Engelke et al., 2010):
p =
m
n
(A
mn
A
mn
) (B
mn
B
mn
)
r
m
n
(A
mn
A
mn
)
2
r
m
n
(B
mn
B
mn
)
2
(2)
where m [1,M], n [1,N] are the pixel coordinates,
and A
mn
, B
mn
denoting the mean pixel value. The
larger the value of the correlation coefficient p gets,
the higher is the correlation between two saliency
maps, with p [1, 1].
4 EXPERIMENTAL RESULTS
We computed the correlation p between the
saliency maps of any two rotated projections of
each of the 10 polygons A
x
,B
x
,..., J
x
(where x
[0
,60
,120
,..., 300
] denotes the rotation angle).
Figure 1 shows the aggregated correlation coefficient
for each polygon over all orientations.
A B C D E F G
0,8466 0,8127 0,8489 0,6669 0,6575 0,8157 0,7789
0,8466
0,8127
0,8489
0,6669
0,6575
0,8157
0,7789
0,8537
0,7891
0,8224
0,6
0,65
0,7
0,75
0,8
0,85
0,9
A B C D E F G H I J
Correlation
Polygon
Figure 1: Mean correlation coefficient for all polygons over
all orientations and participants
Assuming the participants treat the rotated pro-
jections of each polygon as unseen stimuli, the cor-
relation has to be considered as rather high. The
smallest correlation p = 0.5132 was found compar-
ing the saliency maps of polygon E for the rotation
angles x = 180
and x = 300
whereas the highest
value p = 0.9211 was obtained comparing the ro-
tated projections of polygon H with x = 240
and
x = 300
. Furthermore, considering the immediate
“neighbor” of each projection (i.e. x ± 60
) the corre-
lation was found highest for 83% of all neighboring
projections. Considering the mean correlation coef-
ficients (see Fig. 1) the smallest correlation can be
ICPRAM 2012 - International Conference on Pattern Recognition Applications and Methods
454
Figure 2: Overview of all 10 polygons and their corresponding colored saliency maps (rotated back to their initial position of
0
degree for better visually comparison)
found for polygon D and E. As these were the sim-
plest polygons, we assume that most of their proper-
ties could be perceived without overt attention and de-
tails were not fixated directly. When sorting the cor-
relation coefficients p with increasing degree of ro-
tation angle the absolute values of p were found to
be increasing too. We assume this is due to the fact
that there is at least one factor influencing the fixa-
tion selection process of the participants, which is yet
unrepresented in the experimental setting. Consider-
ing that the segregation or dichotomy of the ventral
“what” and dorsal “where” stream (two-stream hy-
pothesis) is highly in doubt (Farivar, 2009), we can
not assure that our results only represent outcomes of
the low-level “where” stream. Due to the heavily in-
terconnection of both streams, we have to act on the
assumption that fixation positions are always a result
of both processing streams.
Another outcome worth being mentioned is that
the highest density of fixation points over all partici-
pants was found at regions with high local complexity.
Thus, interesting characteristics like local convexity,
line-crossings, and junctions trigger the attention of
every observer resulting in very high peaks at the cor-
responding saliency maps. As can be seen in Fig. 2
whenever there is a concave region or a region show-
ing a high local complexity participants tend to fixate
this region more often than outer corners and edges
(e.g. polygon D compared with H).
4.1 Conclusions
As a summary, evidence for rotation invariance at fix-
ation points could be given on a certain level. Assum-
ing that the presented design avoids most of the high
level influences built on experience held before or
gained throughout the experimental process, the anal-
ysed data indicates that rotational invariance exists
ROTATIONAL INVARIANCE AT FIXATION POINTS - Experiments using Human Gaze Data
455
considering the given simple stimuli. The strong cor-
relation of neighboring orientations raise the assump-
tion that it was not possible to suppress all top-down
influences as initially intended. Moreover, given the
fact that there is a significant difference comparing
fixation data of simple convex polygons with other
more complex concave versions further investigation
on convexity and concavity should be performed.
More experiments with different polygon prototypes,
which should be further reduced regarding their com-
plexity (i.e. the number of vertices and nodes) should
be carried out. Another important fact, which is cur-
rently not represented by the proposed experimental
setting is that humans usually do not perceive objects
and scenes in discriminative steps as we simulated.
REFERENCES
Alyosef, A. A. (2011). Comparison of interest points of
computer vision detectors with human fixation data.
Master’s thesis, University of Magdeburg, Germany.
Bergen, J. R. and Julesz, B. (1983). Parallel versus serial
processing in rapid pattern discrimination. Nature,
303:696–698.
Biederman, I. (1987). Recognition-by-components: A the-
ory of human image understanding. Psychological Re-
view, 94(2):115–147.
Corbetta, M., Kincade, J. M., Ollinger, J. M., McAvoy,
M. P., and Shulman, G. L. (2000). Voluntary orienting
is dissociated from target detection in human posterior
parietal cortex. Nature neuroscience, 3(3):292–7.
Deco, G. and Rolls, E. T. (2004). A neurodynamical cortical
model of visual attention and invariant object recogni-
tion. Vision research, 44(6):621–42.
Dobbins, A., Zucker, S. W., and Cynader, M. S.
(1989). Endstopping and curvature. Vision Research,
29(10):1371–1387.
Engelke, U., Liu, H., Zepernick, H.-J., Heynderickx, I.,
and Maeder, A. (2010). Comparing two eye-tracking
databases: The effect of experimental setup and image
presentation time on the creation of saliency maps. In-
ternational Picture Coding Symposium.
Farivar, R. (2009). Dorsalventral integration in object
recognition. Brain Research Reviews, 61(2):144
153.
Harding, P. and Robertson, N. (2009). A comparison of
feature detectors with passive and task-based visual
saliency. LNCS, 5575:716–725.
Heitger, F., Rosenthaler, L., von der Heydt, R., Peterhans,
E., and K
¨
ubler, O. (1992). Simulation of neural con-
tour mechanisms: from simple to end-stopped cells.
Vision Research, 32(5):963–981.
Henderson, J. M. (2003). Human gaze control during real-
world scene perception. Trends in Cognitive Neuro-
science, 7(11):498–504.
Hopfinger, J. B., Buonocore, M. H., and Mangun, G. R.
(2000). The neural mechanisms of top-down atten-
tional control. Nature neuroscience, 3(3):284–91.
Hubel, D. and Wiesel, T. (1965). Receptive fields and func-
tional architecture in two nonstriate visual areas (18
and 19) of the cat. Journal of Neurophysiology, 28.
Itti, L., Koch, C., and Niebur, E. (1998). A model of
saliency-based visual attention for rapid scene anal-
ysis. IEEE Transactions on pattern analysis and ma-
chine intelligence, 20(11):1254–1259.
Koch, C. and Ullman, S. (1985). Shifts in selective visual
attention: towards the underlying neural circuitry. Hu-
man Neurobiology, 4(4):219–227.
Krieger, G., Rentschler, I., Hauske, G., Schill, K., and Zet-
zsche, C. (2000). Object and scene analysis by sac-
cadic eye-movements: an investigation with higher-
order statistics. Spatial vision, 13(2-3):201–14.
Mannan, S., Ruddock, K., and Wooding, D. (1996). The
relationship between the locations of spatial features
and those of fixations made during visual exami-
nation of briefly presented images. Spatial Vision,
10(3):165–188.
Marr, D. and Hildreth, E. (1980). Theory of Edge Detec-
tion. Proceedings of the Royal Society B: Biological
Sciences, 207(1167):187–217.
Mikolajczyk, K., Tuytelaars, T., Schmid, C., Zisserman, A.,
Matas, J., Schaffalitzky, F., Kadir, T., and Gool, L. V.
(2005). A comparison of affine region detectors. In-
ternational Journal of Computer Vision, 65:43–72.
Niebur, E. and Koch, C. Control of selective visual at-
tention: modeling the” where” pathway. Advances
in neural information processing systems, pages 802–
808.
Parkhurst, D., Law, K., and Niebur, E. (2002). Modeling
the role of salience in the allocation of overt visual
attention. Vision research, 42(1):107–23.
Parkhurst, D. J. and Niebur, E. (2003). Scene content se-
lected by active vision. Spatial vision, 16(2):125–54.
Pasupathy, a. and Connor, C. E. (1999). Responses to con-
tour features in macaque area V4. Journal of neuro-
physiology, 82(5):2490–502.
Rajashekar, U., van der Linde, I., Bovik, A. C., and Cor-
mack, L. K. (2007). Foveated analysis of image fea-
tures at fixations. Vision Research, 47:3160–3172.
Rodrigues, J. and du Buf, J. (2006). Multi-scale keypoints
in v1 and beyond: object segregation, scale selection,
saliency maps and face detection. BioSystems, 86.
Treisman, A. M. and Gelade, G. (1980). A feature-
integration theory of attention. Cognitive psychology,
12(1):97–136.
Tuytelaars, T. and Mikolajczyk, K. (2007). Local Invariant
Feature Detectors: A Survey. Foundations and Trends
in Computer Graphics and Vision, 3(3):177–280.
Vosskuehler, A. (2009). Ogama description (version 2.5).
Wallis, G., Rolls, E., and Foldiak, P. (1993). Learning
invariant responses to the natural transformations of
objects. Proceedings of 1993 International Confer-
ence on Neural Networks (IJCNN-93-Nagoya, Japan),
2:1087–1090.
ICPRAM 2012 - International Conference on Pattern Recognition Applications and Methods
456