physical context it is involved, it is possible to apply a corresponding visual model. A
very general starting point for building a visual model is the assumption of the
existence of a ground plane and a gravity vector, which allows us to define upward
and downward directions. If we project the camera axis to the ground plane, then we
can also define forward, backward, left and right. We can also define altitude as the
distance to the ground plane. It is also possible to allow the existence of different
horizontal planes in a single model, for example, if there is a table, over the ground,
there can be other objects over the table. Most of the objects -more precisely, non-
flying objects- either are supported by a horizontal plane or accelerate with gravity.
Supported objects have an almost constant altitude, and their vertical orientation is
usually almost constant and sometimes predetermined.
(iii) Temporal context: The cinematic models for the object’s and observer’s
movements define their relative positions in different time steps. Thus, if an object is
detected in a given position at time step k, then it should appear at a certain position in
time step k+1.
(iv) Objects’ configuration context: Normally physical objects are seen in specific
spatial configurations or groups. For instance, a computer monitor is normally
observed near a keyboard and a mouse; or a face, when detected in its normal upright
pose, it is seen above the shoulders and below hair.
(v) Scene context: In some specific cases, scenes captured in images can be
classified in some defined types [8], as for examples “sunset”, “forest”, “office
environment”, “portrait”, etc. This scene context, which can be determined using a
holistic measurement from the image [1][2][7] and/or the objects detected in the same
image, can contribute to the final detection or recognition of the image’s objects.
(vi) Situation context: A situation is defined by the surround in which the observer
is immersed (environment and place), as well as by the task being performed. An
example of a situation context could be: “playing tennis in a red clay curt, in a sunny
day, at 3PM”. The situation context is determined using several consecutive visual
perception, as well as other source of perceptual information (e.g. auditory) and high-
level information (e.g. task being carried out, weather, time of the day).
In [12] are also defined the photometric context (the information surrounding the
image acquisition process, mainly intrinsic and extrinsic camera parameters), and also
the computational context (the internal state of processing of the observer). However,
we believe that those do not correspond to contextual information in the same sense
we are defining it in this work.
Low-level context is frequently used in computer vision. Thus, most of the systems
performing color or texture perception uses low-level context to some degree (see for
example [13]). Scene context have been also addressed in some computer vision [10]
and image retrieval [4] systems. However, we believe that not enough attention has
been given in robot and computer vision to the physical-spatial context, the temporal
context, the objects’ configuration context, and the situation context.
Having as our main motivation the development of robust and high performing
robot vision systems that can operate in dynamic environment in real-time, in this
work we propose a generic vision system for a mobile robot with a mobile camera,
which employs all defined spatiotemporal contexts. We strongly believe that as in the
case of the human vision, contextual information is a key factor for achieving high
performance in dynamic environments. Although other systems, as for example [1][3]
137