AUTONOMOUS CAMERA CONTROL BY NEURAL MODELS
IN ROBOTIC VISION SYSTEMS
Tyler W. Garaas, Frank Marino and Marc Pomplun
Department of Computer Science, University of Massachusetts Boston
100 Morrissey Boulevard, Boston, MA 02125-3393, U.S.A.
Keywords: Robotic Vision, Neural Modeling, Camera Control, Auto White Balance, Auto Exposure.
Abstract: Recently there has been growing interest in creating large-scale simulations of certain areas in the brain.
The areas that are receiving the overwhelming focus are visual in nature, which may provide a means to
compute some of the complex visual functions that have plagued AI researchers for many decades; robust
object recognition, for example. Additionally, with the recent introduction of cheap computational
hardware capable of computing at several teraflops, real-time robotic vision systems will likely be
implemented using simplified neural models based on their slower, more realistic counterparts. This paper
presents a series of small neural networks that can be integrated into a neural model of the human retina to
automatically control the white-balance and exposure parameters of a standard video camera to optimize the
computational processing performed by the neural model. Results of a sample implementation including a
comparison with proprietary methods are presented. One strong advantage that these integrated sub-
networks possess over proprietary mechanisms is that ‘attention’ signals could be used to selectively
optimize areas of the image that are most relevant to the task at hand.
1 INTRODUCTION
Recent advances in neuroscience have allowed us
unprecedented insight into how assemblies of
neurons can integrate together to establish complex
functions such as the deployment of visual attention
(Moore & Fallah, 2004), the conscious visual
perception of objects (Pascual-Leone & Walsh,
2001), or the remapping of visual items between eye
movements (Melcher, 2007). This accumulation of
detailed knowledge regarding the structure and
function of the individual neurons and neural areas
that are responsible for such functions has led a
number of neuroscientists to prepare large-scale
neural models in order to simulate these areas.
Some of the modeled areas include the primary
visual cortex (McLaughlin, Shapley, & Shelly,
2003), the middle temporal area (Simoncelli &
Heeger, 1998), or an amalgamation of areas
(Walther & Koch, 2006). Although the usual
motivation for creating such models is to ultimately
make predictions about their possible mechanisms or
functional roles in biological organisms, recent
advances in parallel computing – in particular, the
introduction of the Cell processor as well as the
graphics processing unit (GPU) – will likely direct
the attention of robotics researchers toward
developing comprehensive neural models for use in
robotic applications.
Robotic vision systems that are based on
biologically inspired neural models represent an
initially promising path to finally achieving
intelligent vision systems that have the power to
perform the complex visual tasks that we take for
granted on a daily basis. A classic example is that of
object recognition, at which computer vision
systems are notoriously poor performers. Humans,
on the other hand, can quickly – on the order of
hundreds of milliseconds – and effortlessly
recognize complex objects under a variety of
situations – e.g., various lightning conditions,
rotations, or levels of occlusion. Various models of
how this processing may occur in humans have been
proposed, which have resulted in increased object
recognition abilities by artificial systems (e.g.,
Riesenhuber & Poggio, 1999; Walther & Koch,
2006). Consequently, it is likely that subsequent
iterations of these models will make their way into
future robotic vision systems.
305
W. Garaas T., Marino F. and Pomplun M. (2009).
AUTONOMOUS CAMERA CONTROL BY NEURAL MODELS IN ROBOTIC VISION SYSTEMS.
In Proceedings of the 6th International Conference on Informatics in Control, Automation and Robotics - Robotics and Automation, pages 305-311
DOI: 10.5220/0002217303050311
Copyright
c
SciTePress
Most neural models of visual areas operate in an
idealized space (e.g., Lanyon & Denham, 2004);
receiving pre-captured and manipulated images,
whereas in robotic vision systems certain constraints
are necessarily imposed. These constraints largely
revolve around the need to process information in
realistic time-frames – optimally real-time – as well
as to interact directly with the physical world; likely
through some form of video camera. Since these
models will operate on input that cannot be known
ahead of time, the system should be designed to
handle a wide range of situations that may arise.
Most video cameras suitable for a robotic vision
system include some ability to automatically
monitor and adjust white-balance, exposure, and
focus. However, in a robotic vision system that
employs large-scale neural models, these automatic
functions may lead to suboptimal processing
conditions or even conflicts with the neural
mechanisms. This paper presents a proof-of-concept
method for manually controlling certain parameters
in the camera to optimize the processing of a neural
model of the retina, which will likely form the initial
processing stage of future biologically inspired
vision systems. In particular, the control of white-
balance (WB) and exposure parameters are
considered. Implementation details and a
comparison to proprietary methods are given in the
following sections.
2 SYSTEM OVERVIEW
The vision system presented here consists of a
number of simple neural layers (2D layout of
neurons that process image signals from nearby
neruons) interconnected to form the basis for a
robotic vision system; Figure 1 gives a simplified
illustration of how the neurons within each layer are
connected. The layers are modeled after a subset of
the neurons present in the human retina. Figure 2
illustrates the individual neuron types and
connections that constitute the artificial retina, which
are briefly described below in order to establish the
motivation for the WB and exposure sub-networks
(subnets) presented hereafter. The neural model
(excluding WB and exposure subnets) consists of 17
layers which total to approximately 225,000
individual neurons and 1.5 million connections. The
network is designed to be executed on the GPU of a
standard video card.
Figure 1: Simplified illustration of the connections used in
the neural model and camera control subnets: (left)
random connections to cells in previous layers and (right)
random connections demonstrating a center-surround
organization. Information in both cases flows from the
upper layer to the bottom layer.
Figure 2: Connections between the various neuron types in
the retina and camera control subnets. Solid arrows
indicate excitatory connections while arrows with a white
circle indicate inhibitory connections; ellipses between
WB1 and WB4 indicate a continuation of the connection
pattern directly above.
2.1 Apparatus
The network presented hereafter was simulated on
two computer graphics cards (Nvidia 380 gtx) using
an SLI setup. The OpenGL shading language
(GLSL) was used to implement the computations for
individual neurons. Video input was retrieved using
ICINCO 2009 - 6th International Conference on Informatics in Control, Automation and Robotics
306
Figure 3: Activation maps and connection structure of the neural model and camera control subnets. Lighter areas represent
higher activations while the colors indicate the spectral contributions to the activations.
a Cannon VCC4 video camera, and images were
captured from the camera using the Belkin Hi-Speed
USB 2.0 DVD Creator. Activations of the entire
network can be computed very quickly: 100
iterations of computing activations for every single
neuron take approximately 0.5 seconds.
2.2 Simulated Retina
The human retina is often considered a simple
means for sensing light that enters the eye. On the
contrary, the retina is actually a complex extension
of the brain that is responsible for both reducing the
amount of information transmitted to the various
visual centers of the brain and converting the
incoming signal into a form that is suited for higher-
level processing by cortex. In the neural model
presented here, we simulate the cones (R, G, B;
referred to as long-, medium-, and short-wavelength
cones in biological organisms, which are responsible
for extracting the contributions of three primary
color-components of the image), the horizontal cells
(H1 & H2 cells, which essentially compute a
‘blurred’ version of the incoming image), on- and
off-center bipolar cells (R, G, B cells, which
compute an initial contrast-sensitive activation due
to the antagonistic center-surround arrangement),
and on- and off-center ganglion cells (R, G, B cells,
which also compute a center-surround, contrast-
sensitive signal that is also spectrally opposed, due
to the inhibitory connections from bipolar cells).
For the sake of brevity, we do not describe the
specifics of individual neuron activations and
connections. However, the essentials of the retinal
neurons simulated here follow very closely those
laid out by Dacey (2000) and Dowling (1987).
3 WHITE-BALANCE CONTROL
WB control in cameras was included so that changes
in illumination could be countered to keep white
areas within an image looking white. For instance,
lighting that is stronger across the red spectrum of
visible light will cause white areas to take on a
reddish hue. Many different algorithms, such as
white point estimation (Cardei, Funt & Barnard,
1999), chromaticity estimation using neural
networks (Funt & Cardei, 1999), and gray world
(Buchsbaum, 1980), have been proposed to control
for changes in color due to the infinite spectrum of
light sources. Although humans do not have the
ability to directly control the color of objects as they
are being received by the various early visual areas,
neural mechanisms do exist to counter the effect of
AUTONOMOUS CAMERA CONTROL BY NEURAL MODELS IN ROBOTIC VISION SYSTEMS
307
illuminants on the actual perception of color
(Brainard, 2004). This ability is aptly referred to as
color constancy.
The automatic white-balance mechanism
described here is a subnet of the neural model
portrayed above. The basic goal is largely the same
as that of previously proposed mechanisms; that is,
to make white objects project white color onto the
incoming image regardless of the illumination color.
As such, the proprietary automatic white-balance
mechanism would provide an adequate means to
achieve this; however, there are a few caveats that
may make a specifically designed WB control
mechanism desirable. First, ganglion cells, from
which the WB function will be computed, do not
directly encode the primary image colors (i.e.,
RGB); instead, they encode a spatially and spectrally
opponent signal that encodes the differences
between red/green and blue/yellow signals. This
property may introduce differences between an
optimal white-balance parameter set by proprietary
mechanisms and the optimal white-balance
parameter for network computation. Second, certain
biologically inspired mechanisms may take
advantage of having the computation of such things
implemented directly inside the network. This will
be discussed in detail later.
The WB subnet introduced here is conceptually
very simple. It begins by including a layer into the
network (WB0) that ‘extracts’ areas of the image
that represent candidates for white or light gray
regions (technically, B/Y – R/G neutral). The
candidate areas are exactly those areas in which the
on-center ganglion cells have nearly the same level
of activation and where the sum of the activations is
greater than some threshold. A small amount of
programming code is given below which gives a
basic idea of how neurons’ activations in layer WB0
are computed; red, green, and blue variables store
the average activations of incoming red, green, and
blue on-center ganglion cells, respectively; on-center
activations range from 0.0 (no activation) to 1.0 (full
activation).
float intensity = red + green + blue;
float R = red / intensity;
float G = green / intensity;
float B = blue / intensity;
if(R > 0.25 && G > 0.25 && B > 0.25 &&
total > 1.0)
activation = (B – R)*4.0;
else
activation = 0.0;
After layer WB0 has extracted the areas that are
potentially white or light gray, neurons in WB1 then
compute a local maximum of the WB0 neurons to
which it is connected; Figure 1 (left) illustrates the
basic connection structure. Finally, layers WB2
through WB4 perform a simple averaging of the
neuron activations from incoming layers; however,
only neurons with non-zero activation (i.e., those
representing a candidate area) will contribute to the
average. The end-product of the WB subnet is a
value that can be used to step the white-balance
parameter either towards a more blue hue or a more
red hue depending on the situation. If, for instance,
the activations of red and blue ganglion cells are
close to equal across the image, the step functions
will be zero and the white-balance parameter will
not change. However, if red ganglion cells have
larger activations, in general, then the stepping
function will be negative, which will cause the
camera to introduce a slightly bluer hue to the
image.
The WB subnet was designed to balance the
activations across red and blue on-center ganglion
cells. Consequently, the subjective view of the
image cannot be used to assess the performance of
the subnet, which is contrary to proprietary
mechanisms. With that said, the WB subnet adjusts
Figure 4: White-balance results: (top) adjusted image
using proprietary auto-WB mechanism, (middle) adjusted
image using the WB subnet, and (bottom) activation map
of WB2. Lighter portions in WB2 represent candidate
areas that contain greater activations of blue ganglion
cells, while darker portions represent candidate areas
greater activations of red ganglion cells.
ICINCO 2009 - 6th International Conference on Informatics in Control, Automation and Robotics
308
the WB of the camera in much the same way as the
proprietary mechanism in certain situations; see
Figure 4 (left), for example. In contrast, other
situations can produce deviations in WB settings
between the subnet and proprietary mechanisms; see
Figure 4 (right), for example. The size of WB steps
should also be considered, as too large a step size
will introduce over-correction and, ultimately, a
ping-ponging of the WB parameter as the subnet
slowly narrows in on the correct value; on the other
hand, too small a step size will lead to a very slowly
adjusting WB. Finally, in the current network,
following a change in the white-balance parameter,
it was necessary to insert a short delay before
another step could be made; this was needed to
allow the changes in image color due to the WB
parameter change to spread through the various
neural layers.
4 EXPOSURE CONTROL
One of the most remarkable properties of the human
vision system is its ability to function over a
strikingly large range of luminance conditions, a
span of approximately 10 billion to 1 (Dowling,
1987). The human eye has essentially two ways of
dealing with the variation it experiences in day-to-
day luminance levels. (1) The pupil can reduce its
area by a factor of approximately 16 due to changes
in ambient illumination. (2) The circuitry in the
retina is specially designed to handle two general
lighting conditions: dim light, primarily handled by
the rod-pathway in the retina; and bright light,
handled by the cone-pathway in the retina.
Video cameras, on the other hand, do not have
the luxury of such robust input mechanisms.
Nevertheless, various methods have been developed
to allow cameras to function under a rather
impressive span of luminance levels – at least when
all things are considered. The camera used for the
present study employs two primary parameters that
can be adjusted to compensate for luminance levels:
iris size and gain control.
The network control of exposure is similar to that
of WB in that a conceptually simple subnet
progressively computes various properties of the
incoming image which allows it to ‘step’ the
relevant parameter towards optimizing some
computation. The computation that is optimized in
exposure control is contrast; too much light entering
and the image gets ‘washed-out’; too little light
creates an underexposed image. This mechanism in
particular will likely be very important to robotic
vision systems using biologically neural models, as
contrast has shown to play a particularly critical role
in the neural computations that take place in the
primate visual cortex (Sceniak et al., 1999).
As with the WB subnet, the functioning of the
exposure subnet is conceptually very simple.
Essentially, the subnet attempts to maximize the
contrast of two spatially adjacent areas using the on-
and off-center ganglion cells. Recall that in the
neural model of the retina (and the biological retina)
contrast plays a specific role for two classes of
neurons, bipolar cells and ganglion cells. That is,
these cells compute an activation that highlights high
contrast areas of the image. Consequently, much of
the work required for computing our exposure
control function is already implemented.
The remaining work is performed by two
independent subnets, an off-subnet and an on-
subnet. Each subnet first computes a local
maximum of the incoming ganglion cell activations
(on-center ganglion cells will have higher
activations in bright areas, especially if it is adjacent
to a dark area, and vice-versa for off-center ganglion
cells). This maximum is then averaged across the
image to produce an exposure step-value similar in
nature to the WB step-value, with one difference, the
step value for the exposure control must work to
control both the iris and gain of the camera, which is
handled by a simple scheme: changes to the iris take
precedence over changes to gain, which instead
serves to fine-tune the exposure using small step-
values.
Sample results of the exposure subnet are shown
in Figure 5 for both bright-light conditions (left) and
dim-light conditions (right). Exposure values in
dim-light conditions closely follow those computed
by the proprietary control mechanisms. However,
significant differences can be seen in bright-light
conditions. This is likely due to the goal of the
exposure subnet to extract a maximum contrast
signal, which may only occur in a portion of the
image, as opposed to enhancing a global contrast
signal. This difference is most prominent around the
wheel in Figure 5 (left) where the proprietary
mechanism introduces spectral highlighting on the
rubber (black) portion of the wheel; whereas when
under the exposure subnet’s control, the gain is
weaker to allow the natural blackness of the wheel to
maximize the contrast between the plastic (white)
and rubber (black) portions.
AUTONOMOUS CAMERA CONTROL BY NEURAL MODELS IN ROBOTIC VISION SYSTEMS
309
5 CONCLUSIONS
The current paper introduced small subnets that can
be integrated into biologically inspired neural
models of human visual areas to control the WB and
exposure parameters on most standard video
cameras. In contrast to their usual functions, WB
and exposure parameters are used to optimize the
actual processing that occurs in the neural model, as
opposed to simply providing a clearer image. The
subnets of this particular implementation are based
on the activations of on-center and off-center
ganglion cells from a neural model of an artificial
retina.
Figure 5: Exposure results: (top) adjusted image using the
proprietary auto-exposure mechanism, (upper-middle)
adjusted image using the exposure subnet, (bottom-
middle) activation map of on-center EX0, and (bottom)
activation map of off-center EX0.
Aside from customizing the parameters of the
camera to optimize model computation, the subnets
introduced here have other features that would make
them a desirable replacement for proprietary
mechanisms. One feature in particular could
provide a substantial benefit, which is the ability to
selectively optimize computation for areas in the
image in which the neural model is ‘interested’.
Indeed, one of the most studied neural signals in
biological organisms is that of attention, which is
often implemented in artificial neural models
(Lanyon & Denham, 2004). Consequently, with
very little modification, the subnets presented here
could be modified to selectively provide emphasis to
attended areas based on incoming attention signals.
For instance, imagine a robotic vision system that is
placed in a daylight setting receiving very bright
light from the sun. If the robot wishes to examine a
dark portion of the incoming image – say the
lettering of a poster printed on a black background,
proprietary mechanisms will be inadequate as they
will selectively optimize the range of high pixel
values – i.e., those representing bright areas.
Instead, if the image is adjusted to optimize the
range of low pixel values – i.e., those representing
the poster, the robot may then successfully achieve
its goal. Attention signals representing the robot’s
desire to inspect the poster would provide a perfect
indicator by which to optimize the correct portion of
the incoming image. Future implementations will be
directed toward realizing such models.
REFERENCES
Brainard, D. H. (2004). Color constancy. In L. Chalupa &
J. Werner (Eds.), The Visual Neurosciences (pp. 948-
961): MIT Press.
Buchsbaum, G. (1980). A spatial processor model for
object colour perception. Journal of Franklin Institute,
310, 1-26.
Cardei, V., Funt, B. & Barnard, K. (1999). White point
estimation for uncalibrated images. Proceedings of the
IS&T/SID seventh color imaging conference. (pp. 97-
100). Scottsdale, AZ, USA.
Dacey, M. (2000). Parallel pathways for spectral coding in
primate retina. Annual Review of Neuroscience, 23,
743-775.
Dowling, J. E. (1987). The Retina: An Approachable Part
of the Brain. Cambridge, MA, USA: Belknap Press.
Funt, B. & Cardei, V. (1999). Bootstrapping color
constancy. SPIE Electronic Imaging ’99.
Lanyon, L. H. & Denham, S. L. (2004). A model of active
visual search with object-based attention guiding scan
paths. Neural Networks, 873-897.
McLaughlin, D., Shapley, R. & Shelly (2003). Large-scale
modeling of the primary visual cortex: influence of
cortical architecture upon neuronal response. Journal
of Physiology-Paris, 97, 237-252.
ICINCO 2009 - 6th International Conference on Informatics in Control, Automation and Robotics
310
Melcher, D. (2007). Predictive remapping of visual
features precedes saccadic eye movements. Nature
Neuroscience, 10, 903-907.
Moore, T. & Fallah, M. (2004). Microstimulation of the
frontal eye field and its effects on covert spatial
attention. Journal of Neurophysiology, 91, 152-162.
Pascual-Leone, A. & Walsh, V. (2001). Fast
backprojections from the motion to the primary visual
area necessary for visual awareness. Science, 292,
510-512.
Riesenhuber, M., & Poggio, T. (1999). Hierarchical
models of object recognition in cortex. Nature
Neuroscience, 2, 1019-1025.
Sceniak, M. P., Ringach, D. L., Hawken, M. J. & Shapley,
R. (1999). Contrast’s effect on spatial summation by
macaque V1 neurons. Nature Neuroscience, 2, 733-
739.
Simoncelli, E. P. & Heeger, D. J. (1998). A model of
neuronal responses in area MT. Vision Research, 38,
743-761.
Walther, D. & Koch, C. (2006). Modeling attention to
salient proto-objects. Neural Networks, 19, 1395-1407.
AUTONOMOUS CAMERA CONTROL BY NEURAL MODELS IN ROBOTIC VISION SYSTEMS
311