A Vision Architecture
Christoph von der Malsburg
Frankfurt Institute for Advanced Studies, Frankfurt, Germany
Keywords: Vision, Architecture, Perception, Memory.
Abstract: We are offering a particular interpretation (well within the range of experimentally and theoretically
accepted notions) of neural connectivity and dynamics and discuss it as the data-and-process architecture of
the visual system. In this interpretation the permanent connectivity of cortex is an overlay of well-structured
networks, “nets”, which are formed on the slow time-scale of learning by self-interaction of the network
under the influence of sensory input, and which are selectively activated on the fast perceptual time-scale.
Nets serve as an explicit, hierarchical representation of visual structure in the various sub-modalities, as
constraint networks favouring mutually consistent sets of latent variables and as projection mappings to deal
with invariance.
1 INTRODUCTION
The performance of human visual perception is far
superior to that of any computer vision system and
we evidently still have much to learn from biology.
Paradoxically, however, the functionality of neural
vision models is worse, much worse, than that of
computer vision systems. We blame this shortfall on
the commonly accepted neural data structure, which
is based on the single neuron hypothesis (Barlow,
1972). We propose here a radically new
interpretation of neural tissue as data structure, in
which the central role is played by structured neural
nets, which are formed in a slow process based on
synaptic plasticity and which can be activated on the
fast psychological time-scale. This data structure
and the attendant dynamic processes make it
possible to formulate a vision architecture into
which, we argue, many of the algorithmic processes
developed in decades of computer vision can be
adapted.
2 STRUCTURED NETS
Although we live in a three-dimensional world,
biological vision is, to all we know, based on "2.5-
dimensional" representations, that is, two-
dimensional views enriched with local depth
information. The vision modality has many sub-
modalities – texture, colour, depth, surface
curvature, motion, segmentation, contours,
illumination and more. All of these can naturally be
represented in terms of local features tied together
into two-dimensional “nets” by active links. Nets are
naturally embedded in two-dimensional manifolds
and have short-range links between neurons. Neural
sheets, especially also the primary visual cortex, can
support a very large number of nets by sparse local
selection of neurons, which are then linked up in a
structured fashion (see Fig. 1). Given the cell-
number redundancy in primary visual cortex
(exceeding geniculate numbers by estimated factors
of 30 or 50) there is much combinatorial space to
define many nets. These nets are formed by
statistical learning from input and by dynamic self-
interaction. In this way a distributed memory for
local texture in the various modalities can be stored
already in retinal coordinates, that is, in primary
visual cortex. We would like to stress here the
contrast of this mode of representation to the current
paradigm. To cope with the structure of the visual
world, a vision system has to represent a hierarchy
of sub-patterns, “features”. In standard multi-layer
perceptrons (for an early reference see Fukushima,
1980) all features of the hierarchy are represented by
neurons. What we are proposing here amounts to
replacing units as representatives of complex
features by local pieces of net structure tying
together low-level feature neurons (or “texture
elements”, neurons representing the elementary
features that are found in neurophysiological
experiments in primary visual cortex). This has a
345
von der Malsburg C..
A Vision Architecture.
DOI: 10.5220/0005158103450350
In Proceedings of the International Conference on Neural Computation Theory and Applications (NCTA-2014), pages 345-350
ISBN: 978-989-758-054-3
Copyright
c
2014 SCITEPRESS (Science and Technology Publications, Lda.)
number of decisive advantages. First, structured
nets represent visual patterns explicitly, as a two-
dimensional arrangement of local texture elements.
Second, as alluded to above, large numbers of nets
can be implemented on a comparatively narrow
neural basis in a combinatorial fashion. Third,
partial identities of different patterns are taken care
of by partial identity of the representing pieces of net
structure. Fourth, a whole hierarchy of features can
be represented in a flat structure already in primary
visual cortex (a shade of neurophysiological
evidence for the lateral connections between neurons
surfaces in the form of non-classical receptive fields,
see Allman et al. 1985). Fifth, nets that are
homeomorphic to each other (i.e., can be put into
neuron-to-neuron correspondence such that
connected neurons in one net correspond to
connected neurons in the other) can activate each
other directly, without this interaction having to be
taught, see below.
3 ACTIVATION OF NETS
Once local net structure has been established by
learning and self-interaction, the activation by visual
input takes the following form (see Fig. 1). The
sensory input selects local feature types. Each
feature type is (at least in a certain idealization)
represented redundantly by a number of neurons
with identical receptive fields. Sets of such input-
identical redundant neurons form “units”. Within a
unit there is an inhibitory system inducing winner-
take-all (WTA) dynamics (only one or a few of the
redundant neurons surviving after a short time). The
winners in this process are those neurons that form
part of a net, that is, whose activity is supported by
lateral, recurrent input.
This process of selection of the input-activated
neurons that happen to be laterally connected as a
net is an important type of implementation of
dynamic links: although the connections are actually
static, nets are dynamically activated by selection of
net-bearing neurons. For another type see below.
Local pieces of net structure can be connected
like a continuous mosaic into a larger net. This may
be compared to the image-compression scheme in
which the texture within local blocks of an image is
identified with a code-book entry (only the
identifying number of the code-book entry being
transmitted), the code-book entries tiling the image.
4 GENERATION OF NETS
Net structure in primary visual cortex is shaped by
two influences, input statistics and self-interaction.
One may assume that the genetically generated
initial structure has random short-range lateral
connections. In a first bout of organization receptive
fields of neurons are shaped by image statistics,
presumably under the influence of a sparsity
constraint (Olshausen and Field, 1996). In this
period the WTA inhibition may not yet be active,
letting neurons in a unit develop the same receptive
field. Then the network becomes sensitive to the
statistics of visual input within somewhat larger
patches (the scale being set by the range of lateral
connections) and pieces of net structure are formed
by synaptic plasticity strengthening connections
between neurons that are often co-activated and
WTA-selected, while net structure is optimized by
the interplay between (spontaneous or induced)
signal generation and Hebbian modification of
synaptic strengths under the influence of a synaptic
sparsity constraint.
5 MODALITIES
Different sub-modalities (texture, colour, depth,
motion, ..) form their own systems of net structure,
that is, representations of local patterns that are
statistically dominant in the sensory input. Each
modality is invariant to the others and has its own
local feature space structure with its own
dimensionality, three for colour, two for in-plane
motion, one for (stereo-)depth, perhaps 40 for grey-
level texture and so on. Different values along a
given feature dimension are represented by different
neurons, or rather units containing a number of
value-identical neurons. Different value-units of the
same feature dimension, forming a “column”, inhibit
each other, again in WTA fashion.
6 LATENT VARIABLES
Several units standing for different values of a sub-
modality feature may be simultaneously active to
varying degree. They may be seen as representing
different hypotheses as to the actual value of the
feature dimension. These activities thus represent
heuristic uncertainty, which during the perceptual
process needs to be reduced to certainty. In
distinction to computer graphics, realized as a
NCTA2014-InternationalConferenceonNeuralComputationTheoryandApplications
346
deterministic process proceeding from definite
values of all involved variables determining a scene,
vision is an inverse problem, in which these values
first have to be found in a heuristic process that is
inherently non-deterministic. The initially unknown
quantities are called latent variables. The task of the
perceptual process is the iterative reduction of the
heuristic uncertainty of latent variables (“perceptual
collapse”), which is possible by the application of
consistency constraints and known memory patterns.
Figure 1: Combinatorially many nets co-exist within a
cortical structure (schematic). Units (vertical boxes) are
sets of neurons with identical receptive field. Visual input
selects a sparse subset of units (vertical arrows from
below). Neurons within units have WTA dynamics. The
winner neurons are those that are supported by lateral
connections from neurons in other selected units. Lateral
connections have net structure. A neuron can be part of
several nets, thus, many nets can co-exist.
7 CONSISTENCY CONSTRAINTS
Whereas the winner neurons within units are
selected by the patten-representing lateral
connections (which may be called “horizontal nets”),
thus factoring in memory patterns, the winner unit
inside feature columns are selected by another kind
of net structure, “vertical nets”,
which are formed by
connections running between value units in different
sub-modalities. A vertical net ties together feature
value units that are consistent with each other,
consistency meaning that signals that arrive at a unit
over alternate pathways within the net as well as
sensory signals agree with each other. Like the net
structures representing memory for local feature
distribution, consistency nets are established by a
combination of learning from sensory input and self-
interaction.
8 INTRINSIC COORDINATE
DOMAIN
So far, we have spoken of structure in primary visual
cortex, which is dominated by retinal coordinates,
that is, image locations change with eye movements.
All local texture representation must therefore be
repeated for all positions. (This is possible only for
a limited number of local texture patches,
comparable to the sizes of codebooks in image-
compression schemes). In order to store and
represent larger chunks of visual structure, such as
for recurring patterns like familiar objects or abstract
whole-scene lay-outs, there is another domain, see
Figure 2, presumably infero-temporal cortex, in
which neurons, units and columns refer to pattern-
fixed, intrinsic coordinates. (For the structure of
fibre projections between the retinal-coordinate and
the intrinsic-coordinate domains see below.) The
intrinsic domain can be much more parsimonious
than the retinal one in not needing to repeat net
structures over the whole visual field, so that it can
afford to spend more redundancy in each intrinsic
location, so as to be able to store a very large
number of pattern-spanning nets.
Also the intrinsic domain contains sub-structures
for the representation of sub-modalities, and again
there are nets for the representation of mutual
constraints between the sub-modalities. Thus, the
two domains are qualitatively the same but
quantitatively very different.
9 DYNAMIC MAPPINGS
The two domains with retinal and intrinsic
coordinates are connected by dynamic point-to-point
and feature-to-feature fibre projections that can be
switched as quickly as retinal images move, so that
correspondence between homeomorphic structures is
maintained. This switching is achieved with the
help of “control units” (Anderson and VanEssen,
1987). These can be realized as neurons whose
outgoing synapses are co-localized with the
synapses of the projection fibres they control at
dendritic patches of the target neurons. If those
patches have threshold properties, the projection
fibres can transmit signals only if also the
controlling fibre is active. The hypothesis that
dendritic patches with non-linear response properties
are act as decision units has been proposed long ago,
see for instance (Polsky et al., 2004).
A control neuron may, like any other neuron,
receive synaptic inputs (e.g., by re-afferent signals
that can in this way switch projection fibres such as
to compensate an intended eye movement), but they
also may get excited through their control fibres.
AVisionArchitecture
347
Figure 2: Overview of the Architecture. Each plane corresponds to one sub-modality, on the left side in retinal coordinates
(primary visual cortex), on the right side in pattern-intrinsic coordinates (infero-temporal cortex). A segment in the retinal-
coordinate domain is projected by dynamical mappings to the intrinsic-coordinate domain. Constraint interactions that help
to single out mutually latent variable values run between corresponding points (which refer to the same point on a surface
within the visual scene).
We assume that these carry signals that are
proportional to the similarity of the signal pattern in
the controlled projection fibres on the one hand and
the signal pattern in the target neurons on the other.
(Processes of control neurons would thus transmit
and receive signals and should correspondingly be
called neurites.)
Different control units stand for different
transformation parameters (relative position, size or
orientation of connected sets of neurons in the two
domains) and may be responsible for connecting a
local patch in one domain to a local patch in the
other. The set of control units for different
transformation parameters for a given patch in the
target domain form a column with WTA dynamic
and represents transformation parameters as latent
variables. In order to cover deformation,
transformation parameters may change slowly from
point to point in the target domain, and an entire
coherent mapping is represented by a net of laterally
connected control units (again, units contain a
number of redundant neurons to give leeway for
many nets to be stored side-by-side without mutual
interference).
10 SEGMENTATION
Vision is organized as a sequence of attention
flashes. During each such flash, analysis of sensory
input is restricted to a segment – a coherent chunk of
structure – e.g., to the region in retinal space that is
occupied by the image of an object. Like perception
in general, segmentation is a hen-and-egg problem,
segmentation needing recognition, recognition
needing segmentation. Certain patterns indicative of
a coherent structure are already available in primary
cortex, such as the presence of coherent fields of
NCTA2014-InternationalConferenceonNeuralComputationTheoryandApplications
348
motion, depth or colour, or familiar contour shapes.
Others, however, need reference to patterns stored in
the intrinsic coordinate domain. For this to happen,
two types of latent variables have to be made to
converge first, the transformation parameters
identifying the segment's location and size in the
retinal-coordinate domain, and a fitting model in
memory. We have modelled this process for the
purpose of object recognition, which we tested
successfully on a benchmark, observing rather fast
convergence (Wolfrum et al., 2008) . In general, the
intrinsic representation of the segment cannot be
found in memory but needs to be assembled from
partial patterns (just as the extended texture in
primary cortex is assembled from local texture
patches). Conceiving of objects as composites of
known elementary shapes is a well-established
concept (Biederman, 1987). This process of
assembly takes place in a coordinated fashion in the
different sub-modality modules. The de-
composition and re-composition of sensory patterns
is the basis for a very parsimonious system of
representing a large combinatorial universe of
surfaces of different shape, colouring texture under a
range of illumination conditions and in different
states of motion.
11 RECOGNITION
The actual recognition process of a pattern in the
retinal coordinate domain against a pattern in the
intrinsic domain may be seen as a process of finding
a homeomorphic projection, or of graph matching,
performed by many control neurons simultaneously
checking for patterns similarity while competing
with alternate control neurons and cooperating with
compatible ones (compatible in the sense of forming
together a net structure), see (Wolfrum et al., 2008).
Recognition by graph matching has a long tradition,
see for instance (Kree and Zippelius, 1988) or
(Lades et al., 1993). A related approach is
(Arathorn, 2002), who has pointed out the value of
the information inherent in the shape of the
mapping, which is produced as a by-product.
12 PREDICTION
Once this process has converged for a moving
pattern and its motion parameters have also been
determined, the system can set the fibre projection
system in motion to track the object and send short-
term predictions of sensory input from the model in
the intrinsic domain down to the primary cortex.
Successful prediction of sensory input on the basis
of a constructed dynamic model is the ultimate basis
for our confidence in perceptual interpretations of
the environment, and is very important for the
adjustment of constraint interactions.
13 ONGOING WORK AND NEXT
STEPS
We are at present working on a simple version of the
architecture, implementing the modalities grey-level
(the input signal), surface reflectance, illumination,
depth, surface orientation and shading, all realized in
image coordinates. We are manually creating
constraint interactions between them and a small
number of lateral connectivity nets. The goal is to
model the perceptual collapse on simple sample
images. To suppress the tendency of the system to
break up spontaneously into local domains
(generating spurious latent-variable discontinuities)
we are working with coarse-to-fine strategies. As
we are embedded in the lab of the Bernstein Focus
Neurotechnology Frankfurt, which is engaged in an
effort to build a computer vision system by methods
of systems engineering, we plan to adapt more and
more known vision algorithms into the architecture.
14 CONCLUSIONS
All we are proposing is to re-interpret neural tissue
and dynamics such as to see them as the natural
basis for the structures and processes that are
required for vision. The essential point is the
assumption that neural tissue is an overlay of well-
structured “nets”, which are characterized by
sparsity (in terms of connections per neuron) and
consistency of different pathways between pairs of
neurons. Particular supporting assumptions are
exploitation of cell-number redundancy and winner-
take-all dynamics to disentangle different nets and to
represent latent variables, and co-localization and
non-linear interaction of synapses on dendritic
patches. Nets are activated on the perceptual
timescale and are generated by self-interaction and
Hebbian plasticity on the learning timescale. Nets
act as data structure for the representation of
memory patterns, as networks of constraints between
latent variables in different sub-modalities and as
projection fibre mappings between retinal and
AVisionArchitecture
349
intrinsic coordinate systems. The coherent
architecture that is shaped by these assumptions
promises to decisively expand the functional
repertoire of neural models. This architecture may
even help to unify the as yet very heterogeneous
array of algorithms and data structures that has
arisen in computer vision, an urgent precondition for
progress in that field.
There is an important type of experimental
prediction flowing from our proposal concerning the
detailed wiring diagram of cortical tissue.
Analogous to the network of molecular interactions,
which is dominated by “motifs” (Shen-Orr et al.,
2002), connectivity should be dominated by closed
loops or “diamond motifs” , short alternate pathways
starting in one neuron and ending in another, or even
on the same dendritic patch of another neuron. This
may turn out to be a very important type of results in
the upcoming era of connectomics.
REFERENCES
Allman, J., Mieyin, F. and McGuinness, E., 1985.
Stimulus specific responses from beyond the classical
receptive field: neurophysiological mechanisms for
local-global comparisons in visual neurons. Annu Rev
Neurosci 8:407-30
Anderson, C. H. and Van Essen, D. C., 1987). Shifter
circuits: a computational strategy for dynamic aspects
of visual processing. PNAS 84, 6297—6301.
D.W. Arathorn D.W., 2002. Map-Seeking circuits in
Visual Cognition -- A Computational Mechanism for
Biological and Machine Vision. Standford Univ.
Press, Stanford, California.
Barlow, H.B., 1972. Single Units and Sensation: A
Neuron Doctrine for Perceptual Psychology.
Perception 1, 371-394.
Biederman, I., 1987. Recognition-by-components: a
theory of human image understanding. Psychol Rev.
94 115-147.
Fukushima, K., 1980. Neocognitron: A Self-Organizing
Neural Network Model for a Mechanism of Pattern
Recognition Unaffected by a Shift in Position.
Kree, R. and Zippelius, A., 1988. Recognition of
Topological Features of Graphs and Images in Neural
Networks. J. Phys. A 21, 813-818.
Lades, M., Vorbrüggen, J.C., Buhmann, J.. Lange, J., von
der Malsburg, C., Würtz, R.P. Würtz and Konen, W.,
1993. Distortion invariant object recognition in the
dynamic link architecture. IEEE Transactions on
Computers, 42:300-311.
Olshausen, B.A. and Field, D.J., 1996. Emergence of
simple-cell receptive fields properties by learning a
sparse code for natural images. Nature 381, 607—609.
Polsky, A. and Mel, B.W. And Schiller, J., 2004.
Computational subunits in thin dendrites of pyramidal
cells. Nature Neuroscience 7, 621-627.
Shen-Orr, S.S., Milo, R., Mangan, S. and Alon, U., 2002.
Network motifs in the transcriptional regulation
network of Escherichia coli. Nature Genetics 31, 64-
68.
P. Wolfrum, C. Wolff, J. Lücke, and C. von der Malsburg.
A recurrent dynamic model for correspondence-based
face recognition. Journal of Vision 8, 1--18.
doi:10.1167/8.7.34, 2008.
NCTA2014-InternationalConferenceonNeuralComputationTheoryandApplications
350