object recognition (D. Wilkes, 1992). Several ((Den-
zler and Brown, 2002; Farshidi et al., 2009; LaPorte
and Arbel, 2006)) have proposed probabilistic frame-
works for active object recognition. These frame-
works serve to both incorporate multiple viewpoints
as well as incorporating prior probability. However,
most have been evaluated on only a small number
of objects, using simple recognition schemes chosen
specifically to highlight the benefits of active recogni-
tion.
We demonstrate the benefit of active object recog-
nition to improve the results of a state-of-the-art ap-
proach, specifically, to improve in areas where per-
formance is affected by the pose of an object. We
recognize objects using Leabra
1
, which is a cogni-
tive computational neural network simulation of the
visual cortex. The neural networks have hidden layers
designed to mimic the functionality of the primary vi-
sual cortex (V1), the visual area (V4), and the inferior
temporal cortex (IT) . We extend Leabra by adding
a confidence measure to resulting classification, then
use active investigation when necessary to improve
recognition results.
We demonstrate the performance on our system
using the RGB-D (Lai et al., 2011) database. The
RGB-D contains a full 360
◦
range of yaw, and three
levels of pitch. We perform active object recognition
on 115 instances of 28 object classes from the RGB-D
dataset.
The remainder of the paper is organized as fol-
lows. We present related work in the field of active
object recognition in Section 2. We discuss our ap-
proach in Section 3, then present experimental results
in Section 4 with concluding remarks in Section 5.
2 RELATED WORK
Wilkes and Tsotsos’ (D. Wilkes, 1992) seminal work
on active object recognition examined 8 origami ob-
jects using a robotic arm. The next best viewpoint
was selected using a tree-based matching scheme.
This simple heuristic was formalized by Denzler and
Brown (Denzler and Brown, 2002) who proposed an
information theoretic measure to select the next best
viewpoint. They use average gray level value to rec-
ognize objects, selecting the next pose in an optimal
manner to provide the most information to the current
set of probabilities for each object. They fused results
using the product of the probabilities, demonstrating
their approach on 8 objects.
Jia et al (Jia et al., 2010) demonstrated a slightly
1
http://grey.colorado.edu/emergent/
different approach to information fusion, using a
boosting classifier to weight each viewpoint accord-
ing to the importance for recognition. They used a
shape model to recognize objects, using a boosted
classifier to select the next best viewpoint. They rec-
ognized 9 objects in multiple viewpoints with arbi-
trary backgrounds.
Browatzki et al. (Browatzki et al., 2012) used an
active approach to recognize objects on an iCub hu-
manoid robot. Recognition in this case was performed
by segmenting the object from the background, then
recognizing the object over time using a particle filter.
The authors demonstrated this approach to recognize
6 different cups with different colored bottoms.
3 METHODOLOGY
We use Leabra to recognize objects (section 3.1).
Once an object has been evaluated by Leabra, we find
both the object pose (section 3.2), and attach confi-
dence to the resulting classification (section 3.5). Fi-
nally, when the resulting classification has low confi-
dence, we actively investigate (section 3.6).
3.1 Leabra
The architecture of a Leabra neural network is broken
into three different layers, each with a unique func-
tion. The V1 layer takes the original image as input,
then uses wavelets (Gonzalez and Woods, 2007) at
multiple scales to extract edges. The V4 layer uses
these detected edges to learn a higher level representa-
tion of salient features (e.g., corners, curves) and their
spatial arrangement. The features extracted at the V1
layer includes multiple scales, therefore features ex-
tracted in the V4 layer have a sense of the large and
small features that are present in the object. The V4
layer also collapses on location information, provid-
ing invariance to the location of the object in the orig-
inal input image. The V4 layer feeds directly into the
IT activation layer, which has neurons tuned to spe-
cific viewpoints (or visual aspects) of the object.
3.2 Visual Aspects
Object pose plays an important role in recognition.
We consider pose in terms of visual aspects (Cyr and
Kimia, 2004; ?) (see figure 2). When an object under
examination is viewed from a slightly different angle,
the appearance generally should not change. When
it does not, we refer to this as a “stable viewpoint”,
both the original and the modified viewpoint belongs
to the same visual aspect V
1
. However, if this small
VISAPP2013-InternationalConferenceonComputerVisionTheoryandApplications
310