Unposed Object Recognition using an Active Approach

Wallace Lawson and J. Gregory Trafton

Naval Center for Applied Research in Artiﬁcial Intelligence, Washington D.C., U.S.A.

Keywords:

Object Recognition, Active Object Recognition, Neural Networks.

Abstract:

Object recognition is a practical problem with a wide variety of potential applications. Recognition becomes

substantially more difﬁcult when objects have not been presented in some logical, “posed” manner selected by

a human observer. We propose to solve this problem using active object recognition, where the same object

is viewed from multiple viewpoints when it is necessary to gain conﬁdence in the classiﬁcation decision. We

demonstrate the effect of unposed objects on a state-of-the-art approach to object recognition, then show how

an active approach can increase accuracy. The active approach works by attaching conﬁdence to recognition,

prompting further inspection when conﬁdence is low. We demonstrate a performance increase on a wide

variety of objects from the RGB-D database, showing a signiﬁcant increase in recognition accuracy.

1 INTRODUCTION

State-of-the-art-approaches to visual recognition have

focused mostly on situations when objects are

“posed” (i.e., the camera angle, lighting, and position

has been chosen by an observer). When conditions

become more variable, the ability to visually recog-

nize objects quickly decreases. In one prominent ex-

ample demonstrating this affect, (Pinto et al., 2008)

produced very good accuracy classifying objects from

the Caltech-101 dataset (Fei-Fei et al., 2004), but their

state-of-the-art approach was reduced to performing

at chance when variation was introduced. Speciﬁ-

cally, this meant viewing objects at any arbitrary pan,

tilt, scale, and rotation (both in plane and depth).

Unfortunately, such variability is common in the

objects that we see scattered throughout our environ-

ment. In some cases (see ﬁgure 1(a)) it may be difﬁ-

cult for even the most robust visual object recognition

approach to recognize an object. What results is a de-

graded performance from the object recognition sys-

tem. Figure 1(a) shows two objects from the RGB-D

dataset. From left to right the objects are a dry bat-

tery, and a hand towel. However, in both cases, the

object classes could be mistaken with similar classes.

For example, the dry battery could easily be mistaken

for a ﬂashlight or a pack of chewing gum. The hand

towel could easily be confused for a 3 ring binder.

Figure 1(b) shows the accuracy of Leabra (de-

scribed further Section 3.1) recognizing a dry battery

over a range of different pan angles, with a slightly

different camera tilt. While performance is generally

good, there is a point at which performance drops sig-

niﬁcantly. A system that had been recognizing objects

with an accuracy of about 90% suddenly decreases to

an accuracy of 30% when the pan and tilt of the ob-

ject modiﬁed. An image from this region is shown in

ﬁgure 1(a).

(a) Challenging Examples from the RGB-D

dataset.

(b) Recognizing the Dry Battery using a State-of-the-Art

Approach

Figure 1: Images from the RGB-D dataset.

The strategy of improving object recognition

through multiple viewpoints is referred to as active

309

Lawson W. and Trafton J..

Unposed Object Recognition using an Active Approach.

DOI: 10.5220/0004285503090314

In Proceedings of the International Conference on Computer Vision Theory and Applications (VISAPP-2013), pages 309-314

ISBN: 978-989-8565-47-1

 2013 SCITEPRESS (Science and Technology Publications, Lda.)

object recognition (D. Wilkes, 1992). Several ((Den-

zler and Brown, 2002; Farshidi et al., 2009; LaPorte

and Arbel, 2006)) have proposed probabilistic frame-

works for active object recognition. These frame-

works serve to both incorporate multiple viewpoints

as well as incorporating prior probability. However,

most have been evaluated on only a small number

of objects, using simple recognition schemes chosen

speciﬁcally to highlight the beneﬁts of active recogni-

tion.

We demonstrate the beneﬁt of active object recog-

nition to improve the results of a state-of-the-art ap-

proach, speciﬁcally, to improve in areas where per-

formance is affected by the pose of an object. We

recognize objects using Leabra

, which is a cogni-

tive computational neural network simulation of the

visual cortex. The neural networks have hidden layers

designed to mimic the functionality of the primary vi-

sual cortex (V1), the visual area (V4), and the inferior

temporal cortex (IT) . We extend Leabra by adding

a conﬁdence measure to resulting classiﬁcation, then

use active investigation when necessary to improve

recognition results.

We demonstrate the performance on our system

using the RGB-D (Lai et al., 2011) database. The

RGB-D contains a full 360

◦

range of yaw, and three

levels of pitch. We perform active object recognition

on 115 instances of 28 object classes from the RGB-D

dataset.

The remainder of the paper is organized as fol-

lows. We present related work in the ﬁeld of active

object recognition in Section 2. We discuss our ap-

proach in Section 3, then present experimental results

in Section 4 with concluding remarks in Section 5.

2 RELATED WORK

Wilkes and Tsotsos’ (D. Wilkes, 1992) seminal work

on active object recognition examined 8 origami ob-

jects using a robotic arm. The next best viewpoint

was selected using a tree-based matching scheme.

This simple heuristic was formalized by Denzler and

Brown (Denzler and Brown, 2002) who proposed an

information theoretic measure to select the next best

viewpoint. They use average gray level value to rec-

ognize objects, selecting the next pose in an optimal

manner to provide the most information to the current

set of probabilities for each object. They fused results

using the product of the probabilities, demonstrating

their approach on 8 objects.

Jia et al (Jia et al., 2010) demonstrated a slightly

http://grey.colorado.edu/emergent/

different approach to information fusion, using a

boosting classiﬁer to weight each viewpoint accord-

ing to the importance for recognition. They used a

shape model to recognize objects, using a boosted

classiﬁer to select the next best viewpoint. They rec-

ognized 9 objects in multiple viewpoints with arbi-

trary backgrounds.

Browatzki et al. (Browatzki et al., 2012) used an

active approach to recognize objects on an iCub hu-

manoid robot. Recognition in this case was performed

by segmenting the object from the background, then

recognizing the object over time using a particle ﬁlter.

The authors demonstrated this approach to recognize

6 different cups with different colored bottoms.

3 METHODOLOGY

We use Leabra to recognize objects (section 3.1).

Once an object has been evaluated by Leabra, we ﬁnd

both the object pose (section 3.2), and attach conﬁ-

dence to the resulting classiﬁcation (section 3.5). Fi-

nally, when the resulting classiﬁcation has low conﬁ-

dence, we actively investigate (section 3.6).

3.1 Leabra

The architecture of a Leabra neural network is broken

into three different layers, each with a unique func-

tion. The V1 layer takes the original image as input,

then uses wavelets (Gonzalez and Woods, 2007) at

multiple scales to extract edges. The V4 layer uses

these detected edges to learn a higher level representa-

tion of salient features (e.g., corners, curves) and their

spatial arrangement. The features extracted at the V1

layer includes multiple scales, therefore features ex-

tracted in the V4 layer have a sense of the large and

small features that are present in the object. The V4

layer also collapses on location information, provid-

ing invariance to the location of the object in the orig-

inal input image. The V4 layer feeds directly into the

IT activation layer, which has neurons tuned to spe-

ciﬁc viewpoints (or visual aspects) of the object.

3.2 Visual Aspects

Object pose plays an important role in recognition.

We consider pose in terms of visual aspects (Cyr and

Kimia, 2004; ?) (see ﬁgure 2). When an object under

examination is viewed from a slightly different angle,

the appearance generally should not change. When

it does not, we refer to this as a “stable viewpoint”,

both the original and the modiﬁed viewpoint belongs

to the same visual aspect V

. However, if this small

VISAPP2013-InternationalConferenceonComputerVisionTheoryandApplications

310

Figure 2: An example of visual aspects from one level of

pitch. The images show different visual aspects, and the

arrows show how each of these visual aspects are connected.

change in viewing angle affected the appearance of

the object, we would call this an “unstable viewpoint”

representing a transition between two different visual

aspects V

, and V

The human brain stores pose in a similar manner.

Neurophysiological evidence suggests that the brain

has view-speciﬁc encoding (Kietzmann et al., 2009;

?). In this encoding scheme, neurons in the IT cor-

tex activate differently depending on how an object

appears. Referring to Figure 2, when we look at the

football in the ﬁrst visual aspect, a certain set of neu-

rons in the IT layer activate. When we look at the

football in the second visual aspect, a different set of

neurons activate.

To ﬁnd visual aspects, we use the IT layer in the

Leabra neural network. We ﬁnd visual aspects using

unsupervised learning, clustering IT activations. We

describe this process in section 3.3.

3.3 Finding Aspects

Classifying an object using a Leabra network pro-

duces a set of neurons that have been activated in the

IT layer. Leabra contains a total of 210 neurons in this

layer, with similar activation patterns occurring when

an object is viewed in a similar pose. We group ac-

tivation patterns using unsupervised learning through

k-means clustering (Duda et al., 2000).

Some care must be taken to establish the number

of clusters, k, since this is synonymous with the num-

ber of visual aspects of an objects. This number is

variable depending on the complexity of the object.

For example a simple, uniformly colored, perfectly

symmetric object such as a ball would only have one

aspect. That is, a change in viewing angle will never

affect the appearance of the object. Contrast this with

a more complicated object, such as an automobile. An

automobile would likely have a great number of vi-

sual aspects because of its complex structure.

The value of k cannot be estimated a priori, so we

set this value using a heuristic based on viewpoint sta-

Figure 3: Four different visual aspects found using cluster-

ing.

bility. A small change in viewpoint (δ) should gener-

ally not result in a new visual aspect. Therefore, when

the correct value of k has been found, all of the ele-

ments resulting clusters (cl) will mostly all belong to

stable viewpoints. We determine the quality of the

clustering using the hueuristic shown in Eq. 1, where

c represents a cluster.

m(c) =

∀

i∈c

|cl(pose(i)) = cl(pose(i) + δ)|

|c|

(1)

To determine the correct number of visual aspects,

we set k to a large number, then evaluate each result-

ing cluster. If the majority of the elements of any

cluster do not belong to stable viewpoints, k is de-

creased, then the process is repeated. Some visual

aspects from different object classes are shown in ﬁg-

ure 3.

3.4 Distinctiveness of Visual Aspects

A basic tenet of active object recognition is that some

viewpoints have greater distinctiveness than others. In

this section, we establish the distinctiveness of each

visual aspect using STRoud (Barbara et al., 2006), a

test which evaluates the distinctiveness (or conversely

“strangeness”) of the members of each class. The

strangeness of a member (m) is evaluated using the ra-

tio of the distance to other objects of that class c over

the distance to all points of other classes c

−1

(Eq. 2).

In practice, we evaluate this by selecting the k small-

est distances.

str(m, c) =

∑

i=1

distance

(i, m)

∑

i=1

distance

−1

(i, m)

(2)

The sum of the distance to objects in the same

class c should be much smaller than the sum of the

distances to other classes c

−1

. Therefore, a distinctive

data point would have very low strangeness. When

referring to visual aspects (s), the probability that we

have correctly identiﬁed object (o) in visual aspect s

is determined using

UnposedObjectRecognitionusinganActiveApproach

311

p(s|o) =

|∀

i∈s

str(i, s) ≤ str(o, s)|

|s|

(3)

3.5 Recognition from a Single

Viewpoint

To recognize object o, we use the IT activation pattern

from Leabra (a), then compare this against known

classes. We compute the strangeness that the object

belongs to each class using Eq. 2. The probability of

recognition is conditionally dependent upon the dis-

tinctiveness of the visual aspect s, as well as the con-

ﬁdence that the object belongs to visual aspect s. The

probability that we have recognized an object of class

o is p(o

), for object i using image x.

p(o

, s

) = p(o

)p(o

) (4)

p(o

) = αp(a

)p(o

) (5)

p(o

) = αp(s

)p(o

) (6)

Eq. 5 can be interpreted as the probability that we

have observed the object given a particular activation

pattern. If the activation pattern observed is quite sim-

ilar to known activation patterns for object o

, we ex-

pect the probability to be high. Similarly, eq. 6 can

be interpreted as the general conﬁdence of recogniz-

ing object o in estimated visual aspect s. Combining

the two (eq. 4) produces a uncertainty measure that

accounts for both similarity of activation patterns as

well as the conﬁdence in the visual aspect.

3.6 Active Recognition

When conﬁdence is low, a single image may not be

sufﬁcient to correctly recognize the object. In these

cases, we make a small local movement to view the

object from a slightly different perspective, then com-

bine the measurements. The probability that the ob-

ject belongs to class i, as was suggested in (Denzler

and Brown, 2002), is estimated using the product of

all measurements that have been taken over time (n).

This also has the potential for incorporating a prior

probability, which we have set to a uniform probabil-

ity.

p(i) =

∏

x=1

p(o

, s

) (7)

4 EXPERIMENTAL RESULTS

We experimentally validate our approach using the

RGB-D dataset (Lai et al., 2011). This particular

dataset was selected due to its large number of object

classes, many instances of each class, and the range

of poses where each instance was imaged. A few ex-

amples of training images are in Figure 5. Our exper-

iments are conducted using 115 instances of 28 object

classes. RGB-D has images of objects when viewed

from three different levels of camera pitch, rotating

the object a full 360

◦

at each pitch level. We use 39

randomly selected images per object for training (ap-

proximately 5% of the images). One third of the re-

maining images were used for validation, the remain-

ing images are used for testing (52,404 images).

Figure 4: Frequency and number of positions used during

the active object recognition process.

We extract the object using the foreground mask

provided in the RGB-D dataset. The foreground mask

represents the part of the region that is not on the

table, as estimated using the Depth information pro-

vided by the Kinect. The size of the object was nor-

malized in the same manner as was previously de-

scribed in (Pinto et al., 2008). The purpose of fore-

ground extraction and size normalization is to remove

irrelevant size cues and to provide a measure of scale

invariance.

Table 1 shows the recognition rates for Leabra

(i.e., single viewpoint or static object recogni-

tion), and active object recognition. Active object

recognition has been set for very high conﬁdence

(p=0.99999) and therefore will only recognize an ob-

ject when it is extremely conﬁdence in the results.

Note that this is a conﬁdence on a decision-level ba-

sis, and does not necessarily predict the overall per-

formance of the system, as performance is driven by

the variability of the testing data.

During active investigation, on average, objects

are examined at 2.4 positions before they are recog-

nized. The frequency of the positions used during ex-

amination are shown in 4.

Across all of the objects, the static approach has

VISAPP2013-InternationalConferenceonComputerVisionTheoryandApplications

312

Table 1: Object Recognition Results.

object ball binder bowl calculator camera cellphone cerealbox coffeemug

Leabra 95.04% 88.92% 97.80% 84.46% 69.95% 93.47% 77.64% 94.68%

Active 97.30% 96.54 % 99.92% 93.80% 87.98% 91.84% 96.15% 97.68%

comb drybattery ﬂashlight foodbag foodbox foodcan foodcup foodjar

Leabra 86.77% 96.50% 91.65% 82.93% 93.51% 96.38% 94.00% 63.07%

Active 94.06% 100.00% 98.68% 98.29% 99.69% 99.79% 99.19% 79.02%

gluestick handtowel keyboard kleenex lightbulb marker notebook pitcher

Leabra 98.97% 80.55% 76.49% 83.63% 91.80% 96.06% 90.93% 80.62%

Active 99.87% 98.92% 85.30% 96.76% 99.01% 98.92% 97.86% 98.12%

plate pliers scissors sodacan

Leabra 99.90% 96.62% 93.61% 97.85%

Active 99.90% 98.63% 98.99% 94.22%

Figure 5: Examples of training images from the RGB-D dataset.

a precision of 90.55%, and the active approach has a

precision of 96.81%. Furthermore, the standard devi-

ation of precision varies greatly with the approaches.

The standard deviation for static is 9.27%, the active

approach is 4.95%. This indicates that not only is the

accuracy of the system improving, but the number of

objects with a low level of accuracy is also improving.

5 DISCUSSION

State-of-the-art approaches to object recognition have

been demonstrated to perform very well on posed ob-

jects. We have shown that unposed objects can be

more difﬁcult to recognize, particularly in degener-

ate viewpoints. Further, an active strategy can boost

the performance of the system even when considering

a simple approach to next best viewpoint selection.

Using only a random movement strategy, we demon-

strated a 6% boost in improvement without signiﬁ-

cantly impacting the recognition speed of the system

(requiring only 2.4 positions on average).

ACKNOWLEDGEMENTS

This work was partially supported by the Of-

ﬁce of Naval Research under job order num-

ber N0001407WX20452 and N0001408WX30007

awarded to Greg Trafton. Wallace Lawson was also

funded by the Naval Research Laboratory under a

Karles Fellowship.

UnposedObjectRecognitionusinganActiveApproach

313

REFERENCES

Barbara, D., Domeniconi, C., and Rodgers, J. (2006). De-

tecting outliers using transduction and statistical test-

ing. 12th ACM SIGKDD International Conference on

Knowledge Discovery and Data Mining.

Browatzki, B., Tikhanoff, V., Metta, G., Bulthoff, H., and

Wallraven, C. (2012). Active object recognition on a

humanoid robot. In IEEE International Conference on

Robotics and Automation.

Cyr, C. and Kimia, B. (2004). A similarity-based aspect-

graph approach to 3d object recognition. International

Journal of Computer Vision, 57(1).

D. Wilkes, J. T. (1992). Active object recognition. In Com-

puter Vision and Pattern Recognition (CVPR).

Denzler, J. and Brown, C. (2002). Information theoretic

sensor data selection and active object recognition and

state estimation. IEEE. Trans. on Pattern Analysis and

Machine Intelligence, 24(2).

Duda, R., Hart, P., and Stork, D. (2000). Pattern Classiﬁca-

tion. Wiley Interscience.

Farshidi, F., Sirouspour, S., and Kirubarajan, T. (2009). Ro-

bust sequential view planning for object recognition

using multiple cameras. Image and Vision Comput-

ing.

Fei-Fei, L., Fergus, R., and Perona, P. (2004). Learning gen-

erative visual models from few training examples: An

incremental bayesian approach tested on 101 object

categories. In IEEE Conferece on Computer Vision

and Pattern Recognition.

Frank, M., Munakata, Y., Hazy, T., and O’Reilly, R. (2012).

Computational Cognitive Neuroscience.

Gonzalez, R. and Woods, R. (2007). Digital Image Process-

ing. Prentice Hall.

Jia, Z., Chang, Y., and Chen, T. (2010). A general boosting

based framework for active object recognition. British

Machine Vision Conference (BMCV).

Kietzmann, T. C., Lange, S., and Riedmiller, M. (2009).

”computational object recognition: A biologically

motivated approach”. Biol. Cybern.

Lai, K., Bo, L., Ren, X., and Fox, D. (2011). A large-scale

hierarchical multi-view rgb-d object dataset. In Inter-

national Conference on Robotics and Automation.

LaPorte, C. and Arbel, T. (2006). Efﬁcient discriminant

viewpoint selection for active bayesian recognition.

International Journal of Computer Vision, 68(3):267–

287.

Pinto, N., Cox, D., and DiCarlo, J. (2008). Why is real-

world visual object recognition hard? PLoS Compu-

tational Biology, 4(1).

Sebastian, T., Klein, P., and Kimia, B. (2004). Recognition

of shapes by editing their shock graphs. Pattern Anal-

ysis and Machine Intelligence, IEEE Transactions on,

26(5):550 –571.

VISAPP2013-InternationalConferenceonComputerVisionTheoryandApplications

314