CATEGORY LEVEL OBJECT SEGMENTATION

Learning to Segment Objects with Latent Aspect Models

Diane Larlus and Fr

eric Jurie

LEAR Group, INPG-CNRS, INRIA Rh

one-Alpes, France

Keywords:

Object segmentation, Latent aspect models.

Abstract:

We propose a new method for learning to segment objects in images. This method is based on a latent variables

model used for representing images and objects, inspired by the LDA model. Like the LDA model, our model

is capable of automatically discovering which visual information comes from which object. We extend LDA

by considering that images are made of multiple overlapping regions, treated as distinct documents, giving

more chance to small objects to be discovered. This model is extremely well suited for assigning image

patches to objects (even if they are small), and therefore for segmenting objects. We apply this method on

objects belonging to categories with high intra-class variations and strong viewpoint changes.

1 INTRODUCTION

The problem of image segmentation and labeling im-

age region is one of the key problems of computer

vision. It consists in separating or grouping image

pixels into consistent parts, brought to be elements

that humans consider as individual objects or distinct

object parts. This problem received a huge amount

of attention in the past, and was originally addressed

as an unsupervised problem. Many different methods

have been developed, using various image properties

such as color, texture, edges, motion, etc. (Haralick

and Shapiro, 1985). It eventually turned out that im-

age segmentation and image understanding were two

closely related problems which cannot be solved in-

dependently. After being abandoned for a while, im-

age segmentation came back into favor recently, tak-

ing advantage of recent advances of machine learning.

The goal addressed here is the segmentation of

objects belonging to a given category ( the so-called

ﬁgure-ground segmentation problem) assuming the

category is deﬁned by a set of training images. This

is illustrated in Figure 1 for the “bicycle” category

which is a very challenging category. The overall ob-

jective is to classify image pixels as being ﬁgure or

ground. Objects can appear in any size and any posi-

tion in the image. They can occur with widely varying

appearances.

Figure 1: We show an image of the “bike” category, with its

corresponding hand-made segmentation masks. Our goal

is to design algorithms able to compute automatically this

segmentation.

In such conditions, object segmentation is

strongly linked to object detection and recognition.

Indeed, segmenting objects requires learning object

models from training images, as well as to search for

occurrences of these models in images.

For this paper we will focus on difﬁcult real-

condition images where the objects can present ex-

treme appearance variations (see Figure 1).

1.1 Previous Work

The method proposed in this paper is inspired by sev-

eral related recent works, summarized below.

Leibe and Schiele (Leibe and Schiele, 2003) were

among the ﬁrst authors proposing to learn how to seg-

ment objects. Their method generates object hypothe-

ses, without prior segmentation, that can be exploited

122

Larlus D. and Jurie F. (2007).

CATEGORY LEVEL OBJECT SEGMENTATION - Learning to Segment Objects with Latent Aspect Models.

In Proceedings of the Second International Conference on Computer Vision Theory and Applications - IU/MTSV, pages 122-127

 SciTePress

to obtain a category-speciﬁc ﬁgure-ground segmenta-

tion. Training images are used to build a visual vocab-

ulary of interest points, containing information about

their relative positions as well as their corresponding

segmentation masks.

Borenstein et al.(Borenstein et al., 2004) use the

same idea of selecting informative patches from train-

ing images and then use their segmentation masks

on new unseen images. They combine bottom-up

and top-down approaches into a single process. The

top-down approach uses object representation learned

from examples to detect an object in a new image and

provides an approximation to its segmentation. The

bottom-up approach uses image-based criteria to de-

ﬁne coherent groups of pixels that are likely to belong

to the same part. The resulting combination beneﬁts

from both approaches.

Several approaches propose to use Conditional

Random Field (CRF) for part-based detection (Quat-

toni et al., 2004) or segmentation (Kumar and Hebert,

2006). The previous authors extend the notion of

CRFs to Discriminative Random Fields (DRFs) by

exploiting probabilistic discriminative models instead

of the generative models generally used with CRF.

Kumar et al. (Kumar et al., 2005) propose another

methodology for combining top-down and bottom-up

cues with CRFs. They combine CRFs and pictorial

structures (PS). The PS provides good priors to CRFs

for speciﬁc shapes and provides much better results.

None of the previous approaches is able to cope

with occlusion. Win and Shotton (Winn and Shotton,

2006) were the ﬁrst to address speciﬁcally this prob-

lem using an enhanced CRF. Their approach allows

the relative layout (above/below/left/right) of parts to

be modeled, as well as the propagation of long-range

spatial constraints.

1.2 Description of our Approach

Our approach shares many common features with the

previously mentioned approaches. First, it combines

bottom-up and top-down strategies.

The bottom-up process consists in sampling, and

normalizing in size, dense image patches (small

square image sub-windows), as in (Kumar and

Hebert, 2006; Winn and Shotton, 2006), represented

for subsequent processing by SIFT descriptors (Lowe,

2004). These descriptors are then vector quantized

into a discrete set of labels so called visual words.

Each patch is described by the word of the nearest

centroid. This process is illustrated Figure 2. From

this stage, images are seen as sets of visual words oc-

currences. As the process assigns ﬁgure/ground la-

bels to patches, the pixel level segmentation requires

Figure 2: The visual vocabulary is obtained by vector quan-

tizing a set of image patches descriptors.Images are mod-

eled as sets of overlapping documents, each document be-

ing a set of patches.

an additional process, responsible for combining la-

bels carried by patches into pixel hypotheses.

The top-down process embeds object models and

uses them to obtain a global coherence, by combining

local information provided by the bottom-up process.

Most of the models previously used in this context

cannot be used here, because of the strong variation

of object’s appearance. Geometric models such as

the Pictorial Structure (Kumar et al., 2005) or the Im-

plicit Shape Model (Leibe and Schiele, 2003) would

require a huge number of training images in order

to capture the large variability of appearance. Ap-

proaches based on characteristic edge patches (Boren-

stein et al., 2004) are only usable when object outlines

are sufﬁciently stable. As a consequence, it appears

that a more ﬂexible model is required to address such

object categories.

For the recognition of complex object categories,

the bag-of-words model (Csurka et al., 2004) has been

shown to be one of the most effective. It was inspired

from text classiﬁcation; the text framework becomes

applicable for documents corresponding to images,

once images have been transformed into sets of vi-

sual words. More recently, techniques based on la-

tent aspects were developed on top of this unordered

word based representation and were applied ﬁrst for

text classiﬁcation (Grifﬁths and Steyvers, 2004), and

then for image classiﬁcation (Sivic et al., 2005). Such

models are usually coming from the probabilist La-

tent Semantic Analysis (pLSA) (Hofmann, 2001) or

its Bayesian form, the Latent Dirichlet Allocation

(LDA) (Blei et al., 2002). Visual words are consid-

ered as generated from latent aspect (or topics) and

images are combination of speciﬁc distribution of top-

ics.

Using this latent aspect based framework for seg-

menting images is appealing for several reasons. First

because object appearances (topics) can be automati-

cally discovered and learned, limiting the amount of

supervision required. Second, the ﬂexibility of such

a framework can handle large variations in appear-

CATEGORY LEVEL OBJECT SEGMENTATION - Learning to Segment Objects with Latent Aspect Models

123

The estimation is done according to the maxi-

mum likelihood criterion: We collect N images and

observe the set of patches (x

, w

), . . . , (x

, w

We want to compute θ and φ maximizing

P((x

, w

), . . . , (x

, w

)|θ, φ, α, β).

Since the integral of the observation generation

probability (equation 1) makes the direct optimization

of the likelihood intractable, we estimate variables of

interest by an approximate iterative technique called

Gibbs sampling.

During this process we estimate topic affectations

(hidden variables of the model) jointly with θ and φ.

θ and φ are never explicitly estimated, but instead the

posterior distribution over the assignments of words

to topics P(z|w) is considered. This process has been

proposed by (Grifﬁths and Steyvers, 2004) for the

LDA model. Gibbs sampling works as follows: doc-

uments are initialized with equiprobable distribution

over topics, then we iterate to estimate the posterior

distribution P(z|w).

In practice, the model includes only two topics,

one for describing foreground patches, the other for

background patches. However in a totally unsuper-

vised framework, as we presented so far, foreground

objects might not be automatically chosen as a topic

of the model. The multi-documents model might cap-

ture other more frequent aspects of the image.

This is the reason why some extra supervision

is added during the learning stage. A set of train-

ing images with possibly different levels of super-

vision is used to estimate foreground topic distribu-

tion over words. This is done with a standard LDA

model (Blei et al., 2002) which is able to capture efﬁ-

cient topic distributions even with small supervision.

This learned distribution is then used as a prior on the

φ distribution for the test images.

It should be noted that for making the estimation

possible we only process one test image at a time.

We typically have thousands of documents per image.

Processing all these images simultaneously would be

infeasible. As a consequence, documents of different

images become independent.

The hyper-parameters α and β play an important

role as they allow to control topics and visual words

distribution. For α, a small scalar value has been

taken as in (Grifﬁths and Steyvers, 2004) in order to

produce sparse and therefore specialized topics dis-

tribution. For β the knowledge acquired during the

training stage has been used. We used values propor-

tional to the number of patches affected to topics and

words. This prior has large values as the knowledge

acquired by LDA model is strong, but during the es-

timation the φ distribution can still be adapted to a

particular test image.

2.3 From Patches to Segmentation

At the end of the estimation process, all patches have

a probability of being generated by one of the fore-

ground/background topics. These patches correspond

to the squared sub-window’s pixels used to build vi-

sual words. To compute the probability for a pixel p

belonging to an object class (corresponding to topic

z), we accumulate the knowledge on patches P con-

taining the pixel. This is modeled by a mixture model,

where weights (probability of a pixel to have been

generated by a patch P(p|P )) are functions of the dis-

tance between the pixel and the center of the patch.

We have P(class(p) = z) ∝

∑

P(t

= z)P(p|P

)

where t

stands for the topic of patch P

This can be seen as a summary of all labels pro-

vided for the same pixel. In regions where neighbor-

ing patches disagree, the conﬁdence will be low; in

contrast if neighboring patches agree, the probability

for the pixel to belong to the object becomes higher.

3 EXPERIMENTS

We tried to evaluate the soundness of our method

by comparing segmentation it produces with hand

ground truth segmentation. We chose to use the bike

class of the Graz-02 dataset

because of its high com-

plexity. It contains images with high intra-class vari-

ability on highly cluttered backgrounds. The ground

truth is available for 300 images. It is given in terms

of pixel segmentation masks (one example is shown

Figure 1). These masks will be used to evaluate the

quality of our segmentation.

Methodology We have shown in section 2.3 that

our algorithm computes the probability for each im-

age’s pixel to belong to an object of a given category.

On the other hand, we know ground truth pixels la-

bels, given by the provided segmentation masks. It

is therefore natural to evaluate the performance by

computing a ROC curve for each image. The ROC

curve represent the true positive rate (T P) against the

false positive rate (FP), i.e., the rate of correct classi-

ﬁcation for the category of interest against the rate of

object pixels misclassiﬁed. The true positive rate at

equal error rate (EER) is the true positive rate at the

curve point where T P = 1 − FP.

Experimental settings Patches are chosen at differ-

ent scales according to a dense grid. This sampling

is dense enough for all pixels to belong to several

available at http://www.emt.tugraz.at/pinz/data/GRAZ 02/

CATEGORY LEVEL OBJECT SEGMENTATION - Learning to Segment Objects with Latent Aspect Models

125