CONTOUR SEGMENT ANALYSIS

FOR HUMAN SILHOUETTE PRE-SEGMENTATION

Cyrille Migniot, Pascal Bertolino and Jean-Marc Chassery

CNRS Gipsa-Lab DIS, 961 rue de la Houille Blanche BP 46 - 38402 Grenoble Cedex, France

Keywords:

Human detection and segmentation, Silhouette, Histograms of oriented gradients.

Abstract:

Human detection and segmentation is a challenging task owing to variations in human pose and clothing. The

union of Histograms of Oriented Gradients based descriptors and of a Support Vector Machine classiﬁer is

a classic and efﬁcient method for human detection in the images. Conversely, as often in detection, accurate

segmentation of these persons is not performed. Many applications however need it. This paper tackles the

problem of giving rise to information that will guide the ﬁnal segmentation step. It presents a method which

uses the union mention above to relate to each contour segment a likelihood degree of being part of a human

silhouette. Thus, data previously computed in detection are used in the pre-segmentation. A human silhouette

database was ceated for learning.

1 INTRODUCTION

Simultaneous detection and segmentation of element

of a known class , and in particular with the one of

persons, is a much discussed problem. Due to the var-

ious colors and positions that a person can have, it is

a challenging task. The aim is to avoid user supervi-

sion. It is a critical part in any applications such as

video surveillance, driver-assistance or video index-

ing.

Dalal (Dalal and Triggs, 2005) presented an efﬁcient

and reliable detection algorithm based on Histograms

of Oriented Gradient (HOG) descriptors with Support

Vector Machine classiﬁer. Nevertheless, if person lo-

calization is performed, no information about his/her

shape is provided and segmentation is not done.

This paper describes a HOG local process. Indeed

HOG global process in a detection window makes a

decision about the presence of a person in the win-

dow. With HOG local process, information relied on

the person silhouette is obtained. Our method uses

this approach in order to determine the relevance as a

person silhouette part for each contour segment (Fig-

ure 1). This data might allow segmentation, for exam-

ple by looking for the shortest-path cycle in a graph.

Here, contour segments are an interesting support.

Given that only shape is discriminative of person

class, it is logical to study contours. Furthermore,

segments gather pixels that share the same location

and orientation. Associating them with HOGs based

results is therefore relevant and permits to re-use data

computed during detection.

Our method studies a positive detection window (con-

taining a person) and generates for each contour

segment a human silhouette membership likelihood

value. Learning and tests are made with the IN-

RIA Static Person Data Set containing positive detec-

tion windows of Dalal’s algorithm (Dalal and Triggs,

2005).

Figure 1: Input image (a), contour image computed with

Canny algorithm (b), contour pixels gathered in contour

segments (c) and likely segments (d).

1.1 Related Work

Most human detection works have used a “descrip-

tor/classiﬁer” framework (Figure 2). The descriptor

converts an image into a vector of discriminative fea-

tures of the searched class. The classiﬁer, from an

image feature vector, determines whether a person

is in the image. To describe a class, the classiﬁer

Migniot C., Bertolino P. and Chassery J. (2010).

CONTOUR SEGMENT ANALYSIS FOR HUMAN SILHOUETTE PRE-SEGMENTATION.

In Proceedings of the International Conference on Computer Vision Theory and Applications, pages 74-80

DOI: 10.5220/0002845300740080

 SciTePress

is made from feature vectors of positive (containing

an instance of the target class) and negative detection

windows. SVM (Vapnik, 1995), AdaBoost (Freund

Figure 2: The “descriptor/classiﬁer” framework principle.

and Schapire, 1995) and neural networks (Toth and

Aach, 2003) are the most used classiﬁers. Regard-

ing descriptors, Haar wavelets (Oren et al., 1997) and

HOGs are the most usual but Principal Component

Analysis (PCA) (Munder et al., 2008), Riemannian

manifold (Tuzel et al., 2007) or Fourier descriptors

(Toth and Aach, 2003) are used too. Detection can

be done for the entire person or for body parts (limbs,

torso, head, etc.) as in (Alonso et al., 2007). Never-

theless, other works are based on other techniques as,

for example, human movement recognition and par-

ticularly walking periodicity (Ran et al., 2005).

Simultaneous detection and segmentation need other

processing. Using stereo (Kang et al., 2002) easily

gives a segmentation of elements of different depths

but requires particular equipment. Silhouette template

matching (Lin et al., 2007) (Munder and Gavrila,

2006) allows segmentation with the nearest known

template.

Contour segments analysis is used as well. (Wu and

Nevatia, 2007) captures most important edgelets with

a cascade of classiﬁers but essentially for detection.

(Sharma and Davis, 2007) looks for the most likely

contour segment cycles without any knowledge on the

studied class. These cycles are attached to some per-

son features, using a Markov random ﬁeld. This is an

approach inverse to ours.

1.2 HOG and SVM Combination

Our method is based on the combination used by

(Dalal and Triggs, 2005) for detection, as we remind

in this section. The process consists in testing all pos-

sible detection windows for all locations and scales.

A human presence measure is associated to each de-

tection window.

Descriptors are based on contour orientation. In order

to obtain spatial informations, any studied detection

window is divided into areas named blocks and cells.

1.2.1 Blocks and Cells

The detection window is divided by a rectangular

grid. Cells are integrated into blocks. Each block con-

tains a ﬁxed number of cells. Each cell belongs to at

least one block.

Without block overlapping, a cell belongs to a single

block. Division into blocks is also performed using a

rectangular grid (Figure 3). With block overlapping,

a cell may belong to several blocks. First, we do not

use block overlapping.

Figure 3: Image partitioned in blocks (in red) and cells (in

blue) with no block overlapping.

1.2.2 Histograms of Oriented Gradients

For a given area, a HOG means the proportion of con-

tour pixels for each orientation bin. Orientation inter-

val is evenly divided in N

bin

orientation bins in ∆:

∆ =





2k− 1

bin

π,

2k+ 1

bin



, k ∈ [1, N

bin

]



(1)

Let ν

cell

be the occurrence frequency of contour pix-

els whose orientation belongs to interval I in the cell

cell. The cell HOG is given by:

HOG

cell

, I ∈ ∆

(2)

1.2.3 Descriptor

HOGs are descriptors and thus are described by fea-

ture vectors. Clothe colors and texture variations are

not discriminative. HOGs describe shape and are also

relevant for human detection.

1.2.4 Classiﬁer

For a block block, the feature vector V

block

is made

from all HOGs of its cells:

block

= {HOG

cell

, cell ∈ block} (3)

With the normed concatenated vector of the feature

vectors of all blocks, the SVM classiﬁer indicates

whether a person is in the detection window. An

overview of the method is shown in Figure 4.

CONTOUR SEGMENT ANALYSIS FOR HUMAN SILHOUETTE PRE-SEGMENTATION

Figure 4: An overview of person detection.

2 COMBINING HOG WITH SVM

FOR EACH BLOCK

In our approach, the classiﬁcation process is not any-

more performed globally but at the block level in or-

der to classify any piece of contour has being or not a

piece of the person’s silhouette. First, a contour map

of the detection window is computed using the Canny

algorithm. Then a list of contour segment is created.

The window is divided into blocks and cells. In the

following, we present the method without then with

block overlapping.

2.1 Without Block Overlapping

The method shown previously provides a value for

each window. Our aim is to have one for each con-

tour segment.

2.1.1 Feature Vectors

Each contour pixel with orientation θ ∈

bin

π,

k+1

bin

updates two bins of the cell’s HOG with weights ω

(for bin ν

) and 1− ω (for bin ν

k+1

) as follows:

ω =

k+1

bin

π− θ

bin

(4)

HOG

cell

are still obtained with equation 2 and V

block

with equation 3.

For segment likelihood calculation, HOG block is de-

ﬁned by:

HOG

block

, I ∈ ∆

(5)

2.1.2 Normalization

Illumination and contrast can greatly modify HOG

values. This leads to a reduction of the efﬁciency in

the learning and classiﬁcation process. Thus a nor-

malization of V

block

is performed. Let N be the vec-

tor size. Four different normalization schemes were

tested:

• L1-norm: averaging by the norm L1 of the vector:

bloc

(i) =

block

(i)

∑

k=1

block

(k)

(6)

• L1sqrt norm: the square root of the L1-norm is

used to express the feature vector as a probability

distribution

sqrt

bloc

(i) =

block

(i)

∑

k=1

block

(k)

(7)

• L2-norm: averaging by the norm L2 of the vector:

bloc

(i) =

block

(i)

∑

k=1

block

(k)

(8)

• L2-Hyst norm: non linear illumination variations

could make saturations in the acquisition and

cause sharp magnitude variations. High magni-

tude gradient inﬂuence is here decreased by ap-

plying a threshold of 0.2 after L2-normalization.

Finally L1-normalization is performed.

These norms effects are shown in section 3.

2.1.3 Likely Segments

From the database examples, SVM constructs a hy-

perplane to separate positive and negative elements.

Algebraic distance between an example and this hy-

perplane provides the classiﬁcation and associates to

each block a value S

SVM

block

. An overview is shown in

Figure 5. For contour segment seg with orientation

θ ∈ I , let t

be this segment pixel count belonging to

block b

. In a windowcontaining N

blocks, the value

associated to this segment is:

seg

∑

k=1

HOG

(I )S

SVM

∑

k=1

(9)

This value is the likelihood of a segment to be part of

a human silhouette.

VISAPP 2010 - International Conference on Computer Vision Theory and Applications

Figure 5: An overview of our approach.

2.2 With Block Overlapping

An important number of cells in a block leads to an

efﬁcient descriptor. An important number of blocks

gives more spatially accurate results. Nevertheless

cells with too small size is not informative enough.

The solution is block overlapping. A cell could be-

long to several blocks. Hence, cell number and size

could be higher than previously.

In practice, in SVM output, a value S

SVM

block

is al-

ways allocated to each block. But, since a single cell

can belong to several blocks, a value S

SVM

cell

is associ-

ated with each cell.

SVM

cell

= mean

bloc∋cell

SVM

block

) (10)

For a contour segment seg with orientation θ ∈ I , let

be segment pixel count belonging to cell c

. In a

window containing N

cells, the value associated with

this segment is:

rec

seg

∑

k=1

HOG

(I )S

SVM

∑

k=1

(11)

3 RESULTS

To evaluate the ability of our method to extract in-

formation about person silhouette, we tested it on

INRIA Static Person Data Set consisting of images

of person in various upright poses. Learning is re-

alized with 200 binary silhouette images obtained

from positive examples and 200 negative examples.

This database is available from http://www.gipsa-

lab.inpg.fr/∼cyrille.migniot/recherches.html. SVM-

light algorithm (Joachims, 1999) is used as classiﬁer.

In Figure 6 and 10, each block (or cell) is colored

with a gray level proportionalto its SVM output (then,

each segment). Contrary to other methods, a non bi-

Figure 6: Effects of block overlapping: SVM outputs and

likely segments with block overlapping (a) and same results

without block overlapping (b).

nary value is associated to each segment. Thus, evalu-

ation of the method also needs a new benchmark that

is decribed in the following. For a positive detection

window pdw, let M

pdw

pos

be the mean of P values as-

sociated to segment’s pixels belonging to human sil-

houette and M

pdw

neg

the mean of P values associated to

other segment’s pixels.

Let benchmark be a value relative to our method per-

formance, computed from N

detection windows as

follows:

benchmark =

∑

pdw=1

pdw

pos

pdw

neg

(12)

3.1 Overlapping Effects

Benchmark values are very similar with and without

block overlapping. Nevertheless, behaviors are not

the same. With block overlapping, the value associ-

ated to a cell depends on several blocks. Thus false

segment suppression (in particular on window bound-

aries) is less efﬁcient (Figure 6). However, block

overlapping provides spatially ﬁnest study. In this

way, less silhouette segments are omitted by detec-

tion.

3.2 Cell and Block Size Effects

Cell size modiﬁes SVM outputs spatial accuracy and

studied area relevance. Moreover bigger block size

leads to more efﬁcient descriptors but spatially less

accurate results.

In Figure 7, the benchmark is computed with the L2-

norm. Blocks of 4 and 16 cells make block overlap-

CONTOUR SEGMENT ANALYSIS FOR HUMAN SILHOUETTE PRE-SEGMENTATION

ping of

, while blocks of 9 cells make block over-

lapping of

. Blocks of 4 cells give the best results.

Results for various cell sizes are stable, but, for next

tests, cells of 25 pixels are used because they give the

best benchmark.

Figure 7: Cell and block size effects on the algorithm per-

formance.

3.3 Normalization Effects

Feature vector normalization prevents illumination

variation effects. The goal of this normalization is

to increase the similarity between images of the same

class. Results of Figure 8 stem from tests with blocks

of 4 cells of 25 pixels and block overlapping. Mean

of M

pdw

pos

, mean of M

pdw

neg

and benchmark are shown.

Without normalization, variations between elements

of the same class are higher. Thus, descriptor is less

discriminative and benchmark response is less good.

The benchmark promotes L1 and L2-Hyst norms.

Figure 8: Normalization effects on the algorithm perfor-

mance.

However, for these normalizations, mean of M

pdw

pos

low. That means that, with these norms, our method

is accurate but not efﬁcient. L2-norm may also be the

best choice.

3.4 Particular Cases

Segmenting a person is harder if portions of the sil-

houette contours are missing (Figure 9a). Indeed, the

edge detector can miss parts of the silhouette if the

transition is not sharp enough. This has no inﬂuence

in computing, but perturbs segment gathering.

A typical detection and segmentation difﬁculty is

the occlusion (Figure 9b). If a part of the studied per-

son is hidden by an object, detection is a delicate is-

sue and segmentation suffers from contour disconti-

nuities. However, with our local study, each area is

independently studied. Thus, each segment can be lo-

cally likely, even though it is not linked to other likely

segments.

If the studied image has too cluttered background

(Figure 9c), edges are numerous. HOGs are so locally

perturbed and do not permit recognition.

Illumination can pose problems. In example (d)

of Figure 9, only a part of the leg is lighted. An edge

matches the leg and the background boundary while

another one matches the lighted and dark area bound-

ary. These two edges are close and parallel. Our

method can not recognize which one belongs to the

silhouette.

Figure 9: If a silhouette part does not belong to the com-

puted edges (right arm), segmentation is difﬁcult (a). With

occlusion, the calfs are not detected but it does not perturb

the processing of feet (b). Cluttered background increases

difﬁculty (c). Illumination variations provide two parallel

edges in the leg (d).

4 CONCLUSIONS

This paper presented a method for quantifying the

likelihood of contour segment as being part of a hu-

man silhouette. To do that, in a positive detection

window, the HOGs that were previously computed for

people detection give a local description of his/her sil-

houette.

VISAPP 2010 - International Conference on Computer Vision Theory and Applications

Figure 10: First column: INRIA’s database images. Second column: SVM outputs. Third column: likely segments obtained

by our method with block overlapping, L2-normalization and blocks of 4 cells of 25 pixels.

CONTOUR SEGMENT ANALYSIS FOR HUMAN SILHOUETTE PRE-SEGMENTATION

The main perspective of this work is to use this lo-

cal description to perform accurate segmentation. An

oriented graph may be created with all the contour

segments. The graph edges will be weighted with val-

ues found by our method. A shortest-path algorithm,

such as Dijkstra’s algorithm, will ﬁnd the most likely

contour segment cycle representing the person’s sil-

houette.

REFERENCES

Alonso, I., Llorca, D., Sotelo, M., Bergasa, L., Toro, P. D.,

Nuevo, J., Ocania, M., and Garrido, M. (2007). Com-

bination of feature extraction methods for svm pedes-

trian detection. Transactions on Intelligent Trans-

portation Systems, 8:292–307.

Dalal, N. and Triggs, B. (2005). Histograms of oriented

gradients for human detection. Computer Vision and

Pattern Recognition, 1:886–893.

Freund, Y. and Schapire, R. (1995). A decision-theoretic

generalization of on-line learning and an application

to boosting. European Conference on Computational

Learning Theory, pages 23–37.

Joachims, T. (1999). Making Large-Scale SVM Learning

Practical. Advances in Kernel Methods - Support

Vector Learning, B. Schlkopf and C. Burges and A.

Smola, MIT-Press.

Kang, S., Byun, H., and Lee, S. (2002). Real-time pedes-

trian detection using support vector machines. Inter-

national Journal of Pattern Recognition and Artiﬁcial

Intelligence, 17:405–416.

Lin, Z., Davis, L., Doermann, D., and DeMenthon, D.

(2007). Hierarchical part-template matching for hu-

man detection and segmentation. International Con-

ference in Computer Vision, pages 1–8.

Munder, S. and Gavrila, D. (2006). An experimental study

on pedestrian classiﬁcation. Transactions on Pattern

Analysis and Machine Intelligence, 28:1863–1868.

Munder, S., Schnorr, C., and Gavrila, D. (2008). Pedes-

trian detection and tracking using a mixture of view-

based shape-texture models. Transactions on Intelli-

gent Transportation Systems, 9:303–343.

Oren, M., Papageorgiou, C., Sinha, P., Osuna, E., and Pog-

gio, T. (1997). Pedestrian detection using wavelet

templates. Computer Vision and Pattern Recognition,

pages 193–199.

Ran, Y., Zheng, Q., Weiss, I., Davis, L., Abd-Almageed, W.,

and Zhao, L. (2005). Pedestrian classiﬁcation from

moving platforms using cyclic motion pattern. In-

ternational Conference on Image Processing, 2:854–

857.

Sharma, V. and Davis, J. (2007). Integrating appearance

and motion cues for simultaneous detection and seg-

mentation of pedestrians. International Conference on

Computer Vision, pages 1–8.

Toth, D. and Aach, T. (2003). Detection and recognition

of moving objects using statistical motion detection

and fourier descriptors. International Conference on

Image Analysis and Processing, pages 430–435.

Tuzel, O., Porikli, F., and Meer, P. (2007). Human detection

via classiﬁcation on riemannian manifolds. Computer

Vision and Pattern Recognition, 0:1–8.

Vapnik, V. (1995). The Nature of Statistical Learning The-

ory. Springer-Verlag, New York.

Wu, B. and Nevatia, R. (2007). Simultaneous object detec-

tion and segmentation by boosting local shape feature

based classiﬁer. Computer Vision and Pattern Recog-

nition, 0:1–8.

VISAPP 2010 - International Conference on Computer Vision Theory and Applications