A ROBUST BACKGROUND SUBTRACTION ALGORITHM USING

THE A Σ − ∆ ESTIMATION

Applied to the Visual Analysis of Human Motion

Juan Carlos Le

on, Fabio Mart

ınez and Eduardo Romero

CimaLab, Universidad Nacional de Colombia, Bogot

a, Colombia

Keywords:

Background Subtraction, Motion Analysis, Σ∆ Estimation.

Abstract:

This paper introduces a novel method for segmenting the human silhouette in video sequences, based on

a local version of the classical Σ∆ ﬁlter. A main difference of our approach is that the ﬁlter is not pixel-

wise oriented, but rather region wise adjusted by using scaled estimations of both the pixel intensity and the

horizontal (vertical) gradient, i.e., a multiresolution wavelet decomposition using Haar functions. The classical

Σ∆ ﬁlter is independently applied to each component of the obtained feature vector, previously normalized and

a single scalar value is associated to the pixel by averaging the feature vector components. The background

is estimated by setting a threshold in a histogram constructed with these integrated values, attempting to

maximize the interclass variance. This strategy was evaluated in a set of 6 videos, taken from the Human Eva

data set. Results show that the proposed algorithm provides a better segmentation of the human silhouette,

specially in the limbs, which are critical for human movement analysis.

1 INTRODUCTION

Visual analysis of human motion implies the detec-

tion, follow up and characterization of relevant pat-

terns in a sequence of images. Usually the main

features to detect are the position and alignment of

the human body parts (human pose). While visual

markers can be employed for this task(Kirtley, 2005),

the result is usually a simpliﬁed model of the human

body. Most detection methods use a background esti-

mation as preprocessing step, attempting to eliminate

pixels with no temporal change.

Background subtraction methods use a sequence

of images ({I

}

i=1:t

) to build a model of the static

scene (M

), and establish a rule to set a pixel value

in I

as either background or foreground.

A main contribution of the present paper was to

adapt the classical Σ∆ pixel wise estimation to a lo-

cal version of the ﬁlter, which is much more robust

to local variations and tracks better the image object

edges. The basic idea was to approach the pixel in-

formation with a multiresolution decomposition, con-

serving the edge features in the gradient estimations

while the low frequency characteristics regularize the

numerical difference, i.e., a classical wavelet approx-

imation. The obtained Haar coefﬁcients are used in-

dependently in a classical Σ − ∆ estimation, averaged

and used to construct an histogram in which an opti-

mal threshold maximizes the interclass variance. This

paper is organized as follows: Section 2 introduces

the Σ − ∆ operator, and the proposed extension, sec-

tion 3 demonstrates the effectiveness of the method,

ﬁnally section 4 concludes with a discussion of the

proposed method and possible future works.

2 MATERIALS AND METHODS

Among the background subtraction techniques, the

Σ−∆ operator represents a family of background sub-

traction methods, well known for their computational

efﬁciency and capability to work without any prior

knowledge of the scene, even in no controlled illumi-

nation conditions.

While this operator offers a baseline for back-

ground subtraction in human movement analysis, it

is still limited regarding its accuracy and robustness

to noise. As observed in ﬁgure 1, relevant parts of the

human ﬁgure, as the shins and lower arms, are miss-

ing. Noise is present on the image, especially while

the model converges to a good approximation of the

background.

These limitations can be attributed to the selected

484

Carlos León J., Martínez F. and Romero E..

A ROBUST BACKGROUND SUBTRACTION ALGORITHM USING THE A Σ-â

L ˛E ESTIMATION - Applied to the Visual Analysis of Human Motion.

DOI: 10.5220/0003868204840489

In Proceedings of the International Conference on Computer Vision Theory and Applications (VISAPP-2012), pages 484-489

ISBN: 978-989-8565-03-7

 2012 SCITEPRESS (Science and Technology Publications, Lda.)

Figure 1: Output of the Basic Σ − ∆ Algorithm for a se-

quence of the Human Eva Dataset.

pixel descriptors in a single frame, i.e. the regu-

lar Σ − ∆ uses a single pixel intensity. This may be

better approached by introducing local information.

The present investigation proposes an extension of the

Σ − ∆ background subtraction algorithm, focusing on

region features rather than on pixel intensity.

2.1 The Σ − ∆ Operator

The non linear operator Σ∆ increases the correlation

between adjacent frames by oversampling a signal at

higher rates than the speciﬁed by the Nyquist theo-

rem. This operator dynamically updates a background

model M

(x), by comparing each image I

(x) with the

current background model M

(x), using a simple up-

dating rule: If I

(x) is greater (lower) than M

(x), then

a positive increase (decrease) ∆ is performed. The ab-

solute difference |I

(x)−M

(x)| is used to compute an

estimate of the per pixel variance V

(x), based on this

estimate pixels are classiﬁed as either foreground or

background (Manzanera and Richefeu, 2007), a de-

tailed description can be found on algorithm 1.

2.2 Region Features

As stated above, a main limitation of the Σ∆ back-

ground subtraction is that it operates exclusively over

the intensity values of a pixel through an image se-

quence I

, restricting thereby the accuracy and ro-

bustness of the background estimation process. We

approached herein this problem by projecting each

frame I

into a multiresolution space. Unlike a clas-

sical multiresolution decomposition, the different im-

age scales are not herein obtained by a simple down-

sampling of the original image, but rather by a local

pixel neighbourhood smoothing upon which a block

Haar wavelet analysis is carried out. In our scheme

Algorithm 1: Basic Σ − ∆ Algorithm.

Initialization: M

(x) = I

(x)

for each Frame t do

(x) = M

t−1

(x) + sgn(I

(x) − M

t−1

(x))

∆

(x) =

(x) − I

(x)

end for

Initialize: V

(x) = ∆

(x)

for each Frame t do

for each pixel x such that ∆

(x) 6= 0 do

(x) = V

t−1

(x) + sgn(N × ∆

(x) −V

t−1

(x))

if ∆

(x) < V

(x) then

(x) = 0

else

(x) = 1

end if

end for

we use a set of features calculated from the Speeded

Up Local Descriptor (SULD), proposed by Zhao et

al. (Zhao et al., 2009) and now on used as a pixel

descriptor. The low frequency is computed as the av-

erage of a neighbourhood centred at the pixel, while

the high frequency is calculated by ﬁrstly averaging

a spatial shifted version of the previously used neigh-

bourhood, and then differences between the up-down

(left-right) shifted neighbourhoods are stored. The

calculated values are closely related to the gradients

and therefore to the edges along the x and y axes, as

seen in ﬁgure 2. Each image pixel is associated to

a feature vector with three components containing an

average of the different scale pixel descriptors, i.e.,

the neighbourhood sizes. This allows to systemati-

cally remove ﬁner details or high-frequency informa-

tion from an image, achieving a compact description

of the most relevant information which is usually pre-

served through multiple scales. Therefore, the ﬁrst

step of our approach was to build, for each pixel, a

multidimensional feature vector containing the local

ﬁrst order information.

These features are calculated for each of the n

channels of the image and used as input of the Σ − ∆

algorithm, after normalization, yielding a 3n dimen-

sional descriptor for each pixel.

2.2.1 Efﬁcient Feature Calculation

Two of the features are calculated as the difference of

the sum of pixel intensities within two shifted boxes,

either vertically or horizontally. This can be efﬁ-

ciently computed using the summed area table known

as the integral image (Viola and Jones, 2001), case in

which an image (ii) replaces a pixel value (i) with the

sum of the intensity of every pixel located above and

A ROBUST BACKGROUND SUBTRACTION ALGORITHM USING THE A ∑-∆ ESTIMATION - Applied to the

Visual Analysis of Human Motion

485

Figure 2: Descriptors, from left to right: original image, vertically and horizontally ﬁlter response maps, sum of values in

region.

before it, formally:

ii(x, y) =

∑

≤x

∑

≤y

i(x

, y

) (1)

The use of the integral image optimizes calculation of

the region intensity sum as:

∑

j<x

≤k

∑

m<y

≤n

i(x

, y

) = ii( j,m)+ii(k, n) −ii(k, m) −ii( j, n)

(2)

2.3 Foreground Classiﬁcation Criteria

The basic Σ− ∆ algorithm uses a simple classiﬁcation

criterion: the last pixel intensity variation (∆

(x)) is

compared with an estimation of the cumulated vari-

ance (V

(x)), if the result is positive then the pixel

is marked as foreground, otherwise it is considered

as background (see algorithm 1). This metric does

not ﬁt our multidimensional representation: while the

mentioned criterion may be applied to each feature,

another metric must be built to produce a ﬁnal deci-

sion from the obtained set of per-feature decisions.

To overcome these limitations, we propose a multidi-

mensional metric that associates the feature vector to

a single scalar value, obtained by integrating on ev-

ery feature component and shifting from the [−1, 1]

to the [0, 2] interval. Each image pixel is assigned

then to a particular (P

) value, an estimate of the re-

gional changes, the higher (lower) a P

value is the

more (less) likely the corresponding pixel in I

is a

foreground pixel. A change is then deﬁned if the his-

tory of regional changes is smaller than the change

reported by the current local analysis. For achieving

so, we exploit the characteristics of the histogram’s

waveform of P

, where background pixels are near 0

and their number is signiﬁcantly larger than the fore-

ground pixels. Hence we build two classes, one with

a small (large) number of bins which contains most

(few) scene pixels: the background (foreground). We

are interested in a value that maximizes the intra-class

variance by comparing the variances of the two pre-

viously deﬁned classes. For doing so, let us suppose

that we have k different bins, starting from an initial

bin, a class is composed of a set of bins that are pro-

gressively increased by including new bins into the

class. The algorithm includes new bins in each class

by running forward (backward) over the histogram,

starting from 0 and k for the background and fore-

ground classes, respectively. The goal is to stop when

the variance of the two classes is alike and its mag-

nitude is maximum. We search then for a bin (γ)

where the consecutive per group variances are close

and large in magnitude for both classes as follows:

for a histogram with k bins let

= var(bin

....bin

i−1

) − var(bin

....bin

) (3)

the consecutive variance of a background estimation

composed of bins 0 to i, likewise let

= var(bin

, ..., bin

i+1

) − var(bin

, ..., bin

) (4)

the difference of variances for the foreground group

up to bin i. A set of candidate bins Γ

is stablished

with

i ∈ Γ ⇐⇒

≈ −1 (5)

Among all the candidates in Γ

we choose γ as as

the one with the larger magnitude in the variance dif-

ferences i.e.

γ = max

|α

| (6)

2.4 Dataset Description

Validation was carried out with a subset of the Hu-

man Eva Dataset (Sigal et al., 2010), composed from

3 different subjects, each captured from 2 different

cameras for a total of 6 sequences. Each sequence

was manually labeled, as frame n has almost the same

foreground and background of frame n ± 1 labeling

was done only once per 10 frames, additionally the

labeling only started at the 40th frame this accounts

for an initial estimation of the background (stabiliza-

tion) of both algorithms.

3 EVALUATION AND RESULTS

There are well know metrics to evaluate the perfor-

VISAPP 2012 - International Conference on Computer Vision Theory and Applications

486

mance of a binary classiﬁcation, however most of

these metrics assume that there is approximately a

balanced quantity of elements in the classes. In this

dataset the foreground usually amounts to less than

the 10% of pixels in the image, hence we choose the

true positive rate (TPR), and the Matthews Correla-

tion Coefﬁcient (MCC), the former is independent of

the class distribution, while the later is designed to

measure the quality of the classiﬁcation even with un-

balanced classes.

During the evaluation of the algorithm it was clear

that scales (box sizes) larger than 11 were not appro-

priate for the segmentation of relative small objects

in movement (like the hands and forearms), also the

body boundaries are not properly located. Therefore

we ﬁrst seek for a combination of scales between 1

and 11 that provides the best results, for this partic-

ular dataset the selected scales were 1,3 and 5. The

results are summarized in tables 1 & 2.

Table 1: True positive Rate.

Sequence Regular Σ − ∆ TPR Proposed Σ − ∆ TPR

1 37.94% 67.46%

2 55.73% 68.95%

3 31.61% 69.23%

4 57.63% 73.25%

5 48.62% 71.94 %

6 65.86% 78.05%

Table 2: Matthews Correlation Coefﬁcient.

Sequence Regular Σ − ∆ MCC Proposed Σ − ∆ MCC

1 0.557 0.683

2 0.707 0.678

3 0.524 0.721

4 0.725 0.731

5 0.656 0.713

6 0.767 0.774

The TPR of the proposed method outperforms the

regular Σ∆ in every test sequence, this can be at-

tributed to the better detection of the limbs in motion,

specially the shins and forearms (see ﬁgure 6).

An interesting feature of the proposed algorithm

can be analysed with table 1, our method offers a large

improvement for sequences 1, 3 and 5 (30.06% aver-

age) however the improvement for sequences 2, 4 and

6 is smaller (13.71% average). This is related to the

background of the sequences, on the ﬁrst group the

background has several objects of different colors on

it i.e. it contains borders, the background of the later

group has a single color and is nearly ﬂat (see ﬁgure

3). The absence of borders lowers the effectiveness of

the proposed algorithm as the input information for

the Σ∆ comes mainly from the intensities of neigh-

bouring pixels.

Figure 3: Sequences 2,4,6 on the left side, sequences 1, 3, 5

on the right side.

While the TPR shows a signiﬁcant improvement

of our algorithm over the regular Σ∆, the MCC shows

cases where there is not a signiﬁcant improvement

over the base algorithm. This can be attributed to the

nature of the dataset, where the moving object (human

body) is present and in motion on the ﬁrst frames, this

generates ghosts on every scene for both algorithms,

these ghosts last longer in our algorithm thus increas-

ing the amount of False Positives on the ﬁrst frames,

drawing down the average MCC for the sequence.

This can be seen in ﬁgures 4 and 7, while the ﬁrst

frames show an MCC for the proposed algorithm un-

der the MCC of the regular sigma delta, on the next

frames (when the ghost starts to fade) the MCC of

our algorithm is better, even in the second sequence,

where out algorithm had an average MCC under the

regular Σ∆ (ﬁg. 4).

Again the nature of the background seems to have

inﬂuence on how long the ghosts last, scenes 1, 3

and 5. have ghosts that last shorter than the ghosts

in scenes 2,4,6.

3.1 Performance

As stated on section 2, one of the main features of the

Σ∆ Background subtraction is its computational efﬁ-

ciency, therefore we brieﬂy analyze the performance

penalty of the multiscale features and the new classi-

ﬁcation criterion.

A GNU Octave implementation of both algo-

rithms was tested on an core i7 processor at 3.3 Ghz,

on this set up the average the regular Σ∆ can pro-

cess 6.72 million pixels per second. The speed of the

proposed extension depends on the number of scales

A ROBUST BACKGROUND SUBTRACTION ALGORITHM USING THE A ∑-∆ ESTIMATION - Applied to the

Visual Analysis of Human Motion

487

Figure 6: Results of the segmentation, from left to right, Original image, regular Σ∆ segmentation, proposed algorithm

segmentation.

Figure 4: Comparison of the per frame MCC for sequence

Figure 5: Comparison of the per frame MCC for sequence

Table 3: Proposed algorithm processed pixels per second.

Number of scales Average pixels per second (millions)

1 1.61

2 1.31

3 1.06

4 0.89

5 0.77

used for the analysis, we calculated the average speed

of the proposed algorithm for a number of scales be-

tween 1 and 9, the results are summarized on table 3.

Although the base Σ∆ is faster, when the proposed

algorithm is compared with other variations of the

Σ∆ operator for background subtraction proposed on

the literature (Manzanera, 2007)(Lionel Lacassagne,

2009) (Richefeu and Manzanera, 2006) it shows an

average performance (see ﬁgure 7).

Figure 7: Speed of other Σ∆ algorithms (million pixels per

second).

4 CONCLUSIONS

An novel method for segmenting the human silhou-

ette in video sequences based on the Σ∆ background

subtraction, was introduced on this paper, this method

offers a signiﬁcant improvement in the background

segmentation over the base Σ∆, at the expense of com-

putational cost.

The proposed algorithm enhances the pixel de-

scription with local features, allowing a multiscale

representation of each frame, which results in an im-

proved detection of the human body silhouette, this

methods improves the quality of the segmentation,

specially at the arms and lower limbs, which is critical

for tasks that require a proper description of the dy-

namics of the human body, as gait analysis and video

surveillance.

VISAPP 2012 - International Conference on Computer Vision Theory and Applications

488

REFERENCES

Kirtley, C. (2005). Clinical Gait Analysis: Theory and

Practice. Churchill Livingstone.

Lionel Lacassagne, A. M. . A. D. (2009). Motion detection:

Fast and robust algorithms for embedded systems. In

IEEE International Conference on Image Processing.

Manzanera, A. (2007). Sigma-delta background subtraction

and the zipf law. In Progress in Pattern Recognition,

Image Analysis and Applications.

Manzanera, A. and Richefeu, J. (2007). A new mo-

tion detection algorithm based on [sigma]-[delta]

background estimation. Pattern Recognition Letters,

28(3):320–328.

Richefeu, J. and Manzanera, A. (2006). A new hy-

brid differential ﬁlter for motion detection. In Wo-

jciechowski, K., Smolka, B., Palus, H., Kozera, R.,

Skarbek, W., and Noakes, L., editors, Computer

Vision and Graphics, volume 32 of Computational

Imaging and Vision, pages 727–732. Springer Nether-

lands.

Sigal, L., Balan, A. O., and Black., M. J. (2010). Hu-

maneva: Synchronized video and motion capture

dataset for evaluation of articulated human motion. In-

ternational Journal of Computer Vision (IJCV), 87.

Viola, P. and Jones, M. (2001). Robust real-time object de-

tection. In International Journal of Computer Vision.

Zhao, G., Chen, L., and Chen, G. (2009). A speeded-up

local descriptor for dense stereo matching. In Im-

age Processing (ICIP), 2009 16th IEEE International

Conference on, pages 2101–2104.

A ROBUST BACKGROUND SUBTRACTION ALGORITHM USING THE A ∑-∆ ESTIMATION - Applied to the

Visual Analysis of Human Motion

489