Compressed Domain Moving Object Detection

based on H.264/AVC Macroblock Types

Marcus Laumer

1,2

, Peter Amon

, Andreas Hutter

and Andr

e Kaup

Multimedia Communications and Signal Processing, University of Erlangen-Nuremberg, Erlangen, Germany

Imaging and Computer Vision, Siemens Corporate Technology, Munich, Germany

Keywords:

H.264/AVC, Compressed Domain, Object Detection, Macroblock Type.

Abstract:

This paper introduces a low complexity frame-based object detection algorithm for H.264/AVC video streams.

The method solely parses and evaluates H.264/AVC macroblock types extracted from the video stream, which

requires only partial decoding. Different macroblock types indicate different properties of the video content.

This fact is used to segment a scene in fore- and background or, more precisely, to detect moving objects

within the scene. The main advantage of this algorithm is that it is most suitable for massively parallel pro-

cessing, because it is very fast and combinable with several other pre- and post-processing algorithms, without

decreasing their performance. The actual algorithm is able to process about 3600 frames per second of video

streams in CIF resolution, measured on an Intel



Core

i5-2520M CPU @ 2.5 GHz with 4 GB RAM.

1 INTRODUCTION

Moving object detection is probably one of the most

widely used video analysis procedures in many dif-

ferent applications. Video surveillance systems need

to detect moving persons or vehicles, trackers have

to be initialized with the objects they should track,

and recognition algorithms require the regions within

the scene where they should identify objects. For this

reason, several proposals for efﬁcient object detec-

tion have been published. Most of them operate in

the pixel domain, i.e., on the actual pixel data of each

frame. This usually leads to a very high accuracy, but

at the expense of computational complexity.

As most video data is stored or transferred in com-

pressed representation, the bit stream has to be com-

pletely decoded beforehand in such scenarios. There-

fore, attempts have been made to eliminate the costly

step of decoding and to perform the analysis directly

in the compressed domain.

Detection algorithms can therefore be divided into

two categories: pixel domain detection and com-

pressed domain detection. Thereby, pixel domain is

well-deﬁned as the entire video content is decoded

and all video frames are available in pixel representa-

tion. Compressed domain on the other hand does not

clearly express which part of the video content has to

be decoded and which part may remain compressed.

Several compressed domain detection methods that

achieve good results by analyzing different entropy

decoded syntax elements have been presented.

The remainder of this paper is organized as fol-

lows. Section 2 introduces some state-of-the-art com-

pressed domain detection algorithms. Section 3 pro-

vides a brief overview of the H.264/AVC syntax ele-

ments that are relevant for our algorithm and their ex-

traction. It also describes how macroblock types are

grouped to categories and deﬁnes which categories in-

dicate moving regions. Section 4 describes the actual

algorithm and the segmentation process in detail. Af-

ter that, some experimental results are given in Sec-

tion 5. Section 6 concludes this paper with a summary

and an outlook.

2 RELATED WORK

Established moving object detection methods in hy-

brid video codecs are based on solely extracting and

analyzing motion vectors. For instance, (Szczerba

et al., 2009) showed an algorithm to detect objects

in video surveillance applications using H.264/AVC

video streams. Their algorithm assigns a motion vec-

tor to each 4x4 pixel block of the examined frame.

Thereto, macroblocks with larger partitions than 4x4

are divided and their motion vector is assigned to

the smaller blocks. Since intra-coded macroblocks

219

Laumer M., Amon P., Hutter A. and Kaup A..

Compressed Domain Moving Object Detection based on H.264/AVC Macroblock Types.

DOI: 10.5220/0004296602190228

In Proceedings of the International Conference on Computer Vision Theory and Applications (VISAPP-2013), pages 219-228

ISBN: 978-989-8565-47-1

 2013 SCITEPRESS (Science and Technology Publications, Lda.)

have no corresponding motion vector, the algorithm

interpolates a vector from previous and consecutive

frames. This results in a dense motion vector ﬁeld.

This dense motion vector ﬁeld is further analyzed to

estimate vectors that represent real motion by calcu-

lating spatial and temporal conﬁdences as introduced

by (Wang et al., 2000).

Other object detection methods do not solely

analyze motion vectors but also exploit additional

compressed information, like macroblock partition

modes, e.g., (Fei and Zhu, 2010) and (Qiya and

Zhicheng, 2009) or transform coefﬁcients, e.g., (Mak

and Cham, 2009) and (Porikli et al., 2010).

(Fei and Zhu, 2010), for instance, presented a

study on mean shift clustering based moving object

segmentation for H.264/AVC video streams. In a ﬁrst

step, their method reﬁnes the extracted raw motion

vector ﬁeld by normalization, median ﬁltering, and

global motion compensation, whereby already at this

stage the algorithm uses macroblock partition modes

to enhance the ﬁltering process. The resulting dense

motion vector ﬁeld and the macroblock modes then

serve as input for a mean shift clustering based object

segmentation process, adopted from pixel domain ap-

proaches, e.g., introduced by (Comaniciu and Meer,

2002).

(Mak and Cham, 2009) on the other hand analyze

motion vectors in combination with transform coefﬁ-

cients to segment H.264/AVC video streams to fore-

and background. Quite similar to the techniques de-

scribed before, their algorithm initially extracts and

reﬁnes the motion vector ﬁeld by normalization, ﬁl-

tering, and background motion estimation. After that,

the foreground ﬁeld is modeled as a Markov random

ﬁeld. Thereby, the transform coefﬁcients are used as

an indicator for the texture of the video content. The

resulting ﬁeld indicates fore- and background regions,

which are further reﬁned by assigning labels for dis-

tinguished objects.

(Poppe et al., 2009) introduced an algorithm for

moving object detection in the H.264/AVC com-

pressed domain that evaluates the size of macroblocks

(in bits) within video streams. Thereby, the size of

a macroblock includes all corresponding syntax ele-

ments and the encoded transform coefﬁcients. The

ﬁrst step of their algorithm is to ﬁnd the maximum

size of background macroblocks, which is performed

in an initial training phase. During the subsequent

analysis, each macroblock that exceeds this size is re-

garded as foreground, as an intermediate step. Mac-

roblocks with less size are divided to macroblocks

in Skip mode and others. Labeling of macroblocks

in Skip mode depends on the labels of their direct

neighbors, while all other macroblocks are directly

labeled as background. Subsequent steps are spa-

tial and temporal ﬁltering. These two steps are per-

formed to reﬁne the segmentation. During spatial ﬁl-

tering background macroblocks will be changed to

foreground, if most of their neighbors are foreground

as well. Foreground macroblocks will be changed to

background during temporal ﬁltering, if they are nei-

ther foreground in the previous frame nor in the next

frame. The last reﬁnement step is to evaluate bound-

ary macroblocks on a sub-macroblock level of size 4

by 4 pixels.

Extracting motion vectors and transform coefﬁ-

cients from a compressed video stream requires more

decoding steps than just extracting information about

macroblock types and partitions. Hence, attempts

have been made to directly analyze these syntax el-

ements.

(Verstockt et al., 2009) proposed an algorithm

for detecting moving objects by just extracting mac-

roblock partition information from H.264/AVC video

streams. First, they perform a foreground segmen-

tation by assigning macroblocks to foreground and

background, which results in a binary mask for the ex-

amined frame. Thereby, macroblocks in 16x16 parti-

tion mode (i.e., no sub-partitioning of the macroblock,

including the skip mode) are regarded as background

and all other macroblocks are labeled foreground. To

further enhance the generated mask, their algorithm

then performs temporal differencing of several masks

and median ﬁltering of the results. In a ﬁnal step,

objects are extracted by blob merging and convex

hull ﬁtting techniques. (Verstockt et al., 2009) de-

signed their algorithm for multi-view object localiza-

tion. Hence, the extracted objects of a single view

then serve as input for the multi-view object detection

step.

A more basic detection method than moving ob-

ject detection is to detect global content changes

within scenes. (Laumer et al., 2011) designed a

change detection algorithm for RTP streams that does

not require video decoding at all. They presented the

method as a preselection for further analysis modules,

since change detection can be seen as a preliminary

stage of, e.g., moving object detection. Each moving

object causes a global change within the scene. Their

algorithm evaluates RTP packet sizes and number of

packets per frame. Since no decoding of video data is

performed the method is codec-independent and very

efﬁcient.

The algorithm we present in this paper solely ex-

tracts and evaluates macroblock types to detect mov-

ing objects in H.264/AVC video streams. It can either

be performed as stand-alone application or be based

on the results of the change detection algorithm pre-

VISAPP2013-InternationalConferenceonComputerVisionTheoryandApplications

220

sented by (Laumer et al., 2011). Once a global change

within the scene is detected, the object detection al-

gorithm can be started to identify the cause of this

change.

3 MACROBLOCK TYPE

CATEGORIES AND SYNTAX

ELEMENTS

3.1 Categories and Weights

The H.264/AVC video compression standard was

jointly developed by the ITU-T VCEG (VCEG, 2011)

and the ISO/IEC MPEG (MPEG, 2010). It belongs to

the class of block-based hybrid video coders. In a

ﬁrst step each frame of a video sequence will be di-

vided in several so-called slices and each slice will

be further divided in so-called macroblocks, which

have a size of 16 by 16 pixels. In a second step,

the encoder decides, according to a rate-distortion-

optimization (RDO), how each macroblock will be

encoded. Thereby, several different macroblock types

of three classes are available. The ﬁrst class is used if

the macroblock should be intra-frame predicted from

its previously encoded neighbors. The second and

third classes are used in an inter-frame prediction

mode, which allows to exploit similarities between

frames. It is deﬁned that macroblocks of the sec-

ond class are predicted by just using one predictor,

whereas macroblocks of the third class are predicted

by using two different predictors. They are called I, P,

and B macroblocks, respectively.

The same classiﬁcation is deﬁned for slices. In the

scope of this work, the H.264/AVC Baseline proﬁle

is assumed. Within this proﬁle, only I and P slices

are allowed. The 32 macroblock types available for

these two slice classes are grouped to six self-deﬁned

macroblock type categories (MTC):

MB I 4x4. Intra-frame predicted macroblocks that

are further divided into smaller blocks of size 4 by

4 pixels.

MB I 16x16. Intra-frame predicted macroblocks that

are not further divided.

MB P 8x8. Inter-frame predicted macroblocks that

are further divided into smaller blocks of size 8 by 8

pixels.

MB P RECT. Inter-frame predicted macroblocks

that are further divided into smaller blocks of rect-

angular (not square) shape (16x8 or 8x16).

MB P 16x16. Inter-frame predicted macroblocks

that are not further divided.

Table 1: Macroblock type weights (MTW) of macroblock

type categories (MTC).

Slice

Type

MTC Assumption MTW

I MB I 4x4, n/a n/a

MB I 16x16

P MB I 4x4 most likely motion 3

P MB I 16x16 most likely motion 3

P MB P 8x8 likely motion 2

P MB P RECT likely motion 2

P MB P 16x16 maybe motion 1

P MB P SKIP most likely no motion 0

MB P SKIP. No additional data is transmitted for

these macroblocks. Instead, the motion vector pre-

dictor that points to the ﬁrst reference frame is used

directly.

The decision of the RDO which macroblock type

will be used for encoding the block heavily depends

on the actual pixel data of this block and its difference

to previous frames. Therefore, evaluating macroblock

types can give a good guess of the location of moving

objects within the scene. In order to determine which

macroblock types indicate moving objects, an initial

macroblock type weight (MTW) has to be deﬁned for

each category MTC ﬁrst, which are shown in Table 1.

In I slices, only intra-coded macroblocks are al-

lowed. In this case, only two categories MTC are

available and no information about moving objects

can be derived. To solve this problem different solu-

tions are imaginable. One approach is to inter- or ex-

trapolate from neighboring slices if the current frame

consists of several slices. If the encoder conﬁguration

just allows one slice per frame, the resulted fore- and

background segmentation mask of the previous frame

could be also used for the subsequent I frame. For fur-

ther enhancing this result, the mask could be interpo-

lated by also considering the mask of the subsequent

P frame, if the system conﬁguration admits.

Intra-coded macroblocks are also available in P

slices. Within a P slice it is assumed that the two cat-

egories MB I 4x4 and MB I 16x16 indicate blocks

with high motion, because usually the encoder de-

cides to choose an I macroblock type if similar video

content could not be found in previous frames. There-

fore, it is most likely that an object has moved or just

entered the scene within this region.

Macroblock types of the both categories

MB P 8x8 and MB P RECT will usually be se-

lected by the encoder if blocks that are smaller than

16 by 16 pixels can be encoded more efﬁciently than

the entire macroblock. That usually means that these

regions are very structured and/or have been slightly

changed compared to previous frames. Hence, it is

CompressedDomainMovingObjectDetectionbasedonH.264/AVCMacroblockTypes

221

coded picture / frame

slice

macroblock

16x16

block

4x4

pixel

Figure 1: Sample hierarchical structure of block-based

video coding.

assumed that likely a moving object is present here.

Macroblocks that are not further divided (i.e., of

category MB P 16x16) indicate high uncertainty con-

cerning moving objects. On the one hand it is con-

ceivable that slowly moving objects with constant di-

rections are present in these regions, but on the other

hand the corresponding motion vector could be quite

short and this type has been selected because of a

slightly noisy source. Therefore, the assumption here

is that there is maybe motion.

The last category MB P SKIP is selected by the

encoder if the predicted motion vector points to an

area within the previous frame that is quite similar to

the current macroblock. That means that it is most

likely that there is no motion since there is nearly no

difference between the current and the previous frame

within this region.

Since objects usually extent over several mac-

roblocks, the moving object certainty (MOC) of a

macroblock highly depends on its neighboring mac-

roblocks. This is further described in Section 4.

3.2 Syntax Extraction

To be able to assign macroblocks to the previously

deﬁned categories MTC, the macroblock types have

to be extracted from the bit stream. As already men-

tioned, H.264/AVC is a block-based video compres-

sion standard and has a hierarchical structure consist-

ing of ﬁve levels. Figure 1 illustrates this hierarchy.

The highest hierarchical level is a coded picture.

Since the Baseline proﬁle of H.264/AVC does not

support interlaced coding, a coded picture within this

proﬁle is always an entire frame. On the next level a

frame consists of at least one slice. If ﬂexible mac-

roblock ordering (FMO) is not used, which is as-

sumed since FMO is rarely used in practice, a slice

consists of several consecutive macroblocks on the

third level. Each macroblock can be further divided in

smaller blocks, at which the smallest available block

has a size of 4 by 4 pixels.

H.264/AVC deﬁnes a huge number of syntax el-

ements. The most important for the presented al-

gorithm will be discussed in the following. The

nal unit type in the network abstraction layer

(NAL) unit header indicates if the contained coded

slice belongs to an instantaneous decoding refresh

(IDR) or non-IDR frame. IDR frames can only

consist of I slices while non-IDR frames are com-

posed of slices of any type. The actual type of

each slice is then encoded within its header by the

syntax element slice type. The beginning of the

slices within the current frame is encoded by the el-

ement first mb in slice, which can also be ex-

tracted from the slice headers. On macroblock level

two elements are extracted. As already mentioned, no

further information is transmitted if a macroblock is

encoded with P SKIP type. In this case, the bit stream

contains an element called mb skip run that indicates

the number of consecutive macroblocks in skip mode.

For all macroblocks in non-skip mode the algorithm

extracts the available syntax element mb type.

As soon as all these syntax elements are extracted

and parsed accordingly, the algorithm starts to evalu-

ate them, as described in the following section.

4 MOVING OBJECT DETECTION

ALGORITHM

4.1 Foreground/Background

Segmentation

The presented object detection algorithm relies on the

assumptions deﬁned in Section 3. The H.264/AVC

syntax elements are extracted from the bit stream and

decoded, if required. The nal unit type is directly

accessible without decoding. To access the other syn-

tax elements the bit stream has to be parsed, i.e., en-

tropy decoded. Already during the parsing process

each macroblock is assigned to one of the six cate-

gories MTC and the corresponding weight MTW is

set.

An example category MTC and weight MTW map

is shown in Figure 2(b) and Figure 2(c), respectively.

Thereby, the colors within the category MTC map are

deﬁned as follows.

- MB I 4x4: light red 

- MB I 16x16: red 

- MB P 8x8: light blue 

- MB P RECT: blue 

- MB P 16x16: dark blue 

- MB P SKIP: black 

The weight MTW map (and also the certainty

MOC map in Figure 2(d)) is illustrated by a gray

VISAPP2013-InternationalConferenceonComputerVisionTheoryandApplications

222

(a) Original frame. (b) MTC map. (c) MTW map. (d) MOC map.

(e) Binary mask before box ﬁltering. (f) Segmented frame before box ﬁlter-

ing.

(g) Binary mask after box ﬁltering. (h) Segmented frame after box ﬁlter-

ing.

Figure 2: Sample maps and masks created by the algorithm (sequence: door).

-2

weight

-2 -1 0 1 2

0 1 1 1 0

(a) w

[x, y] = 1

-2

-1

weight

-2

0 0 1 0 0

(b) w

[x, y] = 2

-2

-1

weight

-2 -1 0 1 2

2 3 3 3 2

1 3 3 3 1

[x, y] = 3

Figure 3: 3-dimensional illustration of discrete kernels for different MTWs.

scale picture, at which brighter gray levels denote a

higher weight (or certainty in case of the certainty

MOC map).

The main challenge of the algorithm is to create

a robust map that indicates where within the scene

moving objects are located, or in other words to

transform category MTC/weight MTW maps to cer-

tainty MOC maps. These maps signiﬁcantly differ

from each other, since weight MTW maps do not

take dependencies between neighboring macroblocks

into account while certainty MOC maps do. Mac-

roblocks have a size of 16 by 16 pixels. The as-

sumption that actual moving objects usually span over

several macroblocks requires to process them jointly.

The certainty c[x, y] of a single macroblock m[x, y]

(with Cartesian coordinates [x, y]) that depends on the

weights w

[x + i, y + j] of all macroblocks in a desig-

nated neighboring area (translation indicated by (i, j))

is deﬁned as

c[x, y] =

∑

j=−2

∑

i=−2

i j

[x, y] , (1)

where

i j

[x, y] =











[x + i, y + j] , ∀i, j ∈ {−1, 0, 1}

[x + i, y + j] − 1)

∀(i, j) ∈ {(−2, 0), (0, −2), (0, 2), (2, 0)}

[x + i, y + j] − 2)

, otherwise .

Thereby, the operator (·)

is deﬁned as

: Z → N

, a 7→ (a)

:= max(0, a) .

According to (1) the certainty MOC of a mac-

roblock depends on the weights MTW of its eight

direct neighbors and on the weights MTW of the 16

CompressedDomainMovingObjectDetectionbasedonH.264/AVCMacroblockTypes

223

3 0 0 0 0

00022

0 0

0 0 0

0 0 0 0 0

(a) MTC/MTW map and MOC.

2 2

1 1

222

2 2 2

2 2

3 3

333

1 1 1 1

1111

(b) Kernels of neighboring macroblocks (white dotted box) and the mac-

roblock itself (black dashed box).

Figure 4: Sample calculation of the MOC of a macroblock

(black dashed box).

neighbors of their direct neighbors. Thereby, the val-

ues are weighted according to their distance to the

current macroblock. Direct neighbors are weighted

just like the macroblock itself. Neighbors in higher

distance factor into the certainty MOC with decreased

weight, since it is assumed that the mutual interdepen-

dence with respect to the presence of an object is also

lower.

A more illustrative description of the algorithm is

depicted in Figure 4. At each macroblock position

(white dotted boxes in Figure 4(b)) a discrete kernel

is set according to the macroblocks weight MTW. In

case the weight MTW equals 0 all points of the kernel

are also 0, i.e., the weight MTW of this macroblock

does not affect any other macroblock. The three other

kernels can be seen in Figure 3. Once the kernels of

the relevant neighboring macroblocks are set, the cer-

tainty MOC of the current macroblock is calculated

by summarizing all overlapping kernel values at its

position (black dashed box). In the example in Fig-

ure 4(a) this equals 1 + 0 + 2 + 0 + 2 + 3 = 8.

Note that if the current macroblock lies near the

frame border, some of its neighbors will not exist. In

this case the weight MTW map is extended to the re-

quired size and the weights MTW of the new border

macroblocks are set to 0.

The next step of the algorithm is to segment the

current frame to fore- and background. Thereto, the

calculated certainty MOC map is thresholded by t.

Whether a macroblock m[x, y] is part of the fore-

ground is calculated by

m[x, y] =



1 , c[x, y] ≥ t

0 , otherwise

, (2)

where 1 indicates the foreground and 0 indicates the

background of the scene, which is illustrated within

the binary masks in Figure 2(e) and Figure 2(g) by

white and black blocks, respectively.

4.2 Box Filtering

The resulting binary mask of the segmentation pro-

cess is then further reﬁned by an nxn box ﬁltering pro-

cess. That means if most neighboring macroblocks in

a surrounding nxn region (including the macroblock

itself) of a single macroblock are labeled as fore-

ground, the macroblock is also labeled as foreground,

and vice versa. The purpose of this step is to elim-

inate very rarely occurring holes within objects and

to ﬁlter out remaining single foreground labeled mac-

roblocks. Furthermore, object edges are smoothened,

as can be seen in Figure 2(g), which represents the

ﬁltered version of Figure 2(e).

5 EXPERIMENTAL RESULTS

5.1 Performance Measures

The performance of the method is measured by the

following procedure. Since the analysis is frame-

based, for each frame k a manually labeled ground

truth states the set of pixels S

pix

[k] of the moving ob-

jects. The proposed algorithm segments macroblocks

to fore- and background. This is the reason why we

also deﬁned the set of macroblocks S

[k] for each

frame k as ground truth. Thereby, a macroblock is

denoted as foreground, if at least one of its pixels is

considered foreground.

Two conventional measures are used to evaluate

the results: recall and precision. For comparing set of

pixels, they are deﬁned as

pix

[k] =

pix

[k]

pix

[k] + N

pix

[k]

, (3)

pix

[k] =

pix

[k]

pix

[k] + N

pix

[k]

, (4)

where N

pix

[k] is the number of correctly detected pix-

els, N

pix

[k] is the number of missed pixels, i.e., pix-

els that are labeled foreground in the ground truth but

VISAPP2013-InternationalConferenceonComputerVisionTheoryandApplications

224

Table 2: Experimental results of several test sequences.

Sequence r

pix

door 0.96 0.81 0.95 0.88

room1pFreeBlk 0.99 0.40 0.98 0.60

room2pXingDiagBlk 0.97 0.47 0.95 0.65

room2pXingEqMix 0.58 0.50 0.57 0.64

room2pXingDiagMix 0.84 0.48 0.83 0.60

campus4-c0 0.75 0.48 0.71 0.74

campus7-c1 0.98 0.65 0.95 0.82

laboratory4p-c0 0.99 0.38 0.98 0.57

laboratory6p-c1 0.98 0.16 0.97 0.29

terrace1-c0 0.98 0.40 0.96 0.66

terrace2-c1 0.98 0.41 0.97 0.68

have not been detected, and N

pix

[k] is the number of

pixels that have falsely been considered foreground.

Similarly, recall and precision on macroblock

level are deﬁned as

[k] =

[k]

[k] + N

[k]

, (5)

[k] =

[k]

[k] + N

[k]

. (6)

The ﬁnal step to get recall and precision measures

for a whole sequence is an averaging process. The

pixel measures are deﬁned as

pix

frame

∑

pix

[k] , (7)

pix

frame

∑

pix

[k] , (8)

and the macroblock level measures are deﬁned as

frame

∑

[k] , (9)

frame

∑

[k] , (10)

where N

frame

is the number of frames of the sequence.

5.2 Test Sequences and Setup

The algorithm has been tested with several

H.264/AVC video sequences, including sequences

from the data set of CVLAB (Berclaz et al., 2011)

and self-created sequences. A detailed description

for each test sequence is given in the Appendix.

The sequences have been encoded with variable

bit rate by an own implementation of the H.264/AVC

Baseline proﬁle. The GOP size has been set to ten

frames. During the segmentation process we set t =

6, which ﬁts best to the deﬁned macroblock weights

MTW. For box ﬁltering we applied a 3x3 ﬁlter.

5.3 Result Discussion

An overview of the results is given in Table 2.

The ﬁrst column r

pix

represents the recall values

of comparison between the ground truth in pixel accu-

racy with the results of the algorithm in macroblock

accuracy. Although the resulting foreground masks

are block-based, for the majority of sequences the

method achieves 96% and above. That means that al-

most all foreground pixels could be detected correctly

and only very little have been missed. This can also

be seen in the third column r

, which represents the

recall values of comparison between the results of the

algorithm with the ground truth in macroblock accu-

racy. Macroblock accuracy in this scope means that

each macroblock with at least one foreground pixel

is regarded as foreground. In many cases, e.g., if the

object is located at the edge of a macroblock row or

column, this consideration will lead to more pixels la-

beled as foreground as their actual amount. That is

the reason why values in the third column are always

slightly smaller than in the ﬁrst column.

For a few sequences in Table 2 the algorithm

does not achieve very high recall values. This oc-

curs when objects stop moving within the scene. In

case an annotated object stops but is still visible, it

is correctly labeled foreground in the ground truth,

but most encoders will decide to use Skip mode

for its macroblocks. Hence, our algorithm is not

able to detect these objects anymore because they

do not differ from the background. This happens in

sequences room2pXingEqMix, room2pXingDiagMix,

and campus4-c0. Figure 5 illustrates the recall r

pix

and precision p

pix

values for each frame with avail-

able ground truth of sequence campus4-c0. Approx-

imately between frames 200 and 250 the only visible

person stops moving. During this period, recall val-

ues drop to almost 0, while corresponding precision

values increase to 100%. This means that on the one

hand this object can admittedly not be detected cor-

rectly, but on the other hand also no false detections

occur. The same behavior can be seen at the end of

the sequence, where the three visible persons stop one

after another to talk to each other.

The second and fourth columns in Table 2 repre-

sent the precision of the algorithm. The values for

pixel accuracy comparison p

pix

do not achieve that

high percentage than their corresponding recall val-

ues. The main reason for this is that it is not possible

to completely eliminate false detections with an ac-

curacy of macroblock size. Therefore, comparing the

detection results to ground truth in macroblock accu-

racy, as can be seen in column p

, achieves a signiﬁ-

cantly increased precision, up to 27 percentage points.

CompressedDomainMovingObjectDetectionbasedonH.264/AVCMacroblockTypes

225

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

15 65 115 165 215 265 315 365 415 465 515 565 615 665 715 765 815 865 915 965 1015

pix

[k]

Frame k

(a) Recall.

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

15 65 115 165 215 265 315 365 415 465 515 565 615 665 715 765 815 865 915 965 1015

pix

[k]

Frame k

(b) Precision.

Figure 5: Recall and precision in pixel accuracy of sequence

campus4-c0.

Even though macroblock accuracy comparison

improves the results, precision values mostly do not

exceed 70%. There are mainly two reasons: object

shadows and image noise.

Mainly in outdoor sequences moving objects have

shadows that will move accordingly. The algorithm

detects these shadows as moving regions as well, be-

cause it is not possible to distinguish between ac-

tual objects and shadows on macroblock level. This

leads to an increased number of false detections. Fig-

ure 6 shows recall and precision values of sequence

campus7-c1. Within this scene, approximately be-

tween frames 175 and 250 and frames 455 and 865

no visible moving objects occur. During this period

of frames both recall and precision are constantly at

100%. The latter demonstrates that during the ab-

sence of moving objects no macroblocks are falsely

detected as moving regions, i.e., in this setup mostly

false detections are caused by shadows.

The second reason for false detections is im-

age noise. The video content of the sequences

laboratory4p-c0 and laboratory6p-c1 is quite noisy.

In such sequences the difference between frames with

similar content is signiﬁcantly larger than in high-

quality sequences. Therefore, macroblocks in not par-

titioned or Skip modes are rarely used during the en-

coding process, i.e., the algorithm will not only detect

the actual objects as moving regions but signiﬁcantly

noisy regions as well.

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

15 65 115 165 215 265 315 365 415 465 515 565 615 665 715 765 815 865 915 965 1015

[k]

Frame k

(a) Recall.

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

15 65 115 165 215 265 315 365 415 465 515 565 615 665 715 765 815 865 915 965 1015

[k]

Frame k

(b) Precision.

Figure 6: Recall and precision in macroblock accuracy of

sequence campus7-c1.

5.4 Processing Speed

The processing speed of the algorithm depends on the

video resolution of the test sequence and the number

of moving objects that are present within the scene.

Several measurements pointed out that our C++ im-

plementation (without code optimizations or paral-

lel processing) is able to process about 3600 frames

per second of sequences in CIF resolution and 1900

frames per second of sequences in VGA resolution,

measured on an Intel



Core

i5-2520M CPU @ 2.5

GHz with 4 GB RAM. The average number of pro-

cessed frames per second for each sequence is given

in Table 3.

6 CONCLUSIONS

In this paper, we presented a novel compressed do-

main moving object detection method based on ana-

lyzing macroblock types only. The frame-based al-

gorithm extracts and evaluates the type of each single

macroblock. Thereby, the macroblocks get assigned a

moving object certainty that is calculated by factoring

in the types of neighboring macroblocks. The results

could demonstrate that this approach reaches suit-

able detection rates within the limits of compressed

domain processing, despite its very low complexity.

This enables its use as an adequate preselection step

within a multi-tier parallel processing system. It is

VISAPP2013-InternationalConferenceonComputerVisionTheoryandApplications

226

Table 3: Average number of processed frames per second.

Resolution Sequence Frames per

Second

CIF door 7587.82

campus4-c0 4636.33

campus7-c1 4878.68

laboratory4p-c0 1953.28

laboratory6p-c1 1366.58

terrace1-c0 2496.29

terrace2-c1 2628.99

Average: 3649.71

VGA room1pFreeBlk 1693.33

room2pXingDiagBlk 1921.42

room2pXingEqMix 2202.27

room2pXingDiagMix 1820.84

Average: 1909.47

envisioned to further enhance the method by reﬁning

the segmentation process to be able to eliminate in-

appropriate objects caused by, e.g., shadows, and ex-

ploiting temporal dependencies between consecutive

frames. The latter also enables the algorithm to track

moving objects.

ACKNOWLEDGEMENTS

The research leading to these results has received

funding from the European Union’s Seventh Frame-

work Programme ([FP7/2007-2013]) under grant

agreement no. 285248 (FI-WARE).

REFERENCES

Berclaz, J., Fleuret, F., Turetken, E., and Fua, P. (2011).

Multiple Object Tracking Using K-Shortest Paths Op-

timization. IEEE Transactions on Pattern Analysis

and Machine Intelligence, 33(9):1806–1819.

Comaniciu, D. and Meer, P. (2002). Mean Shift: A Ro-

bust Approach Toward Feature Space Analysis. IEEE

Transactions on Pattern Analysis and Machine Intel-

ligence, 24(5):603–619.

Fei, W. and Zhu, S. (2010). Mean Shift Clustering-based

Moving Object Segmentation in the H.264 Com-

pressed Domain. IET Image Processing, 4(1):11–18.

Laumer, M., Amon, P., Hutter, A., and Kaup, A. (2011). A

Compressed Domain Change Detection Algorithm for

RTP Streams in Video Surveillance Applications. In

Proc. IEEE 13th Int. Workshop on Multimedia Signal

Processing (MMSP), pages 1–6.

Mak, C.-M. and Cham, W.-K. (2009). Real-time Video Ob-

ject Segmentation in H.264 Compressed Domain. IET

Image Processing, 3(5):272–285.

MPEG (2010). ISO/IEC 14496-10:2010 - Coding of Audio-

Visual Objects - Part 10: Advanced Video Coding.

Poppe, C., De Bruyne, S., Paridaens, T., Lambert, P., and

Van de Walle, R. (2009). Moving Object Detec-

tion in the H.264/AVC Compressed Domain for Video

Surveillance Applications. Journal of Visual Commu-

nication and Image Representation, 20(6):428–437.

Porikli, F., Bashir, F., and Sun, H. (2010). Compressed

Domain Video Object Segmentation. IEEE Transac-

tions on Circuits and Systems for Video Technology,

20(1):2–14.

Qiya, Z. and Zhicheng, L. (2009). Moving Object Detection

Algorithm for H.264/AVC Compressed Video Stream.

In Proc. Int. Colloquium on Computing, Communica-

tion, Control, and Management (CCCM), volume 1,

pages 186–189.

Szczerba, K., Forchhammer, S., Stottrup-Andersen, J., and

Eybye, P. T. (2009). Fast Compressed Domain Motion

Detection in H.264 Video Streams for Video Surveil-

lance Applications. In Proc. Sixth IEEE Int. Conf.

on Advanced Video and Signal Based Surveillance

(AVSS), pages 478–483.

VCEG (2011). H.264: Advanced Video Coding for Generic

Audiovisual Services.

Verstockt, S., De Bruyne, S., Poppe, C., Lambert, P., and

Van de Walle, R. (2009). Multi-view Object Local-

ization in H.264/AVC Compressed Domain. In Proc.

Sixth IEEE Int. Conf. on Advanced Video and Signal

Based Surveillance (AVSS), pages 370–374.

Wang, R., Zhang, H.-J., and Zhang, Y.-Q. (2000). A Conﬁ-

dence Measure Based Moving Object Extraction Sys-

tem Built for Compressed Domain. In Proc. IEEE

Int. Symp. on Circuits and Systems (ISCAS), volume 5,

pages 21–24.

APPENDIX

A detailed description for each test sequence is given

in Table 4 and Table 5. Column ’GT Distance’ in-

dicates the distance of frames with available ground

truth.

CompressedDomainMovingObjectDetectionbasedonH.264/AVCMacroblockTypes

227

Table 4: Detailed description of self-created test sequences.

Sequence N

frame

Resolution FPS GOP Size GT Distance Sample Frame

door 794 352x288 30 10 1

room1pFreeBlk 423 640x480 30 10 1

room2pXingDiagBlk 196 640x480 30 10 10

room2pXingEqMix 246 640x480 30 10 10

room2pXingDiagMix 174 640x480 30 10 10

Table 5: Detailed description of CVLAB test sequences.

Sequence N

frame

Resolution FPS GOP Size GT Distance Sample Frame

campus4-c0 1005 352x288 25 10 10

campus7-c1 1005 352x288 25 10 10

laboratory4p-c0 1005 352x288 25 10 10

laboratory6p-c1 1005 352x288 25 10 10

terrace1-c0 1005 352x288 25 10 10

terrace2-c1 1005 352x288 25 10 10

VISAPP2013-InternationalConferenceonComputerVisionTheoryandApplications

228