have no corresponding motion vector, the algorithm
interpolates a vector from previous and consecutive
frames. This results in a dense motion vector field.
This dense motion vector field is further analyzed to
estimate vectors that represent real motion by calcu-
lating spatial and temporal confidences as introduced
by (Wang et al., 2000).
Other object detection methods do not solely
analyze motion vectors but also exploit additional
compressed information, like macroblock partition
modes, e.g., (Fei and Zhu, 2010) and (Qiya and
Zhicheng, 2009) or transform coefficients, e.g., (Mak
and Cham, 2009) and (Porikli et al., 2010).
(Fei and Zhu, 2010), for instance, presented a
study on mean shift clustering based moving object
segmentation for H.264/AVC video streams. In a first
step, their method refines the extracted raw motion
vector field by normalization, median filtering, and
global motion compensation, whereby already at this
stage the algorithm uses macroblock partition modes
to enhance the filtering process. The resulting dense
motion vector field and the macroblock modes then
serve as input for a mean shift clustering based object
segmentation process, adopted from pixel domain ap-
proaches, e.g., introduced by (Comaniciu and Meer,
2002).
(Mak and Cham, 2009) on the other hand analyze
motion vectors in combination with transform coeffi-
cients to segment H.264/AVC video streams to fore-
and background. Quite similar to the techniques de-
scribed before, their algorithm initially extracts and
refines the motion vector field by normalization, fil-
tering, and background motion estimation. After that,
the foreground field is modeled as a Markov random
field. Thereby, the transform coefficients are used as
an indicator for the texture of the video content. The
resulting field indicates fore- and background regions,
which are further refined by assigning labels for dis-
tinguished objects.
(Poppe et al., 2009) introduced an algorithm for
moving object detection in the H.264/AVC com-
pressed domain that evaluates the size of macroblocks
(in bits) within video streams. Thereby, the size of
a macroblock includes all corresponding syntax ele-
ments and the encoded transform coefficients. The
first step of their algorithm is to find the maximum
size of background macroblocks, which is performed
in an initial training phase. During the subsequent
analysis, each macroblock that exceeds this size is re-
garded as foreground, as an intermediate step. Mac-
roblocks with less size are divided to macroblocks
in Skip mode and others. Labeling of macroblocks
in Skip mode depends on the labels of their direct
neighbors, while all other macroblocks are directly
labeled as background. Subsequent steps are spa-
tial and temporal filtering. These two steps are per-
formed to refine the segmentation. During spatial fil-
tering background macroblocks will be changed to
foreground, if most of their neighbors are foreground
as well. Foreground macroblocks will be changed to
background during temporal filtering, if they are nei-
ther foreground in the previous frame nor in the next
frame. The last refinement step is to evaluate bound-
ary macroblocks on a sub-macroblock level of size 4
by 4 pixels.
Extracting motion vectors and transform coeffi-
cients from a compressed video stream requires more
decoding steps than just extracting information about
macroblock types and partitions. Hence, attempts
have been made to directly analyze these syntax el-
ements.
(Verstockt et al., 2009) proposed an algorithm
for detecting moving objects by just extracting mac-
roblock partition information from H.264/AVC video
streams. First, they perform a foreground segmen-
tation by assigning macroblocks to foreground and
background, which results in a binary mask for the ex-
amined frame. Thereby, macroblocks in 16x16 parti-
tion mode (i.e., no sub-partitioning of the macroblock,
including the skip mode) are regarded as background
and all other macroblocks are labeled foreground. To
further enhance the generated mask, their algorithm
then performs temporal differencing of several masks
and median filtering of the results. In a final step,
objects are extracted by blob merging and convex
hull fitting techniques. (Verstockt et al., 2009) de-
signed their algorithm for multi-view object localiza-
tion. Hence, the extracted objects of a single view
then serve as input for the multi-view object detection
step.
A more basic detection method than moving ob-
ject detection is to detect global content changes
within scenes. (Laumer et al., 2011) designed a
change detection algorithm for RTP streams that does
not require video decoding at all. They presented the
method as a preselection for further analysis modules,
since change detection can be seen as a preliminary
stage of, e.g., moving object detection. Each moving
object causes a global change within the scene. Their
algorithm evaluates RTP packet sizes and number of
packets per frame. Since no decoding of video data is
performed the method is codec-independent and very
efficient.
The algorithm we present in this paper solely ex-
tracts and evaluates macroblock types to detect mov-
ing objects in H.264/AVC video streams. It can either
be performed as stand-alone application or be based
on the results of the change detection algorithm pre-
VISAPP2013-InternationalConferenceonComputerVisionTheoryandApplications
220