Hybrid Person Detection and Tracking in H.264/AVC Video Streams

Philipp Wojaczek

, Marcus Laumer

1,2

, Peter Amon

, Andreas Hutter

and André Kaup

Multimedia Communications and Signal Processing,

Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU), Erlangen, Germany

Imaging and Computer Vision, Siemens Corporate Technology, Munich, Germany

Keywords:

Object Detection, Person Detection, Tracking, Compressed Domain, Pixel Domain, H.264/AVC, Mac-

roblocks, Compression, Color Histogram, Hue, HSV, Segmentation.

Abstract:

In this paper we present a new hybrid framework for detecting and tracking persons in surveillance video

streams compressed according to the H.264/AVC video coding standard. The framework consists of three

stages and operates in both the compressed and the pixel domain of the video. The combination of com-

pressed and pixel domain represents the hybrid character. Its main objective is to signiﬁcantly reduce the

amount of computation required, in particular for frames and image regions with few people present. In its

ﬁrst stage the proposed framework evaluates the header information for each compressed frame in the video

sequence, namely the macroblock type information. This results in a coarse binary mask segmenting the frame

into foreground and background. Only the foreground regions are processed further in the second stage that

searches for persons in the image pixel domain by applying a person detector based on the Implicit Shape

Model. The third stage segments each detected person further with a newly developed method that fuses infor-

mation from the ﬁrst two stages. This helps obtaining a ﬁner segmentation for calculating a color histogram

suitable for tracking the person using the mean shift algorithm. The proposed framework was experimentally

evaluated on a publicly available test set. The results demonstrate that the proposed framework reliably sep-

arates frames with and without persons such that the computational load is signiﬁcantly reduced while the

detection performance is kept.

1 INTRODUCTION

Over the past few years multiple approaches for video

based person detection and tracking have been stud-

ied (Yilmaz et al., 2006). Among the most relevant

applications is video surveillance of security-relevant

areas. This comprises very crowded, open-places

like train stations that should be observed to increase

the security level. On the other hand there are also

security-relevant areas with very few persons present,

e.g., in perimeter protection or wide area surveillance

applications. In this latter case there is typically a high

number of cameras installed. Employing a common

object detection and tracking algorithm that evaluates

each single frame will waste a lot of resources be-

cause most of the frames will not contain any person

or the person(s) in the scene will be present only in

a small region of the image. To address this problem

we propose a new approach to reduce the computa-

tional complexity. The basic idea is to create a hy-

brid framework consisting of three subsequent stages.

The ﬁrst stage will analyze the video streams in the

compressed domain, enabling a very fast evaluation

of each frame and providing an estimate about the

image content. The following stages will only be trig-

gered upon detection of an object in the ﬁrst stage and

will then further evaluate the images in the pixel do-

main. This way the video decoding and processing

is only performed when necessary. The framework’s

hybrid character results from the combination of com-

pressed domain and pixel domain and the exchange of

information between these two domains.

The remainder of the paper is organized as fol-

lows: Related work is discussed in Section 2. Sec-

tion 3 describes the proposed hybrid framework and

the algorithms applied. Experimental results of our

approach are reported in Section 4, and Section 5 con-

cludes the paper.

2 RELATED WORK

There exists a large number of approaches to fulﬁll

the task of person detection. For instance, among dif-

ferent possibilities for representing objects, like rect-

angular or elliptical patches or silhouettes, (Yilmaz et

al., 2006) present a survey on object detection, seg-

478

Wojaczek P., Laumer M., Amon P., Hutter A. and Kaup A..

Hybrid Person Detection and Tracking in H.264/AVC Video Streams.

DOI: 10.5220/0005296704780485

In Proceedings of the 10th International Conference on Computer Vision Theory and Applications (VISAPP-2015), pages 478-485

ISBN: 978-989-758-089-5

 2015 SCITEPRESS (Science and Technology Publications, Lda.)

mentation and tracking.

(Eiselein et al., 2013) present a method that im-

proves person detection for crowded scenes based on

crowd density measures. They use the histogram of

oriented gradients (HoG) for detecting people. As the

HoG detector normally provides good results, it fails

in the analysis of crowded scenes. Therefore, they

create a so-called crowd density map to improve the

detection. The crowd density map can be compared to

a probability map deﬁning image regions that more or

less likely contain people. The map creation is based

on feature extraction algorithms like Scale-Invariant

Feature Transform (SIFT) (Lowe, 2004). The ex-

tracted features are tracked over some frames us-

ing Robust Local Optical Flow (RLOF) (Senst et al.,

2012) to exclude static features found in the back-

ground. The remaining features are weighted with a

probability density function (pdf) using a 2-D Gaus-

sian kernel density. The resulting 2-D pdf is the crowd

density map, which is used to weight the features

found by HoG to improve the detection results.

(Poppe et al., 2009) face the challenge of people

detection by establishing a method which evaluates

the video sequence in the compressed domain. The

idea is based on an observation made on the num-

ber of bits in every macroblock. The macroblocks

describing the background are well predicted and be-

cause of using P frames, the resulting amount of bits

is very low. However, when a person enters the scene,

the number of bits of the macroblocks containing

parts of the person increases, because good compres-

sion is difﬁcult to achieve since it is hard to ﬁnd a

reference block. First, the number of bits needed for

each macroblock is counted when a frame contains

only background. This number of bits serves as a ref-

erence. The object detection starts by counting the

number of bits of every macroblock in the subsequent

frames. If the number of bits of a macroblock in-

creases compared to the number of reference bits of

the macroblock, it is likely that this macroblock rep-

resents an object.

(Evans et al., 2013) present a multicamera ap-

proach for object detection and tracking. The ap-

proach follows the idea of projecting a foreground

mask onto a coordinate system. The foreground

mask is derived from processing each image using

the Adaptive Gaussian Mixture Model. The coordi-

nate system is a rectangular grid structure, which is

deﬁned on the ground plane of the scene. It is called

synergy map. Based on the number of foreground pix-

els backprojected from each image onto the synergy

map, a value is computed which expresses the proba-

bility of a present person. An object is created in 3D

space based on that value. The object is deﬁned by

Figure 1: Flowchart of the proposed framework.

a 3D bounding box. To track the object in the new

frame the dimensions of the 3D bounding box are op-

timized to ﬁt the object in the new frame. This is done

by projecting the 3D bounding box into the images

of each camera. The dimensions of the resulting 2D

bounding boxes are optimized such that the perime-

ter of the box surrounds the foreground region. The

ideal 3D bounding box is the smallest box, which sur-

rounds all 2D boxes.

There is a drawback in all of the mentioned ap-

proaches. The methods analyzing the sequences in

the compressed domain are very fast but cannot de-

termine object positions as precisely as pixel domain

approaches. On the contrary, methods analyzing the

sequences in the pixel domain achieve very good re-

sults in people detection but they are generally more

complex and therefore often not appropriate for real-

time scenarios. Our approach combines both meth-

ods. Therefore, it has less complexity but still com-

parable results in people detection as commonly used

pixel domain methods.

3 HYBRID PERSON DETECTION

AND TRACKING

Figure 1 shows a ﬂowchart of the framework with

the stages “Compressed domain moving object detec-

tion”, “Pixel domain object detection” and “Tracking

algorithm”. On each stage a different algorithm is ap-

plied.

The ﬁrst stage consists of a compressed domain

moving object detection (CDMOD) algorithm. It an-

HybridPersonDetectionandTrackinginH.264/AVCVideoStreams

479

Figure 2: Binary map (black regions) from CDMOD and

bounding box (yellow rectangle) from PDOD as overlay on

frame 138 from sequence “terrace1-c0”.

alyzes each frame of the video sequence which is en-

coded following the H.264/AVC video coding stan-

dard. This algorithm evaluates syntax elements ex-

tracted from the bitstream without the necessity of full

decoding. Based on that information a binary map for

each frame is created. The binary map consists of

ones and zeros deﬁning foreground and background

of the analyzed image, respectively. Foreground is

deﬁned as the image regions, where moving objects

are assumed.

If a frame can be segmented into foreground and

background, it is likely that there is an object some-

where in the segmented foreground. The binary map

is handed over to an algorithm searching for objects

in the so-called pixel domain: the pixel domain ob-

ject detection (PDOD). Therefore, the frame has to

be decoded before analysis. The second algorithm is

needed, as the CDMOD can not differentiate between

objects of different types. For example, slightly mov-

ing trees or noise would also cause a binary mask. It

just provides a binary mask with a coarse segmenta-

tion. The PDOD algorithm analyzes the image of the

sequence only in the foreground deﬁned by the seg-

mentation made by the CDMOD. If no object can be

detected, the next frame will be analyzed by the CD-

MOD. Otherwise, if an object is found, information

describing the object’s position and scale are handed

over to the last stage, the tracking.

It is most likely that the PDOD detects the same

object again in consecutive frames. Therefore, new

detections are regarded as candidates for new, still

unknown persons and the tracking algorithm initially

tries to match new detections with already known ob-

jects. It should be guaranteed that only as many ob-

jects are tracked as actually available in the sequence.

Each detected object is tracked in following frames.

As soon as the tracking algorithm’s processing comes

to an end the next step is CDMOD analyzing the next

frame of the sequence. Details to the employed algo-

rithms are given in the following subsections.

A person that does not move in consecutive

frames, remains undetected by the PDOD since the

CDMOD detects only moving objects and does not

create a binary mask. But the person had to move to

appear in the image plane. Therefore, a binary mask

is created as soon as it enters the scene and it is very

likely that the person is detected before stopping. The

person’s position is known and it can be tracked. In

case the person stops moving, the position remains

constant until it moves and can be tracked again.

3.1 Compressed Domain Moving Object

Detection

The approach described in (Laumer et al., 2013)

is used as CDMOD algorithm. The algorithm an-

alyzes the type of macroblocks used for compres-

sion of videos in the H.264/AVC video coding stan-

dard (MPEG, 2010). The available macroblock types

are grouped to so-called macroblock type categories

(MTC). Each of the MTC gets a speciﬁc weight,

the macroblock type weight (MTW). The higher the

weight the higher is the assumed probability of the

presence of motion in that macroblock. By analyz-

ing the macroblocks of each frame a map consisting

of MTWs can be created. As the area of moving

objects spans over multiple macroblocks, the map is

further processed such that each macroblock weights

its neighboring macroblocks according to its own

weight. The resulting map is thresholded. The result-

ing binary map deﬁnes foreground and background.

An example of a binary mask can be seen in Figure 2.

A prediction can only be made for P frames. In the

case of an I frame only Intra-Frame prediction is al-

lowed. In our conﬁguration we use the binary mask

from the previous P frame again as binary mask for

the I Frame.

3.2 Pixel Domain Object Detection

To ﬁnd actual objects within segmented frames, the

Implicit Shape Model (ISM) is used (Leibe et al.,

2004). The method consists of a training and a detec-

tion phase. The training phase is done ofﬂine. Train-

ing images are searched for keypoints. A codebook is

created, based on computed descriptors describing the

keypoints. In the detection phase, the objects trained

on are searched in the image. An image is searched

for keypoints and the descriptors are matched to the

codebook. The higher the match, the more likely a

person is detected.

In our conﬁguration the feature detection is done

with the SIFT algorithm (Lowe, 2004). Unlike the

approach described in (Leibe et al., 2004), in our con-

ﬁguration a ﬁnal segmentation based on pixels is not

used. Our conﬁguration of the ISM describes the po-

sition of found objects with four parameters: x, y, w,

VISAPP2015-InternationalConferenceonComputerVisionTheoryandApplications

480

Figure 3: Flowchart to the tracking algorithm.

and h. The parameters describe the coordinates of the

upper left corner and the width and height of a rect-

angle surrounding a found object. For an example see

Figure 2, where a bounding box is visualized as a yel-

low rectangle.

The training is based on the TUD-Pedestrians

training data set (Andriluka et al., 2008). It consists

of 400 images showing several persons. Each image

is scaled such that each person has the equal height of

200 pixels.

3.3 Tracking Algorithm

Figure 3 shows the ﬂowchart of the tracking algo-

rithm. For each found object four parameters are pro-

vided as input as described in Section 3.2. A color his-

togram for each object is computed. This histogram

is used for the matching of new objects with known

objects and for tracking. Based on that matching, ei-

ther an unknown object is found or it is merged with

a known object. Finally, new positions for all known

objects are evaluated based on the mean shift algo-

rithm (Comaniciu et al., 2003). The mean shift algo-

rithm shifts the bounding box from the known object’s

position to a new one. As the mean shift algorithm

needs a probability density function (pdf) describing

the object to track, the previously computed color his-

togram is reused. The next step is the decision if the

object is leaving the frame. Finally, the next frame is

analyzed by the CDMOD.

3.3.1 Computation of Histogram

If an object, i.e. a person, is found, a color histogram

as mentioned above needs to be computed. The

bounding box usually contains too much background

since it is too coarse for describing people’s complex

geometries. And as neither the person’s color appear-

ance is known in advance nor the person’s geometry is

rigid, the following method to segment the person au-

tomatically with less background is established: the

tracking algorithm receives information, on the one

hand the bounding box describing the person’s posi-

tion and on the other hand the binary map. These two

descriptions are combined as shown in Figure 2.

The image’s black part is background declared by

the CDMOD. The yellow rectangle is the visualized

bounding box from the PDOD. It can clearly be seen

that a rectangle does not ﬁt a person’s shape very well,

as the person’s geometry is too complex. Some parts

of the background lie inside the bounding box, which

is marked with B. The remaining regions inside the

bounding box are deﬁned as U. It is still unknown

which part of U belongs to foreground or background.

In the next step of the algorithm, a histogram h

for B

is computed. We selected the hue component from the

color space HSV for histogram computation, because

(Corrales et al., 2009) stated good results in segment-

ing objects with hue. After evaluating the histogram

, histograms h

for smaller image regions belonging

to U are computed. Image regions deﬁned as B are not

taken into account. Every smaller image region U

N×N

is of size N ×N pixel. In our conﬁguration N is set

to 4. Additionally, for every histogram h

the Bhat-

tacharyya distance d

to h

is calculated according to

(Kailath, 1967). The next step is the segmentation

of U into foreground and background. Every block

N×N

is compared to the background B by comparing

its distance d

to a threshold t

Avg

. t

Avg

is based on the

average distance d

∑

, where K is the number

of blocks inside U. We set t

Avg

= 1.2d

. If d

< t

Avg

then U

N×N

has more in common with the background

than a ﬁctive average block of size N ×N inside U

and U

N×N

can be labeled as background. Otherwise,

if d

> t

Avg

, then the block U

N×N

is less similar to the

background than the average block and it is labeled

as foreground. Based on the labeling with foreground

and background, a binary map similar to the one from

the CDMOD can be created. Based on the map, a

color histogram in hue describing the person’s appear-

ance is calculated. An example of a segmentation is

shown in Figure 4. The person is well segmented but

some parts of the object’s head and feet are labeled as

background. The reason is that the bounding box does

not surround the person totally, as the yellow bound-

ing box in Figure 4 shows. Only the inner parts of the

bounding box are taken into account from the above

described algorithm, therefore, these object’s parts are

HybridPersonDetectionandTrackinginH.264/AVCVideoStreams

481

Figure 4: Example of a segmentation of a person based on

hue. Cutouts from Frame 145 from sequence “terrace1-c0”.

(a) black regions: binary mask; yellow rectangle: visual-

ization of bounding box. (b) resulting segmentation of the

object.

missing.

3.3.2 Matching of Objects

A newly found object by the PDOD is called a can-

didate. It can either be a new, unknown object or a

person whose presence is yet known. The decision

whether the candidate is a yet known person is based

on two criteria calculated by the tracking algorithm:

the amount of overlap of the bounding boxes and the

matching of both histograms. The overlap O of two

bounding boxes is calculated according to:

O =

∪A

∩A

, (1)

where A

and A

are the areas from the rectangles

deﬁning the bounding boxes from the newly found

and the known objects respectively. The matching of

the histograms p and q from the known and the newly

found objects is done according to:

d =

1 −ρ[p, q], (2)

where ρ[p, q] ≡

∑

u=1

√

is the so-called Bhat-

tacharyya coefﬁcient (Kailath, 1967). (2) measures

the distance between two color histograms and is

bounded to [0, . . . , 1]. The smaller d, the higher is the

similarity between two color histograms. (1) and (2)

are calculated between a new object and every known

object and are compared to two individually chosen

thresholds t

and t

1 −O ≤ t

(3)

< t

(4)

If the candidate and a known object fulﬁll (3) and

(4) they are merged, since it is assumed that several

detections represent the same person but are detected

a multiple times by the PDOD.

3.3.3 Using the Binary Mask for Tracking

As it is possible that the mean shift algorithm con-

verges to false positions due to similarities between

the background and the persons appearance, the bi-

nary mask is used to force the mean shift algorithm to

converge only in the deﬁned foreground. The bound-

ing box is shifted from the old object’s position to a

new one. The new position is in the segmented fore-

ground. Even if the background is similar to the ob-

ject and converging to the background is likely, the

bounding box is still near the actual object’s position.

3.3.4 The Bounding Box as Binary Mask

To further enhance the tracking results and to reduce

the amount of computations, the bounding box from

each known object is used in addition to the binary

mask. That means as soon as an object is detected, its

bounding box is also used as binary mask in the fol-

lowing frames and deﬁned as image region, in which

feature points should not be searched. Feature points

can not be detected in the bounding box’s region and

therefore the object is not detected a second time.

This is useful as it is not necessary to detect a known

object again.

4 EVALUATION

The framework has been evaluated with video se-

quences from a data set of CVLAB (Berclaz et al.,

2011). All video sequences have been encoded using

the H.264/AVC Baseline proﬁle with a GOP size of

ten frames.

4.1 Measures for Video Analysis

As evaluation measurements we used two different

criteria. For the segmentation of objects we used pre-

cision and recall, deﬁned as:

precision =

TP + FP

(5)

recall =

TP + FN

(6)

where TP deﬁnes the true-positives, the number of

correct detections, FN (false-negatives) is the num-

ber of missed detections and FP (false-positives) the

number of false detections.

The outcome of our framework is not pixel-based

but each found object is described with a bounding

box. It is not appropriate to use again the measure-

ments recall and precision because a parameter deﬁn-

ing the minimal amount of overlap has to be chosen

VISAPP2015-InternationalConferenceonComputerVisionTheoryandApplications

482

Table 1: Results for object segmentation based on hue.

Sequence Recall Precision

campus4-c0 0.33 0.76

campus7-c1 0.51 0.79

terrace1-c0 0.40 0.77

terrace2-c1 0.25 0.66

Table 2: METE for using only the PDOD algorithm and for

CDMOD and PDOD algorithm.

Sequence METE for

PDOD CDMOD & PDOD

campus4-c0 0.71 0.76

campus7-c1 0.91 0.38

terrace1-c0 0.78 0.61

terrace2-c1 0.74 0.57

to count as true-positive. Such a hard decision of a

threshold makes is difﬁcult to compare the tracking

results. The Multiple Extended-target Tracking Error

(METE) described in (Nawaz et al., 2014) is indepen-

dent of such parameters. Therefore, we chose METE

to evaluate the performance of the framework. First,

the accuracy error A

and the cardinality C

for each

frame k are calculated. A

represents the accuracy

error in frame k: A

∑

i j

, where A

i j

deﬁnes the

amount of overlap between the area of a bounding box

of a tracked object i and the area of a bounding box

of a ground-truth object j. C

is the difference be-

tween estimated objects u

and ground-truth objects

. METE is calculated as:

METE

+ C

max(u

, v

)

(7)

METE

is bounded to [0, . . . , 1], where zero is the best

result.

4.2 Evaluation Results

First, we evaluated the results of the object segmen-

tation based on the hue component, please see Sec-

tion 3.3.1. Table 1 shows the results of the segmenta-

tion.

The values for recall are quite low compared to the

precision values. This is mostly because the bounding

boxes from PDOD do not surround the objects con-

tour in total but exclude some parts of the body, like

the feet, head or even the complete upper part of the

body. The excluded parts are not taken into account

for segmentation into foreground and background,

which leads to low recall values. Some objects in the

video sequences are wearing clothes which are par-

tially similar to the background. They are also often

labeled as background, which lowers the recall values

as well.

For evaluation of the framework we compared the

Figure 5: Error METE

per frame k for sequence terrace1-

c0 using PDOD algorithm only.

Figure 6: Error METE

per frame k for sequence terrace1-

c0 using the concatenation of CDMOD and PDOD algo-

rithms.

performance of the framework to the performance of

the algorithms when used separately. First we used

only the PDOD algorithm without the binary mask

from the CDMOD and the tracking algorithm. That

means every image from a sequence is searched for

objects without separating image regions into fore-

ground and background. The results for using only

the PDOD algorithm can be seen in Figure 5 exem-

plarily for sequence terrace1-c0. METE is shown per

frame. That means the detection results are compared

to the ground truth of each frame. The error is in-

ﬂuenced from missed detections and false detections.

True detections where the overlap of bounding boxes

is not accurate enough also inﬂuence the error.

Especially at the beginning of the sequence the er-

ror METE equals 1. In this sequence there are no per-

sons which can be detected in the ﬁrst 100 frames.

That means the error cannot result from missed de-

tections or detections with insufﬁcient overlap but

from false-positive detections. This shows the inﬂu-

ence of false-positive detections on a tracking system.

In the next frames the error is inﬂuenced from false

and missed detections and from insufﬁcient overlap

of bounding boxes. Figure 6 shows the results when

using the combination of CDMOD and PDOD algo-

rithm exemplarily for sequence terrace1-c0.

The comparison of Figure 5 and 6 shows that

METE equals 0 at the beginning of the sequence for

HybridPersonDetectionandTrackinginH.264/AVCVideoStreams

483

Table 3: Time of analysis for 25 frames[s] for PDOD only

and PDOD with CDMOD.

Sequence PDOD PDOD & CDMOD

campus4-c0 2.9 2.5

campus7-c1 2.9 1.5

terrace1-c0 3.6 3.1

terrace2-c1 4.1 3.7

Table 4: Time of analysis for 25 frames[s] for the frame-

work.

Sequence Framework

w/o extensions with extensions

campus4-c0 1.1 1.0

campus7-c1 0.8 0.8

terrace1-c0 1.4 1.2

terrace2-c1 1.7 1.3

the ﬁrst 100 frames. The CDMOD algorithm creates

a binary mask for each frame. As mentioned before,

there are no moving objects in the ﬁrst 100 frames

of that sequence. Therefore, the binary mask deﬁnes

each frame of the ﬁrst 100 frames as background and

the PDOD algorithm does not search for objects. This

shows the advantage of the CDMOD algorithm. It

separates frames without persons from frames with

persons and prevents false-positive detections which

would be tracked wrongly.

The averaged error METE is listed in Table 2.

In the second and third columns are the results

for PDOD and the combination from CDMOD and

PDOD listed. As expected, the error decreases, only

for the sequence campus4-c0 it increases. This is be-

cause of some persons are not moving in some frames,

so they are declared as background and can not be

found from the PDOD algorithm.

As described in Section 3.3.2 the matching of ob-

jects depends on two parameters, on the one hand the

amount of overlap of the bounding boxes and on the

other hand on the similarity of the color histograms.

In the evaluation we parameterized over two thresh-

olds t

and t

to achieve the best overall values which

should be used for object merging. Both thresholds

deﬁne the maximal value of the overlap o

i j

and the

similarity of histograms s

i j

of a new detected object i

and a previously found object j may reach. If o

i j

< t

and s

i j

< t

both objects are merged. Additionally the

PDOD algorithm analyzed every 5th frame only, be-

cause once an object is detected it is not necessary

to detect it again in the following frame, but the CD-

MOD and the tracking algorithm are still analyzing

each frame. Another beneﬁt is the reduction of time.

The computation time for the analysis with PDOD

only and for PDOD and CDMOD is listed in Ta-

ble 3. As expected, the time of computation could

be reduced for using the combination of the CDMOD

Figure 7: Error METE averaged for sequence terrace1-c0

using the framework.

Figure 8: Error METE averaged for sequence terrace1-c0

using the framework and the extensions.

and PDOD algorithms compared to using only the

PDOD algorithm. The CDMOD algorithm deﬁnes

background in images, which is not searched for ob-

jects from the PDOD algorithm. This leads to the re-

duction of computation time. In sequence campus7-

c1 only one person appears in approximately 50% of

the sequence. The computation time is reduced by al-

most 50%. This shows that the CDMOD algorithm

is very well suited to reduce the complexity. Table 4

shows the computation time when the analysis is done

with the framework, that means the combination of

CMDOD, PDOD and the tracking algorithm. The

computation time is again reduced to approximately 1

second for 25 frames. This result shows that the max-

imum computation effort is made from the PDOD al-

gorithm, as the PDOD algorithm analyzes only every

5th frame in this conﬁguration.

In Figure 7 the averaged METE is shown when

parameterizing both thresholds. METE is not shown

per frame but averaged for the sequence with ﬁxed

parameters. In Figure 8 the averaged METE is shown

for using the extensions described in Section 3.3.3 and

Section 3.3.4.

Unfortunately, the error increases compared to the

results in Table 2. The main reason is that there is

no knowledge about previous true-positive or false-

positive detections, when using the PDOD algorithm

only or combined with the CDMOD algorithm. On

VISAPP2015-InternationalConferenceonComputerVisionTheoryandApplications

484

the contrary when using the framework, each false-

positive detection is tracked in the subsequent frames.

That means if an object is detected repeatedly but a

matching was not successfully, the object is tracked

with more than one bounding boxes. Even if all

bounding boxes follow the object correctly the cardi-

nality error increases, which results in a high METE.

But as one can see, a low error is reached for a low

threshold t

. But the threshold t

has to have a high

value to achieve a low METE. That means the color

histogram is more suitable for object merging, than

the overlap of bounding boxes. Another inﬂuence on

the error is the mutual occlusion of objects. The mean

shift algorithm is not able to follow a hidden object,

instead it converges to false positions.

5 CONCLUSION

In this paper we presented a framework for the detec-

tion and tracking of objects. The framework consists

of three stages. For each stage an individual algo-

rithm is applied. The stages are concatenated in a

way that they exchange information about the pres-

ence and the position of objects. An algorithm ana-

lyzing the compressed video stream is used as a pre-

selection step to provide a binary mask, which seg-

ments the regions of an image into foreground and

background. We selected the Implicit Shape Model

as algorithm to actually ﬁnd the position of objects.

A tracking algorithm using the mean shift algorithm

was established to track the detected objects. The

novelties lie in the concatenation of algorithms ana-

lyzing the video sequence in the compressed domain

and pixel domain. Another novelty is the method of

object segmentation to receive a color histogram, as

it is needed for the mean shift algorithm. The evalu-

ation results state good results in object segmentation

and tracking when using the new method. It is also

shown that the complexity could be reduced signiﬁ-

cantly. Another challenge is multiple person tracking

and mutual occlusion of persons. This could be han-

dled with previous knowledge like evaluation of the

individual trajectory for example.

ACKNOWLEDGEMENTS

The research leading to these results has received

funding from EIT ICT Labs’ Action Line “Future

Cloud” under activity n

11882.

REFERENCES

Andriluka, M., Roth, S., and Schiele, B. (2008).

People-Tracking-by-Detection and People-Detection-

by-Tracking. In Proc. 2008 IEEE Conf. on Computer

Vision and Pattern Recognition (CVPR), pages 1–8.

Berclaz, J., Fleuret, F., Turetken, E., and Fua, P. (2011).

Multiple Object Tracking using K-Shortest Paths Op-

timization. IEEE Transactions on Pattern Analysis

and Machine Intelligence, 33(9):1806–1819.

Comaniciu, D., Ramesh, V., and Meer, P. (2003). Kernel-

based Object Tracking. IEEE Transactions on Pattern

Analysis and Machine Intelligence, 25(5):564–577.

Corrales, J., Gil, P., Candelas, F., and Torres, F. (2009).

Tracking based on Hue-Saturation Features with a

Miniaturized Active Vision System. In Proc. 40th Int.

Symposium on Robotics, pages 107–112.

Eiselein, V., Fradi, H., Keller, I., Sikora, T., and Dugelay, J.-

L. (2013). Enhancing Human Detection Using Crowd

Density Measures and an Adaptive Correction Filter.

In Proc. 2013 10th IEEE Int. Conf. on Advanced Video

and Signal Based Surveillance (AVSS), pages 19–24.

Evans, M., Osborne, C., and Ferryman, J. (2013). Mul-

ticamera Object Detection and Tracking with Object

Size Estimation. In Proc. 2013 10th IEEE Int. Conf.

on Advanced Video and Signal Based Surveillance

(AVSS), pages 177–182.

Kailath, T. (1967). The Divergence and Bhattacharyya Dis-

tance Measures in Signal Selection. IEEE Transac-

tions on Communication Technology, 15(1):52–60.

Laumer, M., Amon, P., Hutter, A., and Kaup, A. (2013).

Compressed Domain Moving Object Detection Based

on H.264/AVC Macroblock Types. In Proc. of the

International Conference on Computer Vision Theory

and Applications (VISAPP), pages 219–228.

Leibe, B., Leonardis, A., and Bernt, S. (2004). Combined

Object Categorization and Segmentation With an Im-

plicit Shape Model. In Proc. Workshop on Statisti-

cal Learning in Computer Vision (ECCV workshop),

pages 17–32.

Lowe, D. G. (2004). Distinctive Image Features from Scale-

Invariant Keypoints. International Journal of Com-

puter Vision, 60:91–110.

MPEG (2010). ISO/IEC 14496-10:2010 - Coding of Audio-

Visual Objects - Part 10: Advanced Video Coding.

Nawaz, T., Poiesi, F., and Cavallaro, A. (2014). Measures

of Effective Video Tracking. IEEE Transactions on

Image Processing, 23(1):376–388.

Poppe, C., De Bruyne, S., Paridaens, T., Lambert, P., and

Van de Walle, R. (2009). Moving Object Detec-

tion in the H.264/AVC Compressed Domain for Video

Surveillance Applications. Journal of Visual Commu-

nication and Image Representation, 20(6):428–437.

Senst, T., Eiselein, V., and Sikora, T. (2012). Robust Local

Optical Flow for Feature Tracking. IEEE Transac-

tions on Circuits and Systems for Video Technology,

22(9):1377–1387.

Yilmaz, A., Javed, O., and Shah, M. (2006). Object Track-

ing: A Survey. ACM Computing Surveys, 38(4).

HybridPersonDetectionandTrackinginH.264/AVCVideoStreams

485