Exploiting Scene Cues for Dropped Object Detection
Adolfo L
´
opez-M
´
endez, Florent Monay and Jean-Marc Odobez
IDIAP Research Institute, Martigny, Switzerland
Keywords:
Dropped Object Detection, Human Detection, Background Subtraction, Geometry Constraints.
Abstract:
This paper presents a method for the automated detection of dropped objects in surveillance scenarios, which
is a very important task for abandoned object detection. Our method works in single views and exploits
prior information of the scene, such as geometry or the fact that a number of false alarms are caused by
known objects, such as humans. The proposed approach builds dropped object candidates by analyzing blobs
obtained with a multi-layer background subtraction approach. The created dropped object candidates are
then characterized both by appearance and by temporal aspects such as the estimated drop time. Next, we
incorporate prior knowledge about the possible sizes and positions of dropped objects through an efficient
filtering approach. Finally, the output of a human detector is exploited over in order to filter out static objects
that are likely to be humans that remain still. Experimental results on the publicly available PETS2006 datasets
and on several long sequences recorded in metro stations show the effectiveness of the proposed approach.
Furthermore, our approach can operate in real-time.
1 INTRODUCTION
Automated detection of dropped or abandoned ob-
jects is one of the foremost concerns within automatic
video surveillance systems. Being able to robustly de-
tect suspicions items is a key issue in the protection of
public spaces such as airports or train stations.
A cornerstone of abandoned object detection is
the automated detection of static objects and, more
specifically, of dropped objects. Here dropped objects
are understood as items such as luggage, packets, etc.
that remain static in the scene for a given period of
time. Being able to automatically detect such objects
over time is important in order to draw the attention
of security operators or to identify object owners.
Several challenges are involved in dropped ob-
ject detection: lighting conditions, changing back-
grounds, object occlusions and false alarms caused
by non-suspicious static objects such as humans that
remain still. The latter is very notorious in typical
surveillance scenarios (Fan and Pankanti, Sept). This
problem can be alleviated by exploiting object detec-
tors for known elements in the scene (such as hu-
mans). Surprisingly, and despite recent advances, ex-
isting dropped and abandoned object detection meth-
ods aiming at realistic surveillance scenarios (Fan and
Pankanti, Sept)(Fan and Pankanti, pt 2)(Caro Campos
et al., pt 2)(Tian et al., 2011) do not attempt to use
state-of-the-art object detection methods (Dalal and
Triggs, June) (Felzenszwalb et al., 2010), which could
help in reducing false alarms caused by such known
objects.
In this paper, we propose a dropped object detec-
tion method that leverages on prior information about
the scenario. This information is mainly exploited
in two ways. Firstly, with an efficient implementa-
tion of geometric constraints, that allows filtering ob-
jects based on contextual cues such as feasible ground
plane positions and object sizes. Secondly, with a
method that reasons about how likely is a static ob-
ject to be generated by a human remaining still. For
the latter, we rely on the output state of the art hu-
man detector (Dubout and Fleuret, 2012). Addition-
ally, the proposed dropped object detector character-
izes objects by their appearance and a set of temporal
variables, such as the set down time (estimated time of
dropping). This information is valuable for potential
abandonment analysis. Experiments conducted in dif-
ferent datasets show the effectiveness of the proposed
dropped object detection approach and the contribu-
tion of each one of the components of the system.
Furthermore, the proposed method runs in real-time
and has been deployed in real operational conditions
on two metro stations.
14
Lopez-Mendez A., Monay F. and Odobez J..
Exploiting Scene Cues for Dropped Object Detection.
DOI: 10.5220/0004654800140021
In Proceedings of the 9th International Conference on Computer Vision Theory and Applications (VISAPP-2014), pages 14-21
ISBN: 978-989-758-004-8
Copyright
c
2014 SCITEPRESS (Science and Technology Publications, Lda.)
2 RELATED WORK
Substantial research effort has been recently de-
voted to providing robust solutions to dropped
and abandoned object detection (Fan and Pankanti,
Sept)(Caro Campos et al., pt 2)(Smith et al.,
2006)(Tian et al., 2011). Because dropped objects can
belong to different classes, foreground detection and
adaptive background modeling are central elements in
most of the existing approaches.
A recent approach by Fan and Pankanti (Fan and
Pankanti, pt 2) models temporally static objects with
a finite state machine, where the similarity between
objects and foreground matches is used as a cue to
update the background model. The same authors later
proposed a robust foreground detection approach (Fan
and Pankanti, Sept) based on classifying patches as
being foreground or background. Such a classifica-
tion involves training RBF-SVMs using several tex-
ture and region features. Liao et al. (Liao et al., Sept)
propose a foreground-mask sampling approach con-
sisting in the intersection of foreground masks within
a period of time. Campos et al. (Caro Campos et al.,
pt 2) rely on active contours obtained from foreground
masks in order to detect abandoned and stolen objects.
Automated tracking of people in the scene is an
important component of some approaches, especially
those that aim at finding owners of abandoned ob-
jects (Liao et al., Sept)(Smith et al., 2006). Inferring
people location in the scene might also be used to
filter false alarms caused by still people. However,
real surveillance scenes might be crowded, making
tracking unreliable. Human detection is an alterna-
tive to tracking, but despite recent advances in the task
of detecting humans in unconstrained scenes (Dalal
and Triggs, June)(Felzenszwalb et al., 2010), current
dropped and abandoned object detection approaches
usually do not rely on human detection. In some
methods, human detection is reduced to classifying
foreground blobs as humans or static objects based on
generic features such as blob size or skin color (Liao
et al., Sept). According to (Fan and Pankanti, Sept),
the main cause of false alarms are still persons be-
ing identified as static objects. This result emphasizes
the importance of using advanced methods in order
to detect known classes (e.g., people) in surveillance
scenes.
3 DROPPED OBJECT DETECTOR
The proposed approach comprises several phases.
Firstly, we rely on a multi-layer background sub-
traction algorithm (Yao and Odobez, 2007) to detect
Figure 1: Overview of the proposed method (best viewed in
color). A static object blob (blue pixels) is used to define
a dropped candidate (represented by the dropping time or
set down time t
sd
i
, the vector h
i
, a bounding box b
i
and a
color patch S
i
, see details in Section 3.2). Incoming can-
didates undergo a geometric verification. If they fulfill ge-
ometric constraints, candidates are matched against those
objects in the pool that are not likely to be generated by a
human at time t. In the example, the stored object is re-
peatedly matched, indicating that it is present in the scene.
After n
s
seconds since the estimated set down time t
sd
0
, the
system triggers an alarm (red bounding box).
blobs that represent static objects. Our method uses
these blobs to create dropped object candidates by
gathering appearance and temporal aspects associated
to the blobs. In this process, the prior knowledge
about position and size of dropped objects is used to
remove spurious candidates. Finally, because some
detections are caused by still people in the scene, we
integrate over time the output of a human detector to
filter false alarms. An overview of the method is de-
picted in Figure 1.
3.1 Multi-layer Static Object Detection
Static objects are identified by relying on a multi-
layer background modeling method (Yao and Odobez,
2007), which is described in this section for the sake
of completeness. The motivation of the multi-layer
framework is that layers can be regarded as stacked
objects, where the bottom layer contains the domi-
nant background, and successive layers contain static
objects.
Multi-modal Pixel Representation. The approach
in (Yao and Odobez, 2007) is based on capturing the
appearance statistics of each pixel x in terms of RGB
color and Local Binary Patterns (LBP) (Heikkila and
Pietikainen, 2006). At each time instant t, these statis-
tics are represented by a set of modes that we denote
as M
t
(x) = {K
t
x
, {m
t
k
(x),t
0
k
(x)}
k=1,···,K
t
x
}, where K
t
x
is
ExploitingSceneCuesforDroppedObjectDetection
15
the number of modes, m
t
k
(x) represents the appear-
ance statistics of one mode and t
0
k
(x) is the time in-
stant when the k-th mode was observed for the first
time at pixel x. In the following, we drop x for read-
ability.
Each mode is associated to two other variables:
a weight w
k
and a layer L
k
. Each time a mode m
k
is observed in the target pixel, weight w
k
grows. On
the contrary, if it is not observed, w
k
decays. Hence,
w
k
represents how likely is m
k
to belong to the back-
ground. Layer L
k
determines whether a mode is a re-
liable background mode or not, i.e., if the mode has
been observed for a sufficient amount of time. L
k
= 0
indicates that m
k
is not a reliable background mode,
whereas L
k
> 0 indicates a reliable background mode.
Clearly, an important aspect of the presented method
is the speed at which w
k
grows and decays, since it
determines the time elapsed until modes are set into
reliable background layers. The rate of change is dif-
ferent depending on whether the weight is increased
or decreased, and it is controlled by the mode weight
learning rate α
w
and a constant τ (Yao and Odobez,
2007). When a mode is matched its weight increases
according to the following expressions:
w
t
k
= (1 α
i
w
)w
t1
k
+ α
i
w
ˆw
t
k
= max( ˆw
t1
k
, w
t
k
)
α
i
w
= α
w
(1 + τw
t1
k
)
(1)
where ˆw
k
is the maximum weight achieved by the
k-th mode. The rest of existing modes at a given pixel,
which are not matched, get their weight decreased as
follows:
w
t
k
= (1 α
d
w
)w
t1
k
α
d
w
=
α
w
(1+τ ˆw
t1
k
)
(2)
To simplify the parameters of the proposed
dropped object detection method, we fix τ = 5. This
value was found to be convenient after several exper-
iments on background subtraction.
Layer Management. Layer management consists
in assigning a layer to each mode based on their re-
spective weights. The goal is to represent the scene
as a set of layers containing static objects, where the
most static object, the true background is represented
by layer L
k
= 1 (see Fig. 2(b)). To this end, a layer
addition and layer removal mechanisms are proposed.
Layer addition takes place when a pixel mode m
˜
k
was observed during enough time. This is equivalent
to say that such a mode is promoted to a layer mode
when its weight is larger than a threshold T
bw
. Mode
m
˜
k
is assigned the first available layer.
(a) (b) (c)
Figure 2: Multi-Layer Static Object Detection (best viewed
in color) (a) Original image (b) Layer 1 (true background)
(c) Generated static object blob from layers L
k
> 1 (blue
pixels).
Analogously to layer addition, modes that have
not been observed m
r
during a sufficient amount of
time must be removed from the background layers.
That is, when the weight of a mode m
r
drops below
a threshold T
br
, it indicates that the mode disappeared
or that it has been occluded for a long time. The mode
is then set as a non-layer mode.
The described removal procedure might be slow
for static object analysis. Consider the case in which
an object is left in the scene and, after some time it is
promoted to layer 2 (1 is assigned to true background
modes). If the object is removed from the scene, the
described removal approach will need a long time
to remove pixel modes belonging to the object from
layer 2. This is clearly inefficient, since background
modes in layer 1 will be observed again, indicating
that the object was removed (we assume that bottom
layers cannot get in front of top layers). Based on this
observation, a second removal mechanism is defined.
When a mode m
a
in layer L
a
re-appears and it
is observed for a sufficient proportion of time over
a given period (its weight increases), the background
layers on top of L
a
are removed. The ultimate crite-
rion to decide when such removal takes place is deter-
mined by the percentage (T
ba
(0, 1) ) of total weight
hold by m
t
a
(see Algorithm 1).
Algorithm 1: Removal due to m
a
re-appearing.
1 if m
t
a
re-appears and
w
t
a
K
t
k=1
w
t
k
> T
ba
then
2 for k = 1 . . . K
t
do
3 if L
t
k
> L
t
a
then
4 remove the mode m
t
k
5 end
6 end
7 end
Layer Post-processing. All layers above the first,
which is considered to be the true background, are
regarded as candidates for static objects. We first
gather the content of these layers into a single image
whose non-zero pixels represent modes that are likely
to belong to static objects (see Fig. 2c). To remove
noise and obtain blobs from the objects modeled in
this image, we apply a Gaussian smoothing (in our
VISAPP2014-InternationalConferenceonComputerVisionTheoryandApplications
16
experiments we use a kernel of 11x11 and σ = 3.5).
We subsequently apply a morphological opening with
squared structuring element of 5x5 to further remove
artifacts.
3.2 Dropped Object Management
The output of the multi-layer static object detection
block is a set of blobs on a per-frame basis. Blobs
being detected as static objects are further processed
in order to decide whether they might be dropped ob-
jects. In the following, we describe the method we
propose in order to keep track of potentially dropped
objects.
Let a dropped object candidate be defined using a
static object blob. Specifically, a candidate i is repre-
sented by the tuple {t
sd
i
, h
i
, b
i
, S
i
} (see Fig. 1). The
parameter t
sd
i
is the set down time and represents the
time instant when the object was dropped. The set
down time is estimated upon candidate creation (see
details on the estimation below) and it is an impor-
tant feature of our approach, since it is used to trig-
ger alarms, but also allows users to find the moment
when the object was dropped, thus being important
for potential abandonment analysis. The vector h
i
ac-
cumulates evidence of the dropped object along time.
This evidence will be ultimately used to trigger alarms
for dropped objects. Specifically, for a given time in-
stant t, h
i
(t) is non-zero if the object was observed
at time t. This history only contains the time instants
between the object creation and its deletion. Finally,
the appearance of the candidate is represented by the
bounding box coordinates b
i
that enclose the blob and
the RGB patch S
i
inside such a bounding box.
Set Down Time Estimation. An important aspect
of the proposed dropped object detection is the set
down time. This variable is estimated using the pixel
creation time described in Section 3.1. At a give
time instant t, each pixel x in a static object blob has
an associated set of K
t
x
modes with their respective
t
0
k
(x) N creation times.
A mode creation time set is defined by collecting
the pixel mode creation times of the most recent reli-
able background modes (excluding those in layer 1):
M = {t
0
k
(x)|L
t
k
> 1, L
t
k
= max{L
t
˜
k
}
˜
k=1...K
t
x
} (3)
Based on the collected set of times, our goal is to
estimate the set down time of the object represented
by the blob. The likelihood of t
i
sd
can be derived by
non-parametric density estimation from the observed
creation times:
l(M ) =
t=
φ(t
0
ψ
)δ(t t
0
ψ
) (4)
where t
0
ψ
are the unique creation times within M and
φ(t
0
ψ
) represents the counts for pixel creation time t
0
ψ
.
Since M is created from pixel-wise measurements,
the distribution of creation times in M is, in general,
multimodal.
Fortunately, we can use prior information given
by the time elapsed from the drop until the object is
detected by our approach (it generates a blob). This
time, which we denote as t
ud
, is common for every
detected object, and it depends on the mode weight
learning rate α
w
and on a constant τ (see Section
3.1). Hence, we can experimentally learn a regres-
sor t
ud
(α
w
, τ, T
bw
, T
br
, T
ba
). In practice, after back-
ground subtraction and dropped object detection ex-
periments in validation sets, we found convenient to
set τ = 5, T
bw
= 0.9, T
br
= 0.0001, T
ba
= 0.6 (see Sec-
tion 3.1). This simplifies the regressor to t
ud
(α
w
).
To generate the data for learning t
ud
(α
w
), we run
experiments with synthetic objects, and we measure
the time elapsed since the object is created until it is
detected as a static blob. Specifically, 10 equispaced
values of α
w
in the interval [0.01, 0.1] are tested.
Then, a nonlinear exponential regressor of the form
t
ud
(α
w
) =
A
α
w
exp(λα
w
) +C is fitted to the data.
Matching Dropped Object Candidates. Let us de-
note D the pool of potential dropped objects. Since
the pool D is initially empty, the first candidate is di-
rectly stored in the pool.
Subsequent candidates are compared against the
objects in D. In order to determine if candidate j
matches an object i contained in D, we define two
tests. The first one is the overlap between bounding
boxes:
f
i j
=
2 area(b
i
b
j
)
area(b
i
) + area(b
j
)
(5)
the second is the averaged L
2
norm between the
color patches of both objects in the overlapping area:
d
i j
=
kS
i
(b
i
b
j
) S
j
(b
i
b
j
)k
2
area(b
i
b
j
)
(6)
where patches have pixel values between 0 and 1.
We say that an object i is matched with candidate
j if f
i j
> T
f
and d
i j
< T
d
.
If at time t a match occurs, h
i
(t) is set to 1; other-
wise is set to 0. Similarly, a candidate j which is not
matched against any object in D is added to the pool
as a new object.
The appearance model {S
i
, b
i
} of a stored object
being matched is updated by incorporating {S
j
, b
j
}.
This is done by first updating the bounding box as:
ExploitingSceneCuesforDroppedObjectDetection
17
b
i
= αb
i
+ (1 α)b
j
(7)
where α (0, 1) is a learning rate. In practice, we
set α = 0.9. Then, we obtain S
i
as the patch inside the
updated b
i
.
Objects i in D that have not been matched during
the last minutes are removed.
3.3 Geometric Filtering
If camera calibration is available, one can take advan-
tage of the prior knowledge about dropped object ex-
pected size and ground plane position in the scene.
Our method incorporates such prior information in a
look-up-table (LUT). In test time, LUTs allow to effi-
ciently discard dropped object candidates that do not
fulfill the implemented geometric constraints, i.e., we
do not consider them to be neither stored nor matched
against the candidates in the pool D.
The LUT is computed offline or during the algo-
rithm initialization, and it requires camera calibration
parameters and a binary mask denoting the possible
dropped object locations (i.e., the floor) (see Fig. 3)
1
.
We start by densely sampling the binary mask where
pixel values are different than zero. This procedure
gives a list of possible ground plane positions where
objects can be dropped. In each of these positions
we generate a 3D object of minimum and maximum
allowed sizes. Such objects are projected onto the im-
age plane and modeled by their enclosing bounding
boxes (see Fig. 3). These bounding boxes model the
minimum and maximum apparent size of dropped ob-
jects. The set of bounding boxes after dense sampling
in the image plane is stored in a LUT.
Figure 3: Example of the proposed geometric constraints.
Left: scene Middle: binary mask with valid ground plane
positions. Right: Sampled ground plane positions (red dots)
and valid objects (green bounding boxes). The picture de-
picts a sparse version of the actual constraints.
In test time, we efficiently retrieve the minimum
and maximum allowed sizes by looking at the clos-
est pixel position with associated minimum and max-
imum bounding boxes. A dropped object candidate
is considered valid if its associated bounding box b
j
falls inside the area left in between the minimum and
maximum retrieved bounding boxes. If the object lies
on a non-valid position, there will not exist a pair of
1
If the mask is not provided then the whole image plane
is considered as valid
(a) (b) (c)
Figure 4: Example of still human filtering (a) Detected
static objects causing dropped object alarm (b) Human de-
tections in frames where static objects triggered alarms are
used to filter out false alarms (c) Dropped object alarms af-
ter filtering.
valid bounding boxes such that constraints are ful-
filled. The described procedure is very efficient since
bounding box retrieval is fast thanks to the LUT and
checking the constraints involves comparing bound-
ing boxes.
3.4 Filtering Static Humans
A particular issue in most abandoned object detection
scenarios is that one typically has more information
about uninteresting classes (mostly humans and/or
cars) than the target class (abandoned objects can be-
long to different classes and their appearance might
strongly differ). We leverage on this prior information
to filter spurious dropped object detections that are
in fact caused by people standing still. For that mat-
ter, we rely on human detection with deformable part
models (Felzenszwalb et al., 2010) and more specifi-
cally, on the fast implementation proposed in (Dubout
and Fleuret, 2012). These methods obtain state-of-
the-art performance in detecting humans in several
surveillance scenes.
As described in Section 3.2, the vector h
i
denotes
whether the i-th object has been observed in the scene
during specific time instants. Our goal is to deter-
mine whether an observation of the i-th object is in
fact generated by a person standing still, in order to
be able to avoid alarms. This is achieved by comput-
ing the intersection over union (IOU) between b
i
and
each one of the bounding boxes generated by the hu-
man detector at time t. If the maximum IOU is above
a threshold T
h
, we consider that the i-th object is not
matched and h
i
(t) is set to 0. That is, even if it ex-
ists a candidate j that repeatedly matches the object
i in D , the proposed human filtering strategy avoids
accumulating dropped object evidence.
3.5 Dropped Object Alarms
Dropped objects that stay in the scene for a given
amount of seconds n
s
must generate alarms. Conse-
quently, if for the i-th object there is enough evidence
n
s
seconds after the set down time, the proposed sys-
tem will trigger an alarm. By denoting t the current
VISAPP2014-InternationalConferenceonComputerVisionTheoryandApplications
18
time instant and t a given time interval, we formulate
the mentioned conditions as follows:
t t
sd
i
> n
s
1
|t|
t
n=tt+1
h
i
(n) > θ
where θ (0, 1) is the detection score.
4 EXPERIMENTAL RESULTS
We conduct several experiments in order to evalu-
ate the effectiveness of our approach. We use differ-
ent datasets: the publicly available PETS2006 dataset
(PETS 2006, ) and two datasets recorded in Torino
and Paris metro stations. The use of the latter datasets
is motivated by the fact that the proposed approach is
deployed in those scenarios. Compared to PETS2006,
these datasets capture more realistic conditions.
Hereinafter, we use ML-DOD to denote our base-
line multi-layer dropped object detection approach.
ML-DOD+Geometry denotes the use of geometry
whereas ML-DOD-HD denotes the baseline plus hu-
man filtering.
In all the experiments, we run the algorithms at 1
fps. This is convenient since we do not need more re-
dundancy and because, for all the selected datasets,
our C++ implementation on a desktop PC Intel i7
with 3.2 GHz needs less than 1 second to process a
frame. In our implementation we use several constant
parameters: T
bw
= 0.9, T
br
= 0.0001, T
ba
= 0.6, τ =
5, T
f
= 0.3, T
d
= 0.02, T
h
= 0.06. We employed sev-
eral validation datasets in order to set these parame-
ters. This yields a dropped object detector approach
depending on three variable parameters: the number
of seconds after which alarms should be triggered n
s
,
the weight learning rate α
w
- which can be interpreted
as the speed at which objects are incorporated to static
background layers - and the detection score θ. From
a user viewpoint, this reduced set of variable parame-
ters renders a more understandable algorithm.
For the human detector (Dubout and Fleuret,
2012) we use a person model trained in the IN-
RIA person dataset (INRIA, ) and we set a detection
threshold of -0.1.
Finally, for the geometry constraints, we employ
scenario specific masks and we tune the object sizes
according to expected values (in all the experiments
we allow objects to have minimum width and height
of 20 cm, while we conveniently vary the maximum).
4.1 PETS2006
PETS2006 dataset (PETS 2006, ) contains 7 se-
quences for evaluation of abandoned luggage detec-
Figure 5: Examples of detected dropped objects in
PETS2006 (best viewed in color).
tion. All 7 sequences have been recorded with 4
calibrated 720x576 cameras, at 25 fps. In our ex-
periments we use view 3, since it is the most com-
monly used by state-of-the-art approaches. Overall,
this dataset contains approximately 15 min of video
data for evaluation.
In this work, we focus on dropped luggage detec-
tion, understood as the task of finding luggage be-
ing static in the scene for more than 30 seconds. In
PETS2006 datasets, there are a total of 6 sequences
with objects fulfilling the dropped object criteria, and
they are the same sequences annotated with abandon-
ment alarms.
To determine whether we correctly detect dropped
objects, we determine whether spatio-temporal over-
lap between the alarms triggered by our approach and
the ground truth exist.
We run the proposed ML-DOD with all the possi-
ble configurations (ML-DOD, ML-DOD+Geometry,
ML-DOD-HD, ML-DOD-HD+Geometry). Addition-
ally, we experiment with α
w
= [0.05, 0.1], n
s
= 30 sec-
onds and thresholds θ [0, 1]. In this dataset we allow
objects of big sizes (maximum width and height of 2
m) due to the presence of big luggage (skis). Example
dropped object detections are shown in Fig. 5.
In our experiments only ML-DOD-HD+Geometry
constraints is able to attain 100% precision and recall
in this dataset for both values of α
w
= [0.05, 0.1]. For
α = 0.05, when geometric constraints or human fil-
tering are not used the algorithm fires false alarms on
wrong objects in the background or on humans. This
is summarized in Table 1, where the best performance
for α = 0.05 is shown.
Additionally, we manually annotate the dropping
times, in order to evaluate the accuracy of the es-
timated set down time t
sd
i
for the alarmed objects.
The proposed method has a t
sd
i
estimation error of
0.5 ± 0.7 seconds, with a position estimation error of
28.6 ± 24.2 cm. Hence, even if the proposed method
does not target abandonment analysis, the set down
time feature allows to accurately locate the time in-
stant when the object was dropped without having to
analyze past frames.
We compare the proposed method to exist-
ing approaches reporting dropped object results in
PETS2006 dataset (see Table 1). The approach by
ExploitingSceneCuesforDroppedObjectDetection
19
Table 1: Comparative results for dropped object detection
in PETS2006 (α = 0.05).
Method Precision Recall
(Smith et al., 2006) 0.83 1
(Tian et al., 2011) 0.88 1
ML-DOD 0.67 1
ML-DOD + Geometry 0.86 1
ML-DOD-HD 0.75 1
ML-DOD-HD +Geometry 1 1
Figure 6: Examples successfully detected dropped objects
in metro station sequences (best viewed in color). Top:
Torino metro station Bottom: Paris Metro station.
(Smith et al., 2006) detects one bag that is not dropped
for a sufficient amount of time and some objects in the
background, while (Tian et al., 2011) detects a person
as dropped luggage. ML-DOD-HD+Geometry out-
performs these approaches by using prior knowledge
about object positioning and the human filtering strat-
egy.
4.2 Torino Metro Station
The Torino metro station dataset consists of 2 se-
quences of 45 min. each. Videos are recorded with
one single camera at 5 fps. The image resolution is
704x288.
We manually annotated the ground-truth of
dropped object events for evaluation. Similarly to
the PETS2006 dataset, the ground-truth events cor-
respond to objects that are dropped for more than
30 seconds. We annotate the bounding boxes and
dropping time of each object. In total, we annotated
24 drops fulfilling the 30 seconds criterion. This is
a challenging dataset, since the station is crowded,
some dropped objects are left for slightly more than
30 seconds and these objects get often occluded by
people.
In order to evaluate the accuracy of the proposed
methods, we compute spatio-temporal overlaps be-
tween predicted and ground truth dropped objects.
We run the proposed ML-DOD with all the pos-
sible configurations and different ranges of parame-
ters. Specifically, α
w
[0.05, 0.1], n
s
= 20 seconds
and thresholds θ [0, 1]. We choose n
s
= 20 since
several dropped objects stay static for slightly more
Table 2: Dropped Object detection results in Torino metro
dataset for n
s
= 20 seconds.
Method Video 1 Video 2
Prec. Rec. Prec. Rec.
ML-DOD 0.11 1 0.15 0.83
ML-DOD + Geometry 0.23 1 0.24 0.83
ML-DOD-HD 0.14 1 0.19 0.83
ML-DOD-HD + Geometry 0.33 1 0.36 0.83
than 30 seconds. Since objects get often occluded,
our approach requires an additional time gap to prop-
erly trigger dropped object alarms.
Similarly to PETS, we allow objects of rather big
size (maximum width and height of 1.2 m). For a se-
curity operator, the target is to detect the majority of
dropped objects. For that matter, we report the maxi-
mum precision at the best achieved recall obtained by
each variation of the proposed approach (see Table 2).
Results in this dataset show that despite the chal-
lenging evaluation conditions, our approach is able
to detect the majority of the dropped objects (exam-
ples are shown in Fig. 6). Precision values reveal
that adding geometric constraints yields larger accu-
racy increase compared to human filtering alone. By
varying the sizes of objects we found that the most de-
termining aspect is the ground plane position. With-
out these constraints, the proposed method can trigger
alarms in almost every position of the image. Most
interestingly, accuracy is further increased when geo-
metric constraints and human filtering are combined,
which suggests that these cues are complementary.
For the correctly detected dropped objects, the av-
erage set down time estimation error is 4.2 ± 3.3 sec-
onds. Considering that some objects have occlusions
during the dropping instant, this is an accurate esti-
mate.
4.3 Paris Metro Dataset
The Paris metro station dataset consists of 2 se-
quences recorded at the footbridge and one of the
platforms of the station respectively. In total, both
recordings consist of 2 hours of video data. Videos
are recorded with one single camera at 25 fps. The
image resolutions are 640x480 and 800x600 respec-
tively.
We follow the same annotation and evaluation
methodology as in the Torino dataset experiments
(see Section 4.2), with the difference that in this
dataset we consider objects that remain static for a
minimum of 2 minutes, which renders more realis-
tic evaluation conditions (more similar to (Fan and
Pankanti, pt 2)(Fan and Pankanti, Sept)). A total of
VISAPP2014-InternationalConferenceonComputerVisionTheoryandApplications
20
Table 3: Dropped Object detection results in Paris metro
dataset for n
s
= 90 seconds.
Method Footbridge Platform
Prec. Rec. Prec. Rec.
ML-DOD 0.43 1 0.03 1
ML-DOD + Geometry 1 1 0.2 1
ML-DOD-HD 0.43 1 0.04 1
ML-DOD-HD + Geometry 1 1 0.5 1
4 objects are annotated in both videos (3 objects on
the first video and 1 on the second).
ML-DOD is run with all the possible configura-
tions and α
w
[0.02, 0.05], n
s
= 90 seconds (similarly
to Torino, some objects are removed just after 2 min-
utes, thus requiring extra time for triggering alarms
robustly). Since larger n
s
are targeted, lower α
w
val-
ues are chosen. Regarding the geometric constraints,
we allow objects of maximum width and height of 80
cm. As in the Torino experiments, we report the best
precision at maximum recall ( see Table 3).
This dataset serves to evaluate the ability of the
proposed approach to deal with scenes where people
normally stay still, such as the metro platform (see
Fig. 6). Both the geometry constraints and the filter-
ing of humans are very important in order to success-
fully deal with these cases. Clearly, in an area where
people is transiting, human filtering does not help in
increasing the precision (see Table 3). On the con-
trary, it gives a performance boost on the platform.
While the proposed approach is generally very ac-
curate in estimating the set down time, it predicts one
of the dropping times almost 1 minute in advance,
hence yielding an average of 11.5±9.9 seconds. Such
a failure is caused by mode creation times generated
by the owner (who remains in the same position be-
fore dropping the object). The average set down time
error for the remaining cases is 1.4 ± 1.2 seconds.
5 CONCLUSIONS
We have presented an automatic approach for detect-
ing dropped objects in surveillance scenarios. Our
method leverages on multiple elements to address this
problem. First, objects are characterized by their ap-
pearance and temporal statistics that allow to accu-
rately retrieve the dropping time. Secondly, an effi-
cient implementation of a set of position and size con-
straints that allow removing a number of false alarms.
Finally, the proposed approach leverages on state-of-
the-art detectors to filters spurious objects produced
by known classes such as still humans. Experimen-
tal results conducted on several datasets show the ef-
fectiveness of our approach that, in addition, runs in
real-time.
ACKNOWLEDGEMENTS
This work was supported by the Integrated Project
VANAHEIM (248907) of the European Union under
the 7th framework program.
REFERENCES
Caro Campos, L., SanMiguel, J., and Martinez, J. (30 2011-
Sept. 2). Discrimination of abandoned and stolen ob-
ject based on active contours. In AVSS 2011, pages
101–106.
Comaniciu, D. and Meer, P. (2002). Mean shift: a ro-
bust approach toward feature space analysis. TPAMI,
24(5):603–619.
Dalal, N. and Triggs, B. (June). Histograms of oriented gra-
dients for human detection. In CVPR 2005, volume 1,
pages 886–893 vol. 1.
Dubout, C. and Fleuret, F. (2012). Exact acceleration of
linear object detectors. In ECCV 2012, pages 301–
311, Berlin, Heidelberg. Springer-Verlag.
Fan, Q. and Pankanti, S. (30 2011-Sept. 2). Modeling of
temporarily static objects for robust abandoned object
detection in urban surveillance. In AVSS 2011, pages
36–41.
Fan, Q. and Pankanti, S. (Sept.). Robust foreground and
abandonment analysis for large-scale abandoned ob-
ject detection in complex surveillance videos. In AVSS
2012, pages 58–63.
Felzenszwalb, P., Girshick, R., McAllester, D., and Ra-
manan, D. (2010). Object detection with discrimina-
tively trained part-based models. TPAMI, 32(9):1627
–1645.
Heikkila, M. and Pietikainen, M. (April 2006). A texture-
based method for modeling the background and de-
tecting moving objects. TPAMI, 28(4):657–662.
INRIA. http://pascal.inrialpes.fr/data/human/.
Liao, H.-H., Chang, J.-Y., and Chen, L.-G. (Sept.). A lo-
calized approach to abandoned luggage detection with
foreground-mask sampling. In AVSS 2008, pages 132–
139.
PETS 2006. http://www.cvg.rdg.ac.uk/PETS2006/data.html.
Smith, K. C., Quelhas, P., and Gatica-Perez, D. (2006). De-
tecting abandoned luggage items in a public space. In
IEEE PETS.
Tian, Y., Feris, R., Liu, H., Hampapur, A., and Sun, M.-T.
(Sept. 2011). Robust detection of abandoned and re-
moved objects in complex surveillance videos. TSMC-
C, 41(5):565–576.
Yao, J. and Odobez, J.-M. (2007). Multi-layer background
subtraction based on color and texture. In CVPR 2007,
pages 1–8.
ExploitingSceneCuesforDroppedObjectDetection
21