object recognition, which often focus on two aspects
of the problem: extracting features from images, and
classifying these features.
Even if Computer Vision researchers achieved im-
pressive results on object detection in the last years
(Lowe, 2004; Viola and Jones, 2004), this is still an
open research field. Many factors, such as changes
in viewpoint and scale, illumination, partial occlu-
sions and multiple instances further complicate the
problem of object detection (Uijlings et al., 2013;
Felzenszwalb et al., 2010; Vedaldi et al., 2009; Lopez
et al., 2012). Attentional frameworks have been pro-
posed to speed up the visual search (Bonaiuto and Itti,
2005) without exploiting top-down knowledge about
the target. The VOCUS-model from (Frintrop, 2006)
use both a bottom-up and a top-down version of the
saliency map: the bottom-up map is similar to that of
Itti and Koch’s, while the top-down map is a tuned
version of the bottom-up one, and the total saliency
map is a linear combination of the two maps with
user provided weights. The authors of (Oliva et al.,
2003) show that top-down information extracted from
the context of the scene can modulate the saliency
of image regions during the task of object detection.
Regarding firearms detection, that is the topic of our
paper, notwithstanding the importance of the topic
in the era of social network and anti-terrorism strug-
gles for the authorities, just a few works were pro-
posed. Among these (Zhang and Blum, 1997; Yang
and Blum, 2002; Xue and Blum, 2003) proposed tech-
niques to reveal concealed firearms, by fusing infor-
mation from multiple sources (thermal/infrared (IR),
millimeter wave (MMW), and visual sensors). How-
ever, to the best of our knowledge, no methods based
solely on image information exist.
3 PROPOSED SYSTEM
The proposed system is based on the combination
of the information from two different attention pro-
cesses: a bottom-up saliency map and a top-down
saliency map. Figure 1 shows the scheme of the over-
all system. Regarding the bottom-up analysis, we
used in our system the GBVS approach by (Harel
et al., 2007) which is based on a biologically plau-
sible model, and it consists of two steps: activation
maps on certain feature channels and normalization,
which highlights conspicuity. The top-down anal-
ysis is based on the construction of a probabilistic
model, able to estimate the regions of an image where
a firearm is more likely to be found, with respect to
the position of the person’s face. The main idea is to
build the statistics of a large set of samples and then fit
a model onto it, which is then applied to every image
to analyze. This approach will be further explained in
the next subsections, after the description of dataset
used to create the probabilistic model.
3.1 Dataset Description
Due to the large number of images required, a com-
prehensive dataset had to be acquired. However,
a large realistic database with a variety of firearms
is hard to be built from scratch. For this reason
the images available in the “Internet Movie Firearms
Database” (IMFDB)
1
were used. The database is
composed of several thousands images taken from
movie scenes. Each image represents one or more
persons holding one or more firearms. Images
are middle quality color spanning from 0.06 to 2
megapixels. Figure 2 shows some examples of im-
ages taken from the database.
3.2 Dataset Annotation
In order to obtain reliable statistics from the images,
they required to be manually annotated with some la-
bels. Then, several metrics were measured. In par-
ticular, 1000 images were labeled with the following
information:
• Image filename;
• Image size, both horizontal I
w
and vertical I
h
;
• Firearm position W
px
and W
py
and size, both hori-
zontal W
w
and vertical W
h
.
• Face position F
px
and F
py
and size, both horizontal
F
w
, and vertical F
h
.
From these elements, additional information is ex-
tracted, namely:
• Distance from face to firearm d
w− f
normalized
w.r.t. face size;
• Orientation of the firearm w.r.t. the subject face
α
w− f
;
• The area of the firearm bounding box;
• The area of the face bounding box.
Note that each measure is normalized w.r.t. face
sizes, in order to make the values comparable
notwithstanding the subject size or image resolution.
1
Internet Movie Firearms Database (IMFDB) -
http://www.imfdb.org/
SIGMAP2014-InternationalConferenceonSignalProcessingandMultimediaApplications
26