ing boxes with high scores that do not overlap with the
true bounding box, and true bounding boxes missed
by our detection method.
8 CONCLUSION AND
PERSPECTIVES
We have proposed a novel weakly supervised lo-
calization method based on classification of image
patches represented with large Fisher vectors. The
main advantage of our method is fast evaluation due
to a sparse linear classification model trained with L1
regularization. The sparsity of our model is deter-
mined by the parameter λ which regulates the trade-
off between the loss and the regularizer. We set that
parameter by cross-validation, which implies that our
model outperforms less sparse models on the train
split. Supplying a larger λ would lead to enhanced
sparsity and faster evaluation at the expense of some
performance loss.
To the best of our knowledge, this is the first ac-
count of patch Fisher vectors being used for weakly
supervised object localization. The method has
been experimentally validated on a challenging pub-
lic dataset and the obtained performance (90% recall,
75% precision) is comparable with strongly super-
vised approaches. The most interesting qualities of
the proposed approach include:
1. it is based on a slightly downgraded state-of-the-
art image classification approach;
2. does not require ad-hoc or bottom-up initializa-
tion;
3. it is trainable on images of very small objects (less
than 1% of the image content);
4. it is trainable on very large datasets (thousands of
images) in reasonable time.
Our results suggest that Fisher vectors hold a great
potential in the field of weakly supervised object lo-
calization. An interesting direction for future work
would be to use a block-sparse model to directly en-
force sparsity over GMM components. This would
also help to improve soft-assign time, which is cur-
rently the bottleneck of the method (our unoptimized
Python implementation takes around 20 s in the de-
tection stage). To this end we shall explore cascade
classifiers in the original feature space for quick re-
jection of the patches that can not contribute to the
top scores. An interesting extension would be a more
expressive spatial layout model for proposing bound-
ing boxes. Finally, we would like to tackle weakly-
supervised localization of fine-grained object classes,
as this problem has many interesting applications.
ACKNOWLEDGEMENTS
This work has been supported by the project VISTA
- Computer Vision Innovations for Safe Traffic,
IPA2007/HR/16IPO/001-040514 which is cofinanced
by the European Union from the European Regional
Development Fund.
Parts of this work have been performed while the
first author was funded by the European Commu-
nity Seventh Framework Programme under grant no.
285939 (ACROSS).
Parts of this work have been performed in the
frame of the Croatian Science Foundation project I-
2433-2014.
REFERENCES
Alexe, B., Deselaers, T., and Ferrari, V. (2010). What is an
object? In CVPR, pages 73–80.
Andrews, S., Tsochantaridis, I., and Hofmann, T. (2002).
”Support Vector Machines for Multiple-Instance
Learning”. In NIPS, pages 561–568.
Auer, P. (1997). ”On Learning From Multi-Instance Ex-
amples: Empirical Evaluation of a Theoretical Ap-
proach”. In ICML, pages 21–29.
Bach, F. R., Jenatton, R., Mairal, J., and Obozinski, G.
(2012). Optimization with sparsity-inducing penal-
ties. Foundations and Trends in Machine Learning,
4(1):1–106.
Bottou, L. (1991). ”Stochastic Gradient Learning in Neural
Networks”. In Neuro-N
ˆ
ımes.
Bradski, G. (2000). OpenCV library. Dr. Dobb’s Journal of
Software Tools.
Chen, Q., Song, Z., Feris, R., Datta, A., Cao, L., Huang,
Z., and Yan, S. (2013). Efficient maximum appear-
ance search for large-scale object detection. In CVPR,
pages 3190–3197.
Chum, O., Perdoch, M., and Matas, J. (2009). Geometric
min-hashing: Finding a (thick) needle in a haystack.
In CVPR, pages 17–24.
Chum, O. and Zisserman, A. (2007). An exemplar model
for learning object classes. In CVPR.
Cinbis, R. G., Verbeek, J. J., and Schmid, C. (2013). Seg-
mentation driven object detection with fisher vectors.
In ICCV, pages 2968–2975.
Cinbis, R. G., Verbeek, J. J., and Schmid, C. (2014). Multi-
fold MIL training for weakly supervised object local-
ization. In CVPR, pages 2409–2416.
Crandall, D. J. and Huttenlocher, D. P. (2006). ”Weakly
Supervised Learning of Part-Based Spatial Models for
Visual Object Recognition”. In ECCV, pages 16–29.
Crowley, E. J. and Zisserman, A. (2013). Of gods and
goats: Weakly supervised learning of figurative art. In
BMVC.
Csurka, G., Dance, C. R., Fan, L., Willamowski, J., and
Bray, C. (2004). Visual categorization with bags of
keypoints. In ECCV workshop, pages 1–22.
VISAPP2015-InternationalConferenceonComputerVisionTheoryandApplications
52