an idea of the pipeline we follow in the case of appli-
cation to an object classification problem.
The properties of the proposed functional not only
ensure good performances with linear classifiers, but
moreover can be used directly in the classification
stage. Indeed, we also propose to exploit the dictio-
nary mechanism for the classification task by classi-
fying each single feature on the basis of the dictionary
response, rather than using the reconstruction error
(Yang et al., 2008; Skretting and Husy, 2006; Peyr
´
e,
2009; Mairal et al., 2008a). The main advantage of
this choice is that the classification of local features
allows us to deal with occlusions and presence of clut-
tered background.
Most of the approaches usually focus on learn-
ing dictionaries based on the reconstruction error
(Yang et al., 2008; Skretting and Husy, 2006; Peyr
´
e,
2009), and do not exploit the prior knowledge of the
classes even in supervised tasks. In (Mairal et al.,
2008a) it has been proposed a discriminative method
to learn dictionaries, i.e. learning one dictionary for
each class. Later in (Mairal et al., 2008b) the au-
thors extend (Mairal et al., 2008a) by learning a sin-
gle shared dictionary and models for different classes
mixing both generative and discriminative methods.
There have been some attempts to learn invariant mid-
dle level representations (Wersing and K
¨
orner, 2003;
Boureau et al., 2010), while some other works use
sparse representation as main ingredient for feed for-
ward architectures (Hasler et al., 2007; Hasler et al.,
2009). Most recent works focus on learning general
task purposes dictionaries (Mairal et al., 2012) or they
look at the pooling stage (Jia et al., 2012) trying to
learn the receptive fields that better catch all the im-
age statistics.
In this work, we exploit the power of low-level
features from a different perspective, i.e. taking ad-
vantage on the sparsity. The main contributions of
our work can thus be summarized as follows
• A new functional for learning discriminative and
sparse image representations, that exploits prior
knowledge on the classes. Unlike other ap-
proaches, when building the dictionary of a given
class, we also consider the contributes of negative
examples. This allows us to obtain more discrim-
inative representations of the image content.
• A new classification scheme based on the dictio-
nary response, as opposed to the reconstruction
error, that allows us to exploit the representative
power of the dictionaries and be robust to oc-
clusions. This solution is naturally applicable to
multi-class scenarios and preserves the local fea-
tures configuration.
We experimentally validate the method we propose
showing its applicability to two different classifica-
tion tasks, namely single instance object recognition
and object categorization. As for the first task, we
use a dataset acquired in-house including 20 objects
of different complexity, characterized by variability
in light conditions, scale, background. In the case of
categorization, instead, we consider a collection of 20
classes from the benchmark Caltech-101 dataset. In
both cases, we will show that the solution we propose
outperforms other approaches from the literature.
2 PRELIMINARIES
In this section we review the traditional approach
to dictionary learning and describe the classification
pipeline commonly used in literature in combination
with such representation scheme. This will set the ba-
sis to discuss the contributions of our approach.
2.1 General Classification Framework
We first briefly introduce the classification pipeline
commonly adopted with the sparse coding. It can be
mainly divided in four main stages.
1. Features Extraction. A set of descriptors
x
1
, . . . , x
m
I
are extracted from a test image I.
Examples of local descriptors are image patches,
SIFT (Lowe, 2004), or SURF (Bay et al., 2008)
(either sparse or dense).
2. Coding Stage. The coding stage maps the in-
put features x
1
, . . . , x
m
I
into a new overcomplete
space u
1
, . . . , u
m
I
.
3. Pooling Stage. The locality of the coded descrip-
tors u
1
, . . . , u
m
I
cannot catch high level statistics
of an image, therefore a pooling step is required.
It can be performed at image level or with a multi-
scale approach (see e.g. (Boureau et al., 2010)). It
has been experimentally shown that the max pool-
ing operator obtains the highest performances in
classification tasks (Boureau et al., 2010). With
this operator an image is encoded with single fea-
ture vector ¯u ∈ R
d
, where each component ¯u
j
is
¯u
j
= max
i
u
ji
∀i = 1, . . . , m
I
(1)
4. Classification The final description is fed to a
classifier such as SVM (Vapnik, 1998). Codes ob-
tained through vector quantization usually require
ad-hoc kernels to obtain good performances, in-
stead, sparse coding approaches have shown to
be effective if combined with linear classifiers,
also ensuring real-time performances (Yang et al.,
2009).
Multi-classImageClassification-SparsitydoesitBetter
801