standard algorithm for object localization and recog-
nition. Places are described by the frequency of ob-
jects found in them combined with constraints on
their position. However, object categorization is
still a difficult task and the position of objects can
greatly vary from one environment to another. There-
fore those approaches have not been used on large
databases.
The vast majority of research on place recognition
use techniques developed for visual scene classifica-
tion. We can distinguish methods using global fea-
tures (see (Torralba et al., 2003)) and methods using
descriptors computed around interest points (see (Ul-
lah et al., 2008)). (Filliat, 2008) uses the Bag-of-
Words (BoW) model: local features are first clustered
into a so-called dictionary of visual words learned by
mean of a vector quantization algorithm. An image is
represented by the distribution of visual words found
in it. The major advantage is that the learning space
is discretized but all geometrical information is lost.
Generally speaking using a single image or a sin-
gle type of information is not enough for place recog-
nition tasks. Therefore a lot of research has been con-
ducted to disambiguate perception. (Pronobis and Ca-
puto, 2007) use a confidence criterion to iteratively
compute several cues from the same image until con-
fidence in the classification is sufficiently high (or no
more cues are available).
Another method to reduce ambiguity is to use
several images to mutually disambiguate perception.
In (Pronobis et al., 2010), the authors use a sim-
ple spatio-temporal accumulation process to filter the
decision of a discriminative confidence-based place
recognition system (which uses only one image to
recognize the place). One problem with this method
is that the system needs to wait some time before giv-
ing a response. Also, special care must be taken to
detect places boundaries and to adjust the size of the
bins. (Torralba et al., 2003) use a HMM where each
place is a hidden state and the feature vector stands for
the observation. The drawback is that the input space
is continuous and high-dimensional. The learning
procedure is then computationally expensive. (Ran-
ganathan, 2010) uses a technique called Bayesian on-
line change-point detection. The main idea is to de-
tect abrupt changes in the parameters of the input’s
statistics caused by moving from one place to another.
The main advantage is that the robot is able to learn
in an unsupervised way but relies on the hypothesis
that the shape of the distribution is the same for every
place.
Several works (see (Wu et al., 2009; Guillaume
et al., 2011; Dubois et al., 2011)) have combined
global image description and vector quantization. In
this case, each image is described by a single vi-
sual word. The sequence of images is then trans-
lated into a sequence of words. Such techniques al-
low to draw a parallel between place recognition and
language modelling. (Wu et al., 2009) propose to
use a HMM with discretized signatures. Temporal
integration is performed with Bayesian filtering (see
section 3). (Dubois et al., 2011) propose to use an
extended model called auto-regressive HMM to take
into account the dependence between images.
In this paper we push this idea a step further. The
next section presents our models and its relations to
the standard HMM model.
3 PLACE RECOGNITION WITH
n-GRAMS
Our model is similar to the one described in (Guil-
laume et al., 2011; Dubois et al., 2011). Each image is
described by a unique feature vector which is mapped
to a given visual word thanks to a vector quantiza-
tion algorithm (see section 3.3). The main novelty lies
in the use of High-Order Hidden Markov Model (see
section 3.1) and techniques for visual word selection
(see 3.4).
3.1 High-order Hidden Markov model
In HMMs the relationship between x
t
, the robot’s
knowledge of the world at time t, and z
t
, its per-
ception is represented by figure 1(a). In the case
of place recognition, the state is a discrete random
variable which represents the place the robot is in
at time t. In this model, each place c
i
∈ C is
modelled by the continuous probability distribution
p(z
t
|x
t
= c
i
). This formalism allows to efficiently es-
timate the a posteriori probability bel(x
t
) = P(x
t
|z
1:t
)
by a recursive equation (see (Wu et al., 2009)) given
the discrete place transition probability distribution
P(x
t
|x
t−1
) which encodes the topology of the envi-
ronment.
It is assumed that the current observation depends
only on the current hidden state i.e. that the state is
complete. However, there is a huge semantic gap be-
tween the human notion of a place and what can be
extracted from an image. Several authors have pro-
posed extensions of the classic HMM to take into ac-
count long-term dependencies between observations
(see (Berchtold, 2002; Lee and Lee, 2006)). In this
paper we will call this model High-Order Hidden
Markov Model (HOHMM). In this case, the current
knowledge x
t
depends on the last ` states x
t−`:t−1
.
Similarly the current observation z
t
depends on x
t
and
Usingn-gramsModelsforVisualSemanticPlaceRecognition
809