EVIDENTIAL COMBINATION OF ONTOLOGICAL
AND STATISTICAL INFORMATION FOR ACTIVE SCENE
CLASSIFICATION
Thomas Reineking, Niclas Schult
Cognitive Neuroinformatics, University of Bremen, Enrique-Schmidt-Straße 5, 28359 Bremen, Germany
Joana Hois
Research Center on Spatial Cognition SFB/TR 8, University of Bremen
Enrique-Schmidt-Straße 5, 28359 Bremen, Germany
Keywords:
Ontology, Statistics, Dempster-Shafer theory, Scene classification, Information gain, Knowledge representa-
tion, Domain analysis and modeling.
Abstract:
We introduce an information-driven scene classification system that combines different types of knowledge
derived from a domain ontology and a statistical model in order to analyze scenes based on recognized objects.
The domain ontology structures and formalizes which kind of scene classes exist and which object classes
occur in them. Based on this structure, an empirical analysis of annotations from the LabelMe image database
results in a statistical domain description. Both forms of knowledge are utilized for determining which object
class detector to apply to the current scene according to the principle of maximum information gain. All
evidence is combined in a belief-based framework that explicitly takes into account the uncertainty inherent
to the statistical model and the object detection process as well as the ignorance associated with the coarse
granularity of ontological constraints. Finally, we present preliminary classification performance results for
scenes from the LabelMe database.
1 INTRODUCTION
One of the reasons why humans are so successful at
solving complex problems is that they are capable of
utilizing different forms of knowledge. Here, we fo-
cus on two distinct aspects of an agent’s knowledge
representations in particular. On the one hand, for-
mal domain ontologies provide background knowl-
edge about the world by expressing necessary and
general logical relations between entities. These re-
lations reflect the definitions or inherent determina-
tions of the entities involved, or they reflect general
rules or laws effective in the domain under considera-
tion. Degrees of belief, on the other hand, account for
the fact that perception of the world is intrinsically
uncertain. They are usually obtained from empirical
data and thus belong to the realm of statistics. These
two forms of knowledge are complementary in nature
since ontological models generally abstract from the
uncertainty associated with perception while statisti-
cal models do not explicitly represent the semantics
of variables in addition to being restricted to encod-
ing low-order relations due to the resulting combina-
torial complexity. We argue that both forms should
be used when reasoning about problems that involve
uncertainty and that can not be accurately modeled by
statistics alone.
A good example of a difficult reasoning problem
where both ontologies and statistics can offer valu-
able information is that of vision-based scene classi-
fication. Distinguishing between semantic classes of
scenes is essential for an autonomous agent in order
to interact with its environment in a meaningful way.
While coarse perceptual classes (e.g., coast, forest)
can be recognized based on low-level features (Oliva
and Torralba, 2001), classifying scenes which differ
mainly with respect to activities an agent could per-
form in them (e.g., mall–shopping, kitchen–cooking)
requires an object-centric analysis. The problem
therefore consists of detecting objects and combining
this information with knowledge about the relations
of object and scene classes. These relations can either
be of statistical nature (e.g., 70% of all street scenes
contain cars) or they can be categorical (e.g., all bed-
rooms contain a bed). The former can be naturally
represented by object-scene co-occurrence probabil-
72
Reineking T., Schult N. and Hois J. (2009).
EVIDENTIAL COMBINATION OF ONTOLOGICAL AND STATISTICAL INFORMATION FOR ACTIVE SCENE CLASSIFICATION.
In Proceedings of the International Conference on Knowledge Engineering and Ontology Development, pages 72-79
DOI: 10.5220/0002304300720079
Copyright
c
SciTePress
ities whereas the latter can be expressed by domain
ontologies.
The problem of assigning a semantic class label to
a scene on the basis of sensory information has been
discussed in several works recently. In (Schill et al.,
2009), a domain ontology is used for inferring room
concepts from sensorimotor features associated with
objects. Objects are analyzed via saccadic eye move-
ments which are generated by an information max-
imization process. In contrast to our approach, the
system does not distinguish between categorical and
statistical knowledge, but rather uses expert knowl-
edge for expressing degrees of belief about relations
of room concepts and objects. In (Mart
´
ınez Mozos
et al., 2007), semantic scene classification is used for
enriching the spatial representation of a mobile robot.
The focus of their paper is to apply boosting to the
classification of laser range scans as well as Haar-like
features extracted from camera images. However, the
system exclusively uses statistics for modeling the re-
lation of features and scene classes without utilizing
explicit knowledge about objects.
The idea of utilizing co-occurrence statistics for
modeling context has been proposed in (Kollar and
Roy, 2009). The authors obtained their model via
dense sampling over annotations in the Flickr image
database. Using this information, a robot is then able
to predict the locations of objects based on previously
observed objects. (Maillot and Thonnat, 2008) in-
troduce an object recognition system for classifying
pollen grains. The classification is based on a training
set of sample images supported by a domain ontol-
ogy. Based on this training set, classification-relevant
dependencies between pollen grain classes and their
features are extracted, and this information is added
to the ontological information. Although their on-
tology provides a vocabulary for the domain, it does
not itself, i.e., without the training set, indicate such
classification-relevant dependencies.
In this paper, we propose an information-driven
architecture that combines a domain ontology with a
statistical model which we apply to the problem of
scene classification based on observed objects. An
overview of the system architecture is given in the
next section. The knowledge representation consist-
ing of the domain ontology and the statistical model
is described in section 3. Section 4 explains the belief
update and the system’s information gain maximiza-
tion strategy. Object detection is discussed in sec-
tion 5 along with preliminary results for classifying
scenes from the LabelMe database. The paper con-
cludes with a discussion of the proposed architecture
and future work is pointed out.
domain ontology
statistics
LabelMe
image tags
belief update
object detectors
entities
information gain
p(object|scene)
m(scene|object)
m(object)
selected
detector
ontology
statistics
detection
{
select most
informative
object class
scene constraints
Figure 1: Sketch of the system architecture showing the four
main components of the scene classification system.
2 SYSTEM ARCHITECTURE
The proposed scene classification system is composed
of four main components: a domain ontology for
scenes, a statistical model, a reasoning module for
updating the scene belief via information gain max-
imization, and an image processing module consist-
ing of class-specific object detectors (see figure 1).
The domain ontology defines the vocabulary of object
and scene classes as well as occurrence constraints
between the two (e.g., kitchens contain cooking facil-
ities). The statistical model, on the other hand, pro-
vides co-occurrence probabilities of object and scene
classes which are estimated from the large annotated
image database LabelMe (Russell et al., 2008).
Based on the current scene belief, the system com-
putes the most disambiguating object class accord-
ing to the principle of maximum information gain.
The gain is defined as the expected decrease in un-
certainty between the current and the predicted scene
belief. It depends on the discriminatory power of the
object class (e.g., ‘stove’ when having to distinguish
between kitchens from offices) as well as the system’s
confidence in being able to recognize objects of that
class (if stoves are hard to recognize, their discrimina-
tory power is of little use). Once the most informative
object class (i.e., the one minimizing the expected re-
sulting uncertainty) is determined, the system invokes
a vision-based search for this class in order to support
or reject the current scene hypotheses.
The search is modeled by an object detection
mechanism which determines whether an instance of
this object class is present in the image corresponding
to the current scene. Since the main focus of our work
is the interplay between ontology-based and statistic-
based knowledge representations, the system’s vision
EVIDENTIAL COMBINATION OF ONTOLOGICAL AND STATISTICAL INFORMATION FOR ACTIVE SCENE
CLASSIFICATION
73
part is currently modeled as a simplified object detec-
tion mechanism consisting of a set of class-specific
binary Support Vector Machines (SVMs). Each SVM
is evaluated on a separate data set in order to obtain
a probabilistic model, which reflects the average error
rate during classification. This model is later used to
quantify the level of uncertainty associated with the
detection of a specific object class, which allows the
system to judge the reliability of detection results.
In order to combine the uncertainty resulting from
the object detection and from the statistical model
with the set-based propositions obtained from the on-
tology, we use Dempster-Shafer theory. Represent-
ing the scene belief within this framework allows the
assignment of belief masses to sets of propositions.
The framework is thus suited for expressing the non-
specificity associated with knowledge obtained from
the ontology.
Overall, the architecture analyzes a scene in a cy-
cle of bottom-up object detection followed by a belief
update based on statistical and logical inferences, and
top-down information gain maximization for select-
ing the next object class for detection. In order to clas-
sify a scene, the system first computes the expected
scene information gain associated with the action of
looking for a specific object class. After selecting the
most informative object class, the vision module in-
vokes the corresponding object detector and updates
the current scene belief depending on the detection
result, the constraints defined by the ontology for this
object class and the co-occurrence probabilities of the
statistical model.
3 KNOWLEDGE
REPRESENTATION
The underlying knowledge representation of the sys-
tem comprises (i) statistical and (ii) ontological infor-
mation on the domain. Statistical information results
from empirical data, in our case from the LabelMe
database. It is obtained by averaging over occurrences
of objects (scene entities) in certain scenes. Statistics
are generally subject to noise and they depend on the
availability of sufficient sample data. Furthermore,
they are restricted to representing low-order relations
since the complexity of representation and data ac-
quisition increases exponentially with the number of
variables.
While statistical information reflects the proba-
bility of relations between objects and scenes, onto-
logical information reflects logically strict constraints
and relations between them. A domain ontology for
scenes primarily has to formalize the kind of scenes
and objects that exist as well as their relationships.
In contrast to statistics, it does not rely on a sam-
ple set of data, but on expert knowledge and general
common-sense knowledge of the domain. It may es-
pecially be formalized on the basis of foundational
developments in ontological engineering, as outlined
below. In essence, (i) statistics contribute a proba-
bilistic correspondence between objects and scenes
obtained from a (finite) data set, while (ii) the domain
ontology contributes a formalization of entities and
relations that exist in the domain, which also provides
the vocabulary for the statistics.
We introduce the domain ontology for scene
recognition in the next section, and the statistical anal-
ysis of the domain subsequently.
3.1 Ontology
The domain ontology for visual scene recognition
provides the system with information on scenes, ob-
jects in scenes, and their relations. Although ontolo-
gies can be defined in any logic, we focus here on
ontologies as theories formulated in description logic
(DL) (Baader et al., 2003). DL is supported by the
web ontology language OWL DL 2
1
. Even though
ontologies may be formulated in more or less expres-
sive logics, DL ontologies have the following bene-
fits: they are widely used and a common standard for
ontology specifications, they provide constructions
that are general enough for specifying complex on-
tologies, and they provide a balance between expres-
sive power and computational complexity in terms of
reasoning practicability (Horrocks et al., 2006). Our
scene recognition system uses a domain ontology for
specific scenes (SceneOntology
2
) , which is formulated
in OWL DL 2. Furthermore, the system uses Pellet
(Sirin et al., 2007) for ontological reasoning.
In our scenario, the ontology provides background
knowledge about the domain to support scene classi-
fication. Its structure adopts methods from formal on-
tology developments. In particular, it is a logical re-
finement of the foundational ontology DOLCE (Ma-
solo et al., 2003). For practical reasons, we use the
OWL version DOLCE-Lite. The domain ontology for
scenes conservatively extends DOLCE-Lite, i.e., all
assertions made in the language of DOLCE-Lite that
follow from the scene ontology already follow from
DOLCE-Lite. Essentially, this means that the scene
ontology completely and independently specifies its
vocabulary, i.e., it can be seen as an ontological mod-
ule (Konev et al., 2009).
1
http://www.w3.org/TR/owl2-semantics/
2
http://www.ontospace.uni-
bremen.de/ontology/domain/SceneOntology.owl
KEOD 2009 - International Conference on Knowledge Engineering and Ontology Development
74
Reusing DOLCE ensures that the domain ontol-
ogy is based on a well-developed foundational ontol-
ogy. Certain types of classes and relations can be re-
used and their axiomatizations can be inherited. Par-
ticular ontological classes specified in the scene ontol-
ogy that are involved in the scene recognition process
are SceneClass, SceneEntity, and Scene. Their refine-
ments of DOLCE-Lite are defined as follows:
Scene u SceneEntity v dolce:physical-object
SceneClass v dolce:non-physical-object
The class dolce:physical-object is a subcategory
of physical-endurant in DOLCE, which represents
those entities that have a physical extent and
which are wholly present in time (Masolo et al.,
2003). Scene and SceneEntity are subclasses of this
dolce:physical-object. The class SceneEntity represents
physical entities that occur in spatial scenes and that
correspond to segmented objects in scene images.
These entities are determined by their intrinsic (inher-
ent) properties. Examples are Furniture, Refrigerator,
Chair, Appliance, Tree, and Plant, namely entities that
are contained in indoor and outdoor scenes. These
scenes are represented by the class Scene. The re-
lation contain specifies precisely the relation that cer-
tain instances of SceneEntity are contained in a certain
Scene. The class Scene can be informally described
as a collection of contained SceneEntities. In practice,
it is related to a specific view on the environment that
is perceived by the visual system.
In contrast to Scene and SceneEntity, SceneClass
formalizes the type of the scene, i.e., it indi-
cates the category of collections (Scene) of entities
(SceneEntity). Examples are Kitchen, Office, ParkingLot,
and MountainScenery. SceneClass is a subclass of
dolce:non-physical-object, which is an endurant that has
no mass. It constantly depends on a physical en-
durant, in this case, it depends on the collection of
entities that are physically located at a certain Scene
or that are ‘commonly’ perceivable in this scene.
In the scene ontology, SceneClass therefore defines
the DOLCE-relation generically-dependent-on to one
Scene, which is defined by a conjunction of disjunc-
tions of restrictions on those SceneEntity that may oc-
cur in the scene. Specific subclasses of SceneClass
and SceneEntity are taken from the database of La-
belMe.
Hence, for a specific SceneClass s
i
, there is a num-
ber of subclasses t
k
of SceneEntity that necessarily
have to occur at the Scene r
j
of the SceneClass s
i
.
Specific subclasses of SceneEntity are taken into ac-
count by the following conjunction, with K
s
i
indicat-
ing the index set of SceneEntity t
k
, which constrain s
i
as defined by (2), and N indicating the total amount
of subclasses of SceneEntity:
ξ
s
i
=
^
kK
s
i
t
k
with K
s
i
{1, . . . , N}. (1)
Each SceneEntity t
k
is taken from constraints of the
SceneClass s
i
as follows:
generically-dependent-on(s
i
, (contain(r
j
, t
k
))). (2)
The distinction being drawn between SceneEntity and
SceneClass is based on an agent-centered perspective
on the domain of possible scenes from the LabelMe
database. While instances of SceneEntity (e.g., chair,
refrigerator, or sink) are on the same level of granular-
ity, instances of SceneClass (e.g., kitchen, street cor-
ner, or warehouse) are on a broader level of granular-
ity and they particularly depend on a collection of the
former. The levels of granularity depend on the agent,
i.e., in our case the vision system, that perceives its
environment, i.e., a specific scene. The ontologi-
cal representation of entities differing in granularity
aspects is grounded in this agent-based (embodied)
vision, as outlined, for instance, in (Vernon, 2008).
Note, however, that although an ‘open world’ as-
sumption underlies the ontological representation, the
ontology takes into account precisely the objects that
are classifiable by the system. Currently, the scene on-
tology distinguishes between 7 different scene classes
and 24 scene objects.
Constraints of specific SceneClasses, such as
Kitchen, are given by the scene ontology on the ba-
sis of SceneEntities. A sample of such constraints
is illustrated in the following example (formulated
in Manchester Syntax (Horridge and Patel-Schneider,
2008)):
Class: Kitchen
SubClassOf: SceneClass,
generically-dependent-on only (Scene and
(contain some Oven)),
generically-dependent-on only (Scene and
(contain some Sink)), . . .
Queries over such constraints using the reasoner Pel-
let support the scene recognition process by provid-
ing general background knowledge about the domain.
Given a request for a specific scene class, the reasoner
returns the constraints given by ξ
s
i
as formulated in
(1).
3.2 Statistics
The statistical model represents the relation of a scene
class s
i
and an object class t
k
by their co-occurrence
probability p(t
k
|s
i
). These conditional probabilities
are estimated by computing relative scene-object tag
EVIDENTIAL COMBINATION OF ONTOLOGICAL AND STATISTICAL INFORMATION FOR ACTIVE SCENE
CLASSIFICATION
75
monitor
keyboard
PC
phone
table
desk
plant
trash bin
bookshelf
chair
0 0.5 1
Figure 2: Conditional probabilities of object occurrences
for a given scene. In this example probabilities of object
classes in scene class Office are shown.
frequencies from the LabelMe database. We re-
strict our model to these second-order relations since
higher-order models exhibit combinatorial complex-
ity, even though this implies ignoring possible sta-
tistical dependencies between object classes. After
excluding scenes not containing any known object
classes, 9601 scenes along with 28701 known object
classes remain for the statistical analysis. An example
of the co-occurrence distribution for the scene class
Office is shown in figure 2.
4 REASONING
In order to compute the belief about the current scene
class, knowledge about scene-object relations from
the statistical model and the domain ontology are
combined with the object detection beliefs from the
vision module. While the statistical model and object
detection can be accurately described by Bayesian
probabilities, the constraints defined by the ontol-
ogy result in propositions about sets of scene classes
without any form of belief measure assigned to the
single elements within these sets. We therefore use
Dempster-Shafer theory (Shafer, 1976) throughout
the architecture since it generalizes the Bayesian no-
tion of belief to set-based propositions, thus making
ignorance explicit and avoiding unjustified equiprob-
ability assumptions. In particular, we use a variant of
Dempster-Shafer theory known as the transferable be-
lief model (Smets and Kennes, 1994), which is based
on an open world assumption accounting for the fact
that not all scenes can be mapped to the modeled
classes.
4.1 Belief Update
Let Θ be a finite and exhaustive set of mutually ex-
clusive hypotheses. The belief induced by a piece of
evidence can be expressed as an (unnormalized) mass
function m : 2
Θ
[0, 1] that assigns belief values to
arbitrary subsets A Θ (including
/
0 (Smets, 1992))
such that
A
m(A) = 1. Combining two pieces of evi-
dence which induce a belief m
1
and m
2
respectively is
done by applying the conjunctive rule of combination
denoted by
:
(m
1
m
2
)(A) =
BC=A
m
1
(B)m
2
(C). (3)
If there is a prior belief m and the truth of a hypothesis
A can be established with certainty, m can be condi-
tioned on A yielding the conditional belief m(·¦ A).
Let
b
T be the set of indices k corresponding to all
object classes t
k
for which a visual detection was per-
formed up to this point. The scene belief m
b
T
is then
computed by conjunctively combining the detection
beliefs m(t
k
) of each object class t
k
with the condi-
tional scene class belief derived from the statistical
model and the domain ontology, which can be written
as (Dubois and Prade, 1986):
m
b
T
=
k
b
T
t
k
m(·¦{t
k
})m(t
k
). (4)
This is the basic update equation of the system. It is
important to note that it can be computed recursively,
i.e., the scene belief is updated based on the prior be-
lief and the information from a new detection.
The change of the scene belief depends on the be-
lief about the object class’ presence m(t
k
) (see sec-
tion 5) as well as on the conditional model m(S ¦{t
k
})
that assigns mass values to sets of scene classes S
given the (non-)presence of object class t
k
. As men-
tioned above, the latter belief reflects two sources of
knowledge and can therefore be expressed as a con-
junctive combination of a statistical part m
sta
and an
ontological part m
ont
:
m(·¦{t
k
}) = m
sta
(·¦{t
k
})
m
ont
(·¦{t
k
}). (5)
The statistical model m
sta
is obtained by applying
Bayes’ rule (without normalization) to the conditional
probability p(t
k
|s
i
), which is generated from training
data (see section 3.2):
m
sta
(s
i
¦{t
k
}) = p(t
k
|s
i
) p(s
i
), (6)
m
sta
(s
i
¦t
k
}) = (1 p(t
k
|s
i
)) p(s
i
). (7)
The ontological model m
ont
does not yield any in-
formation if t
k
is true because the presence of an ob-
ject in a scene is never impossible. In this case, mass
1 is assigned to the whole hypothesis space Θ
S
, ex-
pressing a state of total ignorance. However, if a scene
class s
i
requires an object class t
k
according to the do-
main ontology, the non-presence of t
k
implies the re-
jection of s
i
. Therefore, ¬t
k
rules out all those scene
KEOD 2009 - International Conference on Knowledge Engineering and Ontology Development
76
classes s
i
for which the index set K
s
i
belonging to the
conjunctive constraint ξ
s
i
defined by (1) contains the
object class index k:
m
ont
(Θ
S
¦{t
k
}) = 1, (8)
m
ont
(Θ
S
\ {s
i
|k K
s
i
}¦t
k
}) = 1. (9)
4.2 Information Gain
Aside from passively updating the scene belief in a
bottom-up fashion, the system also utilizes a top-
down mechanism for selecting object detectors in or-
der to actively reduce uncertainty. If there is little
doubt about a scene’s class, then it would be wasteful
to apply all possible detectors knowing that it would
be unlikely to change the scene belief. For this, we
need a measure of uncertainty that is applicable to
mass functions. We use the local conflict measure
H(m) (Pal et al., 1993) here since it is a measure of
total uncertainty that generalizes the concept of infor-
mation entropy:
H(m) =
A
m(A)log
2
|A|
m(A)
. (10)
Selecting the most informative object class t
for
the subsequent detection is done by maximizing the
expected information gain. The gain associated with
an object class t
k
from the set of previously ignored
classes
b
T
C
is defined as the difference between the
current uncertainty H(m
b
T
) and the expected uncer-
tainty E(H(m
b
T ∪{k}
)) after applying the detector for
t
k
:
t
= max
k
b
T
C
h
H(m
b
T
) E
H(m
b
T ∪{k}
)
i
. (11)
The extent of the decrease according to (4) depends
on the a priori presence belief for t
k
on the one hand
and on the discriminatory power of t
k
with respect to
the current scene belief m
b
T
expressed by the object-
scene model m(· ¦{t
k
}) on the other hand. Since t
k
is
not directly observable, the integration for computing
the expected value must be performed over the belief
space m(t
k
):
E
H(m
b
T ∪{k}
)
=
m(t
k
)
p(m(t
k
))H(m
b
T ∪{k}
). (12)
Most importantly, the belief distribution over t
k
is
characterized by a single value (t
k
is a binary vari-
able) so that the integration can be approximated by
summing over a normalized histogram of belief val-
ues m(t
k
). The histogram provides the probability
for the t
k
detector returning a belief value in a given
interval. In the current implementation, it contains
100 bins and is computed during the classifier perfor-
mance evaluation described in the following section.
5 SCENE CLASSIFICATION
In this section, we show how the system manages
to classify scenes from the LabelMe image database
and present preliminary results. Since we are not in-
terested in solving the difficult full object detection
problem (Schneiderman and Kanade, 2004) here, we
use the object pre-segmentation provided by Label-
Me. First, we describe how object detection is per-
formed on these segmented images. Subsequently,
we explain the complete scene classification process
using an example and present results for a randomly
selected set of scenes from the LabelMe database.
5.1 Object Detection
To each object class corresponds one specialized de-
tector that is independently applied to the current
scene. Detecting the presence of an object class t
k
is realized by checking whether at least one segment
contains an object of type t
k
. In this way, the object
detection problem can be reduced to a set of binary
classification problems. Here, we deliberately ignore
the fact that using a fixed set of segments contradicts
the independent classification of segments (each seg-
ment belongs to exactly one class) since using the pre-
segmentation is just a means of simplification.
Each segment is scaled to a 128 × 128 gray-value
image before being processed by a class-specific bi-
nary SVM which was trained on several thousand
positive and negative instances of t
k
. Since this classi-
fier can exhibit significant error rates due to the often-
times suboptimal sample quality, the disjunctive com-
bination of classification results is done in a proba-
bilistic fashion. We define the belief about the pres-
ence of object class t
k
as the probability of at least one
segment l being of type t
k
given the classifier response
c
k;l
for each segment:
m(t
k
) = p(
_
l
t
k;l
|{c
k;l
})
= 1
l
p(c
k;l
t
i;k
)p(¬t
k;l
)
l
t
k;l
p(c
k;l
|t
k;l
)p(t
k;l
)
. (13)
The above equation can be derived by applying
Bayes’ rule along with an independence assumption
between segments to the probability of t
k
not being
present. All involved probabilities are learned from a
representative set of sample scenes where p(c
k;l
|t
k;l
)
expresses the classifier’s true/false positive/negative
rates which are a measure of its reliability.
5.2 Results
The test set on which we evaluated the system’s per-
formance consisted of 7 scene classes with 12 ran-
EVIDENTIAL COMBINATION OF ONTOLOGICAL AND STATISTICAL INFORMATION FOR ACTIVE SCENE
CLASSIFICATION
77
office=0.78
bookshelf=0.15
desk=0.07
pc=0.36
monitor=0.66
keyboard=0.48
trash bin=0.29
Figure 3: Example scene with marked segments and de-
tection beliefs for present object classes. Even though the
detection of single object classes is ambiguous, the system
manages to correctly recognize an office scene based on fus-
ing all available information.
domly
3
selected scenes from the LabelMe database
each. On average, the classifiers for the 24 object
classes exhibited a sensitivity of p(c
k;l
|t
k;l
) = 0.71
and a specificity of p(¬c
k;l
t
k;l
) = 0.79. A scene
counted as correctly classified if the highest belief was
assigned to its actual class. Overall, the system cor-
rectly classified 61% of all 84 scenes. The recognition
rate strongly varied across some scene classes, e.g.,
offices: 70% and living rooms: 43%. This can be
explained by the fact that certain scene classes con-
tained less typical objects as well as by the fact that
the classification of certain object classes is more dif-
ficult than for others. Considering the simplicity of
the detection scheme and the poor quality of many of
the displayed objects (caused by occlusion, poor seg-
mentation, low resolution, unusual perspectives, etc.)
the overall recognition rate is surprisingly high even
though more comprehensive tests are necessary for
validation. The information gain strategy for selecting
object detectors based on their expected disambiguat-
ing influence reduced the average number of detec-
tions per scene by 29% compared random selection
for reaching the same level of uncertainty.
Figure 3 shows an office scene example that was
analyzed by the system. Starting with a vacuous scene
belief, the system first computes the expected infor-
mation gain over all object classes and selects the ta-
ble detector since tables regularly occur in different
scene classes. The table classifier is then applied to
all 7 segments, but with no positive responses, the re-
sulting detection probability is only 0.06 (not 0 due
to the possibility of false negatives). As a result, the
belief for scene classes that often contain tables de-
creases while the belief for the remaining classes in-
3
The only selection criterion was that each selected
scene contained at least 3 known object classes.
creases. Once office-specific objects are detected (see
figure 2), the strong statistical evidence along with
the ontological constraints (offices generally contain
desks and electronic equipment) cause the office class
belief to reach a value of 0.78. At this point, the sys-
tem has analyzed 13 object classes and the uncertainty
level has dropped below the termination threshold of
H(m
b
T
) 1.5.
6 DISCUSSION
In this paper, we showed how scene classification can
be performed by combining the complementary infor-
mation provided by a domain ontology and a statisti-
cal model. The consistent propagation of uncertainty
from the recognition of local features up to the scene
level enables the system to draw inferences even if the
input data is very noisy. We chose Dempster-Shafer
theory as a framework for representing uncertainty
since it can be used for expressing both set-based im-
plications obtained from the domain ontology as well
as probabilistic relations. By using unnormalized be-
lief functions, we make an explicit open world as-
sumption which accounts for the fact that not every
scene can be accurately mapped to the set of modeled
scene classes.
Besides the bottom-up updating of the scene be-
lief, we presented a top-down reasoning strategy for
targeted feature selection based on information gain
maximization. This selective processing leads to a
more efficient analysis of the scene by filtering out
uninformative features, which appears to be an impor-
tant principle of how humans analyze scenes with sac-
cadic eye movements (Henderson and Hollingworth,
1999). Information gain maximization can be inter-
preted as an attention mechanism, and it reflects find-
ings in neuro-psychology showing that object recog-
nition in humans is not simply a feature-driven pro-
cess, but rather an interplay of bottom-up processing
and top-down reasoning where recognition is influ-
enced by the context (Schill et al., 2001).
We argue that domain ontologies and statistics can
complement each other since ontologies provide a
more general description of the world whereas statis-
tics can offer more detailed but noisy information
that depends on the availability of suited training
data. The LabelMe database is a good example for
the problem of partially insufficient data, because for
many scene classes it contains only a small number
of samples. A more exhaustive domain model in the
form of an ontology thus enables the system to draw
inferences about classes for which no statistics might
be available at all.
KEOD 2009 - International Conference on Knowledge Engineering and Ontology Development
78
A general problem in the context of scene classifica-
tion is the processing of images that only show parts
of a larger scene. Essentially, this means that it is
not possible to reason on the basis of an object class’
absence. While the explicit representation of uncer-
tainty reduces the severity of the problem in practice,
there is always a chance of miss-classifying a scene
due to critical objects being out of view. A possible
solution to this problem could be to have the system
analyze images taken at different view points in the
scene.
In the future, we plan to integrate the presented
scene classification system into a mobile agent (Zet-
zsche et al., 2008). Not only does this provide the
system with a strong prior due to the agent’s past ob-
servations, the mobility would also ease the problem
of only sensing parts of a scene. In particular, this will
require the detection of objects to be performed with-
out any pre-segmentation, which is why we are cur-
rently working on providing the system with a more
sophisticated vision module. This will also allow us
to produce more conclusive experimental results on
other data sets. Finally, we think it would be inter-
esting to see whether the generic approach of reason-
ing based on ontologies and statistics in a belief-based
framework could be applied to other domains beyond
scene classification.
ACKNOWLEDGEMENTS
This work was supported by DFG, SFB/TR8 Spa-
tial Cognition, projects A5-[ActionSpace] and I1-
[OntoSpace].
REFERENCES
Baader, F., Calvanese, D., McGuinness, D., Nardi, D., and
Patel-Schneider, P. (2003). The Description Logic
Handbook. Cambridge University Press.
Dubois, D. and Prade, H. (1986). On the unicity of Demp-
ster’s rule of combination. International Journal of
Intelligent Systems, 1(2):133–142.
Henderson, J. and Hollingworth, A. (1999). High-level
scene perception. Annual Review of Psychology,
50(1):243–271.
Horridge, M. and Patel-Schneider, P. F. (2008). Manchester
OWL syntax for OWL 1.1. OWL: Experiences and
Directions (OWLED 08 DC), Gaithersberg, Mary-
land.
Horrocks, I., Kutz, O., and Sattler, U. (2006). The Even
More Irresistible SROIQ. In Knowledge Representa-
tion and Reasoning (KR). AAAI Press.
Kollar, T. and Roy, N. (2009). Utilizing object-object and
object-scene context when planning to find things. In
International Conference on Robotics and Automation
(ICRA).
Konev, B., Lutz, C., Walther, D., and Wolter, F. (2009). For-
mal properties of modularisation. In Stuckenschmidt,
H., Parent, C., and Spaccapietra, S., editors, Modular
Ontologies. Springer.
Maillot, N. E. and Thonnat, M. (2008). Ontology based
complex object recognition. Image and Vision Com-
puting, 26(1):102–113.
Mart
´
ınez Mozos,
´
O., Triebel, R., Jensfelt, P., Rottmann, A.,
and Burgard, W. (2007). Supervised semantic label-
ing of places using information extracted from sensor
data. Robotics and Autonomous Systems, 55(5):391–
402.
Masolo, C., Borgo, S., Gangemi, A., Guarino, N., and
Oltramari, A. (2003). Ontologies library. WonderWeb
Deliverable D18, ISTC-CNR.
Oliva, A. and Torralba, A. (2001). Modeling the shape
of the scene: A holistic representation of the spatial
envelope. International Journal of Computer Vision,
42(3):145–175.
Pal, N., Bezdek, J., and Hemasinha, R. (1993). Uncertainty
measures for evidential reasoning II: A new measure
of total uncertainty. International Journal of Approxi-
mate Reasoning, 8(1):1–16.
Russell, B., Torralba, A., Murphy, K., and Freeman, W.
(2008). LabelMe: a database and web-based tool for
image annotation. International Journal of Computer
Vision, 77(1–3):157–173.
Schill, K., Umkehrer, E., Beinlich, S., Krieger, G., and
Zetzsche, C. (2001). Scene analysis with saccadic
eye movements: Top-down and bottom-up modeling.
Journal of Electronic Imaging, 10(1):152–160.
Schill, K., Zetzsche, C., and Hois, J. (2009). A belief-
based architecture for scene analysis: From sensori-
motor features to knowledge and ontology. Fuzzy Sets
and Systems, 160(10):1507–1516.
Schneiderman, H. and Kanade, T. (2004). Object detection
using the statistics of parts. International Journal of
Computer Vision, 56(3):151–177.
Shafer, G. (1976). A Mathematical Theory of Evidence.
Princeton University Press, Princeton, NJ.
Sirin, E., Parsia, B., Grau, B. C., Kalyanpur, A., and Katz,
Y. (2007). Pellet: A practical OWL-DL reasoner. Web
Semantics: Science, Services and Agents on the World
Wide Web, 5(2):51–53.
Smets, P. (1992). The nature of the unnormalized beliefs
encountered in the transferable belief model. In Un-
certainty in Artificial Intelligence, pages 292–297.
Smets, P. and Kennes, R. (1994). The transferable belief
model. Artificial intelligence, 66(2):191–234.
Vernon, D. (2008). Cognitive vision: The case for embodied
perception. Image and Vision Computing, 26(1):127–
140.
Zetzsche, C., Wolter, J., and Schill, K. (2008). Sensorimo-
tor representation and knowledge-based reasoning for
spatial exploration and localisation. Cognitive Pro-
cessing, 9:283–297.
EVIDENTIAL COMBINATION OF ONTOLOGICAL AND STATISTICAL INFORMATION FOR ACTIVE SCENE
CLASSIFICATION
79