between random variables. BNs are not suitable to
represent symmetric relationships that mutually relate
random variables. RFs are natural methods to model
symmetric relationships, but they are not suitable to
model causal or part-of relationships.
Spatial and hierarchical relationships are two
valuable cues for image interpretation of man-made
scenes. In this paper we will develop a consistent
graphical model representation for image interpreta-
tion that includes both information about the spatial
structure and the hierarchical structure. We assume
some preprocessing leads to regions, either as a parti-
tioning of the image area or as a set of overlapping or
non-overlapping segments. The key idea for integrat-
ing the spatial and the hierarchical structural informa-
tion into the interpretation process is to combine them
with the low-level region class probabilities in a clas-
sification process by constructing the graphical model
on the multi-scale image regions.
The following sections are organized as follows.
The related works are discussed in Sec. 2. In Sec. 3,
the statistical model for the interpretation problem is
formulated. Then, the relations to previous models is
discussed in Sec. 4. In Sec. 5, experimental results are
presented. Finally, this work is concluded in Sec. 6.
2 RELATED WORK
There are many recent works on contextual models
that exploit the spatial structures in the image. Mean-
while, the use of multiple different over-segmented
images as a preprocessing step is not new to computer
vision. In the context of multi-class image classifica-
tion, the work of (Plath et al., 2009) comprises two as-
pects for coupling local and global evidences both by
constructing a tree-structured CRF on image regions
on multiple scales and using global image classifica-
tion information. Thereby, (Plath et al., 2009) neglect
direct local neighborhood dependencies. The work
of (Schnitzspan et al., 2008) extends classical one-
layer CRF to a multi-layer CRF by restricting the pair-
wise potentials to a regular 4-neighborhood model
and introducing higher-order potentials between dif-
ferent layers.
Although not as popular as CRFs, BNs have
also been used to solve computer vision problems
(Mortensen & Jia, 2006; Sarkar & Boyer, 1993). BNs
provide a systematic way to model the causal rela-
tionships among the entities. By explicitly exploiting
the conditional independence relationships (known as
prior knowledge) encoded in the structure, BNs could
simplify the modelling of joint probability distribu-
tions. Based on the BN structure, the joint probability
is decomposed into the product of a set of local con-
ditional probabilities, which is much easier to spec-
ify because of their semantic meanings (Zhang & Ji,
2010).
Graphical models have reached a state where both
hierarchical and spatial neighborhood structures can
be efficiently handled. RFs and BNs are suitable for
representing different types of statistical relationships
among the random variables. Yet only a few previous
works focus on integrating RFs with BNs. In (Ku-
mar & Hebert, 2003b), the authors present a genera-
tive model based approach to man-made structure de-
tection in 2D natural images. They use a causal multi-
scale random field as a prior model on the class labels.
Labels over an image are generated using Markov
chains defined over coarse to fine scales. However,
the spatial neighborhood relationships are only con-
sidered at the bottom scale. So, essentially, this model
is a tree-structured belief network plus a flat Markov
random field. Recently, a unified graphical model
that can represent both the causal and noncausal re-
lationships among the random variables is proposed
in (Zhang & Ji, 2010). They first employ a CRF to
model the spatial relationships among the image re-
gions and their measurements. Then, they introduce
a multilayer BN to model the causal dependencies.
The CRF model and the BN model are then combined
through the theories of the factor graphs to form a
unified probabilistic graphical model. Their graphi-
cal model is too complex in general. Although their
model improves state of the art results on the Weiz-
mann horse dataset and the MSRC dataset, they need
a lot of domain expert knowledge to design the local
constraints. Also, they use a combination of super-
vised parameter learning and manual parameter set-
ting for the model parameterization. Simultaneously
learn the BN and CRF parameters automatically from
the training data is not a trivial task. Compared to
the graphical models in (Kumar & Hebert, 2003b),
which are too simple, the graphical models in (Zhang
& Ji, 2010) are too complex in general. Our graphical
model lies in between, cf. Fig. 1. We try to construct
our graphical model that is not too simple in order
to model the rich relationships among the neighbor-
hood of pixels and image regions in the scene, yet
not too complex in order to make parameter learning
and probabilistic inference efficiently. Furthermore,
our model underlies a clear semantic meaning. If the
undirected edges are ignored, meaning no spatial rela-
tionships are considered, the graph is a tree represent-
ing the hierarchy of the partonomy among the scales.
Within each scale, the spatial regions are connected
by the pairwise edges.
AGenericProbabilisticGraphicalModelforRegion-basedSceneInterpretation
487