ever, the it is highly appropriate to determine the lo-
cation of the eyes given a pure face image or a face
region within a complex image. To some extent, this
approach is similar to the Pictoral Structures, Felzen-
szwalb et al. (Felzenszwalb and Huttenlocher, 2000)
elaborate on, because we also define a tree-like struc-
ture where a superordinate element (face) contains the
subordinate elements (eyes, mouth, etc.) and where a
geometric relation between these elements is given.
Skin Color Extraction. Acquires reliable informa-
tion about the face and the facial components, as
opposed to pixel values. It gives evidence about
the location and the contour lines of skin colored
parts, on which subsequent steps rely. Unfortunately,
skin color varies with the scenery, the person, and
the technical equipment, which challenges the auto-
matic detection. As in our previous work (Wimmer
et al., 2006), a high level vision module determines
an image-specific skin color model, on which the ac-
tual process of skin color classification bases. This
color model represents the context conditions of the
image and dynamic skin color classifiers adapt to it.
Therefore, our approach facilitates to distinguish skin
color from very similar color, such as lip color or eye-
brow color, see Figure 4. Our approach makes use
of this concept, because it clearly extracts the borders
the skin regions and subsequent steps fit the contour
model to these borders with high accuracy.
The Objective Function. f(I, p) yields a compara-
ble value that specifies how accurately a parameter-
ized model p matches an image I. It is also known as
the likelihood, similarity, energy, cost, goodness, or
quality function. Without losing generality, we con-
sider lower values to denote a better model fit. Tra-
ditionally, objective functions are manually specified
by first selecting a small number of simple image fea-
tures, such as edges or corners, and then formulating
mathematical calculation rules. Afterwards, the ap-
propriateness is subjectively determined by inspect-
ing the result on example images and example model
parameterizations. If the result is not satisfactory the
function is tuned or redesigned from scratch. This
heuristic approach relies on the designer’s intuition
about a good measure of fitness. Our earlier publica-
tions (Wimmer et al., 2007b; Wimmer et al., 2007a)
show that this methodology is erroneous and tedious.
To avoid this, we propose to learn the objective
function from annotated example images. Our ap-
proach splits up the generation of the objective func-
tion into several independent steps that are mostly
automated. This provides several benefits: first, au-
tomated steps replace the labor-intensive design of
the objective function. Second, this approach is less
Sequence 4 s: 64% b: 50% s: 67% b: 99%
Sequence 6 s: 100% b: 11% s: 89% b: 100%
Sequence 8 s: 89% b: 11% s: 87% b: 99%
Sequence 11 s: 37% b: 98% s: 69% b: 100%
Sequence 15 s: 100% b: 11% s: 84% b: 94%
Figure 4: Deriving skin color from the camera image (left)
using the non-adaptive classifier (center) and adapting the
classifier to the person and to the context (right). The num-
bers indicate the percentage of correctly identifying skin
color (s) and the background (b). These images have been
extracted from some image sequences of the Boston Uni-
versity Skin Color Database (Sigal et al., 2000).
error-prone, because giving examples of good fit is
much easier than explicitly specifying rules that need
to cover all examples. Third, this approach does not
need any expert knowledge and therefore, it is gener-
ally applicable and not domain-dependent. The bot-
tom line is that this approach yields more robust and
accurate objective functions, which greatly facilitate
the task of the fitting algorithms. For a detailed de-
scription of our approach, we refer to (Wimmer et al.,
2007b)
The Fitting Algorithm. Searches for the model that
best describes the face visible in the image. There-
fore, it needs to find the model parameters that min-
imize the objective function. Fitting algorithms have
been the subject of intensive research and evaluation,
e.g. Simulated Annealing, Genetic Algorithms, Parti-
cle Filtering, RANSAC, CONDENSATION, and CCD.
We refer to (Hanek, 2004) for a recent overview and
categorization. Since we adapt the objective function
rather than the fitting algorithm to the specifics of the
face interpretation scenario, we are able to use any of
these standard fitting algorithms.
Emotion interpretation applications mostly re-
quire real-time capabilities, our experiments in Sec-
tion 5 have been conducted with a quick hill climb-
LOW-LEVEL FUSION OF AUDIO AND VIDEO FEATURE FOR MULTI-MODAL EMOTION RECOGNITION
147