contrary, Classifier fusion method has a key
advantage of simplicity because of recognizers that
are independent, but it has a flaw of its inability to
utilize feature vectors. The simplest form of
classifier fusion method is integrating the weighted
values of two recognition results.
In recent studies, several improvements on
various methods are proposed. Some of the fusion
methods proposed are a method to use each modality
information asynchronously (Alissali et al., 1996), a
method to calculate a suitable weighted value
dynamically according to the confidence of each
modality (Heckmann et al., 2001), (Glotin et al.,
2001), (Ghosh et al., 2001), a method to add new
recognizer in order to combine each recognition
result. The following is the usual procedure for
fusion recognition process: From each modality, the
feature vector is extracted from received input data.
Each recognition score is then estimated by various
recognition algorithms. Finally weighted value is
given to the recognition score of each modality in
order to integrate the modalities (Heckmann et al.,
2001), (Glotin et al., 2001), (Ghosh et al., 2001),
(Kim et al., 2003), (Kwak et al., 2006).
One of the most important problems in the multi-
modal fusion recognition is how to integrate two
modalities. Usually, the number of sample data has
to be large enough to yield an acceptable PDD
(Probability Density Distribution) in order for a
recognition process to be executed on.
In this paper, a discrete probability density
function (PDF) estimated by a histogram for speech-
gesture multimodal fusion is used extensively. The
discrete PDF by histogram is known to produce a
good estimation of a distribution if the number of
samples is sufficient, but the demerit of this
approach is the inherent discontinuity. To avoid this
problem, this paper proposes to use integrated
discrete PDF instead of discrete PDF as it is. In this
way, a reasonable estimation for all value ranges can
be obtained. Furthermore, even in cases that the
number of sample is not sufficiently large, a
reasonable estimation can be obtained.
The proposed method is tested with microphone
and 3-axis accelerator in real-time environment. The
test has two parts: a simple method of add-and-
accumulate speech and gesture probability density
distribution(PDD) separately, and a more
complicated method of creating new probability
density distribution from integrating the two PDD’s
of speech and gesture.
The integrated PDF method that proposed in this
paper had improvement of performance about 3%
compare to the add-and accumulate method.
In section 2, the proposed speech-gesture fusion
recognition system is described. Section 3 explains
the multimodal fusion algorithm using integrated
PDF. Utterance and gesture model list which used in
experiment is explained in section 4. The
experimental result is given in section 5. Finally, we
describe conclusion in section 6.
2 SYSTEM ARCHITECTURE
The Speech-gesture fusion recognition system
architecture that implemented for experiment in this
paper is shown in figure 1 and figure 2.
Figure 1: Multi-modal Fusion system using add-and-
accumulate PDD’s.
Figure 2: Multi-modal Fusion system using integrate
PDD’s.
The architecture consists of sensor module,
feature extraction module, independent recognition
module, and fusion recognition module. The sensor
module consists of 3-axis accelerometer and a
microphone in order to obtain the speech and gesture
input data. The feature extraction module is to
extract the feature vector from speech and gesture
input data. Speech feature extraction module is
comprised of the following two modules: Start and
End point Detection module based on the frame
energy, and a feature extraction module based on
Zero-Crossing with Peak Amplitude (ZCPA) and
RelAtive SpecTrAl algorithm (RASTA). The
BIOSIGNALS 2009 - International Conference on Bio-inspired Systems and Signal Processing
212