of some of the monomodal systems can be
eliminated by other ones.
To avoid this type of problem, the use of a
normalization process can be useful and most SVM
based systems incorporate a Min-Max normalization
previous to the classificatory system. In this work,
the effect of several normalization techniques upon a
SVM multimodal system is explored.
5 EXPERIMENTS
In this section, the speaker and face recognition
systems used in the fusion experiments and the
experimental results obtained with the different
normalization methods in an SVM fusion system
will be presented.
5.1 Experimental Setup
The monomodal scores used in the experiments have
been provided by three experts: an SVM fusion of 9
speech prosodic features, a voice spectrum based
speaker recognition system and a facial recognition
expert based in the NMFFaces (Tefas et al., 2005)
algorithm.
In the prosody based recognition system a 9
prosodic feature vector was extracted for each
conversation side (Wolf, 1972). The system was
tested with 1 conversation-side, using the k-Nearest
Neighbour method. The prosodic vectors have been
fused by means of a SVM classificatory system with
RBF kernel to obtain a single monomodal score.
The spectrum based speaker recognition system
was a 32-component GMM system with diagonal
covariance matrices; 20 Frequency Filtering
parameters were generated (Nadeu et al., 1996), and
20 corresponding delta and acceleration coefficients
were included. The UBM was trained with 116
conversations.
The face recognition expert is based in the
NMFFaces algorithm (Tefas et al., 2005). Non-
negative matrix factorization is used in Tefas et al.
work to yield sparse representation of localized
features to represent the constituent facial parts over
the face images.
Prosodic and spectrum scores have been obtained
form speech records of the Switchboard-I database
(Godfrey et al., 1990) and the face scores have been
obtained from still images of the XM2VTS database
(Lüttin et Maître, 1998). The Switchboard-I is a
collection of 2,430 two-sided telephone
conversations among 543 speakers from the United
States. XM2VTS database is a multimodal database
consisting in face images, video sequences and
speech recordings of 295 subjects. A chimerical
database has been created by the combination of the
three expert scores. A total of 5,000 score vectors
have been generated for the training of the models
and 46,500 score vectors has been used in the test
phase.
5.2 Results
In the experiments, several normalization techniques
have been applied upon the monomodal scores.
Later, these scores have been fused by means of a
SVM system. The normalization methods are that
presented in previous sections: Min-Max (MM), Z-
Scores (ZS), a tanh based technique (TANH),
histogram equalization to the best monomodal
system (HEQ), and Bi-Gaussian equalization
(BGEQ).
To compare the effect of each normalization
method upon the SVM fusion system, an RBF kernel
based configuration has been tested. Concretely, for
the RBF kernel different values of the Gaussian
variance σ have been tested: 1/3, 1, and 3.
Furthermore, the regularization parameter C has
been set to 10, 100, and 200.
The minimum percentages of error provided by
the SVM verification system and the equal error rate
(EER) obtained by each normalization technique are
respectively presented in tables 1 and 2 for each
combination of the SVM parameters.
BGEQ obtains the best results and the rest of the
techniques obtain results with a difference of, at
least, a 10.51 % with respect to the best result.
Furthermore, the EER obtained by BGEQ is a 5.40
% better than that obtained by the non equalization
techniques. Concretely, Min-Max, the most used
normalization technique in SVM systems, is
outperformed by Bi-Gaussian equalization with a
relative error improvement of a 22.19 %.
The minimum results obtained with the
equalization techniques are from a 0.533 % to a
0.643 % while the best result obtained by the Min-
Max normalization is of a 0.826 %. In the same
way, the EER obtained by the equalization
techniques are from a 0.667 % to a 0.750 % and the
best result obtained by MM is a 0.815 %. That is, in
these experiments, the selection of an adequate
normalization method has been more decisive for
obtaining the best results than the choice of the
characteristics of the SVM system.
SECRYPT 2007 - International Conference on Security and Cryptography
36