MODEL-MAPPING BASED VOICE CONVERSION SYSTEM
A Novel Approach to Improve Voice Similarity and Naturalness using Model-based
Speech Synthesis Techniques
Baojie Li, Dalei Wu and Hui Jiang
Department of Computer Science and Engineering, York University, 4700 Keele Street, Toronto, Ontario M3J 1P3, Canada
Keywords:
Voice conversion, HMM-based speech synthesis, GMM, Model mapping.
Abstract:
In this paper we present a novel voice conversion application in which no any knowledge of source speakers is
available, but only sufficient utterances from a target speaker and a number of other speakers are in hand. Our
approach consists in two separate stages. At the training stage, we estimate a speaker dependent (SD) Gaussian
mixture model (GMM) for the target speaker and additionally, we also estimate a speaker independent (SI)
GMM by using the data from a number of speakers other than the source speaker. A mapping correlation
between the SD and the SI model is maintained during the training process in terms of each phone label. At
the conversion stage, we use the SI GMM to recognize each input frame and find the closest Gaussian mixture
for it. Next, according to a mapping list, the counterpart Gaussian of the SD GMM is obtained and then
used to generate a parameter vector for each frame vector. Finally all the generated vectors are concatenated
to synthesize speech of the target speaker. By using the proposed model-mapping approach, we can not
only avoid the over-fitting problem by keeping the number of mixtures of the SI GMM to a fixed value, but
also simultaneously improve voice quality in terms of similarity and naturalness by increasing the number of
mixtures of the SD GMM. Experiments showed the effectiveness of this method.
1 INTRODUCTION
Voice conversion (VC) is a technique that con-
verts voice of a source speaker to that of a tar-
get speaker. Generally speaking, text-dependent and
text-independent voice conversion represent two main
streams of research directions. In text-dependent
voice conversion, target voice can be produced with
high-quality of correctness and acceptable smooth-
ness based on the provided transcription for input
speech waveform, e.g. (Yoshimura, 2002). By con-
trast, text-independent systems have no knowledge
about the transcription of input waveform, therefore
more mismatches between source and target speakers
are present and the quality of the generated speech
then degrades. For this reason, text-independent voice
conversion attracts a wider range of studies. The tech-
niques presented in this paper are also focused on
text-independent voice conversion.
In the field of text-independent voice conversion,
usually some forms of transforms are estimated from
training data of both source and target speakers, such
as K-means clustering in VTLN-based voice conver-
sion (Suedermann et al., 2003), codebook based map-
ping (Arslan et al., 1999) and GMM based cluster-
ing (Ye et al., 2006). In some applications, how-
ever, no knowledge about source speakers is appli-
cable beforehand. Therefore it is impossible to esti-
mate the transforms between source and target speak-
ers using the conventional techniques. In our previ-
ous work, we built a GMM-based VC system using
hidden Markov model (HMM) based speech synthe-
sis to address such particular requirements. At the
training stage, a SD GMM is trained for the target
speaker using his/her pre-recorded training data. In
the conversion stage, for each utterance from a source
speaker, the best matched Gaussian mixture is cho-
sen from the GMM. Next, the mean vectors of the se-
lected mixtures are concatenated, smoothed and then
sent as inputs to the sound synthesizer, which is pro-
vided by HTS engine (Tokuda et al, 2000; Yoshimura
et al., 2002). By experiments, we found that this ap-
proach was quite capable of conducting voice conver-
sion with acceptable quality. However, we also found
recognizable discontinuity and flatness in the synthe-
sized voices. Through investigation, we found that the
442
Li B., Wu D. and Jiang H. (2010).
MODEL-MAPPING BASED VOICE CONVERSION SYSTEM - A Novel Approach to Improve Voice Similarity and Naturalness using Model-based
Speech Synthesis Techniques.
In Proceedings of the Third International Conference on Bio-inspired Systems and Signal Processing, pages 442-446
DOI: 10.5220/0002747104420446
Copyright
c
SciTePress