MODEL-MAPPING BASED VOICE CONVERSION SYSTEM - A Novel Approach to Improve Voice Similarity and Naturalness using Model-based Speech Synthesis Techniques

Baojie Li, Dalei Wu, Hui Jiang

2010

Abstract

In this paper we present a novel voice conversion application in which no any knowledge of source speakers is available, but only sufficient utterances from a target speaker and a number of other speakers are in hand. Our approach consists in two separate stages. At the training stage, we estimate a speaker dependent (SD) Gaussian mixture model (GMM) for the target speaker and additionally, we also estimate a speaker independent (SI) GMM by using the data from a number of speakers other than the source speaker. A mapping correlation between the SD and the SI model is maintained during the training process in terms of each phone label. At the conversion stage, we use the SI GMM to recognize each input frame and find the closest Gaussian mixture for it. Next, according to a mapping list, the counterpart Gaussian of the SD GMM is obtained and then used to generate a parameter vector for each frame vector. Finally all the generated vectors are concatenated to synthesize speech of the target speaker. By using the proposed model-mapping approach, we can not only avoid the over-fitting problem by keeping the number of mixtures of the SI GMM to a fixed value, but also simultaneously improve voice quality in terms of similarity and naturalness by increasing the number of mixtures of the SD GMM. Experiments showed the effectiveness of this method.

References

  1. T. Yoshimura, ”Simultaneous Modelling of Phonetic and Prosodic Parameters and Characteristic conversion for GMM-based Tex-to-speech Systems” ,Ph.D. dissertation, Nagoya Institute of Technology, 2002.
  2. D. Sundermann and H. Ney, ”VTLN-based Voice Conversion”, Proc. of Signal Processing and Information Technology (ISSPIT)”, Dec. 2003, pp. 14-17.
  3. K. Tokuda, T. Yoshimura, T. Masuko, T. Kobayashi and T. Kitamura, ”Speech parameter generation algorithms for GMM-based speech synthesis”, Proc. of ICASSP, pp.1315-1318, June 2000.
  4. L. Arslan and D. Talkin, ”Speaker Transformation Algorithm using Segmental Codebooks (STASC)”, Speech Commun., pp. 211-226, 1999.
  5. H. Ye and S. Young, ”Quality-enhanced Voice Morphing Using Maximum Likelihood Transformation”, IEEE Trans. on Audio, Speech and Language Processing, Vol.14, No.4, pp.1301-1312, 2006.
  6. S. Young et al., ”HTKBook (V3.4)”, Cambridge University Engineering Department, 2006.
  7. S. Imai et al., ”Speech Signal Processing Toolkit Ver.3.2”, http://sp-tk.sourceforge.net, 2008 .
Download


Paper Citation


in Harvard Style

Li B., Wu D. and Jiang H. (2010). MODEL-MAPPING BASED VOICE CONVERSION SYSTEM - A Novel Approach to Improve Voice Similarity and Naturalness using Model-based Speech Synthesis Techniques . In Proceedings of the Third International Conference on Bio-inspired Systems and Signal Processing - Volume 1: BIOSIGNALS, (BIOSTEC 2010) ISBN 978-989-674-018-4, pages 442-446. DOI: 10.5220/0002747104420446


in Bibtex Style

@conference{biosignals10,
author={Baojie Li and Dalei Wu and Hui Jiang},
title={MODEL-MAPPING BASED VOICE CONVERSION SYSTEM - A Novel Approach to Improve Voice Similarity and Naturalness using Model-based Speech Synthesis Techniques},
booktitle={Proceedings of the Third International Conference on Bio-inspired Systems and Signal Processing - Volume 1: BIOSIGNALS, (BIOSTEC 2010)},
year={2010},
pages={442-446},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0002747104420446},
isbn={978-989-674-018-4},
}


in EndNote Style

TY - CONF
JO - Proceedings of the Third International Conference on Bio-inspired Systems and Signal Processing - Volume 1: BIOSIGNALS, (BIOSTEC 2010)
TI - MODEL-MAPPING BASED VOICE CONVERSION SYSTEM - A Novel Approach to Improve Voice Similarity and Naturalness using Model-based Speech Synthesis Techniques
SN - 978-989-674-018-4
AU - Li B.
AU - Wu D.
AU - Jiang H.
PY - 2010
SP - 442
EP - 446
DO - 10.5220/0002747104420446