Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-
Farley, D., Ozair, S., and ..., Y. B. (2014). Generative
adversarial nets. In Advances in neural information
processing systems, volume 27.
Gui, J., Sun, Z., Wen, Y., Tao, D., and Ye, J. (2021). A review
on generative adversarial networks: Algorithms, theory,
and applications. IEEE Transactions on Knowledge
and Data Engineering.
Helander, E., Virtanen, T., Nurminen, J., and Gabbouj, M.
(2010). Voice conversion using partial least squares
regression. IEEE Transactions on Audio, Speech, and
Language Processing.
Huang, C.-F. and Akagi, M. (2008). A three-layered model
for expressive speech perception. Speech Communica-
tion, 50(10):810–828.
Huang, X. and Belongie, S. (2017). Arbitrary style transfer
in real-time with adaptive instance normalization. In
Proceedings of the IEEE International Conference on
Computer Vision, pages 1501–1510. IEEE.
Huang, X., Liu, M.-Y., Belongie, S., and Kautz, J. (2018).
Multimodal unsupervised image-to-image translation.
In The European Conference on Computer Vision
(ECCV).
Kameoka, H., Kaneko, T., Tanaka, K., and Hojo, N. (2018).
Stargan-vc: non-parallel many-to-many voice conver-
sion using star generative adversarial networks. In
2018 IEEE Spoken Language Technology Workshop
(SLT), pages 266–273.
Kaneko, T. and Kameoka, H. (2018). Cyclegan-vc: Non-
parallel voice conversion using cycle-consistent ad-
versarial networks. In 2018 26th European Signal
Processing Conference (EUSIPCO), pages 2100–2104.
IEEE.
Luo, Z., Takiguchi, T., and Ariki, Y. (2016). Emotional voice
conversion using deep neural networks with mcc and f0
features. In 2016 IEEE/ACIS 15th International Con-
ference on Computer and Information Science (ICIS).
Luo, Z.-H., Chen, J., Takiguchi, T., and Sakurai, T. (2017).
Emotional voice conversion using neural networks with
arbitrary scales f0 based on wavelet transform. Journal
of Audio, Speech, and Music Processing, 18.
Morise, M., Kawahara, H., and Katayose, H. (2009). Fast
and reliable f0 estimation method based on the period
extraction of vocal fold vibration of singing voice and
speech. Paper 11.
Morise, M., Yokomori, F., and Ozawa, K. (2016). World:
A vocoder-based high-quality speech synthesis system
for real-time applications. IEICE Transactions on In-
formation and Systems, E99-D(7):1877–1884.
Olaronke, I. and Ikono, R. (2017). A systematic review of
emotional intelligence in social robots.
Scherer, K. R., Banse, R., Wallbott, H. G., and Goldbeck, T.
(1991). Vocal cues in emotion encoding and decoding.
Motivation and Emotion, 15(2):123–148.
Schröder, M. (2006). Expressing degree of activation in
synthetic speech. IEEE Transactions on Audio, Speech,
and Language Processing, 14(4):1128–1136.
Shah, N., Singh, M. K., Takahashi, N., and Onoe, N. (2023).
Nonparallel emotional voice conversion for unseen
speaker-emotion pairs using dual domain adversarial
network & virtual domain pairing.
Sisman, B., Zhang, M., and Li, H. (2019). Group sparse
representation with wavenet vocoder adaptation for
spectrum and prosody conversion. IEEE/ACM Trans-
actions on Audio, Speech, and Language Processing,
27(6):1085–1097.
Tao, J., Kang, Y., and Li, A. (2006). Prosody conversion
from neutral speech to emotional speech. IEEE Trans-
actions on Audio, Speech, and Language Processing,
14(4):1145–1154.
Toda, T., Black, A. W., and Tokuda, K. (2007). Voice con-
version based on maximum-likelihood estimation of
spectral parameter trajectory. IEEE Transactions on
Audio, Speech, and Language Processing, 15(8):2222–
2235.
Ulyanov, D., Vedaldi, A., and Lempitsky, V. S. (2016). In-
stance normalization: The missing ingredient for fast
stylization. ArXiv, abs/1607.08022.
Xue, Y., Hamada, Y., and Akagi, M. (2018). Voice conver-
sion for emotional speech: Rule-based synthesis with
degree of emotion controllable in dimensional space.
Speech Communication.
Zhou, K., Sisman, B., Liu, R., and Li, H. (2021). Emotional
voice conversion: Theory, databases and esd. arXiv.
Zhu, J.-Y., Park, T., Isola, P., and Efros, A. A. (2017).
Unpaired image-to-image translation using cycle-
consistent adversarial networks. In Proceedings of
the IEEE international conference on computer vision,
pages 2223–2232. IEEE.
ICINCO 2023 - 20th International Conference on Informatics in Control, Automation and Robotics
24