Figure 3: The vocal-tract models, VTM-T20. Vowels /i/,
/e/, /a/, /o/, and /u/ (from left to right).
Figure 4: A model of the human vocal tract with a flexible
tongue and a movable mandible.
The picture in Fig. 3 shows a set of vocal-tract
models called VTM-T20 (Arai, 2016). When glottal
sounds are fed into the bottom end of each tube,
different vowels can be heard from the top end. While
the shape of the vocal tract mainly determines what
vowel sound is produced (i.e., “articulation”), the
glottal sound mainly determines the pitch (height) of
the voice and the voice quality. Thus, vocal tracts act
as resonators, and their shapes produce different
speech sounds.
A real vocal tract changes its shape in time.
Therefore, our model in Fig. 4 has a flexible tongue
and a movable mandible (Arai, 2020). By
manipulating the tongue configuration and adjusting
the jaw opening, we can produce different speech
sounds dynamically with this model.
3 SPEECH CHAIN IN DIGITAL
ERA
Thus, the speech chain describes how the human
speech communication system works as each event is
connected as a chain. The original speech chain
shows us a simple situation, but it can be extended to
many scenarios. When we talk on a telephone
network, acoustic signals are fed into the telephone,
converted into electric signals, and transmitted over
the network. In human-computer communication, the
speaker can be a speech synthesis system, or the
listener can be an automatic speech recognition
system. A speech synthesis system can improve the
quality of life of people who have lost the ability to
talk, and an automatic speech recognition system can
be a great help to people who have impaired hearing.
In the following section, I will explain the “My
Voice” project that I was involved with a patient as
an example of a speech chain in the digital era.
3.1 What Is My Voice?
We lose our voice for various reasons, such as
amyotrophic lateral sclerosis (ALS). When an ALS
patient has difficulty breathing, they may receive a
tracheotomy, which causes the patient to lose the
ability to speak. A laryngectomy is another procedure
that causes people to lose their voice. “My Voice” is
free, widely used Japanese speech synthesis software
that gives patients the ability to use their voice after
surgeries that take away their ability to speak.
My Voice is widely used for a few reasons in
particular. First of all, it is free. Also, it is easy to use.
The recording time can be minimized for patients and
reduce their workload, as the main recording is only
for basic Japanese syllable units and words can be
concatenated from the recorded syllable units.
Furthermore, its software GUI is well-designed, so a
patient’s therapist, family members, and friends can
also use it with simple training. By recording words
and/or phrases before surgery, patients can keep
communicating using their voice with technology
even after losing that ability physically.
3.2 Why My Voice?
Commercial speech synthesizers usually aim for
clarity and intelligibility in synthesized speech
sounds. However, My Voice uses a speaker’s voice
and vocal characteristics. When an ALS patient has
difficulty breathing as their disease progresses, they
must choose to have a tracheotomy and lose their