5 RESULTS AND DISCUSSION
The vocal tract visualisation tool has been designed
to operate with MS Windows-based PC
environment. The multi-display window and other
user’s features of the complete system are shown in
Figure 5. As can be seen, the system’s screen is
divided into four windows for displaying the vocal
tract graphics, the sound intensity, the pitch and the
first three formants of the speech signal. The system
can operate in two main modes: (a) near real-time
mode, whereby the speech signal is picked up by a
microphone connected to the PC sound card (as with
the case shown in Figure 5), and (b) non real-time
mode, whereby the speech signal is either recorded
by the system or read from a stored audio file, and
its features are then displayed. It also allows the
saving of speech/sound signals. For the vowel
articulation, the user can compare the shape of
his/hers vocal tract to a reference trace (shown with
a dashed line in Figure 5) for the correct tongue
position derived from the measurements data
reported in (Miller & Mathews, 1963). The
deviation from the reference trace is given for this
case in the form of a computed mean squared error
(MSE) of all the estimated mid-sagittal distances.
Figure 6 shows the vocal tract profiles for 10
American English vowels, as estimated by the
system (dashed lines represent reference trace for
tongue position). For comparison and evaluation
purposes, the deviations, in terms of MSE values,
from the reference tongue position data adopted
from (Harshman, et. al., 1977) are also indicated. In
general, the obtained results seem to correlate well
with the reference data. They were also found to
correlate well with x-ray data and the PARAFAC
analysis. Referring to the MSE values shown in
Figure 6, the system seems to perform particularly
well in the cases of all the ‘front vowels’, such as
/IY/, /EY/, /IH/, /EH/ and /AE/, with the MSE
increasing as the vowel height decreases. With the
exception of /AA/ and /UH/, the results show
relatively less accurate correlation with the reference
data for the cases of the ‘back vowels’. As vowel
classification into front and back vowels is related to
the position of the tongue elevation towards the front
or the back of the mouth, we believe that the higher
accuracy in the cases of the front vowels is attributed
to the formant-based added adjustments of the lips,
jawbone and front sections of the vocal tract we used
in our approach.
On the other hand, the relative length of the
vowel’s vocalisation seems to affect the accuracy of
the estimated area functions and hence the displayed
vocal tract shape. In specific, the system seems to
give relatively lower accuracy for relatively longer
vowels, such as /AO/, and complex vowels which
involve changes in the configuration of the mouth
during production of the sound, such as /OW/. We
believe this is due to the fact that the system, in its
current design, bases its estimation of the speech
parameters on information extracted from the 2-3
middle frames of the analysed speech waveform.
6 CONCLUSIONS
We have described the process of designing and
development of a computer-based system for the
near real-time and non real-time visualisation of the
vocal tract shape during vowel articulation.
Compared to other similar systems, our system uses
a new approach for estimating the vocal tract mid-
sagittal distances based on both the area functions
and the first three formants as extracted from the
acoustic speech signal. It also utilises a novel and
simple technique for mapping the extracted
information to corresponding mid-sagittal distances
on the displayed graphics. The system is also
capable of displaying the sound intensity, the pitch
and the first three formants of the uttered speech. It
extracts the required parameters directly from the
acoustic speech signal using an AR speech
production model and LP analysis. Reported
preliminary experimental results have shown that in
general the system is able to reproduce well the
shapes of the vocal tract, with real-time sensation,
for vowel articulation. Work is well underway to
optimise the algorithm used for extraction of the
required acoustics information and the mapping
technique, such that dynamic descriptions of the
vocal tract configuration for long and complex
vowels, as well as vowel-consonant and consonant-
vowel are obtained. Enhancement of the system’s
real-time capability and features, and facilitation of
an integrated speech training aid for the hearing-
impaired are also being investigated.
REFERENCES
Choi, C.D., 1982. A Review on Development of Visual
speech Display Devices for Hearing Impaired Children.
Commun. Disorders, 5, 38-44.
Bunnell, H.T., Yarrington, D. M. & Polokoff, 2000.
STAR: articulation training for young children. In Intl.
Conf. on Spoken Language Processing
(INTERSPEECH 2000), 4, 85-88.
Mashie, J.J., 1995. Use of sensory aids for teaching speech
to children who are deaf. In Spens , K-E. and Plant, G.
A VOCAL TRACT VISUALISATION TOOL FOR A COMPUTER-BASED SPEECH TRAINING AID FOR
HEARING-IMPAIRED INDIVIDUALS
157