Can Ultrasonic Doppler Help Detecting Nasality for Silent Speech
Interfaces?
An Exploratory Analysis based on Alignement of the Doppler Signal with Velum
Aperture Information from Real-Time MRI
João Freitas
1, 2
, António Teixeira
2
and Miguel Sales Dias
1, 3
1
Microsoft Language Development Center, Lisboa, Portugal
2
Dep. Electronics Telecommunications & Informatics/IEETA, University of Aveiro, Aveiro, Portugal
3
ISCTE-Lisbon University Institute/ADETTI-IUL, Lisboa, Portugal
Keywords: Nasality Detection for Silent Speech Interaction, Velum Movement Detection, Ultrasonic Doppler, Nasal
Vowels, Portuguese.
Abstract: This paper describes an exploratory analysis on the usefulness of the information made available from
Ultrasonic Doppler signal data collected from a single speaker, to detect velum movement associated to
European Portuguese nasal vowels. This is directly related to the unsolved problem of detecting nasality in
silent speech interfaces. The applied procedure uses Real-Time Magnetic Resonance Imaging (RT-MRI),
collected from the same speaker providing a method to interpret the reflected ultrasonic data. By ensuring
compatible scenario conditions and proper time alignment between the Ultrasonic Doppler signal data and
the RT-MRI data, we are able to accurately estimate the time when the velum moves and the type of
movement under a nasal vowel occurrence. The combination of these two sources revealed a moderate
relation between the average energy of frequency bands around the carrier, indicating a probable presence
of velum information in the Ultrasonic Doppler signal.
1 INTRODUCTION
A known challenge in Silent Speech Interfaces
(SSI), including those based on Ultrasonic Doppler
Sensing (UDS) (Freitas et al., 2012a), is the
detection of the nasality phenomena in speech
production, being unclear if information on nasality
is present in the UDS signal. Nasality is an
important characteristic of several languages, such
as French and European Portuguese (EP) (Teixeira,
2000), being the latter the selected language for the
experiments here reported. Additionally, it has been
shown before, that nasality can cause severe word
recognition degradation in UDS (Freitas et al.,
2012a) and Surface Electromyography (Freitas et
al., 2012b) based interfaces for this language.
An SSI can be seen as a possible alternative to
conventional speech interfaces since they allow for
communication to occur in the absence of an
acoustic signal. It brings advantages when used in
situations where privacy or confidentiality is
required, in the presence of environmental noise,
such as in office settings, or when used by speech-
impaired persons such as those who were subjected
to a laryngectomy, making it a suitable candidate for
an interface to be used in Ambient Assisted Living
scenarios. An UDS-based SSI could eventually be
included in a multimodal interface as one of the core
input modalities (Zhu et al., 2007, Freitas et al.,
2013).
The UDS approach main advantages are: its non-
invasive nature, since the device is completely non-
obtrusive and it has been proven to work without
requiring any attachments; not being affect by
environment noise in the audible frequency range;
the required hardware is commercially available;
and is very inexpensive. These advantages make
UDS an interesting approach and an attractive
research topic in the area of Human-Computer
Interaction (HCI) (Raj et al., 2012). The sensing
method is based on the emission of a pure tone in the
ultrasonic range towards the moving target and the
reflected signal is captured by an ultrasound receiver
tuned to the transmitted frequency. The movement
232
Freitas J., Teixeira A. and Sales Dias M..
Can Ultrasonic Doppler Help Detecting Nasality for Silent Speech Interfaces? - An Exploratory Analysis based on Alignement of the Doppler Signal with
Velum Aperture Information from Real-Time MRI.
DOI: 10.5220/0004725902320239
In Proceedings of the International Conference on Physiological Computing Systems (PhyCS-2014), pages 232-239
ISBN: 978-989-758-006-2
Copyright
c
2014 SCITEPRESS (Science and Technology Publications, Lda.)
of the target will cause Doppler shifts in the
reflected signal, creating components at different
frequencies, proportional to their velocity relative to
the sensor. This technique has been applied to many
areas of speech technology (Kalgaonkar and Raj,
2008; Toth et al., 2010), including speech and silent
speech recognition (Srinivasan et al., 2010; Freitas
et al., 2012a).
This paper describes an exploratory analysis on
the existence of velum movement information
detected in the Ultrasonic Doppler signal. The
reflected signal contains information about the
articulators and the moving parts of the face of the
speaker, however, it is yet unclear how to
distinguish between articulators and if velum
movement information is actually being captured.
Therefore, considering our aim of detecting velum
movement and to provide a ground truth for our
research, we used images collected from Real-Time
Magnetic Resonance Imaging (RT-MRI) and
extracted the velum aperture information during the
nasal vowels of European Portuguese. Then, by
combining and registering these two sources,
ensuring compatible scenario conditions and proper
time alignment, we are able to accurately estimate
the time when the velum moves and the type of
movement (i.e. ascending or descending) under a
nasal vowel production phenomenon. Using this
method we are able to correlate the features
extracted from the UDS signal with the signal that
represents the velum movement and analyse if
velum information is being captured in our UDS
signal analysis, for all nasal vowels.
The remainder of this paper is structured as
follows: section 2 presents background notions of
the nasality phenomenon and its impact on European
Portuguese, as well as a description of how the
Doppler Effect works; section 3 presents UDS
related work in the area of HCI; section 4 describes
the methodology used for extracting information
from the RT-MRI images, the UDS device, how
both signals were synchronized and the features
extracted from the Ultrasonic signal; in section 5 the
results of our exploratory analysis are presented; in
section 6 we discuss these results and finally, in
section 7, we present the conclusions of this study.
2 BACKGROUND
2.1 Nasality in European Portuguese
The production of a nasal sound involves air flow
through the oral and nasal cavities. This air passage
for the nasal cavity is essentially controlled by the
velum that when lowered allows for the
velopharyngeal port to be open, enabling resonance
in the nasal cavity and the sound to be perceived as
nasal. The production of oral sounds occurs when
the velum is raised and the access to the nasal cavity
is closed (Teixeira, 2000).
Nasality is a common characteristic of
several languages around the world, however, only
20% of these languages have nasal vowels (Rossato
et al. 2006). In EP there are five nasal vowels ([ɐ
̃
, ,
ĩ, õ, ũ])); three nasal consonants ([m], [n], and [ɲ]);
and several nasal diphthongs [wɐ
̃
] (e.g. quando),
[w] (e.g. aguentar), [jɐ
̃
] (e.g. fiando), [wĩ] (e.g.
ruim) and triphthongs [wɐ
̃
w] (e.g. enxaguam). Nasal
vowels also diverge among languages, for example,
nasal vowels in EP differ from French in its wider
variation in the initial segment and stronger nasality
at the end (Trigo, 1993; Lacerda and Head, 1966).
Differences at the pharyngeal cavity level and velum
port opening quotient were also detected by Martins
et al. (2008) when comparing the articulation of EP
and French nasal vowels.
2.2 The Doppler Effect
The Doppler Effect is the modification of the
frequency of a wave when the observer and the wave
source are in relative motion. If
and
are the
speed of the source and the observer measured on
the direction observer-source, if c is the propagation
velocity of the wave on the medium and if
is the
source frequency, the observed frequency will be:


(1)
Considering a standstill observer
0 and
≪c
the following approximation is valid:
1
or


(2)
We are interested in echo ultrasound to characterize
the moving articulators of a human speaker. In this
case a moving body with a speed

(positive when
the object is moving towards the emitter/receiver)
reflects an ultrasound wave, whose frequency is
measured by a receiver placed closely to the emitter.
The observed Doppler shift will then be the double:
2
(3)
CanUltrasonicDopplerHelpDetectingNasalityforSilentSpeechInterfaces?-AnExploratoryAnalysisbasedon
AlignementoftheDopplerSignalwithVelumApertureInformationfromReal-TimeMRI
233
3 RELATED WORK
In this section we present the work related with
Ultrasonic sensors applied to HCI, in particular to
speech recognition. Ultrasonic sensors have been
applied to diverse and multiple areas that go from
industrial automation to medical solutions, however,
only in 1995 this technology was applied to speech
recognition by Jennings and Ruck (1995), presenting
the first “Ultrasonic Mike” with the goal of
improving automatic speech recognition in noisy
environments. In their work, Jennings and Ruck
used an emitter and a receiver based on piezoelectric
material and a 40 kHz oscillator to create a
continuous wave ultrasonic signal.
More than a decade later, in 2007, Ultrasonic
Doppler research saw new developments being
applied to distinct areas of HCI, including speech
recognition (Zhu et al., 2007). Since then, Ultrasonic
Doppler has been applied to characterization and
analysis of human gait (Kalgaonkar and Raj, 2007),
gesture recognition (Kalgaonkar and Raj, 2009),
speaker recognition (Kalgaonkar and Raj, 2008),
speech synthesis (Toth et al., 2010), voice activity
detection (Kalgaonkar et al., 2007), silent speech
(Freitas et al. 2013) and speech recognition
(Srinivasan et al., 2010; Freitas et al., 2012).
Still, several issues that can be found in the state-
of-the-art remain unsolved: speaker dependence,
sensor distance sensitivity, spurious movements
made by the speaker, silent articulation, amongst
others. Since Doppler shifts capture the articulators’
movement, we believe that some of these problems
can be attenuated or even solved if information
about each articulator can be extracted.
In terms of UDS signal analysis Livescu et al.
(2009) studied the phonetic discrimination in the
UDS signal. In this study the authors tried to
determine a set of natural sub-word units,
concluding that the most prominent groupings of
consonants include both place and manner of
articulation classes and, for vowels, the most salient
groups include close, open and round vowels.
In this paper we focus on determining if a
particular articulator – the velum – is actually
captured by the sensor and determine in which cases
it is more evident by looking at the occurrence of
nasal vowels in EP, a language with strong and
particular nasal characteristics.
4 DATA COLLECTION,
SYNCHRONIZATION
AND FEATURE EXTRACTION
In order to understand if velum movement
information can be found in the Doppler shifts of the
echo signal, a signal that describes the velum
movement is used as a reference. This signal was
extracted from RT-MRI images, as described in
section 4.2. This section also describes the hardware
and setup of the Ultrasonic device and how
synchronization of both signals is achieved.
4.1 Ultrasonic Doppler Setup
A custom build device, depicted on Figure 1, with a
dedicated circuit board was developed based on the
work of Zhu (2008). It includes 1) the ultrasound
transducers (400ST and 400SR working at 40 kHz)
and a microphone to receive the speech signal; 2) a
crystal oscillator at 7.2 MHz and frequency dividers
to obtain 40 and 36 kHz; 3) all the amplifiers and
linear filters needed to process the echo signal and
the speech signal. Since the board is placed in front
of the speaker, the echo signal will be the sum of the
contributions of all the articulators. If the ultrasound
generated is a sine wave 2

, an articulator
with a velocity
will generate an echo wave that
can be characterized by:

sin2π

2

d
(4)
,
are parameters defining the reflection and are
function of the distance. Although they are also
function of time they are slow varying and are going
to be considered constants. The total signals will be
the sum for all articulators and the moving parts of
the face of the speaker

sin2π
t
2
c

d
(5)
The signal is a sum of frequency modulated signals.
It was decided to make a frequency translation by
multiplying the echo signal by a sine wave of a
frequency
36and low passing the result it
is obtained a similar frequency modulated signal
centered at



–
, i.e.,
4.

sin2π
f

2

d
(6)
This analogue operation is performed on the board
and it was used an analogue multiplier AD633. The
PhyCS2014-InternationalConferenceonPhysiologicalComputingSystems
234
Doppler echo signal and speech are then digitized at
44.1 kHz and the following process is digital and
implemented in Matlab.
Figure 1: Custom built UDS device with two ultrasound
transducers and a microphone.
4.2 RT-MRI Data Collection
The RT-MRI data collection was previously
conducted at IBILI/Coimbra for nasal production
studies. Images were acquired at the mid-sagittal and
coronal oblique planes of the vocal tract (see Figure
2) using an Ultra-Fast RF-spoiled Gradient Echo
(GE) pulse sequence and yielding a frame rate of 14
frames/second. Each recorded sequence contained
75 images. Additional information concerning the
image acquisition protocol can be found in Silva et
al. (2012).
Figure 2: From left to right: mid-sagittal plane depicting
orientation of the oblique plane used during acquisition,
sample oblique plane showing the oral and nasal cavities
and image sequence details (Teixeira et al., 2012).
Audio was recorded simultaneously with the real-
time images, inside the scanner, at a sampling rate of
16 kHz, using a fiber optic microphone. For
synchronization purposes a TTL pulse was
generated from the RT-MRI scanner (Teixeira et al.
2012).
4.3 Extraction of Information on Nasal
Port from RT-MRI Data
For the mid-sagittal RT-MRI sequences of the vocal
tract, since the main interest was to interpret velum
position/movement from the sagittal RT-MRI
sequences, instead of measuring distances (e.g.,
from velum tip to the posterior pharyngeal wall), we
opted for a method based on the area variation
between the velum and pharynx, closely related to
velum position.
An image with the velum fully lowered was used
to define a region of interest (ROI). Then, a region
growing algorithm was applied with a seed defined
in a hypo intense pixel inside the ROI. This ROI is
roughly positioned between the open velum and the
back of the vocal tract and the main purpose is that
the velum will move over that region when closing.
Since this first ROI could be defined enclosing also
a larger region, even including a part of the velum
(which will not influence the process), it is only
important that the seed is placed in a dark (hypo
intense) pixel inside it, in order to exclude the most
of the velum from the region growing when it is
positioned inside the ROI. Figure 3 presents the
contours of the segmented region over different
image frames encompassing velum lowering and
rising. For representation purposes, in order not to
occlude the image beneath, only the contour of the
segmented region is presented. Processing is always
performed over the pixels enclosed in the depicted
region. Notice that the white boundaries presented in
the images depict the result of the region growing
inside the defined ROI (which just limits the growth)
and not the ROI itself. The number of hypo intense
pixels (corresponding to an area) inside the ROI
decreases when the velum closes and increases when
the velum opens. Therefore, a closed velum
corresponds to area minima while an open velum
corresponds to local area maxima, which allows
detecting the frames where the velum is open. Since
for all image sequences there was no informant
movement, the ROI has only to be set once, for each
informant, and can then be reused throughout all the
processed sagittal real-time sequences. After ROI
definition (around one minute and reusable
throughout all image sequences from the same
Figure 3: Mid-sagittal RT-MRI images of the vocal tract
for several velum positions, over time, showing evolution
from a raised velum, to a lowered velum and back to
initial conditions. The presented curve, used for analysis,
was derived from the images.
CanUltrasonicDopplerHelpDetectingNasalityforSilentSpeechInterfaces?-AnExploratoryAnalysisbasedon
AlignementoftheDopplerSignalwithVelumApertureInformationfromReal-TimeMRI
235
speaker), setting a seed, revising the results and
storing the data took one minute per image
sequence.
These images allowed deriving a signal over
time that describes the velum movement (also shown
in Figure 3 and depicted as dashed line in Figure 4).
As can be observed, minima correspond to a closed
velopharingeal port (oral sound) and maxima to an
open port (nasal sound).
4.4 Corpora
The corpora used in this study, both RT-MRI and
UDS, share a set of prompts composed by several
non-sense words that contain five EP nasal vowels
([ɐ
̃
, , ĩ, õ, ũ]) isolated and in word-initial, word-
internal and word-final context (e.g. ampa [ɐ
̃
pɐ],
pampa [pɐ
̃
pɐ], pam [pɐ
̃
]). The nasal vowels are
flanked by the bilabial stop or the labiodental
fricative. This set contains 3 utterances per nasal
vowels and data from a single speaker. The UDS
data was recorded at a distance of 12 cm from the
speaker.
4.5 Signals Synchronization
In order to be able to take advantage of the RT-MRI
velum information we need to synchronize the UDS
and RT-MRI signals. We start by aligning both UDS
and the information extracted from the RT-MRI with
the corresponding audio recordings. We resample
the audio recordings to 12 kHz and apply Dynamic
Time Warping (DTW) to the signals, finding the
optimal match between the two sequences. Based on
the DTW result we map the information extracted
from RT-MRI from the original production to the
UDS time axis, establishing the needed
correspondence between the UDS and the RT-MRI
information, as depicted on Figure 4.
Figure 4: Exemplification of the warped signal
representing the nasal information extracted from RT-MRI
(dashed line) superimposed on the speech recorded during
the corresponding RT-MRI and UDS acquisition, for the
sentence [ɐ
̃
pɐ, pɐ
̃
pɐ, pɐ
̃
].
4.6 UDS Feature Extraction
For this experiment we have selected two types of
features - frequency-band energy averages and
energy-band frequency averages (Livescu et al.,
2009; Zhu, 2008). To obtain the frequency-band
energy averages, we split the signal spectrum into
several non-linearly divided bands centered around
the carrier. Then, the mean energy is computed for
each band. The frequency interval for each band is
given by:


,

,54
(7)
where 
4000 (carrier frequency),



, 

||
1, and 40. As such, the bandwidths slowly
increase from 40 Hz to 280 Hz, capturing higher
resolution information near the carrier.
In order to compute the energy-band frequency
averages we split the spectrum into several energy
bands and compute frequency centroid for each
band. We extract values from 14 bands (7 below and
7 above the carrier frequency) using 10 dB energy
thresholds that range from 0 dB to -70 dB.
5 EXPLORATORY ANALYSIS
In order to achieve our aim of finding if velum
movement information is present in the ultrasonic
signal, we decided to measure the strength of
association between the obtained features, which
describes the ultrasonic signal and RT-MRI
information and is an accurate representation of the
ground truth. Below, we present several results
based on Pearson’s product-moment correlation
coefficient, which measures how well the two
signals are related and also the results of
Independent Component Analysis application to the
extracted features. The correlation values range
between -1 and 1, thus the greater the absolute value
of a correlation coefficient, the stronger the linear
relationship is. The weakest relationship is indicated
by a correlation coefficient equal to 0.
5.1 Results
When comparing the RT-MRI velum information
with the obtained features along each frequency
band, based on correlation magnitude presented in
Figure 5, it is not clear which band presents the
higher correlation, although the values near the
carrier are slightly higher. However, if we split our
analysis by vowel, more interesting results are
PhyCS2014-InternationalConferenceonPhysiologicalComputingSystems
236
visible. Figure 6 shows the correlation results for
utterances where only the nasal vowel [ɐ
̃
] occurs
(e.g. ampa [ɐ
̃
pɐ], pampa [pɐ
̃
pɐ], pan [pɐ
̃
]) and it is
visible a more distinct group of correlation values at
the frequency interval [4040..4120] Hz. When
looking at the nasal vowel [] a stronger correlation
is also noticed in that interval. However, in the case
of the nasal vowel [õ] and [ũ] higher correlation
values are found in the [3880..4040] Hz range, with
an average correlation magnitude of 0.42 for [õ] and
0.44 for [ũ] (depicted in Figure 7). For the nasal
vowel [ĩ], we find much lower correlation values
when compared with the remaining vowels such as
Figure 5: Boxplot for all utterances. The x axis lists the
frequency-band features and the y axis corresponds to the
absolute Pearson’s correlation value. The central mark is
the median and the edges of the box are the 25th and 75th
percentiles.
Figure 6: Boxplot for utterances with [ɐ
̃
]. The x axis lists
the frequency-band features and the y axis corresponds to
the absolute Pearson’s correlation value.
Figure 7: Boxplot for utterances with [ũ]. The x axis lists
the frequency-band features and the y axis corresponds to
the absolute Pearson’s correlation value.
[ɐ
̃
], [õ] or [ũ] and the best interval can be found in
the [4240..4400] Hz range with an average
correlation magnitude of 0.25.
When looking at the energy-band features for all
vowels we find similar values for the energy bands
below -30dB, where the highest average correlation
value is achieved by the [-30..-40] dB range above
and below the carrier with 0.23. If we split out
analysis by vowel, the highest value is achieved by
the nasal vowel [õ] with an average correlation of
0.43 for the [-40..-50] dB interval above the carrier.
The second best result using energy-band features is
obtained by the nasal vowel [ũ] in the [-30..-40] dB
range with values of 0.40 above the carrier and 0.39
below the carrier.
5.1.1 Applying Independent Component
Analysis
As mentioned earlier the Ultrasonic Doppler signal
can be seen as the sum for all articulators and the
moving parts of the face of the speaker. Thus, the
signal can be interpreted as a mix of multiple
signals. Considering our goal, an ideal solution
would be to find a process to isolate the signal
created by the velum. Independent Component
Analysis (ICA) is a method used for separating a
multivariate signal with independent sources linearly
mixed, thus the underlying idea is to understand if
by applying blind source separation we can obtain
independent components that relate with each
articulator movement, including the velum.
For that purpose we applied the FastICA
algorithm (Hyvarinen, 1999) using the RT-MRI
information as a priori to build the separating
matrix. This allowed to obtain independent
components with a higher correlation value than
when compared to the extracted features without any
transformation, as shown in Table 1. Also, due to the
Table 1: Average correlation magnitude values with 95%
confidence interval using frequency-band and energy band
features for the best independent components of each
utterance.
Average correlation magnitude
Frequency Energy
All vowels 0.42 ± 0.05 0.41 ± 0.04
[ɐ
̃
] 0.44 ± 0.04 0.33 ± 0.05
[] 0.41 ± 0.09 0.41 ± 0.10
[ĩ] 0.30 ± 0.05 0.42 ± 0.03
[õ] 0.47 ± 0.08 0.41 ± 0.05
[ũ] 0.48 ± 0.14 0.50 ± 0.07
CanUltrasonicDopplerHelpDetectingNasalityforSilentSpeechInterfaces?-AnExploratoryAnalysisbasedon
AlignementoftheDopplerSignalwithVelumApertureInformationfromReal-TimeMRI
237
singularity of the covariance matrix we observe a
dimensionality reduction of 4 to 8 components
depending on the utterance when using frequency-
band features. When using energy-band features we
observe a dimensionality reduction of 6 to 12
components.
6 DISCUSSION
The applied methodology uses the audio signal to
synchronize two distinct signals that otherwise were
very hard to align. Although the two sources of
information were recorded at different times, it is
our belief that by reproducing the articulation
movements we are able to obtain a very good
indication of how the velum behaves upon the same
stimulus for most cases. The utterances containing
the [ũ] nasal vowels presented some alignment
inaccuracies mainly at the end of the first phoneme
and further improvements need to be considered for
this particular case.
Knowing that the velum is a slow articulator, as
shown by the RT-MRI velum movement
information in Figure 4, and considering equation 2,
it is expected that velum movement, if detected by
UDS, is found in the regions near the carrier, which
is where the results for [ɐ
̃
, , õ and ũ] present higher
correlation. However, the velum is not the only
slowly moving articulator and a different corpora
which allows, for example, to discard jaw
movements should be considered for future studies.
Another point of discussion is the differences
found between nasal vowels. When looking at the
correlation results of frequency-band features, a
difference is noticed from [ĩ] to the remaining
vowels. One possible explanation for this difference
might be the articulation variances of each nasal
vowel previously reported in literature (Schwartz,
1968). Since our technique is based on the reflection
of the signal it is plausible that the tongue position
influences the detection of the velum, particularly
for the case of [ĩ] in which the tongue posture may
block the UDS signal.
It would also be expected to find a clear
difference between close and open vowels (Livescu
et al., 2009). Although this is true for the nasal
vowel [ĩ], it was not verified in the [ũ] case, which
presented the highest correlation values along with
[õ]. Further investigation is required to understand if
for example the rounding of the lips during the
articulation of these two vowels is influencing the
signal reflection and in which way.
In this study we have also applied blind source
separation as an attempt to split the signal into
independent components. This technique has given
slightly better results for both sets of features,
showing that isolating the velum movement in the
Doppler shifts might be possible. It is also
noteworthy the fact that this process has led to a
dimensionality reduction of 4 to 8 components
depending on the utterance, which may have a
relation with the number of mobile articulators that
can cause Doppler shifts in the signal (i.e. tongue,
lower jaw, velum, lips, cheeks, oral cavity).
7 CONCLUSIONS
This paper analysis the presence of information
about the velum movement for European Portuguese
nasal vowels in the Ultrasonic Doppler signal. As
ground truth for our study, we use previously
collected RT-MRI information from the same
speaker and, after extracting a signal that describes
the movement of the velum, we apply a
synchronization technique based on the audio signal
collected from both corpora. With this approach we
are able to estimate the velum behaviour and
measure the strength of association between the
features that describe the ultrasonic signal data and
RT-MRI data, via computing the Pearson’s product-
moment correlation coefficient.
The obtained results show that for features based
on the energy of pre-determined frequency bands,
we are able find moderate correlation values, for the
case of the vowels [ɐ
̃
], [õ] and [ũ] and weaker
correlation values in the [ĩ] case. Moderate
correlation values were also found using energy
based features for bands below -30dB. We have also
applied a blind source separation technique
obtaining components with a better description of
the velum movement.
For future work and based on this methodology,
we plan to apply the same process to other
articulators such as the tongue or lips, which will
help to determine important aspects and more details
about the captured information. We also intend to
expand the current corpora with more speakers and
adequate prompts for these scenarios. It would also
be important to analyse the impact of distance from
UDS emitter to the speaker face in the captured
information.
ACKNOWLEDGEMENTS
This work was partially funded by Marie Curie
PhyCS2014-InternationalConferenceonPhysiologicalComputingSystems
238
Golem (ref.251415, FP7-PEOPLE-2009-IAPP), by
FEDER through the Program COMPETE under the
scope of QREN 5329 FalaGlobal (PTDC/EEA-
PLP/098298/2008) and by National Funds (FCT-
Foundation for Science and Technology) in the
context of IEETA Research Unit funding FCOMP-
01-0124-FEDER-022682 (FCT-PEst-
C/EEI/UI0127/2011). The authors would also like to
thank the speaker.
REFERENCES
Freitas, J., Teixeira, A., Dias, M. S., 2012b. Towards a
Silent Speech Interface for Portuguese: Surface
Electromyography and the nasality challenge. In Int.
Conf. on Bio-inspired Systems and Signal Processing,
Vilamoura, Algarve, Portugal.
Freitas, J., Teixeira, A., Dias, M. S., 2013. Multimodal
silent speech interface based on Video, depth, surface
electromyography, and ultrasonic Doppler, Workshop
on Speech Production in Automatic Speech
Recognition, Lyon, France.
Freitas, J., Teixeira, A., Vaz, F., Dias, M. S., 2012a.
Automatic Speech Recognition based on Ultrasonic
Doppler Sensing for European Portuguese. Advances
in Speech and Language Technologies for Iberian
Languages, vol. CCIS 328, Springer.
Hyvarinen, A, 1999. Fast and robust fixed-point
algorithms for independent component
analysis. Neural Networks, IEEE Transactions on Vol.
10, no. 3, pp. 626-634.
Jennings, D. L., Ruck, D. W, 1995. Enhancing automatic
speech recognition with an ultrasonic lipmotion
detector, In Int. Conf. on Acoustics, Speech, and
Signal Processing, Detroit.
Kalgaonkar, K. and Raj, B., 2007. Acoustic Doppler Sonar
for Gait Recognition, IEEE International Conference
on Advance Video and Signal-based Surveillance
(AVSS2007).
Kalgaonkar, K. and Raj, B., 2008. Ultrasonic Doppler
Sensor for Speaker Recognition, IEEE Intl. Conf. on
Acoustics Speech and Signal Processing 2008.
Kalgaonkar, K. and Raj, B., 2009. One-handed Gesture
Recognition using Ultrasonic Doppler Sonar, IEEE
Intl. Conf. on Acoustics, Speech and Signal Processing
2009.
Kalgaonkar, K., Raj B., Hu., R., 2007. Ultrasonic doppler
for voice activity detection. IEEE Signal Processing
Letters, vol.14(10), pp. 754–757.
Lacerda, A. and Head, B. F., 1996. Análise de sons nasais
e sons nasalizados do Português. Revista do
Laboratório de Fonética Experimental (de Coimbra).
VI:5_70.
Livescu, K., Zhu, B. and Glass, J., 2009. On the phonetic
information in ultrasonic microphone signals" Intl.
Conf. on Acoustics, Speech and Signal Processing
2009.
Martins, P. Carbone, I. Pinto, A. Silva, A. and Teixeira,
A., 2008. European Portuguese MRI based speech
production studies. Speech Communication. NL:
Elsevier, Vol.50, No.11/12, ISSN 0167-6393, pp. 925–
952.
Raj, B., Kalgaonkar, K. Harrison, C. and Dietz, P., 2012.
Ultrasonic Doppler Sensing in HCI. Pervasive
Computing, IEEE 11, no. 2, pp. 24-29.
Rossato, S. Teixeira, A. and Ferreira, L., 2006. Les
Nasales du Portugais et du Français: une étude
comparative sur les données EMMA. In XXVI
Journées d'Études de la Parole. Dinard, France.
Schwartz, M. F. 1968. The acoustics of normal and nasal
vowel production. Cleft Palate Journal 5: 125-40.
Silva, S., Martins, P., Oliveira, C., Silva, A., Teixeira, A.,
2012. Segmentation and Analysis of the Oral and
Nasal Cavities from MR Time Sequences. Image
Analysis and Recognition. Proceedings of ICIAR
2012, LNCS, Springer.
Srinivasan, S., Raj, B., Ezzat., T., 2010. Ultrasonic sensing
for robust speech recognition, In Intl. Conf. on
Acoustics, Speech, and Signal Processing 2010.
Teixeira, A., Martins, P., Oliveira, C., Ferreira, C., Silva,
A., Shosted, R., 2012. Real-time MRI for Portuguese:
database, methods and applications, Proceedings
of PROPOR 2012, LNCS vol. 7243. pp. 306-317.
Teixeira, J. S., 2000. Síntese Articulatória das Vogais
Nasais do Português Europeu. PhD Thesis,
Universidade de Aveiro.
Toth, A. R., Kalgaonkar, K., Raj, B., T. Ezzat, 2010.
Synthesizing speech from Doppler signals. IEEE
International Conference on Acoustics Speech and
Signal Processing, pp.4638-4641.
Trigo, R. L., 1993. The inherent structure of nasal
segments, In Nasals, Nasalization, and the Velum,
Phonetics and Phonology, M. K. Huffman e R. A.
Krakow (eds.), Vol. 5, pp.369-400, Academic Press
Inc.
Zhu, B., 2008. Multimodal speech recognition with
ultrasonic sensors. Master’s thesis. Massachusetts
Institute of Technology, Cambridge, Massachusetts.
Zhu, B., Hazen, T. and Glass, J. R., 2007. Multimodal
speech recognition with ultrasonic sensors. In
Eurospeech 2007.
CanUltrasonicDopplerHelpDetectingNasalityforSilentSpeechInterfaces?-AnExploratoryAnalysisbasedon
AlignementoftheDopplerSignalwithVelumApertureInformationfromReal-TimeMRI
239