and increment it as the intervals moveaway, we define
a
n integer for each relation. Thus the “b” symbol is
coded into “1”, “m” into “2”, ...for the 14 relations
that are : before, meets, overlaps, stars, during, fin-
ishes, equals, and their symmetric (see (Fraihat et al.,
2008) for details). The ’no-relation’ happens between
two empty intervals. We propose to use these time
representation for coding speech events into a small
discrete integer set. In order to define the intervals we
use the voicing levels as depicted in the next section.
In order to get the subband voicing activity inter-
vals, we estimate the TF voicing activity interval us-
ing the voicing measure R (Glotin, 2001) that is well
correlated with SNR and equivalent to the harmonic-
ity index (HNR). R is calculated by autocorrelogram
of the demodulated signal. In the case of Gaussian
noise, the correlogram of a noisy frame is less modu-
lated than a clean one. We first compute the demodu-
lated signal after half wave rectification, followed by
pass-band filtering in the pitch domain. Then we au-
tocorrelate each frame of LVW (Local Voicing Win-
dow) ms long and we calculate R = R1/R0, where R1
is the local maximum in time delay segment corre-
sponding to the fundamental frequency ([90 350]Hz),
and R0 is the window energy. We showed (Glotin,
2001) that R is strongly correlated with SNR in the
5..20dB range as illustrated in fig. 1. The SB are
defined as in ALLEN J.B. analysis (Allen, 1994;
Glotin, 2001) : [216 778;707 1631;1262 2709;2121
3800;3400 5400;5000 8000] Hz.
We set for vowel recognition LVW=32ms, with a shift
of 4ms.
3 BINARIZATION AND
REPRESENTATION
In order to generate principal separated time intervals
for Allen relations, we threshold the voicing levels :
for each band and each window of Local Binary Win-
dow (LBW 32ms shift and 64 ms length), we binarize
to 1 the T% frame highest quantil, the other to 0.
In order to remove noisy relation, we remove in-
terval that is connected to any window range. Finally
we keep window containing at least 4 connected in-
tervals. We then derive their Allen temporal relations
(see fig. 1). The vowel labels for the training task
are given from forced realignment on standard HMM-
MMG model (Galliano et al., 2005).
As we have 6 SB, we have 15 temporal relations
(one for each couple), ordered from low to high fre-
quency. In our example (fig. 1), from I’1 to I’5, we
get the parameter vector [di di di oi oi d d d d s oi
d oi f d], where i is the inverse relation. Then these
Figure 1: From voicing levels to the Allen’s interval rela-
t
ions: (a) voicing signal (b) the voicing level by subband (c)
the binarized voicing levels by subband using mean thresh-
old (From (Glotin, 2001)).
TFQ features estimated in each LBW window, feed a
neural network (any classifier could be used), that we
trained for automatic vowel decoding.
Moreover, in order to confirm that voicing levels
and intervals definitions are informative, we build a
6 integer feature, called RANK, ranking the subband
of each window using the relative R level of each in-
terval. This information may be correlated to the for-
mant position, that we lose in simple ALLEN rela-
tions.
Thus the functions of binarization and extraction
should also integrate the hierarchy of SB frequency in
ALLEN+RANK concatenated features.
4 DATABASE
Our experiments are made over all the speak-
ers on the six most frequent French vowels:
/Aa/,/Ai/,/An/,/Ei/,/Eu/,/Ii/. SB are defined like in pre-
vious section. We set the shift of each voicing window
LVW to 4ms , and the LVW length to 32ms. We vary
the T% parameter in [0.4 0.5 0.6 0.7]. The training
windows are labelled with the label which covers at
most the window. The features from 1h of continuous
speech are used to train an MLP, and we test on other
20 minutes, best results with number of hidden units
are given in tab.1.
SIGMAP 2008 - International Conference on Signal Processing and Multimedia Applications
190