If triphones are used in place of monophonemes,
the number of needed model increases and it may
occur the problem of insufficient training data. To
solve this problem, tying of acoustically similar
states of the models built for triphones
corresponding to each context is an efficient
solution. For example, in figure 2b, four models are
represented for different contexts of the phoneme
“a”, namely the triphones “k – a + S”, “g – a + z”, “n
– a + j”, “m – a + j”. In figure 2c, 2d, there are
represented the clusters formed with acoustically
similar states of the corresponding HMMs.
The choice of the states and the clustering in
phonetic classes are achieved by mean of phonetic
decision trees. A phonetic decision tree built as a
binary tree, is shown in figure 3 and has in the root
node all the training frames to be tied, in other words
all the contexts of a phoneme. To each node of the
tree, beginning with the parent – nodes, a question is
associated concerning the contexts of the phoneme.
Possible questions are, for example: is the right
context a vowel (R = Consonant?), is the left context
a phoneme “a” (L = a?); the first answer designates a
large class of phonemes, the second only a single
phonetic element. Depending on the answer, yes or
no, child nodes are created and the frames are placed
in them. New questions are further made for the
child nodes, and the frames are divided again.
The questions are chosen in order to increase the
log likelihood of the data after splitting. Splitting is
stopped when increasing in log likelihood is less
than an imposed threshold resulting a leaf node. In
such leaf nodes are concentrated all states having the
same answer to the question made along the path
from the root node and therefore states reaching the
same leaf node can be tied as regarded acoustically
similar. For each leaf node pair the occupancy must
be calculated in order to merge insufficient occupied
leaf nodes.
A decision tree is built for each state of each
phoneme. The sequential top down construction of
the decision trees was realized automatically, with
an algorithm selecting the questions to be answered
from a large set of 130 questions, established after
knowledge about phonetic rules for Romanian
language.
4 DATABASE
The data are sampled by 16 kHz, quantified with 16
bits, and recorded in a laboratory environment.
For continuous speech recognition, database for
training is constituted by 3300 phrases, uttered by 11
speakers, 7 males and 4 females, each speaker
reading 300 phrases.
The databases for testing contained 220 phrases
uttered by 11 speakers, each of them reading 20
phrases.
The training database contains over 3200 distinct
words; the testing database contains 900 distinct
words.
In order to carry out our experiments about
speaker independence, the database was reorganized
as follows: one database for male speakers (MS),
one database for female speakers (FS) and one
database for male and female speakers (MS and FS).
In all cases we have excluded one MS and one FS
from the training and used for testing.
Figure 3: Phonetic tree for phoneme m in state 2.
R=Consonant?
=Vowel?
L=a
L=Vowel_medial?
=p?
y
y
y
y
y
n
n
n
n
n
PROGRESSES IN CONTINUOUS SPEECH RECOGNITION BASED ON STATISTICAL MODELLING FOR
ROMANIAN LANGUAGE
265