cross-validation training of neural networks. The entire test set of 1344 utterances from
168 speakers was used for the classification experiment. None of the test speakers are in
the training set, and hence all the experiments are open and speaker independent. There
are 39 phone classes.
We quantised the neural network output activations using the quantisation level of
10 and removed the redundant tokens from training and test sets. The sizes of the sym-
bolic training and test sets thus obtained are 124962 and 46633 tokens, respectively. In
order to obtain the sets C
+
, each training set P was reduced to 5 cluster centroids using
k-medians clustering employing the Levenshtein weighted edit distance for similarity
computations and set median algorithm for template selection. The clustering algorithm
initialisation criteria was duration-based [4].
During the learning stage, for each class P out of the 39 classes, represented by its
training set C
+
P
, we derived its corresponding inductive structure Π
P
by using an algo-
rithm outlined in Sect. 4. We defined the stopping criterion for the optimisation problem
to be λ = f
−1
(ˆω), where f(ˆω) is given by (1). The particular value of λ we used was
10
−8
. During the recognition stage, an efficient k-NN AESA search technique [13] was
used to compare each of the 46633 test tokens with the class prototypes by using the
template-based Generalised Levenshtein Distance defined by the respective class induc-
tive structure. The classification accuracy we obtained was 51%, correctly classifying
23783 out of 46633 tokens.
6 Conclusions and Future Work
In this paper we gave an outline of a linguistically inspired structural representation for
speech, an attempt to find an inductively meaningful definition for the speech recogni-
tion problem, focusing on a low level phonological representation of speech patterns.
We showed how inductively “rigid” representation can be made expressive with the
introduction of evolving metric and described the results of the initial experiments con-
ducted with the highly non-trivial continous speech data. We believe that the emphasis
on the class representation of linguistic phenomena will facilitate the development of
the speech recognition field, since the recognition problem cannot be approached ade-
quately without a meaningful representation.
There are several ways of improving the representation described in this paper. For
example, instead of using extensions of standard string-based dissimilarity measures,
such as weighted Levenshtein distance, we can introduce linguistically inspired dis-
tance functions along the lines of [14]. The learning algorithms, developed in the gram-
matical inference setting [7, 11], can be further improved to take into the account the
stream-based phonological structure. In addition, some basic phonological constraints
can potentially be introduced (stream independence assumption, for instance, can be
relaxed to account for similar classes of distinctive phonological features, place of ar-
ticulation being one of them). An efficient prototype selection algorithm for reducing
the training set and selecting the templates containing inductively “interesting” features
is also needed. It is expected that the above modifications will lead to significant im-
provements in the classification accuracy on the TIMIT task.
50