6 RESULTS
In section 2 we described a new noise robust feature
extraction front-end based on the ETSI standard. In
order to assess the merit of our front-end, we
compared both front-ends using the Tecnovoz
database. Whole word models were trained and
tested as described in section 3, using both front-
ends. The models are described by a 10 component
Gaussian mixture. As it can be seen in Table 3, our
front-end outperforms the ETSI front-end by a figure
of 2% in recognition rate.
Table 3: Front-end comparison results.
Front-end Recognition rate
ETSI front-end
94.88%
New front-end
96.88%
In section 3 we described 3 possible approaches
for the recognition units: whole-word models,
context-free phone models, and context-dependent
triphone models. The training and testing datasets
were the same as in the previous experiment and the
results are presented in Table 4.
Table 4: Results obtained for Tecnovoz database.
Acoustic Model Recognition rate
Whole-word 96.88% (10 mixtures)
Phone model 91.41% (16 mixtures)
Triphone model 97.13% (10 mixtures)
Triphone model 97.50% (16 mixtures)
With the whole-word model set the recognition
rate is 96.88%. This model set has 46.7k Gaussians
(about 3.6M parameters).
For the context-free phone model set we used a
multiple pronunciation dictionary but, despite of
this, we obtained a recognition rate of only 91.41%
even with a larger number of 16 mixtures. As the
main concern is word recognition performance, the
silence/pause models are simply ignored in the
recognition evaluation. This low performance value
is obviously due to the lack of parameters: only
1,888 Gaussians (150k parameters).
For the triphone model set, no multiple
pronunciation dictionary was used and, as before,
the silence/pause models were ignored. The result of
97.50%, was obtained for 16 mixtures with 32,208
Gaussians (about 2.5M parameters) and 846 physical
triphone models. The result for 10 mixtures is also
shown and compares favourably with the whole-
word case.
7 CONCLUSIONS
A speech command recognizer for the Portuguese
language was presented in this paper. It incorporates
a new noise robust front-end and a Viterbi decoder
optimized for real time operation in embedded
applications. A compact programming interface for
application development was also presented. Several
recognition units were discussed and evaluated.
Results with the presented recognizer show that
the models based on triphone units are higher than
whole-word or context-free phone models. A
possible explanation for this result could be the fact
that, on average, there are much more occurrences of
triphones than whole-words, which leads to a better
parameter estimation. The use of triphones is also a
better solution, as it combines fewer parameters with
higher recognition rate.
The noise robustness of our front-end was also
evaluated, showing an increased performance when
compared to the ETSI front-end. This result suggests
that the ETSI front-end may be biased towards the
database used in its development.
REFERENCES
ETSI, 2003. ETSI ES 202 050 v1.1.3. Speech Processing,
Transmission and Quality Aspects (STQ); Distributed
Speech Recognition; Advanced Front-end Feature
Extraction Algorithm; Compression Algorithms.
Technical Report ETSI ES 202 050, ETSI.
HTK3, 2006. The HTK book (for HTK version 3.4).
Technical report, Cambridge University. England.
http://htk.eng.cam.ac.uk/.
Li, J.-Y., Liu, B., Wang, R.-H., and Dai L.-R., 2004. A
Complexity Reduction of ETSI Advanced Front-end
for DSR. In proc. of ICASSP’2004, vol. I, pp. 61-64.
Montreal, Canada.
Neves, C., Veiga, A., Sá, L., and Perdigão, F., 2008.
Efficient Noise-Robust Speech Recognition Front-end
Based on the ETSI Standard. Submitted to
INTERSPEECH’2008. Brisbane, Australia.
Peinado, A., and Segura, J., 2006. Speech Recognition
over Digital Channels: Robustness and Standards,
John Wiley & Sons, Ltd. England.
Tecnovoz, 2008. http://www.tecnovoz.pt/web/home.asp.
Yu, D., Ju, Y., Wang, Y.-Y., and Alex, W., 2006. N-Gram
Based Filler Model for Robust Grammar Authoring.
In proc. of ICASSP’2006, vol. I, pp. 565-568.
Toulouse, France.
A ROBUST SPEECH COMMAND RECOGNIZER FOR EMBEDDED APPLICATIONS
95