order was sorted by the duration. The average du-
ration of all keywords was 0.51. ”i-vector (3utter-
ances)” shows i-vector results using three enrollment
utterances. ”GMM(0)1st+4th (3utterances)” and
”GMM(10)1st+4th (3utterances)” show the result of
GMM(0) and the result of GMM(10) using three en-
rollment utterances, respectively. ”GMM(0)1st+4th
(2utterances)” and ”GMM(10)1st+4th (2utterances)”
show the result of GMM(0) and the result of
GMM(10) using two enrollment utterances, respec-
tively.
Table 4 shows the values of the results. Basically,
the performance increased as the duration was longer.
The performances of GMM(10) using two and three
enrollment utterances were almost over 90% for more
than 0.5 [sec] duration, and they were higher than that
of i-vector using three enrollment utterances even if
GMM(10) used two enrollment utterances.
Figure 6 and Tab. 4 also show that the aug-
mentation was effective for most of evaluation key-
words. The performance of the proposed method im-
proved in comparison with the i-vector method. Fur-
thermore, the augmentation method increased perfor-
mance. The results show that our proposed method
was effective for speaker identification using short
keywords.
8 CONCLUSION
This paper proposed a speaker identification method
for even a very short duration of keywords recog-
nized by a NN-based detector. Because the feature
of speaker identification is keyword independent, the
proposed method can be used for various applica-
tions using flexible keywords. Moreover, computa-
tion cost for speaker identification is very small be-
cause the feature is derived from the NN of the de-
tector without any additional NNs. The identifica-
tion rate when using a conventional i-vector method
was 71.22%. In contrast, the performance of the pro-
posed method was 89.29% while maintaining low-
resource-cost computation. Performance of the pro-
posed method was clearly better than that of the con-
ventional i-vector method for speaker identification
using short keywords and few enrollment data values
with a low-resource computation cost.
REFERENCES
Barker, J., Marxer, R., Vincent, E., and Watanabe, S.(2015).
The third ‘chime’speech separation and recognition
challenge: Dataset, task and baselines. In 2015 IEEE
Workshop on Automatic Speech Recognition and Un-
derstanding (ASRU), pages 504–511. IEEE.
Chen, G., Parada, C., and Heigold, G. (2014). Small-
footprint keyword spotting using deep neural net-
works. In 2014 IEEE International Conference on
Acoustics, Speech and Signal Processing, ICASSP
2014, pages 4087–4091. IEEE.
Chen, N., Qian, Y., and Yu, K. (2015a). Multi-task learn-
ing for text-dependent speaker verification. In Pro-
ceedings of the 16th Annual Conference of the Inter-
national Speech Communication Association, pages
185–189. International Speech Communication Asso-
ciation (ISCA).
Chen, Y.-h., Lopez-Moreno, I., Sainath, T. N., Visontai,
M., Alvarez, R., and Parada, C. (2015b). Locally-
connected and convolutional neural networks for
small footprint speaker recognition. In Proceed-
ings of the 16th Annual Conference of the Inter-
national Speech Communication Association, pages
1136–1140. International Speech Communication As-
sociation (ISCA).
Dehak, N., Kenny, P. J., Dehak, R., Dumouchel, P., and
Ouellet, P. (2011). Front-end factor analysis for
speaker verification. IEEE Transactions on Audio,
Speech, and Language Processing, 19(4):788–798.
Dutta, T. (2007). Text dependent speaker identification
based on spectrograms. Proceedings of Image and vi-
sion computing, pages 238–243.
Gal, Y. (2016). Uncertainty in deep learning. PhD thesis,
University of Cambridge.
Hebert, M. and Heck, L. (2003). Phonetic class-based
speaker verification. In Eurospeech 2003 - Interspeech
2003. ISCA.
Junkawitsch, J., Neubauer, L., Hoge, H., and Ruske, G.
(1996). A new keyword spotting algorithm with
pre-calculated optimal thresholds. In Proceeding of
Fourth International Conference on Spoken Language
Processing. ICSLP’96, volume 4, pages 2067–2070.
IEEE.
Kanagasundaram, A., Vogt, R., Dean, D. B., Sridharan, S.,
and Mason, M. W. (2011). I-vector based speaker
recognition on short utterances. In Proceedings of the
12th Annual Conference of the International Speech
Communication Association, pages 2341–2344. Inter-
national Speech Communication Association (ISCA).
Kurematsu, A., Takeda, K., Sagisaka, Y., Katagiri, S.,
Kuwabara, H., and Shikano, K. (1990). Atr japanese
speech database as a tool of speech recognition and
synthesis. Speech communication, 9(4):357–363.
Larcher, A., Bonastre, J.-F., Fauve, B. G., Lee, K.-A., L´evy,
C., Li, H., Mason, J. S., and Parfait, J.-Y. (2013).
ALIZE 3.0 - open source toolkit for state-of-the-art
speaker recognition. In Proceedings of the 14th An-
nual Conference of the International Speech Commu-
nication Association, pages 2768–2772.
Larcher, A., Bousquet, P.-M., Lee, K. A., Matrouf, D., Li,
H., and Bonastre, J.-F. (2012). I-vectors in the con-
text of phonetically-constrained short utterances for
speaker verification. In 2012 IEEE International Con-
ference on Acoustics, Speech and Signal Processing,
ICASSP 2012, pages 4773–4776. IEEE.