see that AGender-Tuning system outperforms all the
baseline systems of the manually fused system.
Using our method, the accuracy of the speaker’s age
and gender is improved by approximately 3%
compared with the fused system.
5 CONCLUSION
In this work, we proposed AGender-Tuning DNN
system to classify the speakers’ age and gender by
combining two DNN architectures; Age-DNN to
classify four groups of age, and Gender-DNN to
classify the gender. A third output layer is proposed
to combine the output layers of Age and Gender
DNNs using element-wise summation. The results of
the proposed work are compared with two baseline
systems; the I-Vector and GMM-UBM on the public
database aGender. The proposed work achieved
better results in terms of overall accuracy and even
for individual classes. Also, the proposed system was
doing very well compared with the baseline systems
regardless of the time duration of the speaker
utterance. The overall accuracy of the proposed
system, I-Vector, and GMM-UBM systems are
55.16%, 47.89%, and 43.8% respectively.
REFERENCES
Bahari, M.H. and Van Hamme, H., 2011. Speaker age
estimation and gender detection based on supervised
non-negative matrix factorization. In Biometric
Measurements and Systems for Security and Medical
Applications (BIOMS), 2011 IEEE Workshop on (pp.
1-6). IEEE.
Bahari, M.H., McLaren, M. and van Leeuwen, D.A., 2014.
Speaker age estimation using i-vectors. Engineering
Applications of Artificial Intelligence, 34, pp.99-108.
Baker, J. M., Deng, L., Glass, J., Khudanpur, S., Lee, C.-
H., Morgan, N. & Shaughnessy, D. O., 2009.
Developments and directions in speech recognition
and understanding, Part 1 [DSP Education]. Signal
Processing Magazine, IEEE, 26, 75-80.
Ciregan, D., Meier, U. and Schmidhuber, J., 2012. Multi-
column deep neural networks for image classification.
In Computer Vision and Pattern Recognition (CVPR),
2012 IEEE Conference on (pp. 3642-3649). IEEE.
Dehak, N., Kenny, P., Dehak, R., Dumouchel, P. &
Ouellet, P., 2011. Front-end factor analysis for speaker
verification. Audio, Speech, and Language Processing,
IEEE Transactions on, 19, 788-798.
Kenny, P., 2010. Bayesian speaker verification with
heavy-tailed Priors. In proc. of Odyssey - The Speaker
and Language Recognition Workshop, Brno, CZ.
Kenny, P., Boulianne, G., Ouellet, P. & Dumouchel, P.,
2007. Joint factor analysis versus eigenchannels in
speaker recognition. Audio, Speech, and Language
Processing, IEEE Transactions on, 15, 1435-1447.
Kim, H.J., Bae, K. and Yoon, H.S., 2007. Age and gender
classification for a home-robot service. In RO-MAN
2007-The 16th IEEE International Symposium on
Robot and Human Interactive Communication(pp.
122-126). IEEE.
Li, M., Han, K. J. & Narayanan, S., 2013. Automatic
speaker age and gender recognition using acoustic and
prosodic level information fusion. Computer Speech &
Language, 27, 151-167.
Metze, F., Ajmera, J., Englert, R., Bub, U., Burkhardt, F.,
Stegmann, J., Muller, C., Huber, R., Andrassy, B.,
Bauer, J.G. and Littel, B., 2007. Comparison of four
approaches to age and gender recognition for
telephone applications. In 2007 IEEE International
Conference on Acoustics, Speech and Signal
Processing-ICASSP'07 (Vol. 4, pp. IV-1089). IEEE.
Mysak, E. D., 1959. Pitch and duration characteristics of
older males. Journal of Speech & Hearing Research.
Nguyen, A., Yosinski, J. and Clune, J., 2015. Deep neural
networks are easily fooled: High confidence
predictions for unrecognizable images. In 2015 IEEE
Conference on Computer Vision and Pattern
Recognition (CVPR)(pp. 427-436). IEEE.
Nguyen, P., Tran, D., Huang, X. & Sharma, D., 2010.
Automatic Speech-Based Classification of Gender,
Age and Accent. In: KANG, B.-H. & RICHARDS, D.
(eds.) Knowledge Management and Acquisition for
Smart Systems and Services. Springer Berlin
Heidelberg.
Richardson, F., Reynolds, D. & Dehak, N., 2015. Deep
neural network approaches to speaker and language
recognition. Signal Processing Letters, IEEE, 22,
1671-1675.
Schuller, B., Steidl, S., Batliner, A., Burkhardt, F.,
Devillers, L., Müller, C.A. and Narayanan, S.S., 2010.
The INTERSPEECH 2010 paralinguistic challenge.
In InterSpeech (Vol. 2010, pp. 2795-2798).
Simonyan, K. & Zisserman, A., 2014. Very deep
convolutional networks for large-scale image
recognition. arXiv preprint arXiv:1409.1556.
Yu, D., Wang, S., Karam, Z. and Deng, L., 2010.
Language recognition using deep-structured
conditional random fields. In 2010 IEEE International
Conference on Acoustics, Speech and Signal
Processing (pp. 5030-5033). IEEE.
Zeiler, M. D., 2013. Hierarchical convolutional deep
learning in computer vision. PhD thesis, ch. 6, New
York University.