Combining Two Different DNN Architectures for Classifying Speaker’s Age and Gender

Arafat Abu Mallouh, Zakariya Qawaqneh, Buket D. Barkana

2017

Abstract

Speakers’ age and gender classification is one of the most challenging problems in the field of speech processing. Recently, remarkable developments have been achieved in the neural network field, nowadays, deep neural network (DNN) is considered one of the state-of-art classifiers which have been successful in many speech applications. Motivated by DNN success, we jointly fine-tune two different DNNs to classify the speaker’s age and gender. The first DNN is trained to classify the speaker gender, while the second DNN is trained to classify the age of the speaker. Then, the two pre-trained DNNs are reused to tune a third DNN (AGender-Tuning) which can classify the age and gender of the speaker together. The results show an improvement in term of accuracy for the proposed work compared with the I-Vector and the GMM-UBM as baseline systems. Also, the performance of the proposed work is compared with other published works on a publicly available database.

References

  1. Bahari, M.H. and Van Hamme, H., 2011. Speaker age estimation and gender detection based on supervised non-negative matrix factorization. In Biometric Measurements and Systems for Security and Medical Applications (BIOMS), 2011 IEEE Workshop on (pp. 1-6). IEEE.
  2. Bahari, M.H., McLaren, M. and van Leeuwen, D.A., 2014. Speaker age estimation using i-vectors. Engineering Applications of Artificial Intelligence, 34, pp.99-108.
  3. Baker, J. M., Deng, L., Glass, J., Khudanpur, S., Lee, C.- H., Morgan, N. & Shaughnessy, D. O., 2009. Developments and directions in speech recognition and understanding, Part 1 [DSP Education]. Signal Processing Magazine, IEEE, 26, 75-80.
  4. Ciregan, D., Meier, U. and Schmidhuber, J., 2012. Multicolumn deep neural networks for image classification. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on (pp. 3642-3649). IEEE.
  5. Dehak, N., Kenny, P., Dehak, R., Dumouchel, P. & Ouellet, P., 2011. Front-end factor analysis for speaker verification. Audio, Speech, and Language Processing, IEEE Transactions on, 19, 788-798.
  6. Kenny, P., 2010. Bayesian speaker verification with heavy-tailed Priors. In proc. of Odyssey - The Speaker and Language Recognition Workshop, Brno, CZ.
  7. Kenny, P., Boulianne, G., Ouellet, P. & Dumouchel, P., 2007. Joint factor analysis versus eigenchannels in speaker recognition. Audio, Speech, and Language Processing, IEEE Transactions on, 15, 1435-1447.
  8. Kim, H.J., Bae, K. and Yoon, H.S., 2007. Age and gender classification for a home-robot service. In RO-MAN 2007-The 16th IEEE International Symposium on Robot and Human Interactive Communication(pp. 122-126). IEEE.
  9. Li, M., Han, K. J. & Narayanan, S., 2013. Automatic speaker age and gender recognition using acoustic and prosodic level information fusion. Computer Speech & Language, 27, 151-167.
  10. Metze, F., Ajmera, J., Englert, R., Bub, U., Burkhardt, F., Stegmann, J., Muller, C., Huber, R., Andrassy, B., Bauer, J.G. and Littel, B., 2007. Comparison of four approaches to age and gender recognition for telephone applications. In 2007 IEEE International Conference on Acoustics, Speech and Signal Processing-ICASSP'07 (Vol. 4, pp. IV-1089). IEEE.
  11. Mysak, E. D., 1959. Pitch and duration characteristics of older males. Journal of Speech & Hearing Research.
  12. Nguyen, A., Yosinski, J. and Clune, J., 2015. Deep neural networks are easily fooled: High confidence predictions for unrecognizable images. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)(pp. 427-436). IEEE.
  13. Nguyen, P., Tran, D., Huang, X. & Sharma, D., 2010. Automatic Speech-Based Classification of Gender, Age and Accent. In: KANG, B.-H. & RICHARDS, D. (eds.) Knowledge Management and Acquisition for Smart Systems and Services. Springer Berlin Heidelberg.
  14. Richardson, F., Reynolds, D. & Dehak, N., 2015. Deep neural network approaches to speaker and language recognition. Signal Processing Letters, IEEE, 22, 1671-1675.
  15. Schuller, B., Steidl, S., Batliner, A., Burkhardt, F., Devillers, L., Müller, C.A. and Narayanan, S.S., 2010. The INTERSPEECH 2010 paralinguistic challenge. In InterSpeech (Vol. 2010, pp. 2795-2798).
  16. Simonyan, K. & Zisserman, A., 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
  17. Yu, D., Wang, S., Karam, Z. and Deng, L., 2010. Language recognition using deep-structured conditional random fields. In 2010 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 5030-5033). IEEE.
  18. Zeiler, M. D., 2013. Hierarchical convolutional deep learning in computer vision. PhD thesis, Ch. 6, New York University.
Download


Paper Citation


in Harvard Style

Abu Mallouh A., Qawaqneh Z. and Barkana B. (2017). Combining Two Different DNN Architectures for Classifying Speaker’s Age and Gender . In Proceedings of the 10th International Joint Conference on Biomedical Engineering Systems and Technologies - Volume 4: BIOSIGNALS, (BIOSTEC 2017) ISBN 978-989-758-212-7, pages 112-117. DOI: 10.5220/0006096501120117


in Bibtex Style

@conference{biosignals17,
author={Arafat Abu Mallouh and Zakariya Qawaqneh and Buket D. Barkana},
title={Combining Two Different DNN Architectures for Classifying Speaker’s Age and Gender},
booktitle={Proceedings of the 10th International Joint Conference on Biomedical Engineering Systems and Technologies - Volume 4: BIOSIGNALS, (BIOSTEC 2017)},
year={2017},
pages={112-117},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0006096501120117},
isbn={978-989-758-212-7},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 10th International Joint Conference on Biomedical Engineering Systems and Technologies - Volume 4: BIOSIGNALS, (BIOSTEC 2017)
TI - Combining Two Different DNN Architectures for Classifying Speaker’s Age and Gender
SN - 978-989-758-212-7
AU - Abu Mallouh A.
AU - Qawaqneh Z.
AU - Barkana B.
PY - 2017
SP - 112
EP - 117
DO - 10.5220/0006096501120117