training compared to the matched training when the
test utterances are solely from one language.
The main limitation of our work is the
differences in the recording conditions of the two
databases. As a result, the classification results not
only contain the language mismatch but also the
mismatch between the recording conditions. In the
future, we plan to extend our speech database with
new languages recorded in more unified
environmental and background conditions.
In the experiments, we observed that all three
methods do not perform well in language and
background mismatch conditions with MFCC
features. In order to improve the performance in the
mismatch conditions, we will investigate the other
features which are more suitable for the multi-
language age identification task. We also plan to
implement deep neural network types such as
recurrent neural network (RNN) and convolutional
neural network (CNN) for age identification in the
future.
ACKNOWLEDGEMENTS
This work was supported by The Scientific and
Technological Research Council of Turkey
(TUBITAK) under the project number 3150312.
REFERENCES
Abu Mallouh A., Qawaqneh Z. and Barkana B., 2017.
“Combining two different DNN architectures for
classifying speaker’s age and gender,” International
Joint Conference on Biomedical Engineering Systems
and Technologies - Volume 4: BIOSIGNALS,
(BIOSTEC 2017).
Bahari, M.H., and Hamme, H.V., 2011. “Speaker age
estimation and gender detection based on supervised
non-negative matrix factorization,” IEEE Workshop
on Biometric Measurements and Systems for Security
and Medical Applications (BIOMS 2011), Milan,
Italy.
Blouet, R., Mokbel, C., Mokbel, H., Soto, E.S., Chollet,
G., and Greige, H., 2004. “Becars: a free software for
speaker verification,” The Speaker and Language
Recognition Workshop (ODYSSEY 2004), Toledo,
Spain. pp. 145–148.
Bocklet, T., Maier, A., Bauer, J.G., Burkhardt, F., and
Noth, E., 2008. “Age and gender recognition for
telephone applications based on GMM supervectors
and support vector machines,” IEEE International
Conference on Acoustics, Speech, and Signal
Processing (ICASSP 2008), Las Vegas, USA.
Boser, B.E., Guyon, I., and Vapnik, V., 1992. “A training
algorithm for optimal margin classifiers,” ACM
Workshop on Computational Learning Theory
Pittsburgh, USA. pp. 144–152.
Buyuk, O., and Arslan, L.M., 201. “Combination of long-
term and short-term features for age identification
from voice,” Advances in Electrical and Computer
Engineering 18 (2), pp. 101-108.
Campbell, W.M., Sturim, D.E., and Reynolds, D.A., 2006.
“Support vector machines using GMM supervectors
for speaker verification,” IEEE Signal Processing
Letters 13 (5), pp. 308-311.
Chang, C.C., and Lin, C.J., 2011. “LIBSVM: A library for
support vector machines,” ACM Transactions on
Intelligent Systems and Technology 2 (3), pp. 27:1-
27:27.
Chollet, F., 2015. “Keras,” Github repository 2015.
https://github.com/fchollet/keras.
Cortes, C., and Vapnik, V., 1995. “Support-vector
networks,” Machine Learning 20 (3), pp. 273–297.
Deng, L., and Yu, D., 2013. “Deep learning methods and
applications,” Foundations and Trends in Signal
Processing 7 (3-4), pp. 197-387.
Dobry, G., Hecht, R.M., Avigali M., and Zigel, Y., 2011.
“Supervector dimension reduction for efficient speaker
age estimation based on the acoustic speech signal,”
IEEE Transactions on Audio, Speech, and Language
Processing 19 (7), pp. 1975-1985.
Grzybowska, J., and Kacprzak, S., 2016. “Speaker age
classification and regression using i-vectors.”
International Conference on Spoken Language
Processing (INTERSPEECH 2016), San Francisco,
California, USA.
Hinton, G.E., Osindero, S., and Teh, Y., 2006. “A fast
learning algorithm for deep belief nets,” Neural
Computation 18, pp. 1527-1554.
Hinton, G., L. Deng, Yu, D., Dahl, G., Mohamed, A.,
Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P.,
Sainath, T., and Kingsbury, B., 2012. “Deep neural
networks for acoustic modeling in speech
recognition,” IEEE Signal Processing Magazine 29
(6), pp. 82-97.
Kingsbury, B.E., Morgan, N., and Greenberg, S., 1998.
“Robust speech recognition using the modulation
spectrogram,” Speech Communications 25, pp. 117-
132.
Li, M., Han, K.J., and Narayanan, S., 2013. “Automatic
speaker age and gender recognition using acoustic and
prosodic level information fusion,” Computer Speech
and Language, 27 (1), pp. 151-167.
Meinedo, H., and Trancoso, I., 2010. “Age and gender
classification using fusion of acoustic and prosodic
features,” International Conference on Spoken
Language Processing (INTERSPEECH 2010),
Makuhari, Japan.
Metze, F., Ajmera, J., Englert, R., Bub, U., Burkhardt, F.,
Stegmann, J., Muller, C., Huber, R., Andrassy, B.,
Bauer, J.G., and Little, B., 2007. “Comparison of four
approaches to age and gender recognition for
telephone applications,” IEEE International