Table 5: Comparison between the proposed work and the
previous works (%).
Work
System Accuracy
(Li,
2013)
GMM Base-1 43.1
Mean Super Vector-2 42.6
MLLR Super Vector-3 36.2
TPP Super Vector-4 37.8
SVM Base-5 44.6
MFuse-1+2+3+4+5 52.7
This
work
SDC Class Model-1 47.7
SDC Speakers Model-
2
49.3
Our fused model (1+2) 57.21
Figure 5: The performance of the fused (SSM+SCM)
system versus α.
Table 5 shows the classification accuracies of the
proposed models and the previous works. The best
result in Li’s work (Li, 2013) was achieved by fusing
all the systems together manually (MFuse-
1+2+3+4+5). Using our proposed models, the
accuracy of the speaker age and gender classification
is improved by approximately 5% when compared to
the (MFuse 1+2+3+4+5). In addition, SDC class and
SDC speaker models outperformed the baseline
systems for the fused systems.
5 CONCLUSIONS
In this paper, we proposed DNN-based speaker
models using the SDC feature set in order to improve
the classification accuracies in speaker age and
gender classification. The proposed speaker models
and the effectiveness of the SDC feature set are
compared to the class models and the MFCCs feature
set as a baseline system. Our experimental results
show that speaker models and the SDC feature set
outperforms the class models and the MFCC set. The
proposed speaker models show a better performance
while classifying challenging middle-aged female
and male classes where the other methods fail to
classify. We compared the proposed work with the
GMM Base, Mean Super Vector, MLLR Super
Vector, TPP Super Vector, SVM Base, and the fused
system of all these systems. The results showed that
the proposed SDC speaker model + SDC class model
outperformed all the other systems by achieving
57.21% overall classification accuracy.
REFERENCES
Bahari, M.H., McLaren, M. and van Leeuwen, D.A., 2014.
Speaker age estimation using i-vectors. Engineering
Applications of Artificial Intelligence, 34, pp.99-108.
Barkana, B., Zhou, J., 2015. A new pitch-range based
feature set for a speaker’s age and gender classification.
Applied Acoustics, vol.98, pp.52–61.
Bocklet, T., Stemmer, G., Zeissler, V. and Nöth, E., 2010,
September. Age and gender recognition based on
multiple systems-early vs. late fusion.
In INTERSPEECH, pp. 2830-2833.
Campbell, W.M., Campbell, J.P., Reynolds, D.A., Singer,
E. and Torres-Carrasquillo, P.A., 2006. Support vector
machines for speaker and language
recognition. Computer Speech & Language, 20(2),
pp.210-229.
Ciregan, D., Meier, U. and Schmidhuber, J., 2012. Multi-
column deep neural networks for image classification.
In Computer Vision and Pattern Recognition (CVPR),
2012 IEEE Conference on, pp. 3642-3649.
Davis, S. and Mermelstein, P., 1980. Comparison of
parametric representations for monosyllabic word
recognition in continuously spoken sentences. IEEE
Transactions on Acoustics, Speech, and Signal
Processing, 28(4), pp.357-366.
Dobry, G., Hecht, R. M., Avigal, M. & Zigel, Y., 2011.
Supervector Dimension Reduction for Efficient
Speaker Age Estimation Based on the Acoustic Speech
Signal. IEEE Transactions on Audio, Speech, and
Language Processing, 19, 1975-1985.
Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A.-
R., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P. &
Sainath, T. N., 2012. Deep neural networks for acoustic
modeling in speech recognition: The shared views of
four research groups. Signal Processing Magazine,
IEEE, 29, 82-97.
Li, M., Han, K. J. & Narayanan, S., 2013. Automatic
speaker age and gender recognition using acoustic and
prosodic level information fusion. Computer Speech &
Language, 27, 151-167.
Metze, F., Ajmera, J., Englert, R., Bub, U., Burkhardt, F.,
Stegmann, J., Muller, C., Huber, R., Andrassy, B.,
Bauer, J.G. and Littel, B., 2007. Comparison of four
approaches to age and gender recognition for telephone
applications. In 2007 IEEE International Conference
44
46
48
50
52
54
56
58
0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9
Accuracy%
α