The second one is two-microphone system based on
coherence function. Experiments shown better per-
formance of single-microphone system in case of un-
voiced background sounds like passing cars. In case
of voiced sounds in background simple filtration of
modulation components was insufficient to discrimi-
nate speech from voiced sounds and coherence based
system gives much less false speech detection deci-
sions. Future work will be concentrated on applica-
tion of coherence function into modulation domain.
ACKNOWLEDGEMENTS
The work presented was developed within VIS-
NET 2, a European Network of Excellence
(http://www.visnet-noe.org), funded under the
European Commission IST FP6 Programme.
REFERENCES
Atlas, L. and Shamma, S. (2003). The modulation transfer
function in room acoustics as a predictor of speech
intelligibility. EURASIP Journal on Applied Signal
Processing, 7:668–675.
Baszun, J. (2007). Voice activity detection for speaker ver-
ification systems. In Joint Rougth Set Symposium,
Toronto, Canada.
Baszun, J. and Petrovsky, A. (2000). Flexible cochlear
system based on digital model of cochlea: Structure,
algorithms and testing. In Proceedings of the 10th
European Signal Processing Conference ( EUSIPCO
2000), pages 1863–1866, Tampere, Finland. vol. III.
Carter, G. C. (1987). Coherence and time delay estimation.
Proceedings of the IEEE, 75(2):236–254.
Doblinger, G. (1995). Computationally efficient speech en-
hancement by spectral minima tracking in subbands.
In Proceedings of the 4th European Conference on
Speech Communication and Technology, pages 1613–
1516, Madrit, Spain.
Drullman, R., Festen, J., and Plomp, R. (1994). Effect of
temporal envelope smearing on speech reception. J.
Acoust. Soc. Am., (2):1053–1064.
El-Maleh, K. and Kabal, P. (1997). Comparison of voice ac-
tivity detection algorithms for wireless personal com-
munications systems. In Proceedings IEEE Cana-
dian Conference Electrical and Computer Engineer-
ing, pages 470–473.
Elhilali, M., Chi, T., and Shamma, S. (2003). A spectro-
temporal modulation index (stmi) for assessment
of speech intelligibility. Speech Communication,
41:331–348.
Guerin, A. (2000). A two-sensor voice activity detection
and speech enhancement based on coherence with ad-
ditional enhancement of low frequencies using pitch
information. In EUSIPCO 2000, pages 178–182,
Tampere, Finland.
Hermansky, H. and Morgan, N. (1994). RASTA processing
of speech. IEEE Transactions on Speech and Audio
Processing, 2(4):587–589.
Houtgast, T. and Steeneken, H. J. M. (1973). The modula-
tion transfer function in room acoustics as a predictor
of speech intelligibility. Acustica, 28:66.
Houtgast, T. and Steeneken, H. J. M. (1985). A review of
the MTF concept in room acoustics and its use for es-
timating speech intelligibility in auditoria. J. Acoust.
Soc. Am., 77(3):1069–1077.
Martin, R. (2001). Noise power spectral density estimation
based on optimal smoothing and minimum statistics.
IEEE Transactions on Speech and Audio Processing,
9:504–512.
Martin, R. and Vary, P. (1994). Combined acoustic echo
cancellation, dereverberation and noise reduction: a
two microphone approach. Ann. Telecommun., 49(7-
8):429–438.
Mesgarani, N., Shamma, S., and Slaney, M. (2004). Speech
discrimination based on multiscale spectro-temporal
modulations. In ICASSP, pages 601–604.
Sovka, P. and Pollak, P. (1995). The study of speech/pause
detectors for speech enhancement methods. In Pro-
ceedings of the 4th European Conference on Speech
Communication and Technology, pages 1575–1578,
Madrid, Spain.
Thompson, J. and Atlas, L. (2003). A non-uniform modu-
lation transform for audio coding with increased time
resolution. In ICASSP, pages 397–400.
SPEECH SEGMENTATION IN NOISY STREET ENVIRONMENT
437