ROBUST VOICE ACTIVITY DETECTION BASED ON PITCH
AND SUB-BAND ENERGY
Zhihao Zhang and Jinlong Lin
School of Software and Microelectronics, Peking University, Beijing 100871, China
Keywords: Voice activity detection, Pitch, Sub-band energy criteria.
Abstract: A new Voice Activity Detection (VAD) method is proposed to track the various background noises and it
can be robust in both stationary and variable noise environments. Many previous VAD methods assume that
the background only contains certain kinds of noises, so they could not deal with the noise in practical
applications efficiently. In proposed approach, determinate speech, determinate noise and potential speech
regions are defined. The first two regions are located with extracted pitch contour information and the
ambiguous region will be further retrieved using updated thresholds of sub-bands energy in obtained
determinate noise’s frequency domain. Experiments are carried out with an exhaustive comparison to three
standard VAD methods: G729b, ETSI AFE and AMR. The result shows that our approach has a more
robust performance than others in the real circumstances.
1 INTRODUCTION
Voice Activity Detection (VAD) is defined as a
procedure to separate speech from silence, noise and
other non-voice segments. Since it can not only
facilitate the speech processing but also increase the
performance of most recognition applications, VAD
has become an essential front-end processing step
for various speech signal processing systems, such
as speech recognition (L. Karray & A. Martin, 2003),
speech coding (ITU-T, 1997) and speech
communication (Syed W.Q. & H. Wu, 2007).
The important status of VAD system for speech
signal processing attracts more and more researchers
to pay attention on it. In the early developed VAD
algorithms, zero crossing rates (ITU-T, 1997), linear
predictive coding coefficients, energy thresholds (K.
Woo et al., 2000) and statistical model have been
used. The Standard VAD methods G729b (ITU-T,
1997) proposed by International Telecommunication
Union ITU, Advanced Front-End: AFE (ETSI, 2007)
and Adaptive Multi-Rate: AMR (3GPP, 2001)
introduced by European Telecommunication
Standards Institute ETSI are used on speech coding
and communication. They can achieve a high speech
detection hit rate, but the non-speech detection does
not efficiently perform as well, especially in the
variable background noises. Recently, the method
(Syed W.Q. & H. Wu, 2007) based on an adaptive
threshold related to the Signal Noise Ratio (SNR) is
proposed. This method can well track the variable
white noise, babble noise and vehicular noise
respectively, but it has a limited test on the voice
detection with practical data. A harmonic plus noise
model VAD has been introduced in (E. Fisher et al.,
2006), which presented a new pitch tracking
algorithm. However, the complex computation of
this method makes it hard to use in the real-time
applications. Based on the available methods, we
find that it is hard to find a robust VAD method
which can achieve real-time performance on both
speech and non-speech detection under the complex
situation.
This paper presents a new robust real-time VAD
method that can track the noise in real world
efficiently. Harmonics, pitch value and sub-band
energy criteria are introduced to locate the speech
region and track time-varying noise respectively
without training time. Firstly, most vowel segments
which named determinate speech region can be
detected by pitch measurement. Then, sub-band
division and energy thresholds from determinate
noise region are updated to retrieve the left voiced
parts in the potential voice region.
This paper is organized as follows: Section 2
presents the principles and framework of proposed
method. In Section 3, pitch measure enhancement
44
Zhang Z. and Lin J. (2009).
ROBUST VOICE ACTIVITY DETECTION BASED ON PITCH AND SUB-BAND ENERGY.
In Proceedings of the International Conference on Signal Processing and Multimedia Applications, pages 44-48
DOI: 10.5220/0002221000440048
Copyright
c
SciTePress