having a certain population size is found to vary as a
power of the size of the population, and hence follows
a power law.
The paper is structured as follows: section 2 gives
the description of the database used, which includes
speech and EGG data; section 3 explains the process
of feature extraction, including SUV distinguishing
and PLDC calculation; section 4 introduces the ex-
periments based on SVM; and finally conclusions.
2 DATABASE DESCRIPTION
To evaluate the new features proposed, Beihang Uni-
versity Database of Emotional Speech (BHUDES)
was set up to provide speech utterances. All the ut-
terances were stereo whose left channel contains the
acoustic data and right channel contains the EGG
data.
SUBJECTS
Fifteen healthy volunteers were invited to establish
the database, including seven male and eight female.
The emotions used resemble the far spread MPEG-4
set, namely joy, anger, disgust, fear, sadness, surprise
and added neutrality. The database contains twenty
texts with no emotional tendencies. Each sentence
was repeated three times for each emotion, thus 6,300
utterances were obtained. All these utterances have a
sample-frequency as 11025Hz and mean duration as
1.2s.
INSTRUMENTATION
Acoustic data was obtained by a BE-8800 elec-
tret condenser microphone. TIGEX-EGG3 (Tiger
DRS,Inc., America) measured the EGG signals. The
output of the EGG device was processed by an elec-
tronic preamplifier and then by a 16-bit analog-to-
digital (A/D) converter that was included in a OP-
TIPLEX 330 personal computer. Both the EGG and
acoustic data were analyzed by MATLAB. A raw data
of acoustic and EGG are shown in Fig. 1. The seg-
ments of silence, voiced and unvoiced are signed with
vertical lines.
EVALUATION
Besides, an emotional speech evaluation system is es-
tablished to ensure the reliability of the utterances.
Emotional speech which is accurately recognized by
at least p% of strange listeners is collected into a sub-
set, where p ∈ { 50, 60, 70, 80, 90, 100}.
The subset S70 is selected for further experiments
because of the appropriatequality and quantity. There
are in total 3456 mandarin utterances covering all the
emotion categories in the S70 subset.
Figure 1: A raw data of acoustic and EGG.
3 FEATURE EXTRACTION
There are two steps of feature extraction. First, the
segment of voiced speech, unvoiced speech and si-
lence were separated using information from both
acoustic data and EGG data. Second, we focus on the
characteristics distribution of time-domain. The du-
ration distribution of voiced segment, pitch rise seg-
ment and pitch down segment were analyzed by the
power-law distribution coefficient (PLDC).
SUV DISTINGUISHING
In speech analysis, the SUV decision which is used
to divide whether a given segment of a speech sig-
nal should be classified as voiced speech, unvoiced
speech, or silence, based on measurements made on
the signal. The measured parameters include the
zero-crossing rate, the speech energy, the correlation
between adjacent speech samples, etc.(Atal and Ra-
biner, 1976). It is usually performed in conjunction
with pitch analysis. However, without the information
of EGG, the linking of SUV decision to pitch analysis
results in unnecessary complexity. Fig. 2 shows the
log energy histograms of acoustic and EGG data.
In Fig. 2, both the log energy histograms of acous-
tic and log energy histograms of EGG have two peaks.
The left one represents the unvoiced or silent seg-
ments, while the right one represents the voiced seg-
ments. We use the maximum a posteriori method to
fit the two classes data near the two peaks for both
acoustic and EGG. The recognition rates obtained are
95.98% and 99.96% respectively. This indicates that
the EGG has shown excellent in the recognition of
voice segment.
Based on the above analysis, we designed three
threshold which are determined by the result of statis-
tics as shown in Fig. 2. A SUV division algorithm was
designed as shown in Fig. 3.
BIOSIGNALS 2012 - International Conference on Bio-inspired Systems and Signal Processing
132