BIMODAL QUANTIZATION OF WIDEBAND SPEECH SPECTRAL

INFORMATION

Driss Guerchi

College of Information Technology, UAE University, 17555, Al Ain, U.A.E.

Keywords:

Wideband CELP coding, Bimodal vector quantization, Low-rate LPC quantization.

Abstract:

In this work we introduce an efﬁcient method to reduce the coding rate of the spectral information in an al-

gebraic code-excited linear prediction (ACELP) wideband codec. The Bimodal Vector Quantization (BMVQ)

exploits the interframe correlation in spectral information to reduce the coding rate while maintaining high

coded speech quality. In the BMVQ training phase, two codebooks are separately designed for voiced and

unvoiced speech. For each speech frame, the optimal codebook for the search procedure is selected according

to the interframe correlation of the spectral information.

The BMVQ was successfully implemented in an ACELP wideband coder. The objective and subjective perfor-

mance were found to be comparable to that of the combination of the split vector quantization and multistage

vector quantization at 2.3 kbit/s.

1 INTRODUCTION

Vector Quantization (VQ)is the most popular frame-

work to quantize spectrum parameters in Code-

Excited Linear Prediction (CELP) codecs. Several

variants of the vector quantization technique have

been developedto reduce further the coding rate of the

Linear Predictive Coding (LPC) vectors while main-

taining high coded-speech quality (Agiomyrgiannakis

and Stylianou, 2007). For example, both the nar-

rowband G.729 (Salami, et al., 1998) and wideband

G.722.2 (ITU-T G.722.2, 2003) codec standards use

a combination of Split Vector Quantization (SVQ)

and Multistage Vector Quantization (MSVQ) to re-

duce the search algorithm complexity.

While the coding efﬁciency of vector quantization

relies on its use of the intraframe correlation (that is

the correlation between the parameters of the same

LPC vector), split vector quantization sacriﬁces some

of this correlation to decrease the computational com-

plexity. In SVQ, an input vector is divided in two or

more subvectors, which are coded individually using

separate codebooks. This quantization method low-

ers the search complexity but at the expense of re-

duced speech quality. In order to reduce the CELP

coding rate, VQ algorithms are often applied to the

error between successive LPC vectors (after conver-

sion to one of the LPC frequency representation, such

as Line Spectral Frequency (LSF) in G.729, or Im-

mitance Spectral Frequency (ISF) in G.722.2). Bet-

ter LPC quantization is achieved since some of the

interframe correlation of the spectral information is

exploited. However, this approach is still not fully

efﬁcient since it neglects the interaction of the vo-

cal tract shape with the vocal cords pitch in voiced

speech. The suboptimality of the above methods is

reﬂected in the disjoint operation of spectrum param-

eters quantization and pitch analysis.

The interfame error of the LPC vectors is not uni-

form. While its variance in voiced speech is too

small, in unvoiced speech two consecutive LPC vec-

tors show very week correlation.

In this paper, we introduce a new technique exploit-

ing the interframe correlation of the spectral infor-

mation to lower the spectrum coding rate. Our ap-

proach, which we name Bimodal Vector Quantization

(BMVQ), codes separately the spectral information in

voiced and unvoiced speech.

In the BMVQ training phase, two disjoint code-

books are individually populated from the spectral pa-

rameters of voiced and unvoiced speech, but only one

codebook is used in the encoding of each frame spec-

trum. As a consequence, the LPC quantization pro-

cess in the BMVQ technique is preceded by the se-

lection of the appropriate codebook. This selection

is based on the interframe correlation of the current

151

Guerchi D. (2008).

BIMODAL QUANTIZATION OF WIDEBAND SPEECH SPECTRAL INFORMATION.

In Proceedings of the International Conference on Signal Processing and Multimedia Applications, pages 151-155

DOI: 10.5220/0001939201510155

 SciTePress

and previous LPC vectors. This approach not only

reduces the LPC coding rate but also produces high

coded speech quality.

2 CLASSICAL LPC VECTOR

QUANTIZATION

2.1 General-purpose Vector

Quantization

Most of the recent speech coder standards use one of

the various vector quantization algorithms to code the

spectral information. In contrast to scalar quantiza-

tion, VQ techniques reduce the coding rate in spite of

an increase in search computation. The performance

of a VQ method is function of the size of the code-

book. A codebook with more entries certainly ex-

cels in modeling the spectral parameters. However,

this improvement in speech quality is achieved to the

detriment of an increase in coding rate and computa-

tional complexity.

For example, in the G.729 narrowband codec stan-

dard, a combination of Multistage VQ and Split VQ

(MSVQ) is used to determine which 10-dimensional

LSF vector (among all the LSF codebook entries) cor-

responds most closely to the current frame LSF vec-

tor. In the ﬁrst stage of the search procedure, a 7-

bit codebook is searched for the closest match to the

difference between the input and predicted LSF vec-

tors, while in the second stage two codebooksof 5 bits

each are examined, for a total coding rate of 1.8 kbit/s.

In the G.722.2 wideband coding standard (Bessete, et

al., 2002), the same VQ technique, with slight modi-

ﬁcations, is employed to code 16 ISF coefﬁcients. A

total of 46 bits per 20-ms frame is allocated to coding

the input ISF vector for all the codec modes, except

for the 6.60 kbit/s coder which searches the closed

codewords among 832 entries (for a bit rate of 1.8

kbit/s). These bit allocations ignore the acoustical

characteristics of voiced and unvoiced speech (Tamni,

et al., 2005). The above vector quantization algo-

rithms can be categorized as general-purpose quan-

tization techniques since they are applied jointly to

both voiced and unvoiced speech.

2.2 Shortcomings of Classical Vector

Quantization Techniques

To code efﬁciently the LPC coefﬁcients, one must

employ separate codebooks for voiced and unvoiced

frames. In both G.729 and G.722.2 standards, the er-

ror between the current and past frame LPC vectors

is quantized using a combination of split and multi-

stage vector quantization. While the amplitude of this

error is too small in voiced speech, its magnitude and

dynamic range for unvoiced speech are signiﬁcantly

higher. Figure 1 shows the squared error between

consecutive ISF vectors in a wideband speech signal.

Each of the G.729 and G.722.2 codecs quantize this

error with the same quantizer for both voiced and un-

voiced frames. This approach is certainly inefﬁcient

since it dos not exploit the high interframe correlation

of the voiced speech spectrum. The variable-rate mul-

timode wideband speech codec (Jelinek and Salami,

2007), which is based on a source-controlled coding

paradigm, utilizes separate coding modes for differ-

ent classes of speech. However, the spectral informa-

tion is encoded using the same quantizer in all coding

modes.

An obvious and better approach consists of quantiz-

ing the voiced and unvoiced spectrum information

separately. The source-controlled quantization of the

spectrum parameters will evidently provide higher

coding performance. In (Guerchi, 2007) the inter-

frame correlation of spectrum parameters is used to

reduce the computational complexity by almost 30 %

while maintaining the coding rate ﬁxed. An alterna-

tive method of exploiting the high interframe correla-

tion in voiced speech consists of using a smaller-size

quantizer for this class of speech signal.

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 10

−0.3

−0.2

−0.1

0.1

0.2

0.3

0.4

Time (in samples)

Amplitude

(a)

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 10

0.2

0.4

0.6

0.8

Time (in samples)

Squared error

(b)

Figure 1: (a) Wideband speech signal; (b) ISF squared error

between consecutive frames.

3 BIMODAL VECTOR

QUANTIZATION

In this section, we introduce the bimodal vector quan-

tization (BMVQ) technique. This technique, which

consists of two disjoint ISF codebooks, reduces the

SIGMAP 2008 - International Conference on Signal Processing and Multimedia Applications

152

coding rate of the LPC coefﬁcients while maintaining

a toll speech quality. A voiced-speech ISF codebook

(VCB) quantizes the spectral information in voiced

speech. However, the BMVQ uses a separate ISF

codebook (UCB) for unvoiced frames.

Each ISF input vector is quantized using exclusively

either the VCB or UCB codebook. The two code-

books are trained individually from voiced and un-

voiced speech segments.

The search of the closest match to an input LP ﬁl-

ter vector is conﬁned to only one codebook. The ISF

vectors in steady-state voiced speech are highly corre-

lated. The error between two consecutive ISF vectors

is very small and its variance too. So the ISF error in

voiced speech can be quantized with lesser amount of

bits than in the traditional source-independent vector

quantization techniques.

In the BMVQ approach, the LPC parameters are

quantized differently according to the relativevalue of

their interframe correlation. For each speech frame,

linear prediction analysis is performed to extract 16

LPC coefﬁcients. The estimated LPC vector is com-

pared (after conversion to an equivalent ISF vector)

to the quantized ISF vector of the past frame using a

squared error distortion measure. The selection of the

optimal codebook (VCB versus UCB) is based on the

relative magnitude of this distortion. A small error

distortion is a cue of quasi-stationary voiced speech.

We propose in this paper to exploit this interframe

correlation to estimate the current ISF vector from the

past frame ISF coefﬁcients. The residual ISF vector

is quantized using either one of the BMVQ code-

books. The details of this algorithm are given in the

next section. Figure 2 illustrates the procedure of the

BMVQ algorithm.

LPC analysis

LPC to ISF conversion

VCB

UCB

−ˆp

n−1

Speech

frame n

ife

≤ ǫ

ife

> ǫ

Figure 2: (a) Concept of the Bimodal Vector Quantization.

4 BIMODAL CODEBOOK

DESIGN

In the BMVQ algorithm, two codebooks are trained

from a large speech database. In the ﬁrst phase of the

training process, we manipulate the speech database

to build two sets of speech segments. The ﬁrst set

Table 1: Bit Allocation of the 33-bit VCB.

Stage 1 stage 2

(6 bits) (r− ˆr

)(0 : 2) (5 bits)

(r− ˆr

)(3 : 5) (5 bits)

(r− ˆr

)(6 : 8) (5 bits)

(6 bits) (r− ˆr

)(0 : 2) (3 bits)

(r− ˆr

)(3 : 6) (3 bits)

Table 2: Bit Allocation of the 33-bit UCB.

Stage 1 stage 2

(6 bits) (r− ˆr

)(0 : 2) (5 bits)

(r− ˆr

)(3 : 5) (4 bits)

(r− ˆr

)(6 : 8) (4 bits)

(6 bits) (r− ˆr

)(0 : 2) (4 bits)

(r− ˆr

)(3 : 6) (4 bits)

contains voiced speech, while the second is populated

from unvoiced speech.

In the second phase, linear predictive analysis of

order 16 is performed on 20-ms speech frames from

each set. The obtained LPC vectors from each speech

class are ﬁrst converted to ISF vectors and then used

to design two ISF codebooks, VCB and UCB. We

have adopted in the codebook design a combination

of SVQ and MSVQ. The quantization procedure is

similar to the one used in the G.722.2 standard but

with lesser amount of bits.

It is worth noting that for the VCB codebook, a

much smaller bit rate is sufﬁcient to produce the same

objective and subjective performance as that of the

G.722.2 codec standard. The error between two con-

secutive LPC vectors is too small in voiced speech.

Its variance allows signiﬁcant bit-rate reduction.

At the end of the training process, each codebook

will be characterized by one index l, l = 0, 1. Tables

1 and 2 show the bit allocation of the VCB and UCB,

respectively, where r

is the ISF error vector between

the current ISF vector, p

, and the last frame quan-

tized vector, ˆp

n−1

In the ﬁrst stage of the quantization process, the

error vector r

is split into two subvectors r

n,1

and r

n,2

of 9 and 7 coefﬁcients, respectively. These two sub-

vectors are quantized to ˆr

n,1

and ˆr

n,2

. In the second

stage, the resulting quantization errors r

− ˆr

n,1

and

− ˆr

n,2

are split into three subvectors and two sub-

vectors, respectively.

To achieve a speech quality comparable to that

produced by the G.722.2 standard, our experiments

conﬁrmed that a total amount of 33 bits is required

for voiced speech ISF codebook (ie, for VCB).

Even though, the energy of the error r

is compar-

atively higher in unvoiced speech, the same amount

of 33 bits (for the UCB) is sufﬁcient to provide high

BIMODAL QUANTIZATION OF WIDEBAND SPEECH SPECTRAL INFORMATION

153

coded-speech quality since unvoiced speech is less

sensitive to quantization errors.

One more bit is to be transmitted as an index of

the selected codebook. The average coding rate for

the spectral information is equal to 1.7 kbit/s (34 bits

per 20-ms frame). This is a relative gain of 0.6 kbit/s

compared to the SMVQ method in the G.722.2 stan-

dard.

5 SOURCE-CONTROLLED

CODEBOOK

For every speech frame, the input ISF vector, p

is compared to the last frame quantized ISF vec-

tor ˆp

n−1

. A comparator checks the error distortion,

= p

− ˆp

n−1

, between the two vectors. If the energy

of r

is smaller than a certain threshold ε, then the

VCB will be used for the search of the closest code-

word to the input vector p

. Otherwise, the UCB will

be selected as the optimal codebook for the ISF quan-

tization. The selection of the best codebook (VCB

versus UCB) is controlled by the type of the source

signal (voiced versus unvoiced). The advantage of

this source-controlled method is that for steady-state

speech frames, the chance of hitting the optimum ISF

vector in the VCB codebook is very high. Table 3

illustrates the algorithm for optimal codebook selec-

tion.

6 EVALUATION

We have conducted several simulations to compare

the performance, in terms of objective and subjec-

tive measures, of the BMVQ technique to the G.722.2

SMVQ approach. The codebooks in both techniques

have been trained using the same database. This is to

avoid any effects of the selection of the database on

the performance results. As an objective measure, we

adopt the Segmental Signal-to-Noise Ratio (SegSNR)

at the output of the decoder. The systems to be evalu-

ated are two versions of the same wideband Algebraic

Table 3: Selection of the optimal codebook.

∑

i=15

i=0

(i)

if e

≤ ε

optimal codebook = VCB

else

optimal codebook = UCB

end

Table 4: Objective performance of the BMVQ technique.

Speaker SegSNR (dB)

SMVQ BMVQ

Female 10.65 10.62

Male 9.90 9.74

Average 10.275 10.18

Table 5: Spectral Distortion of the BMVQ technique.

Technique Avg SD (dB) Outliers (in %)

2-4 dB > 4 dB

SMVQ 1.31 2.41 0.06

BMVQ 1.32 2.44 0.08

CELP (ACELP) coder. The two coders are similar ex-

cept in the ISF quantization, where in coder 1 we use

the SMVQ approach. In the second ACELP coder, we

implement the BMVQ algorithm. The database for

the codebooks training consists of 150 minutes of En-

glish speech uttered by 8 speakers; four women and

four men. Each speaker read the same short utterance

10 times. We used the squared error ISF distortion for

training and testing. However, the weighted distor-

tion measure of Paliwal and Atal (Paliwal and Atal,

1993) is used to evaluate the ISF quantization in both

versions of the ACELP coder. The evaluation sim-

ulations have been conducted on six different input

sentences uttered by other speakers. In Table 4, we

present the SegSNR for both ACELP versions. Ta-

ble 5 shows the average spectral distortion (Avg SD)

between the input ISF vectors and their correspond-

ing quantized versions. Informal listening tests have

been performed as a subjective measures. In these

comparative tests, for each speech signal, listeners

have to listen to both signals produced by the BMVQ

and SMVQ-based coders. The coded signals are pre-

sented one pair at a time and in random order to each

listener. For each pair of coded signals, listeners have

to give their preference for one speech signal. The

subjective tests have shown that the average prefer-

ence score is slightly in favor of the SMVQ technique.

7 CONCLUSIONS

The objective measures illustrate that the performance

of the 1.7 kbit/s BMVQ approach are comparable to

that of the 2.3 kbit/s SMVQ method. The BMVQ

average coding rate is reduced by 0.6 kbit/s com-

paratively to the coding rate of the G.722.2 SMVQ

combination. However, the BMVQ technique is still

not robust in the presence of background noise. The

efﬁciency of the BMVQ approach for highly noisy

SIGMAP 2008 - International Conference on Signal Processing and Multimedia Applications

154

speech is affected. The correlation between two con-

secutive LPC vectors is not too high in noisy speech,

even for voiced stationary segments. Unlike in clean

speech, the number of wrong decisions in the selec-

tion of the optimal codebook might increase in noisy

voiced segments; an UCB may be selected to quan-

tize the ISF vectors of a voiced frame. Since each of

the two BMVQ codebooks is optimized principally

for only one class of speech (voiced or unvoiced), a

wrong decision in the selection of the optimal code-

book will generate relatively high quantization errors.

In future work, we plan on enhancing the robust-

ness of the BMVQ approach in the presence of back-

ground noise. We intend designing a preﬁlter that will

reduce (or cancel) the background noise before the

speech coding process.

REFERENCES

Y. Agiomyrgiannakis and Y. Stylianou, “Conditional Vector

Quantization for Speech Coding”, IEEE Transactions

on Audio, Speech, and Language Processing, vol. 15,

2, pp. 377-386, February 2007.

R. A. Salami, et al., “Design and description of CS-ACELP:

A toll quality 8 kb/s speech coder”, IEEE Transactions

on Speech and Audio Processing, vol. 6, n

2, pp. 116-

130, March 1998.

ITU-T G.722.2, Wideband coding of speech at around 16

kbit/s using Adaptive Multi-Rate Wideband (AMR-

WB), July 2003.

B. Bessete, et al., “The adaptive multirate wideband speech

codec (AMR-WB)”, IEEE Transactions on Speech

and Audio Processing, vol. 10, n

8, pp. 620-636,

November 2002.

M. Tamni, M. Jelinek, and V. T. Ruoppila, “Signal modiﬁca-

tion method for variable bit rate wideband speech cod-

ing”, IEEE Transactions on Speech and Audio Pro-

cessing, vol. 13, n

5, pp. 620-636, September 2005.

M. Jelinek and R. Salami, “Wideband speech coding Ad-

vances in VMR-WB standard”, IEEE Transactions on

Speech and Audio Processing, vol. 15, n

4, pp. 1167-

1179, May 2007.

D. Guerchi, T. Rabie, and A. Louzi, “Voicing-based

codebook in low-rate wideband CELP coding”, in

Proc. of the tenth European Conference on Speech

Communication and Technology (Interspeech 2007-

Eurospeech), Antwerp, Belgium, August 2007, pp.

2505-2508.

K. K. Paliwal and B. S. Atal, “Efﬁcient vector quantization

of LPC parameters at 24 bits/frame”, IEEE Transac-

tions on Speech, and Audio Processing, vol. 1, no. 1,

January 1993.

BIMODAL QUANTIZATION OF WIDEBAND SPEECH SPECTRAL INFORMATION

155