Adaptive Speech Watermarking in Wavelet Domain based on

Logarithm

Mehdi Fallahpour

, David Megias

and Hossein Najaf-Zadeh

Estudis d’Informàtica, Multimèdia i Telecomunicació, Internet Interdisciplinary Institute (IN3),

Universitat Oberta de Catalunya, Rambla del Poblenou, 156, 08018 Barcelona, Spain

Advanced Audio Systems, Communications Research Centre Canada (CRC),

3701 Carling Ave., Ottawa, K2H 8S2, Ontario, Canada

Keywords: Speech Watermarking, Data Hiding, Wavelet Transform, Logarithm.

Abstract: Considering the fact that the human auditory system requires more precision at low amplitudes, the use of a

logarithmic quantization algorithm is an appropriate design strategy. Logarithmic quantization is used for

the approximation coefficients of a wavelet transform to embed the secret bits. To improve robustness, the

approximation coefficients are packed into frames and each secret bit is embedded into a frame. The

experimental results show that the distortion caused by the embedding algorithm is adjustable and lower

than that introduced by a standard ITU-T G.723.1 codec. Therefore, the marked signal has high quality

(PESQ-MOS score around 4.0) and the watermarking scheme is transparent. The capacity is adjustable and

ranges from very low bit-rates to 4000 bits per second. The scheme is shown to be robust against different

attacks such as ITU-T G.711 (a-law and u-law companding), amplification and low-pass RC filters.

1 INTRODUCTION

Practical applications for digital watermarking vary

from copy prevention and traitor tracing to broadcast

monitoring or archiving, among others.

Watermarking can be used to identify the license

information, owners, or other information related to

the digital object carrying the watermark.

Imperceptibility, robustness and capacity are three

important conflicting properties which are used to

evaluate watermarking systems.

Speech watermarking systems usually embed

watermarks in inaudible parts of the speech signals.

Many speech watermarking and information

embedding schemes have been proposed. These

methods can be classified into seven approaches:

least significant bit, phase coding, echo hiding

analysis by synthesis-based, spectrum techniques in

the transform domain, feature-based and

watermarking combined with coding frames.

Several quantization-based methods have been

proposed in the recent years by using discrete

Hartley transform coefficients (Sagi and Malah,

2007), autoregressive model parameters (Chen and

Leung, 2007), and the pitch period (Celik et al.,

2005). The payload range of those systems is from a

few bits to a few hundred bits per second, with

varying robustness against different types of attacks.

The proposed method takes advantage of the

wavelet transform, which divides the signal into low

and high frequency bands. The signal-to-noise ratio

(SNR) values in the low frequency region are in the

range of 15 to 20 dB, and gradually decreases to

zero as the frequency increases. To achieve

robustness, using the low frequency bands is more

advisable, whereas embedding the watermark in

high frequency bands leads to better transparency.

The human auditory system requires more

precision (in terms of absolute errors) at low-energy

audible amplitudes (i.e., audible soft sounds), but is

less sensitive at higher amplitudes. Considering this

fact, and making use of the logarithm, a logarithmic

quantization algorithm is used for approximating the

cA coefficients of a wavelet transform to embed the

secret bits. To improve robustness, the cA samples

are grouped into frames and a single secret bit is

embedded into the corresponding frame. Increasing

the frame size decreases the embedding capacity and

increases the robustness.

The experimental results show that the distortion

caused by the embedding algorithm is adjustable and

lower than that caused by the ITU-T G.723.1 speech

412

Fallahpour M., Megias D. and Najaf-Zadeh H..

Adaptive Speech Watermarking in Wavelet Domain based on Logarithm.

DOI: 10.5220/0004091704120415

In Proceedings of the International Conference on Security and Cryptography (SECRYPT-2012), pages 412-415

ISBN: 978-989-8565-24-2

 2012 SCITEPRESS (Science and Technology Publications, Lda.)

codec (ITU-T, Recommendation G.723). G.723 is a

standard speech codec that guarantees the quality of

compressed speech and thus, it is evident that the

marked signal has high quality (PESQ-MOS around

4, i.e., near transparent). The embedding rate is

adjustable and ranges from very low bit-rates to

4000 bits per second (bps).

The rest of the paper is organized as follows.

Section 2 introduces the proposed method. In

Section 3, a discussion on the transparency and

robustness of the proposed scheme is provided and

the experimental results are presented. Finally,

Section 4 summarizes our work with a conclusion.

2 PROPOSED SCHEME

The proposed scheme includes two methods, i.e., the

embedding and the extracting processes.

2.1 Embedding Process

To embed secret information into the approximation

coefficients of the wavelet transform (cA), these

coefficients are divided into frames and each single

secret bit is embedded into the corresponding frame.

Each wavelet coefficient is mapped into the

logarithm domain and changed depending on the

secret bit. The embedding steps are described below:

1. Compute the first level Daubechies wavelet

transform (db10) of the original signal.

2. Divide the cA samples into frames of a given size

(f), where cA is the approximation of the input signal

(i.e., the output of the low-pass filter of the wavelet

transform).

3. Assume that 



is the j

secret bit embedded into

the j

frame of cA. Then let





=

sign

(





)





(













,if



=0,

sign

(





)





(



















,if



=1.





is the marked coefficient, 



represents a cA

coefficient,  is the quantization value and j =



(−1)

⁄

. For example when j = 1 and f = 4, 



(the first secret bit) is embedded into 



, 



, 



and





, then the second secret bit (



) is embedded into

the second frame (



, 



, 



and 



). This process is

repeated for the remaining secret bits. Finally, the

inverse DWT is applied to the marked wavelet

coefficients to obtain the marked audio signal in the

time domain.

2.2 Extracting Process

In the extracting process, we first obtain the wavelet

coefficients of the marked signal. After that, the

difference between the marked samples and the

rounded marked samples is computed to extract a

secret bit. The extracting steps are listed below:

1. Compute the first level Daubechies wavelet

transform (db10) of the marked signal.

2. Extract the bit embedded into each marked

wavelet coefficient based on the following equation:





=

0, if

(

log







<0.25,

1, otherwise,

where 

(



)

=



−round().

To make a decision about an embedded bit in each

frame, a voting algorithm is used. It means that for

each frame, we count the number of zeros and ones

in the extracted values (i.e., 



′). The exact

embedded secret bit in that frame will be the value

with the highest count. For instance, if the frame size

is f = 5 and the number of 



′ with a value of “1” is

more than two, then the embedded bit in the frame is

“1”; otherwise it is “0”.

3 EXPERIMENTAL RESULTS

AND DISCUSSION

Four male speech files sp01.wav – sp04.wav and

five female files sp11.wav – sp15.wav taken from

the Noizeus speech corpus (Hu and Loizou, 2007)

have been selected for our experiments. The

sampling frequency is 8000 Hz and each sample is

represented with 32 bits. In a watermarking system,

we have different properties and, among them,

capacity, transparency, and robustness are the most

relevant ones. In this section, we discuss these

properties for the proposed watermarking scheme.

3.1 Capacity

Capacity is defined by the number of bits embedded

in one second of the speech file. For different

applications, different ranges of capacity are

demanded. For example for copyright protection, we

just need to embed a short identification code and

therefore the required capacity is about 100 bps,

however in this kind of application robustness

against various attacks is essential.

The capacity of the proposed scheme can be

modified adaptively according to the requirements.

The embedded capacity can be adjusted using the

AdaptiveSpeechWatermarkinginWaveletDomainbasedonLogarithm

413

frame size f. This is a relevant feature of the

proposed scheme. When the frame size is one, the

embedded capacity is very high (i.e., 4000 bps). For

instance, when the sampling rate of a speech file is

8000 samples per second, there are 4000 coefficients

available in the approximation part (cA) for

embedding the secret bits.

3.2 Transparency

In general, it is difficult to prove that the distortion

caused by coding, watermarking or other processing

operations is imperceptible. In fact, as the perceptual

properties of the human auditory system are quite

complex, it is difficult to map them into linear

equations and prove transparency.

In the proposed scheme, speech samples are

changed based on a logarithmic function, which is a

common technique used by several speech codecs.

We have obtained some results showing that the

distortion introduced by watermarking is lower than

the coding distortion introduced by a G.723 codec

(the corresponding plots results are omitted here for

space limitations). As G.723 is a standard speech

codec with guaranteed high quality, the perceptual

transparency of this scheme is guaranteed as well.

Experimental results using the Perceptual Evaluation

of Speech Quality (PESQ) objective measurement

(ITU-T, Recommendation P.861) show that the

PESQ-MOS is around 4. PESQ results principally

model mean opinion scores (MOS), which cover a

scale from 1 (bad) to 5 (excellent).

3.3 Robustness

A trade-off between capacity and robustness is

always a challenge for audio watermarking systems.

High capacity usually results in a very fragile

method and, conversely, robust schemes lead to very

low capacity. Repeating a single bit is a simple but

effective idea to increase robustness. Consider that

we face a situation where a burst error destroys 10

ms of the speech signal. If we just embedded the

secret information in the missing part, we would

loose a relevant part of important information. On

the other hand, if the secret information were

repeated in other places, we would not loose all the

secret bits due to the burst error.

In the proposed scheme, we just embed each

single bit into a frame, i.e., the same secret bit is

embedded into all the coefficients in that frame.

Thus, if we were not able to extract the secret bit

from some coefficients, this bit may be extractable

from other coefficients in the same frame. Finally, a

voting technique leads us to extract the embedded

bit in the frame. As mentioned above, considering a

trade-off between capacity and robustness is

necessary. For instance, if the frame size is f = 4,

capacity is decreased by a factor of 4 (compared to f

= 1) but, in return, robustness is increased. Thus, the

parameters of the scheme should be chosen

according to the demands and the specific

application.

Table 1 shows the capacity and transparency for

the test files with different parameters. By increasing

 (quantized value), the distortion decreases. In all

the results, PESQ-MOS is around 4, which means

that the quality of the marked signal is high. In

addition, changing the frame size affects the

capacity (800–4000 bps).

Table 2 shows the robustness of the proposed

method against different attacks. The results were

obtained with two values of ; for  = 5, the PESQ-

MOS is 3.5 and for = 3, the PESQ-MOS is 3.1; in

both cases frame size is f = 4 and thus, the capacity

is 1000 bps. The scheme has been tested against the

attacks provided in the Stirmark Benchmark (Lang,

A., Stirmark Benchmark for Audio), where each

attack has different parameters defined in the

benchmark’s website. As expected, decreasing 

increases robustness (and distortion). This table

illustrates that this technique is robust against the

G.711 codec.

Table 1: Results of 2 signals (robust against table 2

attacks).

Audio File 

Frame size

(f)

PESQ-MOS of

marked

Payload

(bps)

GH (Sp01–Sp04)

1 3.4 4000

4 3.4 1000

1 4.1 4000

4 4.1 1000

JE (Sp11–Sp15)

2 3.5 2000

5 3.5 800

2 4.0 2000

5 4.0 800

Table 2: Robustness test results for selected files.

 = 5  = 3

Attack name Params. BER Params. BER

Amplify 60–170 0.00 10–250 0.00

AddDynNoise 10 0.05 20 0.05

AddNoise 10 0.05 30 0.05

FFT_Invert 1024 0.00 1024 0.00

RC Low pass filter 2000 0.09 2000 0.07

RC High pass filter 50 0.05 50 0.03

G.711 a-law – 0.05 – 0.02

G.711 u-law – 0.01 – 0.01

SECRYPT2012-InternationalConferenceonSecurityandCryptography

414

In Table 3, we compare the performance of the

proposed watermarking algorithm and several recent

speech watermarking strategies. In (Sagi and Malah,

2007), the MOS of the narrow band (NB) speech is

3.7 and the MOS of the NB speech with embedded

data is 3.625. The small difference between the

MOS results demonstrates the transparency of the

proposed data-embedding scheme. In simulations,

the embedding data rate is 600 information

bits/second. The method of (Celik et al., 2005)

allows a relatively low embedding capacity (about 3

bps), which is suitable for metadata tagging and

authentication applications. However, (Celik et al.,

2005) is robust with low data-rate (5-8 kbps) speech

coders. The focus of (Gurijala and Deller, 2007) is

on the robustness performance of linear prediction

embedded speech watermarking. The technique is

robust to a wide range of attacks including noise

addition, cropping, compression, and filtering, but

the achieved capacity is low.

Table 3: Comparison of different speech watermarking

algorithms.

Algorithm SNR (dB)

PESQ-

MOS

Payload

(bps)

(Sagi and Malah, 2007) 35 3.6 600

(Celik et al., 2005) – – 3

(Girin and Marchand,

2004)

High – 200

(Gurijala and Deller,

2007)

– – 24

Proposed 30–40 ~ 4 800–4000

4 CONCLUSIONS

Using the wavelet transform and a logarithmic

quantization results in an adaptive speech

watermarking scheme. Considering the fact that the

human auditory system requires more precision at

low amplitudes (soft sounds) and taking advantage

of the logarithm, a logarithmic quantization

algorithm is used to quantize the approximation

coefficients of the wavelet transform (cA) to embed

the secret bits. To improve robustness, the cA

samples are split into frames and each single secret

bit is embedded into all the samples in the

corresponding frame. Increasing the frame size

decreases the embedding capacity and increases the

robustness.

The experimental results show that the distortion

caused by the embedding algorithm is adjustable and

lower than that introduced by the G.723 speech

codec. Therefore, the marked signal has high quality

(PESQ-MOS around 4), i.e. the proposed

watermarking scheme is transparent. The embedding

rate is adjustable and can start from very low bit-

rates to 4000 bps, depending on the application. The

scheme is shown to be robust against some attacks

such as ITU-T G.711 compression (a-law and u-law

companding), amplification and RC filters.

ACKNOWLEDGEMENTS

This work was partially funded by the Spanish

Government through projects TSI2007-65406-C03-

03 E-AEGIS, TIN2011-27076-C03-02 CO-

PRIVACY and CONSOLIDER INGENIO 2010

CSD2007-0004 ARES.

REFERENCES

Celik, M., Sharma, G., Tekalp, A. M., 2005. Pitch and

duration modification for speech watermarking. In

Proc. IEEE Int. Conf. Acoust., Speech, Signal Process.

(ICASSP), vol. 2, pp. 17–20.

Chen, S., Leung, H., 2007. Speech bandwidth extension

by data hiding and phonetic classification. In Proc.

IEEE Int. Conf. Acoust., Speech, Signal Process.

(ICASSP), vol. 4, pp. 593–596.

Girin, L., Marchand, S., 2004. Watermarking of speech

signals using the sinusoidal model and frequency

modulation of the partials. In Proc. IEEE Int. Conf.

Acoust., Speech, Signal Process. (ICASSP), vol. 1, pp.

633–636.

Gurijala, A., Deller, J., 2007. On the robustness of

parametric watermarking of speech. In Multimedia

Content Analysis and Mining, ser. Lecture Notes in

Computer Science, Springer, vol. 4577/2007, pp. 501–

510.

Hu, Y., Loizou, P., 2007. Subjective evaluation and

comparison of speech enhancement algorithms.

Speech Communication, 49, 588-601.

ITU-T, Recommendation P.861. http://www.itu.int/rec/T-

REC-P.861/en (accessed on June 22nd, 2012).

ITU-T, Recommendation G.711. http://www.itu.int/rec/T-

REC-G.711/en (accessed on June 22nd, 2012).

ITU-T, Recommendation G.723. http://www.itu.int/rec/T-

REC-G.723/en (accessed on June 22nd, 2012).

Lang, A., Stirmark Benchmark for Audio.

http://wwwiti.cs.uni-magdeburg.de/~alang/smba.php

(accessed on June 22nd, 2012).

Sagi, A., Malah, D., 2007. Bandwidth extension of

telephone speech aided by data embedding. EURASIP

J. Adv. Signal Process., vol. 2007, article ID 64921.

Salomon, D., 2007. Data Compression: the Complete

Reference. Springer.

AdaptiveSpeechWatermarkinginWaveletDomainbasedonLogarithm

415