Pitch-synchronous Discrete Cosine Transform Features for Speaker

Identiﬁcation and Veriﬁcation

Amit Meghanani

and A. G. Ramakrishnan

Indian Institute of Science, Bangalore, India

Keywords:

Pitch-synchronous, DCT, MFCC, Speaker Identiﬁcation, Speaker Veriﬁcation.

Abstract:

We propose a feature called pitch-synchronous discrete cosine transform (PS-DCT), derived from the voiced

part of the speech for speaker identiﬁcation (SID) and veriﬁcation (SV) tasks. PS-DCT features are derived

from the ‘time-domain, quasi-stationary waveform shape’ of the voiced sounds. We test our PS-DCT feature

on TIMIT, Mandarin and YOHO datasets. On TIMIT with 168 and Mandarin with 855 speakers, we obtain

the SID accuracies of 99.4% and 96.1%, respectively, using a Gaussian mixture model-based classiﬁer. In the

i-vector-based SV framework, fusing the ‘PS-DCT based system’ with the ‘MFCC-based system’ at the score

level reduces the equal error rate (EER) for both YOHO and Mandarin datasets. In the case of limited test data

and session variabilities, we obtain a signiﬁcant reduction in EER, up to 5.8% (for test data of duration < 3

sec).

1 INTRODUCTION

Traditional signal processing based features used in

the literature for speaker identiﬁcation (SID) and ver-

iﬁcation (SV) studies include Mel frequency cep-

stral coefﬁcients (MFCC) (Davis and Mermelstein,

1980), linear prediction residual (Prasanna et al.,

2006), voice source cepstral coefﬁcients (Gudna-

son and Brookes, 2008), deterministic plus stochas-

tic model (Drugman and Dutoit, 2012) and MFCC

with phase (Nakagawa et al., 2012), etc. Except the

MFCC, all the above mentioned features are based on

voice source. MFCC are motivated by human audi-

tory perceptual mechanism and it primarily captures

the vocal-tract information.

In contrast, we propose a feature, which exploits

the time-domain, quasi-periodic waveform shape of

the voiced phones. Since the voiced regions are rel-

atively stationary, each period of any voiced phone

must contain information from both the vocal folds

and the vocal tract of the speaker, in addition to the

phonemic information. This point of view motivated

us to derive another traditional signal processing

based feature, which is proposed in this work. To cap-

ture the time-domain waveform shape of the voiced

phones, we employ discrete cosine transform (DCT),

https://orcid.org/0000-0002-0811-274X

https://orcid.org/0000-0002-3646-1955

which is a reversible, linear transformation, with fre-

quency resolution twice that of discrete Fourier trans-

form. Since DCT has energy compaction property, it

can capture the waveshape of any voiced phone for

any pitch cycle in a limited number of coefﬁcients.

The pitch-synchronous DCT analysis is obviously

valid only for the voiced regions of speech. Pitch-

synchronous analysis is important as DCT is shift-

variant. Therefore, we avoid unvoiced segments of

the speech. We consider each pitch cycle from the

voiced segments as an analysis frame and perform

DCT of a ﬁxed length, after zero-padding the anal-

ysis frame. Hence, the DCT of different cycles for

the same phone for a particular speaker are expected

to show similar trends. Pitch synchronous analysis of

voiced sounds was reported in (Mathews et al., 1961)

to show that vowel sounds can be represented by a se-

quence of poles arising from the vocal tract and a se-

quence of zeros characterizing the glottal excitation.

Similarly, DCTILPR (Abhiram et al., 2015), the pitch

synchronous DCT of the integrated linear prediction

residual (ILPR) was proposed for the task of speaker

identiﬁcation. In our work, DCT is applied on the

speech signal itself as opposed to the case of DC-

TILPR, where DCT is applied on ILPR, which con-

tains only the voice source information. Also, we use

uniform length DCT basis as opposed to the above

work (Abhiram et al., 2015), where variable length

DCT basis is used.

Meghanani, A. and Ramakrishnan, A.

Pitch-synchronous Discrete Cosine Transform Features for Speaker Identiﬁcation and Veriﬁcation.

DOI: 10.5220/0008911503950401

In Proceedings of the 9th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2020), pages 395-401

ISBN: 978-989-758-397-1; ISSN: 2184-4313

395

Select voiced

part

Map GCI

locations to

nearest zero-

crossing points

Speech

signal

N-dimensional

PS-DCT

features

Detect GCI

PS-DCT-II

Truncate

Figure 1: Extraction of the proposed PS-DCT features from the speech signal.

2.86 2.88 2.9 2.92 2.94 2.96 2.98 3 3.02 3.04 3.06

Sample index

-0.4

-0.3

-0.2

-0.1

0.1

0.2

0.3

0.4

0.5

Amplitude

Figure 2: GCI locations of a voiced region of an utterance from the TIMIT dataset, after mapping them to the nearest zero-

crossings (dotted vertical lines).

The major contributions of this paper are:

• Proposing a novel feature named PS-DCT, de-

rived using traditional signal processing ideas,

which can be used for speaker identiﬁcation and

veriﬁcation.

• Showing that fusing the ‘PS-DCT based SV sys-

tem’ with the ‘MFCC-based SV system’ improves

the performance of speaker veriﬁcation system for

limited test data (see Table 4).

In a nutshell, we propose a feature named PS-

DCT, which when combined with the conventional

MFCC-based system, boosts the performance of the

SV/SID system. PS-DCT possesses both voice source

and vocal tract information, thus providing supple-

mentary information to the MFCC-based SV/SID.

This is crucial for short-utterances, since the perfor-

mance drops signiﬁcantly with less test data, even

when the training data is sufﬁcient (Das et al., 2014).

By combining the systems based on MFCC and PS-

DCT at the score level, good SV performance is ob-

tained.

2 EXTRACTION OF THE

PROPOSED PS-DCT FEATURES

Figure 1 shows the method of extraction of the PS-

DCT features as a block diagram. Since we are

extracting the features from the voiced part of the

speech, we need to separate voiced and unvoiced

speech. For this purpose, we have used normalized

autocorrelation at unit sample delay. By using this

method with empirically obtained threshold values,

we take only voiced part of the speech and discard

all the unvoiced/silence part of the speech. The al-

gorithm proposed in (Prathosh et al., 2013) is used

on these voiced regions for getting the glottal closure

instants (GCI). The obtained GCI locations are then

mapped to the nearest zero-crossings, which obviates

the scenario of abrupt onset and ending of the signal

in the analysis frames. In Fig. 2, the dotted vertical

lines show the locations of these shifted GCIs. Our

analysis frames correspond to the intervals between

these zero-crossings.

The next block in Fig. 1 is PS-DCT of the anal-

ysis intervals. Let M be the number of samples in a

pitch period. In simple pitch-synchronous analysis,

the value of this analysis interval (M) changes with

the pitch period, which can be different even for con-

secutive pitch cycles. If M-point DCT is used, then

the DCT basis will be different for different analysis

intervals, since M itself varies. Then, the k-th DCT

coefﬁcients of two different speakers (or even for the

same speaker) do not represent the same frequency,

but the same harmonic of their instantaneous funda-

mental frequency. The basis needs to be ﬁxed for

all the speakers, so that it can serve as a reference

and we can obtain the relative distribution of the co-

efﬁcients over the same basis to capture discrimina-

tive, speaker-speciﬁc information. In order to make

the length of the analysis frame same (say, L) for all

the pitch cycles for all the speakers, we choose the

maximum number of samples that a pitch cycle can

have as L . Thus L, the length of the basis of DCT, is

determined as (sampling rate)/(minimum pitch under

consideration). The value of L turns out to be 230 or

115, depending upon whether the sampling rate is 16

or 8 kHz, with the minimum pitch value assumed to

ICPRAM 2020 - 9th International Conference on Pattern Recognition Applications and Methods

396

50 100 150

-1

-0.8

-0.6

-0.4

-0.2

0.2

0.4

0.6

0.8

Amplitude

(a)

50 100 150

-1

-0.8

-0.6

-0.4

-0.2

0.2

0.4

0.6

0.8

(b)

20 40 60 80 100 120

-1

-0.8

-0.6

-0.4

-0.2

0.2

0.4

0.6

0.8

(c)

20 40 60 80 100 120 140

-1

-0.8

-0.6

-0.4

-0.2

0.2

0.4

0.6

0.8

(d)

-1.5

-1

-0.5

0.5

1.5

Amplitude

(e)

5 10 15 20 25

-1.5

-1

-0.5

0.5

1.5

(f)

5 10 15 20 25

-1.5

-1

-0.5

0.5

1.5

(g)

5 10 15 20 25

-1.5

-1

-0.5

0.5

1.5

(h)

5 10 15 20 25

PS-DCT coefficient index

Sample index

Figure 3: (a), (b), (c), (d): The time-domain waveform shape of one pitch period of the vowel /ah/ for four different speakers

taken from the vowel database (https://homepages.wmich.edu/

∼

hillenbr/voweldata.html). (e), (f), (g), (h) : 28-dimensional

PS-DCT features derived from the waveforms shown in Figs. a, b, c, d, respectively.

Table 1: Choice of the number (N) of DCT coefﬁcients retained vs SID accuracy (in %) on TIMIT database for 168 speakers

and Chinese Mandarin corpus for 855 speakers.

No. of

Coefﬁcients

N 10 12 14 16 18 20 22 24 26 28 30

SID accuracy

(in %)

TIMIT 87.5 90.4 93.4 94.6 95.8 95.8 96.4 96.4 97 99.4 97.6

Mandarin 85.3 89.9 91.3 92.9 94.2 94 94.8 95.3 95.9 96.1 96.1

be 70 Hz. To the signal corresponding to each pitch

period, we append appropriate number of zeros to get

L samples and take L-point DCT.

Before applying DCT, we normalize the analysis

frame by its maximum absolute amplitude.

Then, we truncate the PS-DCT, ensuring that the

retained N-coefﬁcients have the necessary speaker-

speciﬁc information. The value of N is chosen exper-

imentally, based on the best SID performance, to be

28 or 56, depending upon the sampling rate. The ﬁrst

coefﬁcient, being the mean value of the waveform, is

excluded.

For TIMIT and Mandarin corpora, we observe

from Table 1 that the best value of N is 28, determined

empirically. Similarly, N is 56 for YOHO database.

2.1 PS-DCT Captures Speaker-speciﬁc

Information

The shapes of the time-domain quasi-periodic wave-

forms of a particular voiced phone are distinct for dif-

ferent speakers, though they look grossly similar as

shown in Fig. 3 (a, b, c, d). Since PS-DCT captures

the temporal waveform shape of a period, even for the

same voiced phone (eg. /ah/), it is distinct for different

speakers as shown in Fig. 3 (e, f, g, h). In other words,

for a particular voiced phone, the PS-DCT gives the

distinct distributions of the coefﬁcients over the same

ﬁxed basis for different speakers, thus capturing dis-

criminative speaker-speciﬁc information.

3 EXPERIMENTAL DETAILS OF

THE STUDY

3.1 Speaker Identiﬁcation (SID) Studies

Experiments are conducted on TIMIT (Garofolo

et al., 1993), YOHO (Campbell and Higgins,

1994) and Chinese Mandarin (https://openslr.org/38/)

databases for text-independent SID. TIMIT consists

of data from 630 speakers, sampled at 16 kHz. For our

study, we use data from randomly selected 168 speak-

ers. Of the ten utterances available for each speaker,

feature vectors obtained from 8 sentences are used for

training and 2 sentences for testing. Chinese Man-

darin corpus has a total of 102600 utterances from 855

speakers. Each speaker has 120 utterances, sampled

at the rate of 16 kHz. For all the speakers, 8 utterances

Pitch-synchronous Discrete Cosine Transform Features for Speaker Identiﬁcation and Veriﬁcation

397

are used for training and features obtained from 2 ut-

terances, for testing. The YOHO database with four

different recording sessions is used to compare the

performance of the features under session variability.

It is sampled at 8 kHz, and we use recordings from

138 speakers. For each speaker, from the ﬁrst ses-

sion, the ﬁrst 20 utterances are used for training and

from all the four sessions, the features obtained from

the next four utterances are together used for testing.

The feature vectors extracted from the training

data are scaled to unit norm. Using these features,

we model each speaker by 32-component Gaussian

mixture with diagonal covariance matrix, denoted by

. Here, θ denotes the model parameters (mean

µ, covariance Σ and weights w of mixture compo-

nents). The same GMM conﬁguration is used for all

the databases, and for MFCC features also. The test

data is classiﬁed as belonging to the speaker having

the maximum per-sample average log-likelihood, ob-

tained as,

L (θ|x) =

∑

i=1

log(S

)) (1)

where, S

) =

∑

j=1

f (x

|µ

, Σ

); k (=32) is the

number of mixture components and n is the number of

feature vectors [x = (x

, x

, ...., x

)] available in the

test data. The modelling and likelihood estimation are

performed using scikit-learn (Pedregosa et al., 2011).

The performance of PS-DCT features is compared

with those of the existing glottal based features and

13-dimensional MFCCs. MFCCs are computed only

from the voiced segments of the speech signal for all

the databases, so that the comparison is fair. A frame

length of 30 ms with Hanning window and a frame

shift of 10 ms are used for computing the MFCC fea-

tures. The GMM described above is used for the

MFCC features also. The other features compared

are: (i) the deterministic plus stochastic model (DSM)

of the residual signal, proposed by Drugman and Du-

toit (Drugman and Dutoit, 2012), which they used for

SID. (ii) DCT of the integrated linear prediction resid-

ual (ILPR), proposed in (Abhiram et al., 2015), where

ILPR is used as a voice source estimate. The results

reported for these two features are taken from the lit-

erature.

3.2 Speaker Veriﬁcation (SV) Studies

An i-vector based speaker veriﬁcation system has

been implemented using Microsoft Identity Toolbox

(Sadjadi and Omid, 2013)for this work. Entire TIMIT

database and a subset of Mandarin corpus (7000 ut-

terances from 700 speakers) are used as develop-

ment data for obtaining universal background model

(UBM) of 256 mixtures and total variability subspace

(T-matrix of 400 columns). Totally, we are using ap-

proximately 12 hours of data from 542 female and

788 male speakers. From the YOHO database, we

have used recordings from four sessions of each of

138 speakers for training their speaker models and

evaluating the performance. Veriﬁcation trials con-

sist of all possible model-test combinations, resulting

in a total of 19,044 (138 × 138) trials (138 target ver-

sus 18,906 impostor trials). From the ﬁrst session, we

have used the ﬁrst 20 utterances from each speaker as

enrollment data for training speaker models and from

all the four sessions, the data obtained from the next

four utterances are used for testing. From Mandarin

corpus, 24,025 trials (155 target versus 23,870 impos-

tor trials) from 155 speakers are used for testing. 10

utterances from each speaker are used for enrollment

and data from four utterances are used for testing.

In i-vector based SV systems, both the training

and test segments are represented by i-vectors. The

dimensionality of the i-vectors is reduced using 200-

dimensional linear discriminant analysis (LDA) to re-

move channel directions in order to increase the dis-

crimination between speaker subspaces. The Baum-

Welch statistics (Dehak et al., 2011) are computed

from the training and test feature vectors. Using this

statistics along with T-matrix, we compute the train

and test i-vectors. After mean and length normaliza-

tion (Garcia-Romero and Espy-Wilson, 2011), the i-

vectors are modelled via a generative factor analysis

approach called the probabilistic LDA (PLDA). Af-

ter that, a whitening transformation is applied, which

is learned from the i-vectors of the development set

(Sadjadi and Omid, 2013). Finally, a linear strategy

is used for scoring the veriﬁcation trials (Sadjadi and

Omid, 2013), which computes the log-likelihood ratio

of same to different speakers’ hypotheses.

Separate SV experiments are performed using

MFCCs, PS-DCT and their score level combination.

For more details about the implemented SV system,

one can refer to Microsoft Identity Toolbox (Sadjadi

and Omid, 2013). We have extracted 13 MFCCs

with their delta and delta-delta coefﬁcients to form

39-dimensional feature vectors. Cepstral mean and

variance normalization (CMVN) is used for further

processing. Since the dimension of PS-DCT depends

on the sampling rate, we have extracted PS-DCT fea-

tures after re-sampling all the utterances to 8 kHz. 56-

dimensional PS-DCT features are obtained as men-

tioned in Sec. 2. Since PS-DCT is not in the cep-

stral domain, we have not applied CMVN. Instead,

we have scaled the feature vectors individually to unit

norm (feature vector length) for post-processing.

ICPRAM 2020 - 9th International Conference on Pattern Recognition Applications and Methods

398

Table 2: Identiﬁcation accuracies (in %) on 168 speakers from the TIMIT database, 855 from the Chinese Mandarin corpus

and 138 from the YOHO database for four different recording sessions.

Datasets DCTILPR DSM PS-DCT MFCC

TIMIT 94.6 98.0 99.4 99.4

Mandarin NA NA 96.1 98.1

YOHO

(Same session)

100 99.7 100 100

YOHO

(1 session later)

80.4 69.3 86.9 96.3

YOHO

(2 session later)

73.9 64.3 92.0 97.8

YOHO

(3 session later)

72.5 58.7 80.4 91.3

4 RESULTS AND DISCUSSION

4.1 Results of Speaker Identiﬁcation

Experiments

Table 2 shows that the PS-DCT, MFCC and DSM

features perform well on TIMIT dataset. To conﬁrm

our hypothesis that the pitch-synchronous analysis of-

fers an advantage, we also extracted DCT features

using the traditional ﬁxed length windows. With a

frame length of 20 ms and a frame shift of 10 ms,

we obtained a maximum accuracy of 75% using 30

coefﬁcients on TIMIT dataset. However, the best

accuracy of 88.6% is obtained with a frame length

of 10 ms, shift of 5 ms and 50 coefﬁcients. Thus,

the performance of PS-DCT at 99.4% is far superior,

and hence, for the rest of the databases, we experi-

mented only with PS-DCT. On YOHO dataset, PS-

DCT achieves better results than those of DCTILPR

and DSM, though it also suffers from session vari-

ability. On Mandarin corpus, PS-DCT performs well

(96.1%), but not better than MFCCs. By observing

the performance of PS-DCT on various datasets, we

can see that it has potential to capture speaker speciﬁc,

discriminative information.

4.2 Results of Speaker Veriﬁcation

Experiments

In real world deployment, the SV systems need a min-

imum duration of test data for robust performance. In

some applications, this may not be possible, which

leads to poor system performance. Hence, it is of

interest to build a SV system that can perform bet-

ter even when the data available for testing is lim-

ited. Later, we see that PS-DCT-based SV system

can boost the performance of MFCC-based SV sys-

tem signiﬁcantly under limited test data condition.

Table 3 shows the results obtained on YOHO

database and Mandarin corpus for 19,044 and 24,055

trials, respectively, using both MFCC and PS-DCT

features. Table 3 shows that when the duration of the

test data is low (< 4 sec), the EER is quite higher

than the case where the duration of the test data is

higher (> 4 sec) for both the features. Here, PS-DCT

alone does not perform as good as MFCC. In an ear-

lier study (Das et al., 2014), it has been shown that

fusing MFCCs and other feature based classiﬁers us-

ing a convex combination improves the performance

of the speaker veriﬁcation system signiﬁcantly. So,

here also we have applied a convex combination at the

score level to get the score for the combined system

as follows:

COMB

= λS

PS-DCT

+ (1 − λ)S

MFCC

(2)

where, S

COMB

, S

PS-DCT

and S

MFCC

are the scores ob-

tained from the combined, PS-DCT and MFCC based

SV systems, respectively. λ is a scalar between 0 and

Using the combined score S

COMB

improves the

performance of the SV system signiﬁcantly for both

the databases, speciﬁcally when the duration of the

test data is limited (< 3 sec). The best value of λ

opt

is experimentally determined to be 0.4 for both the

databases. Table 4 shows the performance of the com-

bined system for both YOHO and Mandarin databases

along with the absolute improvement in EER (in %).

For YOHO database, the performance of the com-

bined system is higher for all the sessions for all

the durations of the test data. The performance im-

provement is signiﬁcantly high (up to 5.8% in abso-

lute terms) for very limited duration of the test data

(1.5 sec). The same is true for Mandarin corpus also,

where the improvement is up to 3.2% for test data of

2 sec. In both these cases, there is sufﬁcient training

data. Table 4 shows that the absolute improvement

in EER over the MFCC system increases as the dura-

Pitch-synchronous Discrete Cosine Transform Features for Speaker Identiﬁcation and Veriﬁcation

399

Table 3: Speaker veriﬁcation performance of MFCC-based (PS-DCT-based) i-vector system on YOHO database and Man-

darin corpus, for different durations of the test data. Number of trials for YOHO: 19,044; Mandarin: 24,025.

YOHO database (equal error rate (EER) in %) Mandarin corpus

Test data

(approx.)

Same

session

1 session

later

2 sessions

later

3 sessions

later

Test data

(approx.)

EER

(%)

1.5 sec 11.5 (12.2) 23.1 (28.2) 18.8 (22.8) 19.5 (25.3) 2 sec 12.2 (16.0)

3 sec 5.2 (7.2) 14.0 (20.2) 9.4 (18.0) 8.7 (18.0) 4 sec 5.1 (9.6)

6 sec 1.4 (4.3) 7.2 (11.0) 7.1 (12.3) 6.5 (16.6) 8 sec 2.6 (5.5)

9 sec 1.0 (2.8) 4.3 (9.1) 6.5 (10.8) 5.7 (14.4) 12 sec 2.4 (4.0)

Table 4: Performance of the combined i-vector speaker veriﬁcation system (EER in %) on YOHO and Mandarin databases

for different durations of the test data (with λ

opt

= 0.4). The values within brackets give the absolute improvement in EER

over the MFCC-based system.

YOHO database (equal error rate (EER) in %) Mandarin corpus

Test data

(approx.)

Same

session

1 session

later

2 sessions

later

3 sessions

later

Test data

(approx.)

EER

(%)

1.5 sec 6.4 (5.1) 17.3 (5.8) 14.1 (4.7) 16.4 (3.1) 2 sec 9.0 (3.2)

3 sec 2.1 (3.1) 11.4 (2.6) 6.0 (3.4 7.9 (0.8) 4 sec 3.2 (1.9)

6 sec 0.1 (1.3) 5.0 (2.2) 4.1 (3) 5.7 (0.8) 8 sec 1.2 (1.4)

9 sec 0.0 (1.1) 3.6 (0.7) 4.1 (2.4) 5.0 (0.7) 12 sec 1.2 (1.2)

tion of the test data reduces. Thus, PS-DCT does cap-

ture speaker-speciﬁc information that is supplemen-

tary to the information captured by MFCCs and the

performance gain of the combined system is higher

for limited durations of the test data (< 3 sec). Since

PS-DCT is directly computed on the speech signal, it

captures the voice source information in addition to

the vocal tract information, and hence is able to sup-

plement the performance of MFCCs in the speaker

veriﬁcation task.

5 CONCLUSION

Based on experiments performed on different

datasets, we show that the PS-DCT, derived from the

voiced part of the speech signal, is an effective feature

for SID/SV tasks (see Tables 2, 3, and 4). The pro-

posed PS-DCT can capture the ﬁne level differences

in the ‘time-domain waveform shape’ of a particular

voiced phone from different speakers (see Fig. 3).

REFERENCES

Abhiram, B., Ramakrishnan, A. G., and Prasanna, S. M.

(2015). Voice source characterization using pitch syn-

chronous discrete cosine transform for speaker identi-

ﬁcation. J. Acoust. Soc. Am.

Campbell, J. and Higgins, A. (1994). YOHO speaker veri-

ﬁcation.

Das, R. K., Abhiram, B., Prasanna, S., and Ramakrishnan,

A. (2014). Combining source and system informa-

tion for limited data speaker veriﬁcation. Proc. 15th

Annual Conf. of the International Speech Communi-

cation Association.

Davis, S. and Mermelstein, P. (1980). Comparison of para-

metric representations for monosyllabic word recog-

nition in continuously spoken sentences. IEEE Trans.

Acoust., Speech, Signal Process., 28(4):357–366.

Dehak, N., Kenny, P. J., Dehak, R., Dumouchel, P., and

Ouellet, P. (2011). Front-end factor analysis for

speaker veriﬁcation. IEEE Transactions on Audio,

Speech and Language Processing, 19(4):788–798.

Drugman, T. and Dutoit, T. (2012). The deterministic plus

stochastic model of the residual signal and its appli-

cations. IEEE Trans. Audio, Speech, Lang. Process.,

pages 968–981.

Garcia-Romero, D. and Espy-Wilson, C. Y. (2011). Analy-

sis of i-vector length normalization in speaker recog-

nition systems. Twelfth Annual Conference of the In-

ternational Speech Communication Association.

Garofolo, J. S., Lamel, L. F., Fisher, W. M., Fiscus, J. G.,

and Pallett, D. S. (1993). DARPA TIMIT acoustic-

phonetic continuous speech corpus CD-ROM. NIST

speech disc 1-1.1. NASA STI/Recon Technical Report,

93.

Gudnason, J. and Brookes, M. (2008). Voice source cep-

strum coefﬁcients for speaker identiﬁcation. Proc.

IEEE Int. Conf. Acoust., Speech and Signal Process.

ICASSP ., pages 4821–4824.

Mathews, M. V., Miller, J. E., and Jr., E. E. D. (1961). Pitch

synchronous analysis of voiced sounds. J. Acoust. Soc.

Am., 33.

Nakagawa, S., Wang, L., and Ohtsuka, S. (2012). Speaker

identiﬁcation and veriﬁcation by combining MFCC

and phase information. IEEE Trans. Audio, Speech,

Lang. Process., 20(4):1085–1095.

ICPRAM 2020 - 9th International Conference on Pattern Recognition Applications and Methods

400

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V.,

Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P.,

Weiss, R., Dubourg, V., Vanderplas, J., Passos, A.,

Cournapeau, D., Brucher, M., Perrot, M., and Duch-

esnay, E. (2011). Scikit-learn: Machine Learning

in Python. Journal of Machine Learning Research,

12:2825–2830.

Prasanna, S. M., Gupta, C. S., and Yegnanarayana, B.

(2006). Extraction of speaker-speciﬁc excitation in-

formation from linear prediction residual of speech.

Speech Communication, 48(10):1243–1261.

Prathosh, A. P., Ananthapadmanabha, T. V., and Ramakr-

ishnan, A. G. (2013). Epoch extraction based on in-

tegrated linear prediction residual using plosion in-

dex. IEEE Trans. Audio, Speech, Lang. Process.,

21(12):2471 – 2480.

Sadjadi and Omid, S. (2013). MSR identity toolbox v1.0:

A MATLAB toolbox for speaker recognition research.

Microsoft research technical report.

Pitch-synchronous Discrete Cosine Transform Features for Speaker Identiﬁcation and Veriﬁcation

401