First Experiments on Speaker Identiﬁcation Combining a New

Shift-invariant Phase-related Feature (NRD),

MFCCs and F0 Information

ıbal Ferreira

Department of Electrical and Computers Engineering, University of Porto - Faculty of Engineering, Porto, Portugal

Keywords:

Phase-related Voice Feature, Speaker Identiﬁcation.

Abstract:

In this paper we report on a number of speaker identiﬁcation experiments that assume a phonetic-oriented seg-

mentation scheme exists such as to motivate the extraction of psychoacoustically-motivated phase and pitch

related features. MFCC features are also considered for benchmarking. An emphasis is given to an innovative

shift-invariant phase-related feature that is closely linked to the glottal source. A very simple statistical mod-

eling is proposed and adapted in order to highlight the relative discrimination capabilities of different feature

types. Results are presented for individual features and a discussion is also developed regarding possibilities

of fusing features at the speaker modeling stage, or fusing distances at the speaker identiﬁcation stage.

1 INTRODUCTION

Automatic speaker identiﬁcation involves using real-

time or recorded samples of the voice of different

speakers, building speakers models, and ﬁnding dis-

tances between those speakers models. In closed-set

speaker identiﬁcation, speaker models for all speak-

ers are available. In this case, a successful iden-

tiﬁcation is simply determined by ﬁnding the min-

imum distance between the model of a test speaker

and the different speaker models of all speakers that

are built during the training (or enrolment) phase.

In open-set speaker identiﬁcation, not all speakers

are known and therefore a special ‘general’ (or av-

erage) speaker model is needed that is frequently

referred to as Universal Background Model (UBM)

(Raynolds, 1997). In this case, the distance between

a test speaker model and the general speaker model

is also computed, in addition to the distances found

between the test speaker model and the models of all

known speakers. Successful identiﬁcation is achieved

if the distance between the test speaker model and

that of a known speaker is safely shorter (i.e. smaller)

than the distance to the general speaker model (this is

usually evaluated using Likelihood Ratios -LR (Ra-

machandran et al., 2002, page 2808)). If this happens

not to be the case, speaker identiﬁcation is inconclu-

sive. Automatic speaker identiﬁcation may also be

characterized and inﬂuenced by other important con-

textual and application factors such as matched or

mismatched conditions in the communication channel

and signal acquisition equipment during training and

testing, or such as text-dependent or text-independent

assumptions during training and testing. For a com-

prehensive treatment on the subject of speaker identi-

ﬁcation, the reader is referred to (Hansen and Hasan,

2015).

In this paper, we assume closed-set speaker

identiﬁcation as well as channel-matched and text-

independent training and testing. Deliberately, these

are simplifying assumptions allowing us to minimize

variability factors affecting the voice signal and that

are external to the speaker, and to focus instead ex-

clusively on the diversity of voiced sounds, i.e. those

sounds (e.g. voiced vowels and voiced consonants)

that involve phonation as a result of the vibration of

vocal folds. In this case, the vocal source excitation

has a pattern which is periodic in the time domain, and

has a harmonic structure in the spectral domain which

is also manifested in the voice signal that is captured

by means of a microphone (Sundberg, 1987). In turn,

the harmonic structure has an underlying phase struc-

ture relating all harmonics and whose speaker dis-

criminative potential is the main aspect studied in this

paper.

Differently from automatic speech recognition,

automatic speaker identiﬁcation focuses on acoustic

aspects of the voice signal that reﬂect either idiosyn-

cratic traits due to the vocal source signal, i.e. the

glottal excitation, or idiosyncratic traits due to the in-

Ferreira, A.

First Experiments on Speaker Identiﬁcation Combining a New Shift-invariant Phase-related Feature (NRD), MFCCs and F0 Information.

DOI: 10.5220/0006868103470358

In Proceedings of the 15th International Joint Conference on e-Business and Telecommunications (ICETE 2018) - Volume 1: DCNET, ICE-B, OPTICS, SIGMAP and WINSYS, pages 347-358

ISBN: 978-989-758-319-3

347

ﬂuence of supra-laryngeal structures, notably the vo-

cal tract and nasal tract resonances. Dynamic aspects

of voice expression are also idiosyncratic and include

articulation gestures related for example to formants

trajectories (i.e. trajectories of the vocal and nasal

tract resonant frequencies and that can be easily iden-

tiﬁed, for example, in spectrograms of diphthong re-

gions of the speech signal) and consonant-to-vowel

(and vowel-to-consonant) co-articulation gestures.

Concerning the signal analysis and feature ex-

traction front-end, the dominant approach in current

speaker identiﬁcation technology involves capturing

speaker-speciﬁc voice and articulation traits by means

of signal features that are extracted from the speech

signal. Feature extraction is usually performed using

i) a simple signal segmentation strategy that simply

tries to exclude silence and non-speech sounds, and

ii) using exclusively the magnitude of a spectral rep-

resentation of short signal segments -in the order of

20 ms-, for example through such features as Mel-

Frequency Cepstral Coefﬁcients (MFCCs) (Davis and

Mermelstein, 1980) or parameters extracted from Lin-

ear Predictive Coding (LPC) analysis (Rabiner and

Juang, 1993).

This approach reﬂects however two assumptions

we believe are not entirely correct. The ﬁrst assump-

tion is that the same idiosyncratic voice traits per-

sist in all phonetic realizations by the same speaker,

irrespective of their nature being voiced vowels or

unvoiced consonants, for example. In a previous

study, we have shown that using phonetic-oriented

signal segmentation and conventional MFCC and

GMM speaker modeling, speaker identiﬁcation per-

formance improves signiﬁcantly relatively to the case

where just uniform, blind, signal segmentation is used

(Mendes and Ferreira, 2012). We therefore argue that

speaker-speciﬁc vocal traits are highly localized in

time according to the speciﬁcity of the categorical

phonetic realization. It can be argued however that

GMM accommodates this because in a GMM ‘A uni-

modal Gaussian density can be thought of as mod-

eling an acoustic class representing a phonetic event

(like vowel, nasals and fricatives)’ (Ramachandran

et al., 2002, page 2808). Although this is plausible,

to a signiﬁcant extent we think this can also be a mat-

ter of belief since MFCCs have an intrinsic low res-

olution at high frequencies and are therefore not tai-

lored to capture the subtleties that make for example a

sibilant a special case of a fricative. Although strate-

gies for phonetic-oriented signal segmentation are not

the focus of this paper, we presume one such strat-

egy exists that leads to the segmentation of vowel-like

sounds only, which are those for which it makes sense

to extract features that are related to the phases of the

signal harmonics.

The second incorrect assumption is that phase is

irrelevant in the voice signal analysis and feature ex-

traction process and, thus, has no useful potential in

speaker discrimination. Fortunately, this aspect has

been extensively tackled in the literature and many

authors have addressed the speaker discriminative po-

tential of phase-related features, with positive results,

e.g. (Alam et al., 2015; Wang et al., 2010; Wang

et al., 2009; Nakagawa et al., 2007; Rajan et al.,

2013; Padmanabhan et al., 2009). Taking as a ref-

erence the speaker identiﬁcation performance that is

achieved using MFCC-related features alone, which

have proven so far to be the most effective acous-

tic features and, therefore, constitute a benchmark-

ing standard easily offering above around 70% correct

identiﬁcation, it has been concluded that, in general,

phase related features help to improve the MFCC-

based scores between 0.0% and less than 10.0%, e.g.

(Wang et al., 2010).

The novelty of our paper lies in the nature of

the phase-related features. In fact, our approach is

based on a phase-related feature involving the ac-

curate phase relationships between the relevant har-

monic components in a quasi-periodic signal. We

named this feature as Normalized Relative Delay

(NRD). In short, NRD coefﬁcients are equivalent to

the parametric phase of the harmonics when a Fourier

analysis is performed for any periodic signal. To-

gether with the accurate magnitudes of those harmon-

ics, they fully characterize the shape invariance of any

periodic waveform. Section 3.1 provides additional

information on the NRD feature.

This paper is a follow-up to a previous paper (Fer-

reira, 2014) in which we concluded NRD coefﬁcients

(NRDs) possess the potential to discriminate between

speakers because NRDs exhibit essentially the same

proﬁle even for different vowels uttered by the same

speaker, i.e. they reﬂect the idiosyncratic nature of

the glottal pulse of a speciﬁc speaker rather than the

contribution of the vocal/nasal tract ﬁlter which obvi-

ously changes for different vowels because it is their

spectral envelope that conveys the linguistic meaning.

This is a conclusion which we believe also brings in-

novative insights as we discuss in (Ferreira and Tri-

bolet, 2018). In this paper, however, we take as a ref-

erence other results in the literature that also address

phase-related features.

For example, (Nakagawa et al., 2007; Wang et al.,

2009; Wang et al., 2010) use the phase of the ﬁrst 12

coefﬁcients of a DFT analysis which is made relative

to the phase of the 1 kHz reference DFT bin. This

relative phase is projected into the coordinates of a

unit circle and the result is subject to Gaussian Mix-

SIGMAP 2018 - International Conference on Signal Processing and Multimedia Applications

348

ture Modeling (GMM). The speaker identiﬁcation de-

cision is obtained at the score level by combining the

likelihoods due on one hand to the GMM modeling

of MFCCs, and due on the other hand to the GMM

modeling of the projected phase.

The modiﬁed group delay (MODGD) is used in

(Padmanabhan et al., 2009) to indirectly model phase

and the result is further converted into cepstral fea-

tures (MODGDF). Both MFCCs and MODGDF are

modelled using GMMs. Interestingly, the authors

mention that MODGDF and MFCCs are signiﬁcantly

related which suggested they are not independent.

Features are not combined neither are feature scores.

Speaker identiﬁcation performance is graphically il-

lustrated by means of the Equal Error Rate (ERR)

for the NIST 2003 database and, interestingly, MOD-

GDFs appear to offer around EER=17% which sur-

passes the EER achieved by MFCCs.

In (Alam et al., 2015) MODGD features are also

used although in an indirect way: they are extracted

from the LPC model that ﬁts the signal under analysis

and then they are converted into (18) cepstral coef-

ﬁcients. Speaker modeling uses a GMM-UBM ap-

proach and results on the NIST 2010 database indi-

cate that the EER performance varies between 10%

and 23% for LPC-based MODGD features, while for

MFCC features the EER performance varies between

abou 7% and 35%.

In (Rajan et al., 2013) the authors also follow a

similar approach but also consider other feature sets

including MODGD and LPC-based MODGD features

and study several score fusion variations. GMM-

UBM modeling is also used and EER performance

results indicate that MODGD may achieve a perfor-

mance similar to that of MFCCs (including ∆ and

∆∆), and that their combination may reach sub 1%.

It is interesting to note that all of these studies

avoid phonetic signal segmentation or dedicated har-

monic analysis. Also, all studies mention that diago-

nal covariance matrices are used.

The remaining of the paper is organised as fol-

lows. In Sec. 2 we describe the database and charac-

terize its speciﬁcities and, in Sec. 3, we describe the

four types of features under focus in this paper. The

simple performance criteria used in this paper are ad-

dressed in Sec. 4. In Sec. 5 we present and discuss

results characterizing the performance of individual

features. Section 6 describes our attempts to com-

bine different feature types in the same feature vector

that is used for speaker modeling, and Sec. 7 presents

and discusses our results combining distances aris-

ing from separate speaker models based on individual

features. Finally, Sec. 8 concludes this paper.

2 THE SPEAKERS DATABASE

In total, our database includes in total the

speech/voice recordings of 37 speakers, 20 fe-

male and 17 male. The particularities of this database

have been described in (Fernandes and Ferreira,

2017; Ferreira and Fernandes, 2017). For example,

it includes 5 pairs of twin brothers or sisters, and 9

triplets of twins and a relative of the same gender and

about the same age. In addition, it includes two in-

dependent conversations, one conducted over a GSM

channel, and another one conducted over a VoIP

channel. Both telephonic-quality and full-quality

versions of the same conversation are available as

contemporaneous recordings made at the subject end,

and at the interviewer end. In both conversations, in

addition to a simulated dialogue lasting more than

2 minutes, the subjects were also asked to utter a

sequence of ﬁve sustained vowels each one lasting

about 1 second. In this paper, we are using only the

high-quality versions of the vowel recordings which

were manually segmented and labelled. Thus, in this

paper, for each speaker, we use the ﬁve sustained

vowels of one conversation to build a speaker model,

and we use the vowels of the other conversation to

test the speaker model. This way all data is mutually

exclusive. In this paper, the sampling frequency of all

recordings is 22050 kHz, and the sample resolution

is 16 bit.

3 NRD, MFCC & F0 FEATURES

In this section, we address the three types of features

under study in this paper, and place an emphasis on

the time-shift invariant phase-related NRD feature.

3.1 The NRD Feature

We have introduced the NRD concept in (Sousa and

Ferreira, 2010), and have been using it in singing

voice analysis (Sousa and Ferreira, 2011), glottal

source modeling (Dias et al., 2011; Dias and Ferreira,

2013; Dias and Ferreira, 2014), speaker identiﬁcation

(Mendes and Ferreira, 2012; Ferreira, 2014), para-

metric audio coding (Ferreira and Sinha, 2016) and

dyspohonic voice reconstruction (Ferreira, 2016).

Smooth phase descriptors for harmonic signals

that are similar to NRD were also proposed by

Stylianou in 1996 (phase envelope (Stylianou, 1996,

page 44)) and Saratxaga in 2009 (Relative Phase Shift

-RPS (Saratxaga et al., 2009)).

The Normalized Relative Delay (NRD) feature

results from the accurate estimation of the absolute

First Experiments on Speaker Identiﬁcation Combining a New Shift-invariant Phase-related Feature (NRD), MFCCs and F0 Information

349

phase of each harmonic pertaining to a periodic wave-

form, which is further converted to a relative phase

taking as a reference that of the fundamental fre-

quency. This makes that the NRD is time-shift invari-

ant and, by deﬁnition, the NRD of the fundamental

frequency is always zero. The result is further normal-

ized taking into consideration the accurate period (in

samples) of each harmonic. Thus, for each harmonic,

the NRD is a real number between −1 and +1. Be-

cause it is a number which represents a faction of the

half-period of that harmonic, it expresses how much

that harmonic is ‘delayed’ to build the time shape of

the periodic waveform it belongs to. In short, the

NRD is a phase-related feature that is relative to the

phase of the fundamental frequency and that is further

normalized by the accurate period of the harmonic it

is associated with. Hence, the NRD is intrinsically

time-shift invariant, and is independent of the period

and overall amplitude of the periodic waveform. Il-

lustrative examples are provided in (Sousa and Fer-

reira, 2010; Ferreira, 2014). Still, NRDs preserve the

properties of phase which means phase wrapping and

phase unwrapping also applies.

Figures 1 and 2 represent the unwrapped NRD

feature vectors that were extracted from a sustained

vowel (about 1 second long) uttered by a male and a

female speaker, respectively. These ﬁgures are quite

Figure 1: Unwrapped NRDs extracted for a sustained vowel

uttered by a male speaker. The thick magenta line represents

the average NRD model up to harmonic 19.

representative of many other examples that can be ex-

tracted for different vowels and speakers and essen-

tially lead to the conclusion that a stable and consis-

tent trend is perfectly identiﬁable for harmonics or-

ders up to 20, as we have already anticipated and il-

lustrated in (Ferreira, 2014). The fact that inconsis-

tencies arise in the unwrapped NRDs for very high

harmonic orders is mainly due to two reasons: i) the

magnitude of those harmonics is extremely low which

means they are signiﬁcantly affected by noise which

makes the accurate phase estimation process more

difﬁcult, and ii) the accurate period of each harmonic,

which is computed individually accounting for some

degree of inharmonicity, is extremely short, e.g. less

than three samples, which makes the NRD estimation

even more difﬁcult. This, however, is not problematic

Figure 2: Unwrapped NRDs extracted for a sustained vowel

uttered by a female speaker. The thick magenta line repre-

sents the average NRD model up to harmonic 19.

because the most important NRD trend is deﬁned by

the lower order and stronger harmonics. The lower 20

harmonics represent a spectral region easily encom-

passing the ﬁrst three and even four formant frequen-

cies. If higher order NRDs are needed, as in paramet-

ric voice or audio synthesis, for example, (Ferreira,

2016; Ferreira and Sinha, 2016), they can be simply

extrapolated from the lower order harmonics.

In (Ferreira and Tribolet, 2018) we describe NRDs

in more detail and we highlight that NRD reﬂect es-

sentially phase attributes due to the vibration of the

vocal folds.

3.2 MFCC Features

As we mentioned in section 1, MFCC features are

standard benchmarking features that are very impor-

tant to assess the relative performance of new fea-

tures. We use the melcepst() function that is avail-

able in the popular and freely available Voicebox tool-

box. The melcepst() function is very convenient

because we can conﬁgure the analysis parameters so

that they are comparable to the analysis parameters

involved in NRD computation. In both cases, we

use 22050 Hz sampling frequency and the maximum

number of both MFCC and NRD coefﬁcients is 19. It

should be noted that because our training and testing

SIGMAP 2018 - International Conference on Signal Processing and Multimedia Applications

350

conditions are quite simplistic (vowel sounds only, to-

tal duration about 5 sec. both for training and testing),

we are not using ∆ or ∆∆ MFCC coefﬁcients.

3.3 F0-related Features

Fundamental frequency estimation can be imple-

mented using any accurate and robust algorithm, e.g.

(de Cheveign

e and Kawahara, 2002). In this paper,

we use an algorithm we have developed for accurate

singing analysis (Ventura et al., 2012) and general au-

dio and harmonic-based sounds (Ferreira and Sousa,

2010; Ferreira and Sinha, 2016). Using this algo-

rithm, the average fundamental frequency is found for

each speaker and vowel signal (F0) as well as its stan-

dard deviation, which we represent as F0std.

4 PERFORMANCE CRITERIA

In this paper, we are not aiming at state-of-the-art

performance. We are instead motivated by a prag-

matic approach that was suggested by Hansen and

Hasan: ‘Ideally, if features could be designed in

such a way that no intra-speaker variation is present

while inter-speaker discrimination is maximum, the

simplest methods of modeling might have sufﬁced’

(Hansen and Hasan, 2015, page 85) and ‘if acoustic

features are improved, simple modeling techniques

will be sufﬁcient’ (Hansen and Hasan, 2015, page

86). Thus, we follow a simple approach, also adopted

by several authors, e.g. (Segundo et al., 2017; Fer-

reira, 2007), of statistical modeling using mean (µ)

and variance (σ) vectors and covariance matrices (C).

Distances are found using the Euclidian-based Maha-

lanobis metric

= (x − µ)

−1

(x − µ). (1)

Since our training and test conditions are symmet-

ric, we use a more general deﬁnition that expresses

symmetric distance. Assuming we want to ﬁnd the

distance between speaker k using data pertaining to

his/her conversation A (please note that as indicated

in Section 2, two similar but different recordings exist

for the same speaker, for simplicity we denote his/her

conversations as A and B), and speaker ` using data

pertaining to his/her conversation B, distances are ac-

tually computed using

k,`

= (x

k,A

− µ

`,B

)

−1

`,B

k,A

− µ

`,B

) +

`,B

− µ

k,A

)

−1

k,A

`,B

− µ

k,A

) . (2)

It should be noted that, in general, d

k,`

6= d

`,k

. In

most cases in the literature, the covariance matrices

are forced to be diagonal. This makes that correla-

tions between different coefﬁcients in the feature vec-

tor are ignored and, therefore, distances are normal-

ized Euclidian. This has signiﬁcant computational ad-

vantages as the inverse of a diagonal covariance ma-

trix is easy to compute, and all values contributing to

k,`

in (2) are positive. In Section 7 we will redeﬁne

(2) such that full covariance matrices are not singular

and all values involved in the computation of d

k,`

are

positive.

A matrix of size 37×37 and including the dis-

tances for all pairs of speakers, is used to produce a

match matrix by looking at distances along the rows

and along the columns and by setting the identiﬁca-

tion according to the minimum distance found. Ide-

ally, this matrix should be diagonal and contain the

value 2 along the diagonal. The percent correct iden-

tiﬁcation is found by summing all values along the

main diagonal and dividing the result by the sum of

all values in the match matrix.

The matrix containing all distances can also be

converted into a ‘score’ matrix by just making all dis-

tances negative. It becomes then possible to compute

other performance metrics such as the Equal Error

Rate (EER). This performance criterion will be used

in Sec. 5.1 and Sec. 7.

5 PERFORMANCE OF

INDIVIDUAL FEATURES

In this section, we present the performance results for

each individual feature type (NRD, MFCC, F0 and

F0std). Before we address each feature type individ-

ually, a few preliminary considerations are in order

concerning the reliability of the phase information ac-

cording to the speciﬁc vowel.

Overlapping the average NRD feature vectors for

different vowels uttered by the same speaker, we were

expecting to observe a clear consistent trend as the

preliminary results in (Ferreira, 2014) suggested. In

most cases we indeed observed that trend as the ex-

amples illustrated in Figs. 3 and 4 conﬁrm.

However, in a number of cases, we also observed

a signiﬁcant deviation of the mean NRD feature vec-

tor for vowels /i/ and /u/, from the trend deﬁned by

the mean feature vectors found for vowels /a/, /e/ and

/o/. Figures 5 and 6 provide an illustration of this out-

come.

An analysis of this outcome led to the conclusion

that, as already noted earlier in this paper (Section

3.1) and as further discussed in (Ferreira and Tribo-

let, 2018), the relative magnitude of the different har-

monics plays an important role. In fact, of the ﬁve

First Experiments on Speaker Identiﬁcation Combining a New Shift-invariant Phase-related Feature (NRD), MFCCs and F0 Information

351

Figure 3: Overlap of the average NRD feature vector found

for different vowels by a female speaker.

Figure 4: Overlap of the average NRD feature vector found

for different vowels by a male speaker.

tonic vowels considered in this paper, vowel /u/ is

the one whose spectral decay is the steepest from low

frequencies. This means that harmonics whose order

is higher than one have a very small magnitude and,

as a consequence, phase estimation is adversely af-

fected. On the other hand, vowel /i/ is the one with

the widest separation between the two important res-

onant frequencies (in the literature this is also referred

to as F1 and F2 formant frequencies separation). This

also creates a long spectral valley in which harmon-

ics have a reduced magnitude which also adversely

affects phase estimation.

This outcome suggested, however, that speaker

identiﬁcation performances may depend on either us-

ing data from just three vowels (/a/, /e/ and /o/), or

from all ﬁve vowels, in order to build speaker models.

This will be addressed in the following subsections.

Figure 5: Overlap of the average NRD feature vector found

for different vowels by a female speaker. It can be seen that

the NRDs associated with vowels /i/ and /u/ deviate from

the trend deﬁned by the remaining vowels.

Figure 6: Overlap of the average NRD feature vector found

for different vowels by a male speaker. It can be seen that

the NRDs associated with vowels /i/ and /u/ deviate from

the trend deﬁned by the remaining vowels.

5.1 Performance of NRD Features

It has been illustrated in Sec. 3.1 that, in most cases,

NRD coefﬁcients are reliable and stable up to har-

monic order 20 and even beyond, especially in the

case of male voices, whose spectrum is known to be

less sparse than in the case of female voices. It is thus

an interesting matter to assess what the percent cor-

rect identiﬁcation is when NRD coefﬁcients are used

alone, and how that depends on the size of the NRD

feature vector. We studied these alternatives using the

distance metric given by Eq. 2, the performance crite-

rion deﬁned in Sec. 4, and using diagonal covariance

matrices.

SIGMAP 2018 - International Conference on Signal Processing and Multimedia Applications

352

Figure 7 represents the percent correct speaker

identiﬁcation when the size of the NRD feature vec-

tor varies between 1 and 19, and when three vow-

els (/a/, /e/ and /o/) are used for speaker modeling,

or when all ﬁve vowels are used. It can be con-

Figure 7: Percent correct speaker identiﬁcation when NRD

features are used for speaker modeling. Results are pre-

sented when the NRD feature vector size varies between 1

and 19, and when speaker modeling uses data from either 3

vowels or 5 vowels.

cluded that speaker modeling using data from just

three vowels is especially beneﬁcial when the NRD

feature vector size exceeds 14. This is to be expected

as the higher the harmonic number, the less reliable

the phase estimation process is, especially for vow-

els /u/ and /i/, as noted before. The lesson learned at

this point is that, quite conveniently, a vowel-oriented

phonetic segmentation scheme is not strictly required

for phase-based speaker modeling.

Secondly, it can be observed in Fig. 7 that the

percent correct speaker identiﬁcation using NRD fea-

tures only varies between 32% and around 42%. Al-

though this performance level is not impressive, it

is signiﬁcant and suggests that NRDs may positively

contribute to the speaker identiﬁcation performance

when different features are combined.

The simplistic statistical modeling that is adopted

in this paper, and whose motivation is to focus on the

intrinsic discriminative capabilities of different fea-

tures, as it was explained in Sec. 4, does not make it

easy to compare our results against other results in the

literature. Even so, for example Rajan et al. (Rajan

et al., 2013) indicate that the EER is 32.9% when us-

ing standard (18-dimensional) group-delay features,

a 128-mixture GMM-UBM, the NIST SRE 2010 data

and normal vocal effort conditions. In order to ob-

tain EER-like scores from our data, as we indicated

in Sec. 4, we converted the distance values into neg-

ative values and took the resulting score matrix to the

Biometrics software for analysis. The result for the 3-

vowel speaker modeling condition is represented in

Fig. 8. Results for 3-vowel and 5-vowel speaker

modeling are presented in Table 1. In another re-

sult, Wang et al. (Wang et al., 2010) report that us-

ing 128-mixture GMM speaker modeling (and diag-

Figure 8: Equal Error Rate (EER) when 19-dimensional

NRD feature vectors are used and speaker modeling is

based on 3 vowels.

Table 1: EER for 19-dimensional NRD feature vectors.

EER (3 vowels) EER (5 vowels)

26.61% 31.72%

onal covariance matrices), a database containing 35

speakers, and a 12-dimensional phase-related feature

vector, the achieved correct identiﬁcation varies be-

tween 63.4% and 73.4%. These results are better than

those we obtained which is probably due to the so-

phisticated statistical modeling that is used in (Wang

et al., 2010), relatively to ours.

Lastly, we emphasize that our results have been

obtained using diagonal covariance matrices. We

tested full covariance matrices but these were found to

be singular which conﬁrms that a linear dependency

exists among different NRD coefﬁcients.

5.2 Performance of MFCC Features

In this section, we report on the speaker identiﬁcation

results using MFCC coefﬁcients alone. We were mo-

tivated by the same concern in Sec. 5.1 in evaluating

how results vary as a function of the feature vector

size and according to the two possibilities of 3 or 5

vowel speaker modeling.

As in Sec. 5.1, we also consider here the distance

metric given by Eq. 2, the performance criterion de-

ﬁned in Sec. 4, and diagonal covariance matrices.

Figure 9 represents the percent correct speaker

identiﬁcation when the size of the MFCC feature vec-

tor varies between 1 and 19, and when three vowels

(/a/, /e/ and /o/) are used for speaker modeling, or

when all ﬁve vowels are used. A ﬁrst important out-

come is that results for 5-vowel speaker modeling are

almost always better than those that are obtained for

3-vowel speaker modeling. This is particularly no-

torious for MFCC feature sizes larger than 10. This

is somewhat expected and just conﬁrms that a richer

diversity in the data is beneﬁcial. Secondly, the high-

est speaker identiﬁcation performance is obtained for

First Experiments on Speaker Identiﬁcation Combining a New Shift-invariant Phase-related Feature (NRD), MFCCs and F0 Information

353

Figure 9: Percent correct speaker identiﬁcation when

MFCC features are used for speaker modeling. Results are

presented when the MFCC feature vector size varies be-

tween 1 and 19, and when speaker modeling uses data from

either 3 vowels or 5 vowels.

MFCC feature sizes between 13 and 14. Although

this is perfectly in line with the typical 13-coefﬁcient

MFCC feature size commonly found in the literature

and currently used by actual software and hardware

systems, we were expecting to see a different number

given that the speaker modeling conditions are pecu-

liar. Lastly, the best percent correct identiﬁcation is

77% which is fair but clearly not competitive, as we

have anticipated already, and which can be explained

by the fact that (deliberately) we use a very simple,

Euclidian-based, distance metric.

It is also worth stressing that MFCC results have

been obtained using diagonal covariance matrices.

We also tested with full covariance matrices but these

were found to be singular which suggests a linear de-

pendency may exist among different MFCC coefﬁ-

cients.

5.3 Performance of F0 & F0std Features

Table 2 presents the percent correct speaker identi-

ﬁcation results when the average fundamental fre-

quency (F0) alone is used, when the F0 standard de-

viation is used, when both values are used and the co-

variance matrices are forced to be diagonal, and when

full covariance matrices are allowed in the compu-

tation of the distance metric (2). The most relevant

Table 2: Percent correct speaker identiﬁcation using F0 fre-

quency only, using the standard deviation of F0, using both

frequency values and diagonal covariance matrices, or using

full covariance matrices.

F0 F0std F0+F0std F0+F0std (full)

29.73% 25.68% 39.19% 40.54%

conclusion is that both F0 and F0std possess a visible

potential helping the speaker discrimination task, and

that that potential is even more expressive when both

values are combined.

6 PERFORMANCE WHEN

FEATURES ARE COMBINED

We extended our simulation environment to allow any

combination of any number (from 1 to 19) of NRD

coefﬁcients, any number (from 1 to 19) of MFCC co-

efﬁcients, and any number of F0-related features (i.e.

from 1 to 2, 1 meaning just F0 and 2 meaning F0 plus

F0std) in an extended feature vector. Iterating over all

possibilities looking for the maximum overall identi-

ﬁcation performance, and using diagonal covariance

matrices, we were surprised to ﬁnd that the best re-

sults were even poorer that those achieved with in-

dividual features. For example, we obtained 67.57%

which is less than the best results (77%) obtained us-

ing MFCCs alone (as reported in Sec. 5.2). In short,

using diagonal covariance matrices, data fusion at the

feature level, when all three types of features are si-

multaneously present in the feature vectors, does not

appear to lead to improvements.

As expected, using full covariance matrices leads

to numerical errors as covariance matrices are singu-

lar.

Before moving to a different fusion strategy, we

tried to arrange the covariance matrices so that they

are diagonal only in those regions where covariance

elements relate feature coefﬁcients of the same type.

In other words, considering that each feature vec-

tor includes NRD, MFCC and F0-related coefﬁcients,

the overall covariance matrix can be split into 9 sub-

matrices, only the three of which that lie along the

main diagonal, are forced to be locally diagonal. Al-

though the number of cases in which the overall co-

variance matrix is singular is smaller, the problem still

persists.

7 PERFORMANCE WHEN

DISTANCES ARE COMBINED

Since our attempts to ‘fuse’ features into extended

feature vectors did not produce useful results, we

moved to a new approach of “score fusion” which

is a more promising approach (Ramachandran et al.,

2002). According to this approach, distances (or

scores) for speaker identiﬁcation are obtained sepa-

rately for different types of features (in our case we

have three: NRD, MFCC and F0-related features).

These distances are then combined, or ‘fused’, giving

rise to an overall distance which determines a success-

ful or unsuccessful speaker identiﬁcation.

In this process, we also investigated why full co-

variance matrices did not produce useful results as re-

SIGMAP 2018 - International Conference on Signal Processing and Multimedia Applications

354

ported in Secs. 5.1 and 5.2.

Let us admit that we deal with a full-rank covari-

ance matrix C. Simple examples may help to high-

light the fact that even when all elements in C

−1

are positive-valued (which is not always guaranteed),

then there are good changes that some parcels in com-

puting d

according to Eq. (1) and involving off-

diagonal elements in C

−1

, are in fact negative, which

is against the concept of distance that, by deﬁnition, is

always positive. As a consequence, we redeﬁned Eq.

(1) according to two simple and fair principles:

• the new distance metric should reduce to Eq. (1)

as a particular case when diagonal covariance ma-

trices are used,

• it should involve only positive values in the com-

putations.

Hence, considering that when diagonal covariance

matrices are used, Eq. (1) delivers a variance-

normalized Euclidian distance, we extended this con-

cept such that when products of difference factors are

involved that pertain to different feature coefﬁcients,

the absolute value of those difference factors is con-

sidered instead, which is followed by a normalization

using the standard deviation of the feature coefﬁcients

involved. Thus, the modiﬁed distance is computed as

follows:

x − µ

, (3)

where S is ﬁrst obtained as the outer product be-

tween the standard deviations of the feature vector,

i.e. S = σσ

, and ﬁnally all elements in S become

their reciprocal, i.e. s

i, j

= (s

i, j

)

−1

. Clearly, when C is

diagonal, then S = C

−1

and Eq. (3) gives rise to the

same result as Eq. (1).

When the modiﬁed distance metric according to

Eq. (3) is introduced in the distance metric deﬁned by

Eq. (2), new results are obtained for the three types of

features. Those concerning NRD and MFCC are rep-

resented in Figs. 10 and 11, respectively. The results

Figure 10: Percent correct speaker identiﬁcation when NRD

features based on 5 vowels are used for speaker modeling.

Results are presented when the NRD feature vector size

varies between 1 and 19, and when the modiﬁed distance

metric uses a full or diagonal S matrix.

in these ﬁgures for diagonal S matrix are exactly the

same as those represented in Figs. 7 and 9 for the case

of 5-vowel speaker modeling. It can be concluded

Figure 11: Percent correct speaker identiﬁcation when

MFCC features based on 5 vowels are used for speaker

modeling. Results are presented when the MFCC feature

vector size varies between 1 and 19, and when the modiﬁed

distance metric uses a full or diagonal S matrix.

that concerning NRDs, the maximum performance in-

creases slightly (from about 42% to 43%) and then

performance decreases for feature vector sizes larger

than 14. Concerning MFCCs, results for full S ma-

trix are almost always slightly above those obtained

for diagonal S matrix and, in particular, the maximum

performance increases from around 77%, as already

seen in Sec. 5.2, to around 84%. Finally, the results

regarding F0-related features are presented in Table

3. The only new fact in this case and taking into con-

sideration the results presented in Table 2, is that the

results for full S matrix are even better than the results

that were obtained with full covariance C matrix.

Table 3: Percent correct speaker identiﬁcation using F0 fre-

quency only, using F0 + F0std features and diagonal S ma-

trix, and using F0 + F0std features and full S matrix.

F0 F0+F0std (diag S) F0+F0std (full S)

29.73% 39.19 % 41.89%

According to Ramachandran et al. (Ramachan-

dran et al., 2002), improvements in fusion should be

expected only when data is such that it leads to uncor-

related errors when considered individually. We have

thus analysed the behaviour of distances for those

cases of correct identiﬁcation using each feature type

individually. These results are shown in Fig. 12.

The feature vector size is 14 in the case of NRD and

MFCC, and 2 in the case of F0-related features. Sev-

eral interesting conclusions can be extracted from Fig.

12. First, when MFCC features are used, speakers

whose number is 2, 6, 16, 17 and 18 are not cor-

rectly identiﬁed and so it also happens for NRD and

F0-related features, with exception for speaker num-

ber 18 who is correctly identiﬁed using NRD features.

This shows however that identiﬁcation errors for dif-

ferent features are not truly uncorrelated, which sug-

First Experiments on Speaker Identiﬁcation Combining a New Shift-invariant Phase-related Feature (NRD), MFCCs and F0 Information

355

Figure 12: Distances associated with correct identiﬁcation

cases when NRD features are used (top panel), when MFCC

features are used (middle panel) and when F0-related fea-

tures are used (lower panel). The numbers in the horizontal

represent the speaker number. Please note that no bar means

that the speaker has not been correctly identiﬁed.

gests fusion of distances is likely to face difﬁculties.

Our attempts to fuse distances involved additive

and multiplicative combinations as we were looking

for the simplest possible ways. Only multiplicative

combinations produced useful results. We started by

combining MFCC and NRD distances and enforcing

a saturation point on the NRD distance. We allowed

our simulation environment to search for the maxi-

mum performance using any combination of feature

vector sizes. We concluded that the maximum percent

correct speaker identiﬁcation was 86.48%. As Fig.

13 illustrates, an analysis of the resulting score matrix

(after simple conversion of distances to scores, as sug-

gested in Sect. 4), leads to an EER=8.3%. For a given

Figure 13: Equal Error Rate (EER) when 14-dimensional

NRD and MFCC feature vectors are separately used in

speaker modeling but distances are combined in speaker

identiﬁcation.

saturation threshold (25), this result emerged by us-

ing 12-dimensional NRD vectors and 14-dimensional

MFCC vectors. Interestingly, the same correct identi-

ﬁcation performance (86.48%) was also obtained for

other saturation levels, e.g. 35, in which case the di-

mension of NRD vectors is 15, and that of MFCC fea-

ture vectors is 14.

Trying to combine in a multiplicative way the dis-

tances due F0-based modeling, we have observed that

results do not improve. In fact, they drop instead to

81.1% correct identiﬁcation.

Using the best scenario obtained so far combining

NRD and MFCC features, we also assessed the im-

pact of using diagonal S matrices. We concluded that

the correct identiﬁcation level droped signiﬁcantly

from 86.48% to 74.3%.

Quick overall conclusions are as follows:

• despite the deliberate simplistic statistical model-

ing conditions in this paper, the best speaker iden-

tiﬁcation score that we achieved, in the order of

86.5%, can be considered fairly good especially

in face of the fact that about half of the subjects

in the database are twins, and that recordings for

speaker model training and testing consist of ﬁve

sustained vowels each about 1 second long,

• mixing acoustic features of different types does

not appear to add constructively in the same fea-

ture vector, which possibly explains why only

MFCC-related features are used by most speaker

identiﬁcation systems,

SIGMAP 2018 - International Conference on Signal Processing and Multimedia Applications

356

• when speaker modeling and identiﬁcation is im-

plemented separately using different categories of

features, and the performance of the individual

methods is quite asymmetric, their combination is

not guaranteed to deliver improved results, on the

contrary, chances are that the opposite happens,

• while improving fusion at the score level or at the

distance level is beneﬁcial, optimizing fusion too

much can turn into a kind of black magic that

serves rather the purpose of overﬁtting the train-

ing and testing data,

• our results conﬁrm that even for very special con-

ditions governing data base construction, MFCC

features deliver the best performance,

• despite the fact that the innovative time-shift in-

variant phase-related feature, as well as F0-related

features, exhibit a potentially interesting speaker

discrimination capability when they are consid-

ered separately for speaker modeling and identiﬁ-

cation, further research is needed to constructively

combine them such as to improve their overall

performance.

8 CONCLUSIONS

In this paper we described a number of speaker iden-

tiﬁcation experiments which focused on simple meth-

ods evaluating the performance of three types of

acoustic features, notably an innovative, time-shift

invariant, phase-related feature. We assessed the

speaker identiﬁcation performance of all three types

of features individually considered, and studied fea-

ture fusion and distance fusion methods looking for

improved performance. Immediate future work will

address assessing how the performance in speaker

identiﬁcation is affected when the high-quality voice

recordings we have used in this paper are replaced by

their corresponding telephonic versions. More long-

term future work will concentrate on extended, re-

ﬁned or new features that possess a psychoacousti-

cal meaning and allow to effectively capture speaker

idiosyncratic traits (or gestures) without requiring so-

phisticated statistical modeling approaches. Of par-

ticular interest are ‘minutia level’ features, namely

pitch striations (Hansen and Hasan, 2015).

ACKNOWLEDGEMENTS

The author would like to thank Anil Alexander and

Oscar Forth at Oxford Wave Research for the use of

the Bio-Metrics software making it possible to pro-

duce some of the results presented in this paper.

This work was ﬁnanced by FEDER - Fundo Eu-

ropeu de Desenvolvimento Regional funds through

the COMPETE 2020-Operacional Programme for

Competitiveness and Internationalization (POCI), and

by Portuguese funds through FCT-Fundac¸

ao para a

encia e a Tecnologia in the framework of the project

POCI-01-0145-FEDER-029308.

REFERENCES

Alam, M. J., Kenny, P., and Stafylakis, T. (2015). Com-

bining amplitude and phase-based features for speaker

veriﬁcation with short duration utterances. In Pro-

ceeedings of Interspeech, pages 249–253.

Davis, S. B. and Mermelstein, P. (1980). Comparison

of parametric representations for monosyllabic word

recognition in continuously spoken sentences. IEEE

Transactions on Acoustics, Speech and Signal Pro-

cessing, 28(4):357–366.

de Cheveign

e, A. and Kawahara, H. (2002). Yin, a funda-

mental frequency estimator for speech and music. J.

Acoustical Society of America, 111(4):1917–1930.

Dias, S. and Ferreira, A. (2013). A hybrid LF-Rosenberg

frequency-domain model of the glottal source. In 2013

IEEE Workshop on Applications of Signal Processing

to Audio and Acoustics.

Dias, S. and Ferreira, A. (2014). Glottal pulse estimation -

a frequency domain approach. In Speech Processing

Conference. Tel-Aviv, Israel.

Dias, S., Sousa, R., and Ferreira, A. (2011). Glottal inverse

ﬁltering: a new road-map and ﬁrst results. In Speech

Processing Conference. Tel-Aviv, Israel.

Fernandes, V. and Ferreira, A. (2017). On the relevance

of F0, Jitter, Shimmer and HNR acoustic parameters

in forensic voice comparisons using GSM, VOIP and

contemporaneous high-quality voice records. In AES

Conference on Audio Forensics.

Ferreira, A. (2014). On the possibility of speaker discrim-

ination using a glottal pulse phase-related feature. In

IEEE International Symposium on Signal Processing

and Information Technology -ISSPIT. Noida, India.

Ferreira, A. (2016). Implantation of voicing on whispered

speech using frequency-domain parametric modelling

of source and ﬁlter information. In International Sym-

posium on Signal, Image, Video and Communications

(ISIVC), pages 159–166. Tunis, Tunisia.

Ferreira, A. and Fernandes, V. (2017). Consistency of the

F0, Jitter, Shimmer and HNR voice parameters in

GSM and VOIP communication. In DSP 2017.

Ferreira, A. and Sinha, D. (2016). Advances to a frequency-

domain parametric coder of wideband speech. 140th

Convention of the Audio Engineering Society. Paper

9509.

Ferreira, A. J. and Tribolet, J. M. (2018). A holistic glotal

phase related feature. In 21st International Confer-

First Experiments on Speaker Identiﬁcation Combining a New Shift-invariant Phase-related Feature (NRD), MFCCs and F0 Information

357

ence on Digital Audio Effects (DAFx-18). Aveiro, Por-

tugal.

Ferreira, A. J. S. (2007). Static features in real-time recog-

nition of isolated vowels at high pitch. Journal of the

Acoustical Society of America, 112(4):2389–2404.

Ferreira, A. J. S. and Sousa, R. (2010). DFT-based fre-

quency estimation under harmonic interference. In 4th

International Symposium on Communications, Con-

trol and Signal Processing.

Hansen, J. H. L. and Hasan, T. (2015). Speaker recognition

by machines and humans. IEEE Signal Processing

Magazine, 32(6):136–145.

Mendes, D. and Ferreira, A. (2012). Speaker identiﬁcation

using phonetic segmentation and normalized relative

delays of source harmonics. In 46th AES International

Conference on Audio Forensics, pages 215–222. Paper

9-1.

Nakagawa, S., Asakawa, K., and Wang, L. (2007). Speaker

recognition by combining MFCC and phase informa-

tion. In Proceeedings of Interspeech, pages 2005–

2008.

Padmanabhan, R., Parthasarathi, S. H. K., and Murthy,

H. A. (2009). Using group delay functions from all-

pole models for speaker recognition. In Proceeedings

of Interspeech, pages 2355–2358.

Rabiner, L. and Juang, B.-H. (1993). Fundamentals of

Speech Recognition. Prentice-Hall, Inc.

Rajan, P., Kinnunen, T., Hanilci, C., Pohjalainen, J., and

Alku, P. (2013). Using group delay functions from all-

pole models for speaker recognition. In Proceeedings

of Interspeech, pages 2489–2493.

Ramachandran, R. P., Farrell, K. R., Ramachandran, R., and

Mammone, R. J. (2002). Speaker recognition - general

classiﬁer approaches and data fusion methods. Pattern

Recognition, 35(12):2801–2821.

Raynolds, D. A. (1997). Comparison of background nor-

malization methods for text-independent speaker ver-

iﬁcation. In Proceeedings of Eurospeech, pages 963–

966.

Saratxaga, I., Hernaez, I., Erro, D., Navas, E., and Sanchez,

J. (2009). Simple representation of signal phase for

harmonic speech models. Electronic Letters, 45(381).

Segundo, E. S., Tsanas, A., and G

omez-Vilda, P. (2017).

Euclidian distances as measures of speaker similarity

including identical twin pairs: a forensic investigation

using source and ﬁlter voice characteristics. Forensic

Science International, (270):25–38.

Sousa, R. and Ferreira, A. (2010). Importance of the rel-

ative delay of glottal source harmonics. In 39th AES

International Conference on Audio Forensics - prac-

tices and challenges, pages 59–69.

Sousa, R. and Ferreira, A. (2011). Singing voice analysis

using relative harmonic delays. In 12th Annual Con-

ference of the International Speech Communication

Association (Interspeech-2011), pages 1997–2000.

Stylianou, I. (1996). Harmonic plus noise models for

speech, combined with statistical methods, for speech

and speaker modiﬁcation. PhD thesis,

Ecole Na-

tionale Sup

erieure des T

ecommunications, France.

Sundberg, J. (1987). The Science of the Singing Voice.

Northern Illinois University Press.

Ventura, J., Sousa, R., and Ferreira, A. (2012). Accurate

analysis and visual feedback of vibrato in singing.

In 5th International Symposium on Communications

Control and Signal Processing (ISCCSP).

Wang, L., Minami, K., Yamamoto, K., and Nakagawa, S.

(2010). Speaker identiﬁcation by combining MFCC

and phase information in noisy environments. In IEEE

International Conference on Acoustics, Speech and

Signal Processing, pages 4502–4505.

Wang, L., Ohtsuka, S., and Nakagawa, S. (2009). High

improvement of speaker identiﬁcation and veriﬁcation

by combining mfcc and phase information. In IEEE

International Conference on Acoustics, Speech and

Signal Processing, pages 4529–4532.

SIGMAP 2018 - International Conference on Signal Processing and Multimedia Applications

358