are used for training and features obtained from 2 ut-
terances, for testing. The YOHO database with four
different recording sessions is used to compare the
performance of the features under session variability.
It is sampled at 8 kHz, and we use recordings from
138 speakers. For each speaker, from the first ses-
sion, the first 20 utterances are used for training and
from all the four sessions, the features obtained from
the next four utterances are together used for testing.
The feature vectors extracted from the training
data are scaled to unit norm. Using these features,
we model each speaker by 32-component Gaussian
mixture with diagonal covariance matrix, denoted by
S
θ
. Here, θ denotes the model parameters (mean
µ, covariance Σ and weights w of mixture compo-
nents). The same GMM configuration is used for all
the databases, and for MFCC features also. The test
data is classified as belonging to the speaker having
the maximum per-sample average log-likelihood, ob-
tained as,
L (θ|x) =
1
n
n
∑
i=1
log(S
θ
(x
i
)) (1)
where, S
θ
(x
i
) =
∑
k
j=1
w
j
f (x
i
|µ
j
, Σ
j
); k (=32) is the
number of mixture components and n is the number of
feature vectors [x = (x
1
, x
2
, x
3
, ...., x
n
)] available in the
test data. The modelling and likelihood estimation are
performed using scikit-learn (Pedregosa et al., 2011).
The performance of PS-DCT features is compared
with those of the existing glottal based features and
13-dimensional MFCCs. MFCCs are computed only
from the voiced segments of the speech signal for all
the databases, so that the comparison is fair. A frame
length of 30 ms with Hanning window and a frame
shift of 10 ms are used for computing the MFCC fea-
tures. The GMM described above is used for the
MFCC features also. The other features compared
are: (i) the deterministic plus stochastic model (DSM)
of the residual signal, proposed by Drugman and Du-
toit (Drugman and Dutoit, 2012), which they used for
SID. (ii) DCT of the integrated linear prediction resid-
ual (ILPR), proposed in (Abhiram et al., 2015), where
ILPR is used as a voice source estimate. The results
reported for these two features are taken from the lit-
erature.
3.2 Speaker Verification (SV) Studies
An i-vector based speaker verification system has
been implemented using Microsoft Identity Toolbox
(Sadjadi and Omid, 2013)for this work. Entire TIMIT
database and a subset of Mandarin corpus (7000 ut-
terances from 700 speakers) are used as develop-
ment data for obtaining universal background model
(UBM) of 256 mixtures and total variability subspace
(T-matrix of 400 columns). Totally, we are using ap-
proximately 12 hours of data from 542 female and
788 male speakers. From the YOHO database, we
have used recordings from four sessions of each of
138 speakers for training their speaker models and
evaluating the performance. Verification trials con-
sist of all possible model-test combinations, resulting
in a total of 19,044 (138 × 138) trials (138 target ver-
sus 18,906 impostor trials). From the first session, we
have used the first 20 utterances from each speaker as
enrollment data for training speaker models and from
all the four sessions, the data obtained from the next
four utterances are used for testing. From Mandarin
corpus, 24,025 trials (155 target versus 23,870 impos-
tor trials) from 155 speakers are used for testing. 10
utterances from each speaker are used for enrollment
and data from four utterances are used for testing.
In i-vector based SV systems, both the training
and test segments are represented by i-vectors. The
dimensionality of the i-vectors is reduced using 200-
dimensional linear discriminant analysis (LDA) to re-
move channel directions in order to increase the dis-
crimination between speaker subspaces. The Baum-
Welch statistics (Dehak et al., 2011) are computed
from the training and test feature vectors. Using this
statistics along with T-matrix, we compute the train
and test i-vectors. After mean and length normaliza-
tion (Garcia-Romero and Espy-Wilson, 2011), the i-
vectors are modelled via a generative factor analysis
approach called the probabilistic LDA (PLDA). Af-
ter that, a whitening transformation is applied, which
is learned from the i-vectors of the development set
(Sadjadi and Omid, 2013). Finally, a linear strategy
is used for scoring the verification trials (Sadjadi and
Omid, 2013), which computes the log-likelihood ratio
of same to different speakers’ hypotheses.
Separate SV experiments are performed using
MFCCs, PS-DCT and their score level combination.
For more details about the implemented SV system,
one can refer to Microsoft Identity Toolbox (Sadjadi
and Omid, 2013). We have extracted 13 MFCCs
with their delta and delta-delta coefficients to form
39-dimensional feature vectors. Cepstral mean and
variance normalization (CMVN) is used for further
processing. Since the dimension of PS-DCT depends
on the sampling rate, we have extracted PS-DCT fea-
tures after re-sampling all the utterances to 8 kHz. 56-
dimensional PS-DCT features are obtained as men-
tioned in Sec. 2. Since PS-DCT is not in the cep-
stral domain, we have not applied CMVN. Instead,
we have scaled the feature vectors individually to unit
norm (feature vector length) for post-processing.
ICPRAM 2020 - 9th International Conference on Pattern Recognition Applications and Methods
398