the segmentation has been provided in Figure 3. The pixel values in these overlapping
ordered matrix cells A(f, t) (f = 1, 2, . . . , F ), (t = 1, 2, . . . , T ) may be interpreted
as the energy content in the cells (which uniquely characterizes an individual) as the
speech signal is swept through time. The matrix, A(f, t), is the intersection of pixel
cells of the f
th
frequency band and the t
th
time band.
As is depicted in the figure, the spectrogram has been segmented into several over-
lapping matrices. Let the mean of the pixel values of the (f, t)
th
matrix, A(f, t), be
given by µ
ft
, where ‘f’ denotes the frequency band and ‘t’ denotes the time band.
Given a spectrogram of the speech signal of a speaker, the F -dimensional vector
(µ
1t
, µ
2t
, . . . , µ
F t
) (t = 1, 2, . . . , T ), represents the vocal properties of the speaker in
the t
th
time band.
In the database samples, let µ
ijrf t
denote the mean pixel values for replicate r cor-
responding to the (f, t)
th
matrix, A(f, t), of the spectrogram of the i
th
speaker’s utter-
ance of the j
th
word. Here, i = 1, . . . , N; j = 1, . . . , M ; r = 1, . . . , R; t = 1, . . . , T
and l = i, . . . , P . N denotes the number of speakers in the closed set; M, the number
of different words uttered; R, the number of replications per word used for training,
corresponding to each known speaker. F denotes the number of frequency bands the
spectrograms are segmented into in the frequency domain and T denotes the number of
time bands the spectrograms are split into along the time axis. We use these observa-
tions to prepare our codebook corresponding to each spectrogram. A typical codebook,
corresponding to the r
th
replicate of the j
th
word, of the i
th
speaker would consist of T
code vectors. The elements of each code vector would be representing the means of the
ordered overlapping matrices of the segmented (along frequency axis) time band and
the vector is given by Ψ
ijrt
= (µ
ijr1t
, µ
ijr2t
, . . . , µ
ijrF t
)
0
where t = 1, . . . , T . This
technique of data compression draws a close resemblance with quantization, in which
each time band is represented by a F -dimensional vector conditioning on the F fre-
quency bands. Quantization by conditioning on frequency bands enhances recognition
rate as it performs a superior template matching of images in question, than, uncon-
ditional vector quantization (of pixels in a particular time band) as in the later case,
the ordering/distribution of the centroids is not taken into consideration. Also, in vector
quantization, formation of empty clusters is likely, specially in time bands representing
silence or uniform energy content, thus, leading to erroneous results. This fact lays the
basis of our methodology to verify and, more importantly, identify a speaker.
3 Speaker Recognition
3.1 Identification in a Closed Set
Having collected our training database of spectrograms for 40 speakers, 1 training sam-
ple for every word for every speaker is chosen randomly to be tested with. We consider
a test sample comprising of the 3 words of an unknown speaker (in the closed set).
An important assumption is that, the unknown speaker is in the closed set and utters
the three prescribed words in a predefined order to enable identifying which sample
corresponds to which word.
Let θ represent the actual identity of the unknown speaker based on the mean pixel
values of the matrices of the segmented spectrogram. For simplicity, let the i
th
speaker
137137