Center of Gravity (COG), left and right borders, and
average spectral power, give rise to relevant percep-
tual cues that we believe are used by the human au-
ditory system to recognize and discriminate among
vowels.
The rest of this paper is structured as follows. In
section 2 we describe the PSC concept and address
the estimation of PSC related features when pitch is
either seen as an additional feature, or when it is used
as an explicit normalization factor. In section 3 we
describe the classification criterion used in the auto-
matic vowel recognition tests and present two sets of
known features that have been used as a reference in
those tests. In section 4 we characterize the training
and testing data base. In section 5 we discuss the main
results and conclusions of the vowel recognition tests.
Section 6 summarizes and concludes the paper.
2 THE PERCEPTUAL SPECTRAL
CLUSTER CONCEPT
The PSC concept has found inspiration on Klatt’s dis-
cussion regarding ‘prominent energy concentrations’
in the magnitude spectrum of a vowel sound (Klatt,
1982), and first experimental results have been re-
ported in (Ferreira, 2005) and further investigated in
(Ferreira, 2007).
The PSC concept is strongly rooted on the idea
that the human recognition of a sustained voiced
vowel results from both the identification of its pitch
and timbre, both being perceptual sensations. It is
known that the partials of a harmonic structure are
fused (or integrated)on a single pitch perception,even
if some of the partials are missing (Moore, 1989). On
the other hand, timbre is commonly seen as the ‘color’
of a sound and, in the case of a harmonic sound such
as a voiced vowel utterance, depends on the spectral
power of its partials. Thus, for a voiced vowel sound,
timbre analysis requires the identification of the un-
derlying harmonic structure. The PSC concept builds
on this perceptual integration of partials pertaining
to the same harmonic structure, and tries to identify
clusters of harmonic partials and their attributes, that
explain the ability of the human auditory system to
discriminate among vowels. It is thus admitted that
a second level of perceptual integration involving the
harmonic partials within each PSC is carried out by
the human auditory system.
2.1 Estimation of PSCs
PSC features are extracted after PSC boundaries have
been estimated according to the algorithm illustrated
COG(1, 2) PSCR difdB
F0
audioPCM
analysis
pitch & harmonic
| ODFT |
windowing
smoothing
PSC merge
PSC
feature extraction
PSC
pre-processing
Figure 1: Estimation of PSC boundaries and PSC features.
in Fig. 1. Each audio frame of audio samples,
x(n), is 32 ms long (1024 samples, 32 kHz sam-
pling frequency) and adjacent frames are 50% over-
lapped. A frame is first multiplied by the square
root of a shifted Hanning window, h(n), before being
transformed to the Odd-DFT domain by computing
X
ODFT
(k) =
∑
N−1
n=0
h(n)x(n)e
− j
2π
N
(k+
1
2
)n
. A pitch and
harmonic analysis is subsequently implemented using
a frequency domain pitch estimator (Hess, 1983) that
takes into account the specificity of the Odd-DFT and
analysis window (Ferreira, 2007).
The lower and upper borders and average spec-
tral power of each PSC are found as a result of a
PSC pre-processing and merge operations. First, a
new frequency domain is created that includes all
harmonic partials in the magnitude spectrum of the
voiced vowel, and then a magnitude smoothing in
the new frequency domain is implemented so as to
avoid small local peaks. All local peaks are subse-
quently identified as potential PSC candidates. Start-
ing from the center of each PSC candidate, left and
right borders are found by integrating into the PSC
neighboring partials whose magnitude is not below 8
dB
1
the average magnitude of the PSC (this value is
updated every time one more partial is integrated into
the PSC). This PSC pre-processing does not merge
different PSCs, but may result in PSCs with abutting
borders corresponding to local minima. These PSC
are first identified and, if their absolute magnitude
difference is below 8 dB, PSCs are merged. Finally,
adjacent but non-abutting PSCs are identified and, if
sufficiently close to each other, their magnitude dif-
ference is tested and eventually they are merged.
This algorithm is iterated for each frame till there
are no more PSCs to merge. Subsequently, a mapping
1
This value has been found experimentally (Ferreira,
2007).
SIGMAP 2008 - International Conference on Signal Processing and Multimedia Applications
64