works ROS is defined inter alia as: phones per
second (PPS) (Mirghafori, 1996), vowels per second
(VPS) (Pfau, 1998), phones per second normalized
to the probability of the specific phone duration
(Zheng, Franco, Stolcke, 2000), word duration
normalized to the probability of its duration
(Zheng, Franco, Weng, 2000). Some measures, like
those proposed by the Zheng et al., required the ASR
or the transcription of the utterances. Therefore, for
real-time unknown input signal, ROS estimation
could be done only by statistical analysis. In this
work, as ROS definition, the VPS parameter is used,
as the derivate of SPS measure. Therefore, ROS is
defined as (Eq. 6):
t
N
nROS
vowels
Δ
=)(
(6)
For every signal frame ROS estimation is
performed using the knowledge about the frame
content, which is provided by vowels and voice
activity detectors. Therefore, ROS value is updated
for every 23 ms (length of the vowel detector
analysis frame). Instantaneous ROS value is
calculated as the mean number of vowels in the last
2 s of speech signal. Period of the time for the
averaging was chosen experimentally in such a way
that local ROS changes could be captured.
The highest ROS value that could be measured
by this method equals 21 vowels/s, provided that all
vowels and consonants durations are equal to 23 ms.
It is worth mentioning that the instantaneous value
of ROS is updated only when the current frame does
not contain silence or prolongation of the vowel. At
the beginning of the algorithm work, to eliminate the
situation when the ROS values increase from zero to
some value, initial ROS value is set to 5,16
vowels/s.
During the analysis instantaneous ROS value is
used to assign, to the current utterance, one of
speech rates categories, high or low. This division is
obtained using the ROS threshold value (ROSth).
ROSth was determined during the analysis of the
mean ROS values of the speech rates recorded for 8
persons. Each person read five different phrases with
three speech rates: high, medium and low. Results of
the ROS statistics were presented in Tab. 1.
Table 1: Mean value and standard deviation of ROS
calculated for the different speech rates.
speech rate low medium high
µ(ROS)[vowels/s] 4,80 5,17 5,52
σ(ROS)[vowels/s] 0,76 0,75 0,79
It can be seen that, because of the high value of the
standard deviation (nearly 0.76 for all classes) and as
a consequence of the low distance between the
neighbouring classes, only two classes could be
separated linearly using the instantaneous ROS
value. On the basis of the statistics, the ROS value
was set to 5.16 vowels/s. The threshold was
calculated according to the equation (7):
2
)()(
highlow
th
ROSROS
ROS
μμ
+
=
(7)
where µ(ROS)
low
is the mean value of ROS for the
low rate speech and µ(ROS)
high
is the mean value of
ROS for the high rate speech.
In Sec. 3 the accuracy of speech rate class
recognition as well as its applicability to the non-
uniform speech stretching are investigated.
2.4 Time-scale Modification Algorithm
Selection
Many algorithms dedicated for speech time-scaling
can be found in literature. All of them are based on
the overlap-and-add technique. Most of the known
algorithms were not optimized for real-time signal
processing. Therefore, for real-time speech
stretching only a few methods could be used. The
best quality of time-scaled speech is achieved for
complex methods that combine precise speech signal
analysis such as speech periodicity judgment and
adjustment of the analysis and synthesis frame sizes
to the current part of the signal (Moulines, 1995).
The algorithms, for instance PSOLA (Pitch
Synchronous Overlap and Add) or WSOLA
(Waveform Similarity Based Overlap and Add)
produce high quality modified signals (Grofit, 2008;
Verhelst, 1993), but require changing analysis shift
sizes (WSOLA) or synthesis (PSOLA) frame sizes
according to the current speech content.
It was shown that those algorithms could be used
for real-time signal processing (Verhelst, 1993; Le
Beux 2010), but for the non-uniform time-scale
modification variable sizes of analysis time shift or
synthesis frame would add complexity to the
detection algorithms (voice activity detection,
vowels detection). For this reason, NU-RTSM
algorithm is based on the SOLA algorithm
(Synchronous Overlap-and-Add) which in the
fundamental form uses constant values of the
analysis/synthesis frame sizes and analysis/synthesis
time shift (Pesce, 2000) as well ensures quality of
the processed speech nearly as good as for the other
methods (Verhelst; 1993; Kupryjanow, 2009).
SIGMAP 2011 - International Conference on Signal Processing and Multimedia Applications
30