for Automatic Recognition of Injunctive Values in
Interaction Oral Language). This corpus is constituted
of authentic and massive speech oral French data
gathered in a spontaneous noisy speech interaction
environment with various speakers. In a first study,
this work will be limited to a labelled subset of
sentences containing the French word ‘allez’, which
contain injunctive values. This work will give an
answer to the role of prosody in the injunctive
interpretation that emerges at the syntagm and
sentence level.
More precisely, the principal first aim of this work
is to evaluate the contribution of the prosodic
descriptor for the task of speech injunction
classification using Gaussian Mixtures Models
(GMM) models. The study will investigate prosody
descriptors (pitch, energy) and other typical
descriptors used in speech processing, such that
Linear Predictive Cepstral Coefficients (LPCC),
Perceptual Linear Prediction coefficients (PLP) and
Mel-Frequency Cepstrum Coefficients (MFCC)
(Basu, Chakraborty, Bag, & Aftabuddin, 2017) (Wu,
Falk, & Chan, 2011) (Hacine-Gharbi, Deriche,
Ravier, Harba, & Mohamadi, 2013) with their first
derivatives and second derivatives for dynamical
modelling. Performance of the prosodic descriptors
will be compared with performance of LPCC, PLP
and MFCC descriptors.
The second aim of the paper is to select the
relevant prosodic features for this task using a
wrapper method.
The paper is organized as follows. Section 2
describes the injunctions classification system based
on a brief state of the art of some prosodic
classification systems. Section 3 introduces the
RAVIOLI database, details the system and gives
classification rate results for many experiments. In
addition, results of a feature selection procedure are
analysed before concluding the paper.
2 SPEECH INJUNCTIONS
CLASSIFICATION SYSTEM
Several systems based on prosodic features have been
proposed for pattern recognition tasks (Singh, Khan,
& Pandey, 2012) (Mary & Yegnanarayana, 2008)
(Ferrer, et al., 2015) (Szaszak, Tündik, & Gerazov,
2018). In (Hacine-Gharbi, Petit, Ravier, & Nemo,
2015), the authors have proposed an automatic
system for the classification of the speech signal into
semantic categories labelled “conviction (CV)” and
“lack of conviction (NCV)” in the use of the French
single word ‘oui’. Each one of these semantic
categories has been associated to a class in an
automatic classification task, using hidden Markov
models (HMM) for class modelling. The authors have
more specifically studied the contribution of the
prosodic features in the discrimination between the
two classes. These features are the pitch and the
energy as well as their dynamic features delta and
delta-delta. In (Hacine-Gharbi, A.; Ravier, P.; Nemo,
F., 2017), the authors go one step further with the
extraction of the local and global prosodic features,
combined with SVM classifier to classify the signal
of French ‘oui’ into classes CV or NCV. In (Hacine-
Gharbi, A.; Ravier, P., 2019), the authors have
investigated the combination of MFCC coefficients
with prosodic features for emotions recognition based
on modelling each emotion class by GMM.
The purpose of the present system is to classify
speech audio signals into injunctive class (INJ) or
non-injunctive class (NINJ). We propose the use of
GMM combined with feature extraction method for
this task of classification. In this paper, we will
consider prosodic features and classical spectral
features.
Such a system requires a learning phase in order
to obtain a GMM model for each class followed by a
testing phase in order to identify the class an unknown
utterance signal belongs to. Thus, the dataset is
divided into a training dataset for the learning phase
and a testing dataset useful for evaluating
performance of the classification system. Each phase
requires the extraction of features that contain
information useful for discriminating the two classes
INJ and NINJ.
Figure 1 illustrates the diagram of our automatic
classification system of speech occurrences into
injunctive and non-injunctive classes.
Figure 1: Diagram of the GMM classification system in the
two classes INJ or NINJ. The system is composed of the
training phase (dashed lines) which learns the GMM
injunction models using the occurrences of the training
dataset with their corresponding text; the testing phase (full
lines) decides whether the test signal is injunctive or not.
Automatic Classification of French Spontaneous Oral Speech into Injunction and No-injunction Classes