signal, and apply the mappings to the synthesizer con-
trols, in a similar manner to (Janer, 2005) but here
focusing on note-to-note articulations. The synthesis
is a two-step process: sample selection, and sample
transformation.
1.2 Toward User-adapted Mappings
We claim that the choice of different phonetics when
imitating different instruments and different articula-
tions (note-to-note transitions) is subject-dependent.
In order to evaluate the possibilities of automatically
learning such behaviour from real imitation cases, we
carry out here some experiments. We propose a sys-
tem consisting of two main modules: an imitation seg-
mentation module, and an articulation type classifica-
tion module. In the former, a probabilistic model au-
tomatically locates note-to-note transitions from the
imitation utterance by paying attention to phonetics.
In the latter, for each detected note-to-note transition,
a classifier determines the intended type of articula-
tion from a set of low-level audio features.
In our experiment, subjects were requested to im-
itate real instrument performance recordings, consist-
ing of a set of short musical phrases played by saxo-
phone and violin professional performers. We asked
the musicians to perform each musical phrase using
different types of articulation. From each recorded
imitation, our imitation segmentation module auto-
matically segments note-to-note transitions. After
that, a set of low-level descriptors, mainly based on
cepstral analysis, is extracted from the audio excerpt
corresponding to the segmented note-to-note transi-
tion. Then, we perform supervised training of the ar-
ticulation type classification module by means of ma-
chine learning techniques, feeding the classifier with
different sets of low-level phonetic descriptors, and
the target labels corresponding to the imitated musi-
cal phrase (see figure 1). Results of the supervised
training are compared to classifier of articulation type
based on heuristic rules.
2 IMITATION SEGMENTATION
MODULE
In the context of instrument imitation, singing voice
signal has a distinct characteristics in relation to tra-
ditional singing. The latter is often referred as sylla-
bling (Sundberg, 1994). For both, traditional singing
and syllabling, principal musical information involves
pitch, dynamics and timing; and those are indepen-
dent of the phonetics. In vocal imitation, though, the
Imitation
Segmentation
Module
Articulation Type
Classification
Module
Voice imitation
Target
performances
supervised
training
to synthesizer
parameters
Phonetic features
Figure 1: Overview of the proposed system. After the im-
itation segmentation, a classifier is trained with phonetic
low-level features and the articulation type label of target
performance.
role of phonetics is reserved for determining articu-
lation and timbre aspects. For the former, we will
use phonetics changes to determine the boundaries of
musical articulations. For the latter, phonetic aspects
such as formant frequencies within vowels can signify
a timbre modulation (e.g. brightness). We can con-
clude that unlike in speech recognition, a phoneme
recognizer is not required and a more simple classifi-
cation will fulfill our needs.
In Phonetics, one can find various classifications
of phonemes depending on the point of view, e.g.
from the acoustic properties the articulatory gestures.
A commonly accepted classification based on the
acoustic characteristics consists of six broad phonetic
classes (Lieberman and Blumstein, 1986): vowels,
semi-vowels, liquids and glides, nasals, plosive, and
fricatives. Alternatively, we might consider a new
phonetic classification that better suits the acoustic
characteristics of voice signal in our particular con-
text. As we have learned from section 2, a reduced set
of phonemes is mostly employed in syllabling. Fur-
thermore, this set of phonemes tends to convey mu-
sical information. Vowels constitute the nucleus of
a syllable, while some consonants are used in note
onsets (i.e. note attacks) and nasals are mostly em-
ployed as codas. Our proposal envisages different
categories resulting from the previous studies in syl-
labling (Sundberg, 1994). Taking into account syl-
labling characteristics, we propose a classification
based on its musical function, comprising: attack,
sustain, release, articulation and other (additional).
SIGMAP 2007 - International Conference on Signal Processing and Multimedia Applications
110