Extracting Emotions and Communication Styles from Vocal Signals

Licia Sbattella, Luca Colombo, Carlo Rinaldi, Roberto Tedesco, Matteo Matteucci

and Alessandro Trivilini

Politecnico di Milano, Dip. di Elettronica, Informazione e Biongegneria, P.zza Leonardo da Vinci 32, Milano, Italy

Keywords:

Natural Language Processing, Communication Style Recognition, Emotion Recognition.

Abstract:

Many psychological and social studies highlighted the two distinct channels we use to exchange information

among us—an explicit, linguistic channel, and an implicit, paralinguistic channel. The latter contains infor-

mation about the emotional state of the speaker, providing clues about the implicit meaning of the message.

In particular, the paralinguistic channel can improve applications requiring human-machine interactions (for

example, Automatic Speech Recognition systems or Conversational Agents), as well as support the analysis

of human-human interactions (think, for example, of clinic or forensic applications). In this work we present

PrEmA, a tool able to recognize and classify both emotions and communication style of the speaker, relying

on prosodic features. In particular, communication-style recognition is, to our knowledge, new, and could be

used to infer interesting clues about the state of the interaction. We selected two sets of prosodic features,

and trained two classiﬁers, based on the Linear Discriminant Analysis. The experiments we conducted, with

Italian speakers, provided encouraging results (Ac=71% for classiﬁcation of emotions, Ac=86% for classi-

ﬁcation of communication styles), showing that the models were able to discriminate among emotions and

communication styles, associating phrases with the correct labels.

1 INTRODUCTION

Many psychological and sociological studies high-

lighted the two distinct channels we use to exchange

information among us—a linguistic (i.e., explicit)

channel used to transmit the contents of a conver-

sation, and a paralinguistic (i.e., implicit) channel

responsible for providing clues about the emotional

state of the speaker and the implicit meaning of the

message.

Information conveyed by the paralinguistic chan-

nel, in particular prosody, is useful for many research

ﬁelds where the study of the rhythmic and intona-

tional properties of speech is required (Leung et al.,

2010). The ability to guess the emotional state of the

speaker, as well as her/his communication style, are

particularly interesting for Conversational Agents, as

could allow them to select the more appropriate re-

action to the user’s requests, making the conversation

more natural and thus improving the effectiveness of

the system (Pleva et al., 2011; Moridis and Econo-

mides, 2012). Moreover, being able to extract paralin-

guistic information is interesting in clinic application,

where psychological proﬁles of subjects and the clin-

ical relationships they establish with doctors could be

created. Finally, in forensic applications, paralinguis-

tic information could be useful for observing how de-

fendants, witnesses, and victims behave under inter-

rogation.

Our contribution lies in the latter research ﬁeld;

in particular, we explore techniques for emotion and

communication style recognition. In this paper we

present an original model, a prototype (PrEmA -

Prosodic Emotion Analyzer), and the results we ob-

tained.

The paper is structured as follow. In Section 2

we provide a brief introduction about the relationship

among voice, emotions, and communication styles;

in Section 3 we present some research projects about

emotion recognition; in Section 4 we introduce our

model; in Section 5 we illustrate the experiments

we conducted, and discuss the results we gathered;

in Section 6 we introduce PrEmA, the prototype we

built; ﬁnally, in Section 7 we draw some conclusions

and outline our future research directions.

183

Sbattella L., Colombo L., Rinaldi C., Tedesco R., Matteucci M. and Trivilini A..

Extracting Emotions and Communication Styles from Vocal Signals.

DOI: 10.5220/0004699301830195

In Proceedings of the International Conference on Physiological Computing Systems (PhyCS-2014), pages 183-195

ISBN: 978-989-758-006-2

 2014 SCITEPRESS (Science and Technology Publications, Lda.)

2 BACKGROUND

2.1 The Prosodic Elements

Several studies investigate the issue of characteriz-

ing human behaviors through vocal expressions; such

studies rely on prosodic elements that transmit essen-

tial information about the speaker’s attitude, emotion,

intention, context, gender, age, and physical condition

(Caldognetto and Poggi, 2004; Tesser et al., 2004;

Asawa et al., 2012).

Intonation makes spoken language very different

from written language. In written language, white

spaces and punctuation are used to separate words,

sentences and phrases, inducing a particular “rhythm”

to the sentences. Punctuation also contributes to

specify the meaning to the whole sentence, stressing

words, creating emphasis on certain parts of the sen-

tence, etc. In spoken language, a similar task is done

by means of prosody—changes in speech rate, dura-

tion of syllables, intonation, loudness, etc.

Such so-called suprasegmental characteristics

play an important role in the process of utterance

understanding; they are key elements in expressing

the intention of a message (interrogative, afﬁrmative,

etc.) and its style (aggressive, assertive, etc.) In

this work we focused on the following prosodic

characteristics (Pinker and Prince, 1994): intonation,

loudness, duration, pauses, timbre, and rhythm.

Intonation(or tonal Variation, or Melodic Contour)

is the most important prosodic effects, and deter-

mines the evolution of speech melody. Intonation is

tightly related to the illocutionary force of the utter-

ance (e.g., assertion, direction, commission, expres-

sion, or declaration). For example, in Italian intona-

tion is the sole way to distinguish among requests (by

raising the intonation of the ﬁnal part of the sentence),

assertions (ascending intonation at the beginning of

the sentence, then descending intonation in the ﬁnal

part), and commands (descending intonation); thus,

it is possible to distinguish the question “vieni do-

mani?” (are you coming tomorrow?) from the as-

sertion “vieni domani” (you are coming tomorrow) or

the imperative “vieni domani!” (come tomorrow!).

Moreover, intonation provides clues on the distri-

bution of information in the utterance. In other words,

it helps in emphasizing new or important facts the

speaker is introducing in the discourse (for example,

in Italian, by means of a peak in the intonation con-

tour). Thus, intonation takes part in clarifying the

syntactic structure of the utterance.

Finally, and most important for our work, into-

nation is also related to emotions; for example, the

melodic contour of anger is rough and full of sudden

variations on accented syllables, while joy exhibits

a smooth, rounded, and slow-varying intonation.

Intonation also conveys the attitude of the speaker,

leading the hearer to grasp nuances of meaning, like

irony, kindness, impatience, etc.

Loudness is another important prosodic feature, and

is directly related to the voice loudness. Loudness

can emphasize differences in terms of meaning—an

increase of loudness, for example, can be related to

anger.

Duration (or Speech Rate) indicates the length of

phonetic segments. Duration can transmit a wide

range of meanings, such as speaker’s emotions; in

general, emotional states that imply psychophys-

iological activation (like fear, anger, and joy) are

correlated to short durations and high speech rate

(Bonvino, 2000), while sadness is typically related

to slow speech. Duration also correlates with the

speaker’s attitudes (it gives clues about courtesy,

impatience, or insecurity of the speaker), as well as

types of discourse (a homily will have slower speech

rate than, for example, a sport running commentary).

Pauses allow the speaker to take breath, but can

also be used to emphasize parts of the utterance,

by inserting breaks in the intonation contour; from

this point of view, pauses correspond to punctuation

we add in written language. Pauses, however, are

much more general and can convey a larger variety of

nuances than punctuation.

Timbre –such as falsetto, whisper, hoarse voice,

quavering voice– often provide information about the

emotional state and health of the speaker (for exam-

ple, a speaker feeling insecure is easily associated

with quavering voice). Timbre also depends on the

mount of noise affecting the vocal emission.

Rhythm is a complex prosodic element, emerging

from joint action of several factors, in particular

intonation, loudness, and duration. It is an intrinsic

and unique attribute of each language.

In the following, we present some studies that try

to model the relationship among prosody, emotions,

and communication styles.

2.2 Speech and Emotions

Emotion is a complex construct and represents a com-

ponent of how we react to external stimuli (Scherer,

PhyCS2014-InternationalConferenceonPhysiologicalComputingSystems

184

2005). In emotions we can distinguish:

• A neurophysiological component of activation

(arousal).

• A cognitive component, through which an indi-

vidual evaluates the situation-stimulus in relation

to her/his needs.

• A motoric component, which aims at transform-

ing intentions in actions.

• An expressive component, through which an in-

dividual expresses her/his intentions in relation to

her/his level of social interaction.

• A subjective component, which is related to the

experience of the individual.

The emotional expression is not only based on lin-

guistic events, but also on paralinguistic events, which

can be acoustic (such as screams or particular vocal

inﬂections), visual (such as facial expressions or ges-

tures), tactile (for example, a caress), gustatory, olfac-

tory, and motoric (Balconi and Carrera, 2005; Planet

and Iriondo, 2012). In particular, the contribution of

non-verbal channels on the communication process is

huge; according to (Mehrabian, 1972) the linguistic,

paralinguistic, and motoric channels, constitutes, re-

spectively, 7%, 38%, and 55% of the communication

process. In this work, we focused on the acoustic par-

alinguistic channel.

According to (Stern, 1985), emotions can be di-

vided in: vital affects (ﬂoating, vanishing, spend-

ing, exploding, increasing, decreasing, bloated, ex-

hausted, etc.) and categorical affects (happiness, sad-

ness, anger, fear, disgust, surprise, interest, shame).

The former are very difﬁcult to deﬁne and recognize,

while the latter can be more easily treated. Thus, in

this work we focused on categorical affects.

Finally, emotions have two components—an he-

donic tone, which refers to the degree of pleasure, or

displeasure, connected to the emotion; and an activa-

tion, which refers to the intensity of the physiological

activation (Mandler, 1984). In this work we relied on

the latter component, which is easier to measure.

2.2.1 Classifying Emotions

Several well-known theories for classifying emotions

have been proposed. In (Russell and Snodgrass,

1987) authors consider a huge number of character-

istics about emotions, identifying two primary axes:

pleasantness / unpleasantness and arousal / inhibition.

In (Izard, 1971) the author lists 10 primary emo-

tions: sadness, joy, surprise, sadness, anger, dis-

gust, contempt, fear, shame, guilt; in (Tomkins, 1982)

the latest one is eliminated; in (Ekman et al., 1994)

a more restrictive classiﬁcation (happiness, surprise,

fear, sadness, anger, disgust) is proposed.

In particular, Ekman distinguishes between pri-

mary emotions, quickly activated and difﬁcult to con-

trol (for example, anger, fear, disgust, happiness, sad-

ness, surprise), and secondary emotions, which un-

dergo social control and cognitive ﬁltering (for exam-

ple, shame, jealousy, pride). In this work we focused

on primary emotions.

2.2.2 Mapping Speech and Emotions

As stated before, voice is considered a very reliable

indicator of emotional states. The relationship be-

tween voice and emotion is based on the assump-

tion that the physiological responses typical of an

emotional state, such as the modiﬁcation of breath-

ing, phonation and articulation of sounds, produce de-

tectable changes in the acoustic indexes associated to

the production of speech.

Several theories have been developed in an effort

to ﬁnd a correlation among speech characteristics and

emotions. For example, for Italian (Anolli and Ciceri,

1997):

• Fear is expressed as a subtle, tense, and tight tone.

• Sadness is communicated using a low tone, with

the presence of long pauses and slow speech rate.

• Joy is expressed with a very sharp tone and with

a progressive intonation proﬁle, with increasing

loudness and, sometimes, with an acceleration in

speech rate.

In (Anolli, 2002) it is suggested that active emo-

tions produce faster speech, with higher frequencies

and wider loudness range, while the low-activation

emotions are associated with slow voice and low fre-

quencies.

In (Juslin, 1997) the author proposes a detailed

study of the relationship between emotion an prosodic

characteristics. His approach is based on time, loud-

ness, spectrum, attack, articulation, and differences in

duration (Juslin, 1998). Table 1 shows such prosodic

characterization, for the four primary emotions; our

work started from such clues, trying to derive mea-

surable acoustic features.

Relying on the aforementioned works, we decided

to focus on the following emotions: joy, fear, anger,

sadness, and neutral.

2.3 Speech and Communication Styles

The process of communication has been studied from

many points of view. Communication not only con-

veys information and expresses emotions, it is also

ExtractingEmotionsandCommunicationStylesfromVocalSignals

185

characterized by a particular relational style (in other

words, a communication style). Everyone has a rela-

tional style that, from time to time, may be more or

less dominant or passive, sociable or withdrawn, ag-

gressive or friendly, welcoming or rejecting.

2.3.1 Classifying Communication Styles

We chose to rely on the following simple classiﬁca-

tion and description that includes three communica-

tion styles (Michel, 2008):

• Passive

• Assertive

• Aggressive

Passive communication imply not expressing hon-

est feelings, thoughts and beliefs. Therefore, allow-

ing others to violate your rights; expressing thoughts

Table 1: Prosodic characterization of emotions.

Emotion Prosodic feature

Joy

- quick meters

- moderate duration variations

- high average sound level

- tendency to tighten up the contrasts be-

tween long and short words

- articulation predominantly detached

- quick attacks

- brilliant tone

- slight or missing vibrato

- slightly rising intonation

Sadness

- slow meter

- relatively large variations in duration

- low noise level

- tendency to attenuate the contrasts be-

tween long and short words

- articulation linked

- soft attacks

- slow and wide vibrato

- ﬁnal delaying

- soft tone

- intonation (at times) slightly declining

Anger

- quick meters

- high noise level

- relatively sharp contrasts between long

and short words

- articulation mostly not linked

- very dry attacks

- sharply stamp

- distorted notes

Fear

- quick meters

- high noise level

- relatively sharp contrasts between long

and short words

- articulation mostly not linked

- very dry attacks

- sharply stamp

- distorted notes

and feelings in an apologetic, self-effacing way, so

that others easily disregard them; sometimes showing

a subtle lack of respect for the other person’s ability

to take disappointments, shoulder some responsibil-

ity, or handle their own problems.

Persons with aggressive communication style

stand up for their personal rights and express their

thoughts, feelings and beliefs in a way which is usu-

ally inappropriate and always violates the rights of the

other person. They tend to maintain their superiority

by putting others down. When threatened, they tend

to attack.

Finally, assertive communication is a way of com-

municating feelings, thoughts, and beliefs in an open,

honest manner without violating the rights of others.

It is an alternative to being aggressive where we abuse

other people’s rights, and passive where we abuse our

own rights.

It is useful to learn the distinction among the ag-

gressive, passive, and assertive communication be-

haviors, because such psychological characteristics

provides clues on the prosodic parameters we can ex-

pect.

2.3.2 Mapping Speech and Communication

Styles

Starting from the aforementioned characteristics of

communication styles, considering the prosodic clues

provided in (Michel, 2008), and taking into account

other works (Hirshberg and Avesani, 2000; Shriberg

et al., 2000; Shriberg and Stolcke, 2001; Hastie et al.,

2001; Hirst, 2001), we came out with the prosodic

characterization showed in Table 2.

2.4 Acoustic Features

As we discussed above, characterizing emotional

states and communication styles associated to a vo-

cal signal implies measuring some acoustic features,

which, in turn, are derived from physiological reac-

tions. Table 1 and Table 2 provide some clues about

how to relate such physiological reactions to prosodic

characteristics, but we need to deﬁne a set of measur-

able acoustic features.

2.4.1 Acoustic Features for Emotions

We started from the most studied acoustic features

(Murray and Arnott, 1995; McGilloway et al., 2000;

Cowie et al., 2001; Wang and Li, 2012).

Pitch measures the intonation, and is represented by

the fundamental harmonic (F0); it tends to increase

for anger, joy, and fear; it decreases for sadness. Pitch

PhyCS2014-InternationalConferenceonPhysiologicalComputingSystems

186

Table 2: Prosodic characterization of communication styles.

Communication Prosodic feature

style

Passive

- ﬂickering

- voice often dull and monoto-

nous

- tone may be sing-song or

whining

- low Volume

- hesitant, ﬁlled with pauses

- slow-fast or fast-slow

- frequent throat clearing

Aggressive

- very ﬁrm voice

- often abrupt, clipped

- often fast

- tone sarcastic, cold, harsh

- grinding

- ﬂuent, without hesitations

- voice can be strident, often

shouting, rising at end

Assertive

- ﬁrm, relaxed voice

- steady even pace

- tone is middle range, rich and

warm

- not over-loud or quiet

- ﬂuent, few hesitation

tends to be more variable for anger and joy.

Intensity represents the amplitude of the vocal signal,

and measures the loudness; intensity tends to increase

for anger and joy, decrease for sadness, and stay

constant for fear.

Time measures duration and pauses, as voiced and

unvoiced segments. High speech rate is associated

to anger, joy, and fear while low speech rate is

associated to sadness. Irregular speech rate is often

associated with anger and sadness. Time is also an

important parameter for distinguishing articulation

breaks (speaker’s breathing) from unvoiced segments.

The unvoiced segments represent silences—parts of

the signal where the information of the pitch and/or

intensity are below a certain threshold.

Voice Quality measures the timbre and is related to

variations of the voice spectrum, as to the signal-noise

ratio. In particular:

• Changes in amplitude of the waveform between

successive cycles (called shimmer).

• Changes in the frequency of the waveform be-

tween successive cycles (called jitter).

• Hammarberg’s index, which covers the difference

between the energy in the 0-2000 Hz and 2000-

5000 Hz bands.

• The harmonic/noise ratio (HNR) between the en-

ergy of the harmonic part of the signal and the

remaining part of the signal; see (Hammarberg

et al., 1980; Banse and Sherer, 1996; Gobl and

Chasaide, 2000).

High values of shimmer and jitter characterize, for ex-

ample, disgust and sadness, while fear and joy are dis-

tinguished by different values of the Hammarberg’s

index.

2.4.2 Acoustic Features for Communication

Styles

Starting from clues provided by Table 2, we decided

to rely on the same acoustic features we used for

the emotion recognition (pitch, intensity, time, and

voice quality). But, in order to recognize complex

prosodic variations that particularly affects communi-

cation style, we reviewed the literature and found that

research mostly focuses on the variations of tonal ac-

cents within a sentence and at a level of prominent

syllables (Avesani et al., 2003; D’Anna and Petrillo,

2001; Delmonte, 2000). Thus, we decided to add two

more elements to our acoustic feature set:

• Contour of the pitch curve

• Contour of the intensity curve

In Section 4.2 we will show how we measured the

feature set we deﬁned for emotions and communica-

tion style.

3 RELATED WORK

Several approaches exist in literature for the task of

emotion recognition, based on classiﬁers like Support

Vector Machines (SVM), decision trees, Neural Net-

works (NN), etc. In the following, we present some

of such approaches.

The system described in (Koolagudi et al., 2011)

made use of SVM for classifying emotions expressed

by a group of professional speakers. The authors un-

derlined that, for extreme emotions (anger, happiness

and fear), the most useful information was contained

in the ﬁrst words of the sentence, while last words

were more discriminative in case of neutral emotion.

The recognition Precision

of the system, on aver-

age, using prosodic parameters and considering only

the beginning words, was around 36%.

Notice that the performance index provided in this sec-

tion are indicative and cannot be compared each other, since

each system used its own vocal dataset.

ExtractingEmotionsandCommunicationStylesfromVocalSignals

187

The approach described in (Borchert and Diister-

hoft, 2005) used SVM, too, applying it to the Ger-

man language. In particular, this project developed

a prototype for analyzing the mood of customers in

call centers. This research showed that pitch and in-

tensity were the most important features for the emo-

tional speech, while features on spectral energy distri-

bution were the most important voice quality features.

Recognition Precision they obtained was, on average,

around 70%.

Another approach leveraged the Alternating De-

cision Trees (ADTree), for the analysis of humorous

spoken conversations from a classic comedy TV show

(Purandare and Litman, 2006); speaker turns were

classiﬁed as humorous or non-humorous. They used a

combination of prosodic (e.g., pitch, energy, duration,

silences, etc.) and non-prosodic features (e.g., words,

turn length, etc.) Authors discovered that the best set

of features was related to the gender of the speaker.

Their classiﬁer obtained Accuracies of 64.63% for

males and 64.8% for females.

The project described in (Shi and Song, 2010)

made use of NN. The project used two databases of

Chinese utterances. One was composed of speech

recorded by non-professional speakers, while the

other was composed of TV recordings. They used

Mel-Frequency Cepstral Coefﬁcients for analyzing

the utterances, considering six speech emotions: an-

gry, happiness, sadness, and surprised. They ob-

tained the following Precisions: angry 66%, happi-

ness 57.8%, sadness 85.1%, and surprised 58.7%.

The approaches described in (Lee and Narayanan,

2005) used a combination of three different sources

of information: acoustic, lexical, and discourse. They

proposed a case study for detecting negative and

non-negative emotions using spoken language com-

ing from a call center application. In particular, the

samples were obtained from real users involved in

spoken dialog with an automatic agent over the tele-

phone. In order to capture the emotional features

at the lexical level, they introduced a new concept

named “emotional salience”—an emotionally salient

word, with respect to a category, tends to appear more

often in that category than in other categories. For the

acoustic analysis they compared a K-Nearest Neigh-

borhood classiﬁer and a Linear Discriminant Clas-

siﬁer. The results of the project demonstrated that

the best performance was obtained when acoustic and

language features were combined. The best perform-

ing results of this project, in terms of classiﬁcation

errors, were 10.65% for males and 7.95% for females.

Finally, in (L

opez-de Ipi

na et al., 2013) au-

thors focuses on “emotional temperature” (ET) as

a biomarker for early Alzheimer disease detection.

They leverages non linear features, such as the Frac-

tal Dimension, and rely on a SVM for classifying ET

of voice frames as pathological or non-pathological.

They claim an Accuracy of 90.7% to 97.7%.

Our project is based on a classiﬁer that leverages

the Linear Discriminant Analysis (LDA) (McLach-

lan, 2004); such a model is simpler than SVM and

NN, and easier to train. Moreover, with respect to ap-

proaches making use of textual features, our model

is considerably simpler. Nevertheless, our approach

provides good results (see Section 5).

Finally, we didn’t ﬁnd any system able to classify

communication styles so, to our knowledge, this fea-

ture provided by our system is novel.

4 THE MODEL

For each voiced segment, two set of features –one

for recognizing emotions and one for communication

style– were calculated; then, by means of two LDA-

based classiﬁers, such segments were associated with

emotion and communication style.

LDA-based classiﬁer provided a good trade-off

between performance and classiﬁcation correctness.

LDA projects vectors of features, which represents

the samples to analyze, to a smaller space. The

method maximizes the ratio of between-class vari-

ance to the within-class variance, permitting to maxi-

mize class separability. More formally, LDA ﬁnds the

eigenvectors

that solve:

− λW

= 0 (1)

where B is the between-class scatter matrix and W is

the within-class scatter matrix. Once a sample ~x

projected on the new space provided by the eigenvec-

tors, the class

k corresponding to the projection ~y

chosen according to (Boersma and Weenink, 2013):

k = argmax

p(k|~y

) = argmax

−d

(~y

) (2)

where d

(·) is the generalized squared distance func-

tion:

(~y) = (~y −~µ

)

−1

(~y −~µ

ln|Σ

−ln p(k) (3)

where Σ

is the covariance matrix for the class k and

p(k) is the a-priori probability of the class k:

p(k) =

∑

i=1

(4)

where n

is the number of samples belonging to the

class k, and K is the number of classes.

PhyCS2014-InternationalConferenceonPhysiologicalComputingSystems

188

4.1 Creating a Corpus

Our model was trained and tested on a corpus of sen-

tences, labeled with the ﬁve basic emotions and the

three communication styles we introduced. We col-

lected 900 sentences, uttered by six Italian profes-

sional speakers, asking them to simulate emotions and

communication styles. This way, we obtained good

samples, showing clear emotions and expressing the

desired communication styles.

4.2 Measuring and Selecting Acoustic

Features

Figure 1 shows the activities that lead to the calcula-

tion of the acoustic features: Preprocessing, segmen-

tation, and feature extraction. The result is the dataset

we used for training and testing the classiﬁers.

In the following, the aforementioned phases are

presented. The values shown for the various param-

eters needed by the voice-processing routines, have

been chosen experimentally (see Section 4.3.2 for de-

tails on how the values of such parameters were se-

lected; see Section 6 for details on Praat, the voice-

processing tool we adopted).

Segmentation

(voiced / unvoiced

segment labeling)

Vocal signal being analyzed Recorded vocal signals

Band pass ﬁlter

Normalization

Preprocessing

Raw data

extraction

Feature

calculation

Feature extraction

Training set

Test set

Figure 1: The feature calculation process.

4.2.1 Preprocessing and Segmentation

We used a Hann band ﬁlter, for removing useless

harmonics (F

=100Hz, F

=6kHz, and smoothing

w=100Hz). Then, we normalized the intensity of dif-

ferent audio ﬁles, so that the average intensity of dif-

ferent recordings was uniform and matched a prede-

ﬁned setpoint. Finally, we divided the audio signal

into segments; in particular we divided voiced seg-

ments, where the average, normalized intensity was

above the threshold I

voicing

=0.45, and silenced seg-

ments, where the average, normalized intensity was

below the threshold I

silence

=0.03. Segments having

average, normalized intensity between the two thresh-

olds were not considered

4.2.2 Feature Calculation

For features related to Pitch, we used the following

range F

f loor

=75Hz, F

ceiling

=600Hz (such values are

well suited for male voices, as we used male subjects

for our experiments).

Among features related with Time, articulation

ratio refers to the amount of time taken by voiced seg-

ments, excluding articulation breaks, divided by the

total recording time; an articulation break is a pause

–in a voiced segment– longer than a given threshold

(we used the threshold T

break

=0.25s), and is used to

capture the speaker’s breathing. The speech ratio,

instead, is the percentage of voiced segments over

the total recording time. These two parameters are

very similar for short utterances, because articulation

breaks are negligible; for long utterances, however,

these parameters deﬁnitely differ, revealing that artic-

ulation breaks are an intrinsic property of the speaker.

Finally, unvoiced frame ratio is the total time of un-

voiced frames, divided by the recording total time

The speech signal, even if produced with maxi-

mum stationarity, contains variations of F0 and in-

tensity (Hammarberg et al., 1980); such variations

represents the perceived voice quality. The random

changes in the short term (micro disturbances) of F0

are deﬁned as jitter, while the variations of the ampli-

tude are known as shimmer. The Harmonic-to-Noise

ratio (HNR) value is the “degree of hoarseness” of

the signal—the extent to which noise replaces the har-

monic structure in the spectrogram (Boersma, 1993).

Finally, the following features are meant to repre-

sent Pitch and Intensity contours:

• Pitch Contour

– Number of peaks per second. The number of

maxima in the pitch contour, within a voiced

segment, divided by the duration of the seg-

ment.

– Average and variance of peak values.

Such segments were considered too loud for being clear

silences, but too quiet for providing a clear voiced signal.

ExtractingEmotionsandCommunicationStylesfromVocalSignals

189

– Average gradient. The average gradient be-

tween two consecutive sampling points in the

pitch curve.

– Variance of gradients. The variance of such

pitch gradients.

• Intensity Contour

– Number of peaks per second. The number of

maxima in the intensity curve, within a voiced

segment, divided by the duration of the seg-

ment.

– Mean and variance of peak values.

– Variance of peak values.

– Average gradient. The average gradient be-

tween two consecutive sampling points in the

intensity curve.

– Variance of gradients. The variance of such in-

tensity gradients.

Table 3 and Table 4 summarized the acoustic fea-

tures we measured, for emotions and communication

styles, respectively.

Table 3: Measured acoustic features for emotions.

Features Characteristics

Pitch (F0)

Average [Hz]

Standard deviation [Hz]

Maximum [Hz]

Minimum [Hz]

25th quantile [Hz]

75th quantile [Hz]

Median [Hz]

Intensity

Average [dB]

Standard deviation [dB]

Maximum [dB]

Minimum [dB]

Median [dB]

Time

Unvoiced frame ratio [%]

Articulation break ratio [%]

Articulation ratio [%]

Speech ratio [%]

Voice quality

Jitter [%]

Shimmer [%]

HNR [dB]

The features we deﬁned underwent a selection

process, aiming at discarding highly correlated mea-

surements, in order to obtain the minimum set of fea-

tures. In particular, we used the ANOVA and the LSD

tests.

The ANOVA analysis for features related to

emotions (assuming 0.01 as signiﬁcance threshold)

found all the features to be signiﬁcant, except Av-

erage Intensity. For Average intensity we leveraged

Table 4: Measured acoustic features for communication

style (features in italics have been removed).

Features Characteristics

Pitch (F0)

Average [Hz]

Standard deviation [Hz]

Maximum [Hz]

Minimum [Hz]

10th quantile [Hz]

90th quantile [Hz]

Median [Hz]

Pitch contour

Peaks per second [#peaks/s]

Average peaks height [Hz]

Variance of peak heights [Hz]

Average peak gradient [Hz/s]

Variance of peak gradients [Hz/s]

Intensity

Average [dB]

Standard deviation [dB]

Maximum [dB]

Minimum [dB]

10th quantile [dB]

90th quantile [dB]

Median [dB]

Intensity contour

Peaks per second [#peaks/s]

Average peak height [dB]

Variance of peaks heights [dB]

Average peak gradients [dB/s]

Variance of peak gradients [dB/s]

Time

Unvoiced frame ratio [%]

Articulation break ratio [%]

Articulation ratio [%]

Speech ratio [%]

Voice quality

Jitter [%]

Shimmer [%]

NHR [dB]

the Fischer’s LSD test, which showed that Aver-

age Intensity was not useful for discriminating Joy

from Neutral and Sadness, Neutral from Sadness, and

Fear from Anger. Nevertheless Average Intensity was

retained, as LSD proved it useful for discriminating

Sadness, Joy, and Neutral from Anger and Fear.

The ANOVA analysis for features re-

lated to communication style (assum-

ing 0.01 as signiﬁcance threshold) found

eight potentially useless features: Aver-

age Intensity, 90 th quantile, Unvoiced frame ratio,

Peaks per second, Average peak gradient, Stan-

dard deviation peaks gradient, Median intensity,

and Standard deviation intensity. For such features

we performed the LSD test, which showed that

Peaks per second was not able to discriminate

Aggressive vs Assertive, but was useful for discrim-

inating all others communication styles and thus we

decided to retain it. The others seven features were

dropped as LSD showed that they were not useful for

discriminating communication styles.

After this selection phase, the set of features for

the emotion recognition task remained unchanged,

PhyCS2014-InternationalConferenceonPhysiologicalComputingSystems

190

while the set of features for the communication-style

recognition task was reduced (in Table 4, text in ital-

ics indicates removed features).

4.3 Training

4.3.1 The Vocal Dataset

For the creation of the vocal corpus we examined the

public vocal databases available for the Italian lan-

guage (EUROM0, EUROM1 and AIDA), public au-

diobooks, and different resources provided by profes-

sional actors. After a detailed evaluation of avail-

able resources, we realized that they were not suit-

able for our study, due to the scarcity of sequences

where emotion an communication style were unam-

biguously expressed.

We therefore opted for the development of our

own datasets, composed of:

• A series of sentences, with different emotional in-

tentions

• A series of monologues, with different communi-

cation styles

We carefully selected –taking into account the

work presented in (Canepari, 1985)– 10 sentences

for each emotion, expressing strong and clear emo-

tional states. This way, it was easier for the actor to

communicate the desired emotional state, because the

meaning of the sentence already contained the emo-

tional intention. With the same approach we selected

3 monologues (about ten to ﬁfteen rows long, each)—

they were chosen to help the actor in identifying him-

self with the desired communication style.

For example, to represent the passive style we

chose some monologues by Woody Allen; to repre-

sent the aggressive style, we chose “The night be-

fore the trial” by Anton Chekhov; and to represent

assertive style, we used the book “Redesigning the

company” by Richard Normann.

We selected six male actors; each one was

recorded independently and individually, in order to

avoid mutual conditioning. In addition, each actor re-

ceived the texts in advance, in order to review them

and practice before the registration.

4.3.2 The Learning Process

The ﬁrst step of the learning process was to select the

parameters needed by the voice-processing routines.

Using the whole vocal dataset we trained several clas-

siﬁers, varying the parameters, and selected the best

combination according to the performance indexes we

Table 5: Confusion matrix for emotions (%).

Predicted emotions

Joy Neutral Fear Anger Sadness

Joy 63.81 0.00 18.35 11.79 6.05

Neutral 3.47 77.51 2.14 1.79 15.09

Fear 33.75 0.00 58.35 6.65 1.25

Anger 10.24 1.16 8.16 77.28 3.16

Sadness 5.14 14.44 0.28 0.81 79.33

deﬁned (see Section 5). We did it for both the emo-

tion recognition classiﬁer and the communication-

style recognition classiﬁer, obtaining two parameter

sets.

Once the parameter sets were deﬁned, a subset

of the vocal dataset –the training dataset– was used

to train the two classiﬁers. In particular, for the

emotional dataset –containing 900 voiced segments–

and the communication-style dataset –containing 54

paragraphs– we deﬁned a training dataset containing

90% of the initial dataset, and an evaluation dataset

containing the remaining 10%.

Then, we trained the two classiﬁers on the training

dataset. Such process was repeated 10 times, with

different training set/test set subdivisions.

5 EVALUATION

AND DISCUSSION

During the evaluation phase, the 10 pairs of LDA-

based classiﬁers we trained (10 for emotions and 10

for communication styles) tagged each voiced seg-

ment in the evaluation dataset with an emotion and

an communication style. Then performance metrics

were calculated for each classiﬁer; ﬁnally, average

performance metrics were calculated (see Section 6

for details on Praat, the voice-processing tool we

adopted).

5.1 Emotions

The validation dataset consists of 18 voiced segments

chosen at random for each of the ﬁve emotions, for a

total of 90 voiced segments (10% of the whole emo-

tion dataset).

The average performance indexes of the 10 trained

classiﬁers, are shown in Table 5 and Table 6

Precision and F-measure are good for Neutral,

Anger, and Sadness, while Fear and Joy are more

problematic (especially Joy, which has the worst

value). The issue is conﬁrmed by the confusion ma-

trix of Table 5, which shows that Joy phrases were

ExtractingEmotionsandCommunicationStylesfromVocalSignals

191

Table 6: Precision, Recall, and F-measure for emotions (%).

Joy Neutral Fear Anger Sadness

Pr 56.03 75.00 64.84 80.36 80.28

Re 63.53 76.36 58.42 77.10 79.39

59.54 75.68 61.46 78.70 79.83

Table 7: Error rates for emotions.

Joy Neutral Fear Anger Sadness

164 56 96 65 70

120 52 126 79 74

(%) 18.25 6.94 14.27 9.25 9.25

Table 8: Average Pr, Re, F

, and Ac, for emotions (%).

Avg Pr 71.44

Avg Re 71.06

Avg F

71.16

Ac 71.27

tagged as Fear 33% of the time, lowering the Preci-

sion of both. Recall is good for all the emotions and

also for Joy, which exhibits the better value. The av-

erage values for Precision, Recall, and F-measure are

about 71%; Accuracy exhibits a similar value.

The K value, the agreement between the classiﬁer

and the dataset, is K=0.63541, meaning a good agree-

ment was found.

Finally, for each class, we calculated false posi-

tives F

(number of voiced segments belonging to an-

other class, incorrectly tagged in the class), false neg-

atives F

(number of voiced segments belonging to

this class, incorrectly classiﬁed in another class), and

thus the error rate T

(see Table 7).

Joy and Fear exhibit the highest errors, as the clas-

siﬁer often confused them. We argue this result is due

to the highs degree of arousal that characterize both

Joy and Fear.

5.2 Communication Styles

The validation data set consists of 2 randomly cho-

sen paragraphs, for each of the three communication

styles, for a total of 6 paragraphs, which corresponds

to 10% of the communication-style dataset.

The average performance indexes of the 10 trained

models, are shown in Table 9 and Table 10.

Precision, Recall, and F-measure indicate very

good performances for Aggressive and Passive com-

munication styles; acceptable but much smaller val-

ues are obtained for Assertive sentences, as they are

often tagged as Aggressive (24.26% of the time, as

shown in the confusion matrix). The average values

for Precision, Recall, and F-measure are about 86%;

Accuracy exhibits a similar value.

Table 9: Confusion matrix for communication styles (%).

Predicted communication styles

Aggressive Assertive Passive

Aggressive 99.30 0.70 0.00

Assertive 24.26 62.68 13.06

Passive 7.08 10.30 82.62

Table 10: Precision, Recall, and F-measure for communica-

tion styles (%).

Aggressive Assertive Passive

Pr 85.55 68.32 93.61

Re 99.33 60.53 83.25

91.93 64.19 88.13

Table 11: Error rate for communication style.

Aggressive Assertive Passive

100 64 36

4 90 106

(%) 7.14 10.57 9.75

Table 12: Average Pr, Re, F

, and Ac, for communication-

style (%).

Avg Pr 86.10

Avg Re 85.87

Avg F

85.61

Ac 86.00

The K value, the agreement between the classiﬁer

and the dataset, is K=0.777214, meaning that a good

agreement was found.

Finally, Table 11 shows error rates for each class.

As expected, Assertive exhibits the highest er-

ror (10.57%), while the best result is achieved by

the recognition of Aggressive, with an error rate of

7.14%. Analyzing the F

and F

values we noted that

only Aggressive had F

> F

, which means that the

classiﬁer tended to mistakenly associate such a class

to segments where it was not appropriate.

6 THE PROTOTYPE

The application architecture is composed of ﬁve mod-

ules (see Figure 2): GUI, Feature Extraction, Emotion

Recognition, Communication-style Recognition, and

Praat.

Praat (Boersma, 2001) is a well-known open-

source tool for speech analysis

; it provides several

functionalities for the analysis of vocal signals as well

as a statistical module (containing the LDA-based

classiﬁer we described in Section 4). The script-

http://www.fon.hum.uva.nl/praat/

PhyCS2014-InternationalConferenceonPhysiologicalComputingSystems

192

(a) Recognition of emotions

(b) Recognition of communication styles

Figure 3: Recognition of emotions and communication styles.

GUI

Emotion Recognition Communication-style

Recognition

Feature Extraction

Praat

LDA classiﬁer

for emotion recognition

LDA classiﬁer

for speech-style recon.

Figure 2: The PrEmA architecture overview.

ing functionalities provided by Praat permitted us to

quickly implement our prototype.

The GUI module permits to choose the audio ﬁle

to analyze, and shows the results to the user; the Fea-

ture Extraction module performs the calculations pre-

sented in Section 4 (preprocessing, segmentation, and

feature calculation); the Emotion Recognition and

Communication-style Recognition modules rely on

the two best-performing

models to classify the in-

From the 10 LDA-based classiﬁers generated for the

emotion classiﬁcation task, the one with better performance

indexes was chosen as a ﬁnal model; the same approach was

followed for the communication-style classiﬁer.

put ﬁle according to its emotional state and commu-

nication style. All the calculations are implemented

by means of scripts that leverage functionalities pro-

vided by Praat.

Figure 3 shows two screenshots of the PrEmA ap-

plication (translated in English) recognizing, respec-

tively, emotions and communication styles. In par-

ticular, in Figure 3(a) the application analyzed a sen-

tence of 6.87s expressing anger, divided in four seg-

ments (i.e., three silences where found); each segment

was assigned with an emotion: Anger, Joy (mistak-

enly), Anger, Anger. Figure 3(b) shows the ﬁrst 10s

fragment of a 123.9s aggressive speech; the appli-

cation found 40 segments and assigned them with a

communication style (in the example, the segments

where all classiﬁed as Aggressive).

7 CONCLUSIONS AND FUTURE

WORK

We presented PrEmA, a tool able to recognize emo-

tions and communication styles from vocal signals,

providing clues about the state of the conversation. In

particular, we consider communication-style recogni-

tion as our main contribution since it could provide

ExtractingEmotionsandCommunicationStylesfromVocalSignals

193

a potentially powerful mean for understanding user’s

needs, problems and desires.

The tool, written using the Praat scripting lan-

guage, relies on two sets of prosodic features and

two LDA-based classiﬁers. The experiments, per-

formed on a custom corpus of tagged audio record-

ings, showed encouraging results: for classiﬁcation

of emotions, we obtained a value of about 71% for

average Pr, average Re, average F

, and Ac, with a

K=0.64; for classiﬁcation of communication styles,

we obtained a value of about 86% for average Pr, av-

erage Re, average F

, and Ac, with a K=0.78.

As a future work, we plan to test other classiﬁca-

tion approaches, such as HMM and CRF, experiment-

ing them with a bigger corpus. Moreover, we plan to

investigate text-based features provided by NLP tools,

like POS taggers and parsers. Finally, the analysis

will be enhanced according to the “musical behavior”

methodology (Sbattella, 2006; Sbattella, 2013).

REFERENCES

Anolli, L. (2002). Le emozioni. Ed. Unicopoli.

Anolli, L. and Ciceri, R. (1997). The voice of emotions.

Milano, Angeli.

Asawa, K., Verma, V., and Agrawal, A. (2012). Recognition

of vocal emotions from acoustic proﬁle. In Proceed-

ings of the International Conference on Advances in

Computing, Communications and Informatics.

Avesani, C., Cosi, P., Fauri, E., Gretter, R., Mana, N., Roc-

chi, S., Rossi, F., and Tesser, F. (2003). Deﬁnizione ed

annotazione prosodica di un database di parlato-letto

usando il formalismo ToBI. In Proc. of Il Parlato Ital-

iano, Napoli, Italy.

Balconi, M. and Carrera, A. (2005). Il lessico emotivo nel

decoding delle espressioni facciali. ESE - Psychofenia

- Salento University Publishing.

Banse, R. and Sherer, K. R. (1996). Acoustic proﬁles in

vocal emotion expression. Journal of Personality and

Social Psychology.

Boersma, P. (1993). Accurate Short-Term Analysis of the

Fundamental Frequency and the Harmonics-to-Noise

Ratio of a Sampled Sound. Institute of Phonetic Sci-

ences, University of Amsterdam, Proceedings, 17:97–

110.

Boersma, P. (2001). Praat, a system for doing phonetics by

computer. Glot International, 5(9/10):341–345.

Boersma, P. and Weenink, D. (2013). Manual of praat: do-

ing phonetics by computer [computer program].

Bonvino, E. (2000). Le strutture del linguaggio: unintro-

duzione alla fonologia. Milano: La Nuova Italia.

Borchert, M. and Diisterhoft, A. (2005). Emotions in

speech - experiments with prosody and quality fea-

tures in speech for use in categorical and dimensional

emotion recognition environments. Natural Language

Processing and Knowledge Engineering, IEEE.

Caldognetto, E. M. and Poggi, I. (2004). Il parlato emotivo.

aspetti cognitivi, linguistici e fonetici. In Il parlato

italiano. Atti del Convegno Nazionale, Napoli 13-15

febbraio 2003.

Canepari, L. (1985). LIntonazione Linguistica e paralin-

guistica. Liguori Editore.

Cowie, R., Douglas-Cowie, E., Tsapatsoulis, N., Votsis, G.,

Kollias, S., and Fellenz, W. (2001). Emotion recogni-

tion in human-computer interaction. Signal Process-

ing Magazine, IEEE.

D’Anna, L. and Petrillo, M. (2001). Apa: un prototipo di

sistema automatico per lanalisi prosodica. In Atti delle

11e giornate di studio del Gruppo di Fonetica Speri-

mentale.

Delmonte, R. (2000). Speech communication. In Speech

Communication.

Ekman, D., Ekman, P., and Davidson, R. (1994). The Na-

ture of Emotion: Fundamental Questions. New York

Oxford, Oxford University Press.

Gobl, C. and Chasaide, A. N. (2000). Testing affective cor-

relates of voice quality through analysis and resynthe-

sis. In ISCA Workshop on Emotion and Speech.

Hammarberg, B., Fritzell, B., Gaufﬁn, J., Sundberg, J., and

Wedin, L. (1980). Perceptual and acoustic correlates

of voice qualities. Acta Oto-laryngologica, 90(1–

6):441–451.

Hastie, H. W., Poesio, M., and Isard, S. (2001). Automat-

ically predicting dialog structure using prosodic fea-

tures. In Speech Communication.

Hirshberg, J. and Avesani, C. (2000). Prosodic disambigua-

tion in English and Italian, in Botinis. Ed., Intonation,

Kluwer.

Hirst, D. (2001). Automatic analysis of prosody for mul-

tilingual speech corpora. In Improvements in Speech

Synthesis.

Izard, C. E. (1971). The face of emotion. Ed. Appleton

Century Crofts.

Juslin, P. (1998). A functionalist perspective on emotional

communication in music performance. Acta Universi-

tatis Upsaliensis, 1st edition.

Juslin, P. N. (1997). Emotional communication in music

performance: A functionalist perspective and some

data. In Music Perception.

Koolagudi, S. G., Kumar, N., and Rao, K. S. (2011). Speech

emotion recognition using segmental level prosodic

analysis. Devices and Communications (ICDeCom),

IEEE.

Lee, C. M. and Narayanan, S. (2005). Toward detecting

emotions in spoken dialogs. Transaction on Speech

and Audio Processing, IEEE.

Leung, C., Lee, T., Ma, B., and Li, H. (2010). Prosodic

attribute model for spoken language identiﬁcation. In

Acoustics, speech and signal processing. IEEE inter-

national conference (ICASSP 2010).

opez-de Ipi

na, K., Alonso, J.-B., Travieso, C. M., Sol

Casals, J., Egiraun, H., Faundez-Zanuy, M., Ezeiza,

A., Barroso, N., Ecay-Torres, M., Martinez-Lage, P.,

and Lizardui, U. M. d. (2013). On the selection of

PhyCS2014-InternationalConferenceonPhysiologicalComputingSystems

194

non-invasive methods based on speech analysis ori-

ented to automatic alzheimer disease diagnosis. Sen-

sors, 13(5):6730–6745.

Mandler, G. (1984). Mind and Body: Psychology of Emo-

tion and Stress. New York: Norton.

McGilloway, S., Cowie, R., Cowie, E. D., Gielen, S., Wes-

terdijk, M., and Stroeve, S. (2000). Approaching au-

tomatic recognition of emotion from voice: a rough

benchmark. In ISCA Workshop on Speech and Emo-

tion.

McLachlan, G. J. (2004). Discriminant Analysis and Statis-

tical Pattern Recognition. Wiley.

Mehrabian, A. (1972). Nonverbal communication. Aldine-

Atherton.

Michel, F. (2008). Assert Yourself. Centre for Clinical In-

terventions, Perth, Western Australia.

Moridis, C. N. and Economides, A. A. (2012). Affective

learning: Empathetic agents with emotional facial and

tone of voice expressions. IEEE Transactions on Af-

fective Computing, 99(PrePrints).

Murray, E. and Arnott, J. L. (1995). Towards a simulation of

emotion in synthetic speech: a review of the literature

on human vocal emotion. Journal of the Acoustical

Society of America.

Pinker, S. and Prince, A. (1994). Regular and irregular mor-

phology and the psychological status of rules of gram-

mar. In The reality of linguistic rules.

Planet, S. and Iriondo, I. (2012). Comparison between

decision-level and feature-level fusion of acoustic and

linguistic features for spontaneous emotion recog-

nition. In Information Systems and Technologies

(CISTI).

Pleva, M., Ondas, S., Juhar, J., Cizmar, A., Papaj, J., and

Dobos, L. (2011). Speech and mobile technologies for

cognitive communication and information systems. In

Cognitive Infocommunications (CogInfoCom), 2011

2nd International Conference on, pages 1 –5.

Purandare, A. and Litman, D. (2006). Humor: Prosody

analysis and automatic recognition for F * R * I * E

* N * D * S *. In Proc. of the Conference on Empir-

ical Methods in Natural Language Processing, Syd-

ney, Australia.

Russell, J. A. and Snodgrass, J. (1987). Emotion and the en-

vironment. Handbook of Environmental Psychology.

Sbattella, L. (2006). La Mente Orchestra. Elaborazione

della risonanza e autismo. Vita e pensiero.

Sbattella, L. (2013). Ti penso, dunque suono. Costrutti cog-

nitivi e relazionali del comportamento musicale: un

modello di ricerca-azione. Vita e pensiero.

Scherer, K. (2005). What are emotions? and how can they

be measured? Social Science Information.

Shi, Y. and Song, W. (2010). Speech emotion recognition

based on data mining technology. In Sixth Interna-

tional Conference on Natural Computation.

Shriberg, E. and Stolcke, A. (2001). Prosody modeling for

automatic speech recognition and understanding. In

Proc. of ISCA Workshop on Prosody in Speech Recog-

nition and Understanding.

Shriberg, E., Stolcke, A., Hakkani-Tr, D., and Tr, G. (2000).

Prosody-based automatic segmentation of speech into

sentences and topics. Ed. Speech Communication.

Stern, D. (1985). Il mondo interpersonale del bambino.

Bollati Boringhieri, 1st edition.

Tesser, F., Cosi, P., Orioli, C., and Tisato, G. (2004). Mod-

elli prosodici emotivi per la sintesi dell’italiano. ITC-

IRST, ISTC-CNR.

Tomkins, S. (1982). Affect theory. Approaches to emotion,

Ed. Lawrence Erlbaum Associates.

Wang, C. and Li, Y. (2012). A study on the search of

the most discriminative speech features in the speaker

dependent speech emotion recognition. In Parallel

Architectures, algortihms and programming. Interna-

tional symposium (PAAP 2012).

ExtractingEmotionsandCommunicationStylesfromVocalSignals

195