EMPHASIZING ON THE TIMING AND TYPE

Enhancing the Backchannel Performance of Virtual Agent

Xia Mao, Na Luo and Yuli Xue

School of Electronic and Information Engineering, Beihang University, Bejing, China

Keywords: Human-Computer Interaction (HCI), Virtual Agent, Backchannel, Personality Rules, Emotional

Backchannel Lexicon.

Abstract: Addressing backchannel feedbacks to virtual agent listener gives the agent human-like conversation skills

and creates rapport in Human-Computer Interaction. We argue the limitations of current approaches in

predicting and generating backchannel. Following two hypotheses emphasizing on the timing and type of

backchannel, we introduce an improved system to enhance the agent listener’s performance. By using

Newcastle Personality Assessor before parasocial consensus sampling and then neural networks, we can

obtain the personality rules and select different backchannel timing thresholds for specific agent listener

according to its own personality. After a context-free perceptual study, we will build two emotional

backchannel lexicons showing positive affection and negative affection respectively. In accordance with the

empathy strategy, the system will select one type of backchannel from the corresponding emotional

backchannel lexicon. The improved system will be more suitable for different conversation occasions and

greatly increase the naturalness between the human speaker and the virtual agent listener in the future.

1 INTRODUCTION

Human-Computer Interaction focuses on human’s

feelings and aims at making the interaction process

more natural and efficient. In multimodal

interactions, users can use their facial expressions,

voice, gestures, postures to express themselves to the

computer, but corresponding feedbacks from the

computer are limited. Virtual agent, which is the

geometrical and active presentation of human in

virtual environment and can realize multimodal

interaction with people, has been widely applied. At

the very start, virtual agents used in human-

computer interaction were acting passively and

stiffly. With the development of technology, users

prefer more intelligent, autonomous, interactive,

reactive agents to ones with predefined behaviour.

Consequently, researches on how to make agents

percept and act like human and how to make the

interaction between agent and human more natural

and vivid have won universal attentions.

When two people have a conversation, the

listener will naturely produce some behaviours or

short utterances while the speaker is talking. These

feedbacks include nod, smile, shake head, say

‘yeah’, ’hmm’, etc. However, the listener is usually

not conscious of giving these feedbacks. We deem

that it is related to the listener’s subconsciousness.

Backchannel is pervasive in conversations and is

an important kind of feedback in a dialogue as a

signal of presenting listener’s interest and

encouraging speaker to continue speaking. It was

first found by Victor Yngve (1970). He found

backchannel communication when analyzing

English conversation. Later, Duncan (1974) and his

colleagues (Duncan and Fiske, 1977) termed this

listener’s feedback phenomenon as backchannel.

Earlier study found that when people interacted with

others, they used backchannel feedbacks such as

speech prosody, gesture, gaze, posture and facial

expression to establish a sense of rapport.

Backchannel plays an important role in everyday

conversation, especially in the speaker-listener

dialogue. Application of appropriate backchannel

has been found to improve the narrator’s

performance (Bavelas et al., 2000).

In face-to-face interaction between a human

speaker and a virtual agent listener, addressing

backchannel feedbacks to virtual agent listener gives

the agent human-like conversation skills and creates

rapport in Human-Computer Interaction.

Researchers have made great efforts in predicting

backchannel for the agent listener. Ward and

259

Mao X., Luo N. and Xue Y..

EMPHASIZING ON THE TIMING AND TYPE - Enhancing the Backchannel Performance of Virtual Agent.

DOI: 10.5220/0003833102590263

In Proceedings of the 4th International Conference on Agents and Artiﬁcial Intelligence (ICAART-2012), pages 259-263

ISBN: 978-989-8425-96-6

 2012 SCITEPRESS (Science and Technology Publications, Lda.)

Tsukahara (2000) paid attention to regions of low

pitch late in an utterance and made five rules to

produce backchannel feedback by the low pitch cue.

Cathcart et al. (2003) proposed that backchannel

often produced after a short pause in the speaker’s

discourse. In addition to only audio-based

prediction, visual evidences are also very useful in

the multimodal interaction. Kendon (1967) and

Bavelas et al. (2002) found out that there were

relationships between eye gaze and backchannels.

Maatman et al. (2005) presented a mapping from

posture shifts, head movements and speech quality

to agent listening behaviours.

However, current backchannel systems are not

well accepted in the subjective evaluation progress.

The agent listener’s feedbacks are considered not

natural and precise. In order to enhance the

backchannel performance of the agent listener, we

propose to build an improved system which

emphasizes on the timing and type of backchannel

and is implemented by analyzing the listener’s

personality and emotion state.

2 CURRENT APPROACHES AND

LIMITATIONS

A virtual agent with appropriate backchannel

feedbacks do act more like a real human listener, but

we find that the subjective evaluation results of the

current experiments are not so satisfactory in rapport

scale.

Poppe et al. (2010) evaluated six different

multimodal rule-based strategies for backchannel

generation in face-to-face conversation. Features

they used were speaker’s speeches and eye gaze

while backchannel performed by the agent listener

was nods, vocalizations and the combination of them

randomly. Evaluation results showed that Copy

strategy got evident higher scores than the other five

strategies. In Copy strategy, the timing of

backchannel feedbacks was the same to the actual

human listener. The other five strategies used by

Poppe et al. almost covered all the rule-based

approaches widely used for backchannel prediction,

but the results of their subjective evaluations were

not so good. When participants were asked how

likely the agent’s backchannel feedbacks were

performed by a human listener, the average scores

were all below 45(the full score is 100). It proved

that current rule-based strategies themselves could

difficultly perform as good as a human listener.

In contrast with rule-based methods, prediction

models were wildly used in generating agent’s

backchannel. Morency et al. (2008) used sequential

probabilistic model (Hidden Markov Model and

Conditional Random Fields) to predict listener

backchannel. Features they used were speaker’s

prosody, spoken words and eye gaze. Prediction

results proved that their method with Conditional

Random Fields outperformed Ward and Tsukahara’s

rule-based approach, but the value of precision and

recall was 0.1862 and 0.4106 separately which was

quiet low. It was a little far from the ideal

backchannel prediction of human listener. Some of

the reasons were due to the database which the

prediction models learned from was limited and not

taking into account individual differences.

Huang et al. (2010b) proposed a method to learn

an effective prediction model from parasocial

consensus sampling. This novel data collection

method could collect large amount of behaviour data

quickly and get rid of individual differences

limitations in backchannel responses. Evaluation

results showed that the virtual agent driven by

Conditional Random Fields model trained on

parasocial consensus sampling data obtained more

rapport scale, perceived accuracy and naturalness

than rule-based Rapport Agent (Gratch et al., 2006).

Moreover, it performed better than the ones driven

by actual human listener in low-rapport videos.

Huang’s efforts in innovating methods make

sense indeed. We deem that it is really necessary to

research on backchannel feedback for virtual agent

and develop improved approaches to enhance the

agent’s performance. We notice that the timing and

type of backchannel are most significant to the

human-like backchannel behaviour of virtual agent

(Poppe et al., 2010), but in Huang et al.’s

experiments, the agent listener only used nods as

backchannel feedbacks. Furthermore, the listener’s

personality and emotion state were not taken into

consideration during the backchannel predicting and

generating process.

3 AN IMPROVED

BACKCHANNEL SYSTEM

3.1 Two Hypotheses for the

Backchannel System

Trait models of personality assume that traits

influence behaviour, and that they are fundamental

properties of an individual (McRorie et al., 2009).

For improving rapport in the conversation, we also

consider giving the agent capacity of empathy. In

this way, the emotional state of the speaker will

ICAART 2012 - International Conference on Agents and Artificial Intelligence

260

influence the agent listener’s feedback. Emphasizing

on the timing and type of backchannel, we formulate

two hypotheses:

H1: the timing of backchannel is related with the

listener’s personality.

H2: the type of backchannel is connected with

the speaker’s emotional state.

Following our two hypotheses, we propose to

make an improvement in the previous backchannel

system for virtual agent. Architecture of the

improved system is illustrated in Figure 1. It consists

of two main parts: (1) backchannel prediction, which

extracts useful features from human speaker’s video

and predicts the timing of backchannel feedbacks;

(2) backchannel generation, which generates action

commands for the virtual agent listener and animates

its backchannel feedback behaviours.

Figure 1: Architecture of the improved backchannel

system for virtual agent system.

3.2 Backchannel Prediction

After recording conversational videos, we will apply

parasocial consensus sampling (Huang et al., 2010a)

to collect different listener’s backchannel behaviour

data and then learn a probabilistic model for

backchannel prediction. According to our H1

Hypothesis, we plan to add listener’s personality

rules module to the prediction model which will

make an influence on the timing of backchannel.

To find the relationship between the timing of

backchannel and the listener’s personality, we can

ask the participants to fill out the Daniel Nettle’s

Newcastle Personality Assessor (NPA) before

participating in parasocial consensus sampling.

Results of the questionnaire will quantify the tester’s

personality on five dimensions: Extraversion,

Neuroticism, Conscientious, Agreeableness, and

Openness. After the sampling, we use neural

networks to learn from participant’s five personality

dimensions and the total number of their

backchannel feedbacks. In this way, we can obtain

the personality rules related to the number of

backchannel.

By establishing the personality rules, we can

easily select different thresholds of response level

for specific personality. The thresholds are used to

decide the timing of backchannel by filtering out the

feedbacks whose probabilities are low (Huang et al.,

2010a). To keep the agent listener’s backchannel fit

its personality, we prepare to make the number of

backchannel from parasocial consensus data closest

to that from the personality rules.

The personality rules module enables our agent

listener to be capable of different backchannel

frequency to show multiple personality and makes

our backchannel system more suitable for different

conversation occasions in the future.

3.3 Backchannel Generation

Recently, researches have shown that empathic

virtual agent enhance human-computer interaction

(Prendinger et al., 2005). In our improved

backchannel system, we hope that the agent listener

can express the same type of emotion to the speaker

through its backchannel feedback. For developing

empathy strategy, the speaker’s emotional states are

divided into positive affection and negative

affection, thus the virtual agent listener should show

positive or negative affection in order to be similar

with the speaker’s emotional state.

According to H2 Hypothesis, if we can detect the

speaker’s emotional state, the most important step is

to find out how one’s emotion influences the type of

backchannel behaviour. A context-free perceptual

study is introduced to understand how various types

of backchannel feedbacks are interpreted by users.

We intend to generate multimodal backchannel for

the agent listener as the combinations of visual and

acoustic behaviours such as smile, nod, frown, shake

head, say ‘yeah’, and say ‘hmm’. The participants

are asked to evaluate all the backchannel behaviours

to assess the emotion expressed by the virtual agent

EMPHASIZING ON THE TIMING AND TYPE - Enhancing the Backchannel Performance of Virtual Agent

261

positive affection or negative affection. After the

evaluation, we can build two emotional backchannel

lexicons. By detecting features of the speaker, the

system can analyze emotional state and randomly

select one type of backchannel from the similar

emotional backchannel lexicon of agent listener in

accordance with the empathy strategy.

With the empathy strategy and emotional

backchannel lexicon, the virtual agent listener can

‘feel’ the speaker’s emotional state and give

appropriate feedbacks.

It is obvious that detecting the speaker’s features

is the foundation of the system. Performance of the

speaker’s emotional state detection will greatly

influence the performance of our system. We

propose to combine the speaker’s facial expressions

and speeches detected by camera and microphone to

analyze his emotion. Although emotional facial

expression and emotional speech recognition have

developed for years, recognition results are not

perfect for various emotions especially in real-time

systems. We concern the implementation of our

system, so we only divide emotional states into two

kinds which are positive affection and negative

affection and build two corresponding emotional

backchannel lexicons. These two emotional states

can be recognized correctly in current real-time

experiments. Therefore we can apply the recognition

technology in the backchannel system.

4 CONCLUSIONS

In face-to-face interaction between a human speaker

and a virtual agent listener, addressing backchannel

feedbacks to virtual agent listener gives the agent

human-like conversation skills and creates rapport in

Human-Computer Interaction. In recent years,

researchers have made great efforts in predicting

backchannel for the agent listener. In this position

paper, we have argued the limitations of current

approaches. It is time for us to look for new methods

to improve the backchannel prediction and

generation.

Following the two hypotheses emphasizing on

the timing and type of backchannel, we introduce an

improved system to enhance the agent listener’s

performance. In the backchannel prediction part,

using Newcastle Personality Assessor before

parasocial consensus sampling and neural networks

will enable us to obtain the personality rules related

to the number of backchannel. Then we can easily

select different backchannel timing thresholds for

specific agent listener’s personality. In the

backchannel generation part, we intend to build two

emotional backchannel lexicons showing positive

affection and negative affection respectively after

conducting a context-free perceptual study. The

system will randomly select one type of backchannel

from the similar emotional backchannel lexicon in

accordance with the empathy strategy. Further steps

may include asking some volunteers to assess the

system and keeping on developing it according to

the evaluation results. These efforts will help to

make the system more suitable for different

conversation occasions. Implementation of the

proposed system will greatly increase the

naturalness between the human speaker and the

agent listener.

ACKNOWLEDGEMENTS

This work is supported by the National Nature

Science Foundation of China (No.61103097,

No.60873269), International Science and

Technology Cooperation Program of China

(No.2010DFA11990).

REFERENCES

Bavelas, J. B., Coates, L., Johnson, T. (2002). Listener

responses as a collaborative process: The role of gaze.

Journal of Communication, 52(3), 566-580.

Bavelas, J. B., Coates, L., Johnson, T. (2000). Listeners as

conarrators. Journal of Personality and Social

Psychology, 79(6), 941-952.

Cathcart, N., Carletta, J., Klein, E. (2003). A shallow

model of backchannel continuers in spoken dialogue.

Proceedings of the Conference of the European

chapter of the Association for Computational

Linguistics, 51-58.

Duncan, Starkey, Jr. and Fiske D. (1977). Face-to-face

Interaction. New York: Halsted Press.

Duncan, Starkey, Jr. (1974). On the structure of speaker-

auditor interaction during speaking turns. Language in

Society, 3(2): 161-180.

Gratch, J., Okhmatovskaia, A., Lamothe, F., Marsella, S.,

Morales, M., Werf, R. J. V. D., Morency, L. (2006).

Virtual Rapport. Proceedings of the International

Conference in Intelligent Virtual Agent, 14-27.

Huang L., Morency, L., Gratch, J. (2010a). Parasocial

Consensus Sampling: Combining Multiple

Perspectives to Learn Virtual Human Behaviour.

Proceedings of AAMAS 2010, 1265-1272.

Huang L., Morency, L., Gratch, J. (2010b). Learning

backchannel prediction model from parasocial

consensus sampling. Proceedings of the International

Conference in Intelligent Virtual Agent, 159-172.

ICAART 2012 - International Conference on Agents and Artificial Intelligence

262

Kendon, A. (1967). Some functions of gaze direction in

social interaction. Acta Psychologica, 26(1), 22-63.

Maatman, M., Gratch, J., Marsella, S. (2005). Natural

behaviour of a listening agent. Proceedings of the

International Conference in Intelligent Virtual Agent,

25-36.

McRorie, M., Sneddon, I., Sevin, E. D., Bevacqua, E., and

Pelachaud, C. (2009). A Model of Personality and

Emotional Traits. Proceedings of the International

Conference in Intelligent Virtual Agent, 27-33.

Morency, L., Kok, I. D., Gratch, J. (2008). Predicting

Listener Backchannels: A Probabilistic Multimodal

Approach. Proceedings of the International

Conference in Intelligent Virtual Agent, 176-190.

Poppe, R., Truong, K. P., Reidsma, D., et al. (2010).

Backchannel strategies for artificial listeners.

Proceedings of the International Conference in

Intelligent Virtual Agent, 146-158.

Prendinger, H., Mori, J., Ishizuka, M. (2005). Using

human physiology to evaluate subtle expressivity of a

virtual quizmaster in a mathematical game.

International Journal of Human Computer Studies,

62(2), 231-245.

Ward, N., Tsukahara, W. (2000). Prosodic features which

cue backchannel responses in English and Japanese.

Journal of Pragmatics, 32(8), 1177-1207.

Yngve, V. (1970). On getting a word in edgewise. Papers

from the Sixth Regional Meeting of the Chicago

Linguistic Society, 567-577.

EMPHASIZING ON THE TIMING AND TYPE - Enhancing the Backchannel Performance of Virtual Agent

263