Better Choice? Combining Speech and Touch in Multimodal

Interaction for Elderly Persons

Cui Jian

, Hui Shi

, Nadine Sasse

, Carsten Rachuy

, Frank Schafmeister

, Holger Schmidt

and

Nicole von Steinbüchel

SFB/TR8 Spatial Cognition, University of Bremen, Enrique-Schmidt-Straße 5, Bremen, Germany

Medical Psychology and Medical Sociology, University Medical Center Göttingen, Waldweg 37, Göttingen, Germany

Neurology, University Medical Center Göttingen, Robert-Koch-Str. 40, Göttingen, Germany

Keywords: ICT, Ageing and Disability, Elderly-centered Design, Multimodal Interaction, Elderly-friendly Interface,

Formal Methods, System Evaluation.

Abstract: This paper presents our work on developing, implementing and evaluating a multimodal interactive

guidance system that features spoken language and touch-screen input for elderly persons. The development

foundation of the system comprises two systematically designed and empirically improved aspects: a set of

development guidelines for elderly-friendly multimodal interaction according to common ageing-related

decline of important human abilities, and a hybrid dialogue modelling approach with a formal method

triggering and agent-based management for the elderly-centered multimodal interaction. To evaluate the

minutely developed and implemented system, an experimental study was conducted with thirty-three elderly

persons and empirical data were analyzed by applying an adapted version of a general evaluation

framework, which provided overall positive analysis results and validated our effort to develop an effective,

efficient and elderly friendly multimodal interaction.

1 INTRODUCTION

As the demographic development shows, the amount

of elderly people is constantly growing in modern

societies (Lutz et al., 2008). These persons often

suffer from age-related decline or impairment of

sensory, perceptual, physical and cognitive abilities.

This poses particular challenges to the application of

technical systems nowadays, which are getting more

and more commonly implemented in the daily

routines of elderly persons.

Meantime, attention is increasingly focused on

the technical systems with multimodal interfaces,

which provide the users with multiple modes of

interaction with a system; therefore they improve the

quality of human-system communication concerning

effectiveness, efficiency and user-friendliness (cf

(Jaimes and Sebe, 2007)).

Thus, in order to maximize the usability and user

experience of technical systems for elderly persons,

research on multimodal interaction for this specific

user group is increasingly gaining more interest

during the last decade. Various emerging

technologies have been considered, such as

advanced speech enabled interface (Krajewski et al.,

2008), brain-signal interface (Mandel et al., 2009),

visual input via digital camera (Goetze et al., 2012);

also, a large contribution is being made to “Ambient

Assistive Living”, the concept for developing age-

adjusted and care-friendly living environments (cf.

(Rodríguez et al., 2011)).

This paper presents a multimodal interactive

system that can provide elderly persons with both

spoken language and touch-screen input modalities.

It has been particularly developed and implemented

for the elderly focussing on two important aspects:

1) a set of development guidelines for multimodal

interactive systems with respect to the basic design

principles of conventional interactive systems and

the most common age-related characteristics; and 2)

a hybrid dialogue modelling and management

approach that combines the advanced finite state

based generalized dialogue model and the classic

agent based dialogue theory; it supports a flexible

and context-sensitive, yet tractable and controllable

multimodal interaction with a formal language based

Jian C., Shi H., Sasse N., Rachuy C., Schafmeister F., Schmidt H. and von Steinbüchel N..

Better Choice? Combining Speech and Touch in Multimodal Interaction for Elderly Persons .

DOI: 10.5220/0004244800940103

In Proceedings of the International Conference on Health Informatics (HEALTHINF-2013), pages 94-103

ISBN: 978-989-8565-37-2

 2013 SCITEPRESS (Science and Technology Publications, Lda.)

development framework. The resulting system has

been continuously improved by a series of

evaluative studies of our previous work concerning

the development foundation and different modalities

(cf. (Jian et al., 2011), (Jian et al., 2012)). In order to

perform a further evaluation of spoken language and

touch-screen combining input modalities, as well as

the assessment of the complete multimodal

interactive system concerning its effectiveness of

task performance, efficiency of interface interaction

and user acceptance by elderly persons, a

supplementary experimental study was conducted

with 33 elderly. The data were analysed by applying

an adapted version of the general evaluation

framework PARADISE (Walker et al., 1997). The

Results are briefly described in this paper.

The rest of the paper is organized as follows:

section 2 briefly introduces the design guidelines for

multimodal interactive systems for elderly persons

and the hybrid approach for multimodal interaction

management; section 3 presents the multimodal

interactive guidance system and section 4 describes

the experimental study on evaluating the modality

combining spoken language and touch-screen; the

results are analysed and discussed in section 5.

Finally, section 6 concludes the reported work and

outlines the direction of our future activities.

2 THE DEVELOPMENT

FOUNDATION

The theoretical and technical foundation of our work

comprises two aspects:

 A set of design guidelines for an elderly-

centered multimodal interactive system;

 A formal method and agent based hybrid

modelling approach for dialogue management.

They are both systematically designed with

respect to their suitability for the application, and

continuously improved by our previous empirical

studies (cf. (Jian et al., 2012)).

2.1 Design Guidelines of Multimodal

Interactive Systems for Elderly

Persons

Physical and cognitive decline is almost universal in

the elderly. According to (Birdi et al., 1997), these

age-related characteristics should be considered

while developing interactive systems for the elderly.

Therefore, based on the common design principles

for conventional interactive systems and the ageing-

related empirical findings, we defined a set of design

guidelines for multimodal interaction with respect to

the decline of important human perceptual and

cognitive functions. These guidelines have been

implemented into our multimodal interactive

guidance system, evaluated by our previous

empirical studies, and then improved on the basis of

their results.

The final set of improved design guidelines were

summarized in (Jian et al., 2012). For reasons of

brevity we report empirical findings regarding the

decline of the seven most common human abilities,

accordingly followed by the most important

elements implemented and improved during our

system development:

Visual Perception worsens for most people with

age (Fozard, 1990). Physically the size of the visual

field is decreasing and the peripheral vision can be

lost. It is more difficult to focus on objects up close

and to see fine details, including rich colours and

complex shapes that make images hard or even

impossible to identify. Rapidly moving objects are

either causing too much distraction, and/or become

less noticeable. This decline concerns most with the

graphical user interface. Based on the suggested

guidelines, only simple and clear layout was

constructed without overlapping items; 12-14 sized

sans-serif fonts were chosen for all displayed texts.

Simple and high contrast colours without fancy

visual effects were used and placed aside; regularly

shaped rectangles and circles were selected for

comfortable perception and easy identification.

Speech Ability declines while ageing in the way

of being less efficient for pronouncing complex

words or longer sentences, probably due to reduced

motor control of tongue and lips (Mackay and

Abrams, 1996). (Moeller et al., 2008) confirmed

that, elderly-centered adaptation of speech-enabled

interactive components can improve the interaction

quality to a satisfactory level. Therefore, the

vocabulary and grammar for our speech recognizer

and analyser were constructed with preferably short

and easy wording in daily life communication;

dialogue strategies were also adjusted to many

elderly-specific situations.

Auditory Perception declines to 75% between

the age of 75 and 79 year olds (cf. Kline and Scialfa,

1996). High pitched sounds are hard to perceive;

complex sentences are difficult to follow (Schieber,

1992). Therefore, text and acoustic output were both

provided as system responses. Style, vocabulary and

structure of the sentences were intensively

revisedregarding brevity. A low-pitched yet

vigorous

male voice was used for the speech synthesis.

BetterChoice?CombiningSpeechandTouchinMultimodalInteractionforElderlyPersons

Motor Abilities decline generally due to loss of

physical activities while ageing. Complex fine motor

activities are more difficult to perform, e.g. to grab

small or irregular targets (cf. Charness and Bosman,

1990); conventional input devices such as a

computer mouse are less preferred by elderly

persons as good hand-eye coordination is required

(Walkder et al., 1997). Taking these findings into

account, a touch screen was chosen as the haptic

interface; Regularly shaped, sufficiently sized and

well separated interface elements were constructed;

pressing instead of clicking or dragging was decided

to be the only action in order to avoid otherwise

frequently occurring errors.

Attention and Concentration drop while

ageing. Elderly persons either become more easily

distracted by details and noise, or find other things

harder to notice when concentrating on one thing

(Kotary and Hoyer, 1995); they show great difficulty

with situations where divided attention is needed

(McDowd and Craik, 1988). Thus, fancy irrelevant

images or decorations were removed. Unified font,

colours, sizes of interface elements were used

throughout the interaction. Simple animations for

notifying changes were constructed, giving

sufficiently clear feedback to the user.

Memory Functions decline differently. Short

term memory holds fewer items with age and

working memory becomes less efficient (Salthouse,

1994). Semantic information is normally preserved

in long term memory (Craik and Jennings, 1992).

Guided by these facts, the quantity of displayed

items was restricted to no more than three, regarding

the average capacity of short term memory of

elderly persons; sequentially presented items were

intensively revised to assist orientation during

interaction. Context sensitive cues were presented

with selected colours: green for items concerning

persons, yellow for items concerning rooms, etc.

Intellectual Reasoning Ability does not decline

much during the normal ageing process. (Hawthorn,

2000) believed that crystallized intelligence can

assist elderly persons to perform better in a stable

well-known interface environment. Therefore,

consistent layout, colours and interaction styles were

used throughout the interaction. Changes on the

interface can only happen on data level.

2.2 The Hybrid Approach for

Interaction Management

The hybrid dialogue modelling approach combines

the finite-state-based generalized dialogue models

with the classic agent-based dialogue management

theories. This section

 introduces the basic theory of this approach,

 presents the adapted instance model for

multimodal interaction in elderly persons by

applying the hybrid approach;

 describes a formal language based

development toolkit, which is then used to

support the implementation of the instance

model and its integration into our multimodal

interactive guidance system for achieving an

effective, flexible, yet formally controllable

multimodal interaction management.

2.2.1 The Theory

The development of the hybrid dialogue modelling

approach benefited from existing researches on these

two important interaction management theories:

The generalized dialogue models, which are

constructed with recursive transition networks

(RTN) at the illocutionary level. These networks can

abstract dialogue models by describing discourse

patterns as illocutionary acts, without reference to

any direct surface indicators (cf. (Alston, 2000));

The classic agent-based management method:

information state update based management theories

(cf. (Traum and Larsson, 2003)), which focus on the

modelling of discourse context as the attitudinal

state of an intelligent agent. This method shows a

powerful way to handle dynamic information for a

context sensitive dialogue management.

However, these two well-accepted methods have

their own limitations. On one hand, the generalized

dialogue models are based on finite state transition

models, which are criticized for their inflexibility of

dealing with dynamic information exchange; on the

other hand, the information state update models are

usually very difficult to manage and extend.

Therefore, we designed a hybrid dialogue

modelling approach by extending the generalized

dialogue model with conditions and information

state update rules added into finite-state transitions.

2.2.2 The Interaction Model

In order to manage multimodal interaction for

elderly persons, an adapted hybrid dialogue model

was constructed and evaluated by our previous work

(cf. (Shi et al., 2011)). The accordingly improved

version consists of four hybrid dialogue schemas:

the initiating schema, the user’s action schema, the

system’s response schema and the user’s response

schema, regarding the four general transitions during

interaction.

HEALTHINF2013-InternationalConferenceonHealthInformatics

Each interaction is initiated with the schema

Dialogue(S, U) (cf. Figure 1), by the initialization of

the system’s start state and a greeting request. In

Dialogue(S, U) the system initiates a dialogue with a

request move (i.e. S.request), which cause the

initialization of the dialogue context using the

update rule INIT.

Figure 1: The initiating model.

The dialogue continues with the user’s

instruction, request for a certain information or

restart action, leading to the system’s further

response or dialogue restart, respectively, while

updating the information state with the attached

update rules (cf. Dialogue(U, S) in figure 2).

Figure 2: The user’s actions model.

After receiving user input, the system tries to

generate an appropriate response according to its

current knowledge base and information state (cf.

Response(S, U) in figure 3). This can be informing

the user with requested data, rejecting an

unacceptable request with or without certain reasons,

providing choices for multiple options, or asking for

further confirmation of taking a critical action, each

of which triggers transitions to other hybrid models.

Finally, the user can accept or reject the system’s

response, or even ignore it by simply providing new

instructions or requests, triggering further state

transitions as well as information state updates (cf.

Response(U, S) in Figure 4).

Besides the improvement performed with respect to

the specific interaction data of the elderly subjects in

our previous studies, the decline of physical and

cognitive abilities of elderly persons, especially

memory function, concentration and fluid reasoning

ability should be considered as well. Therefore, for

the improvement of the current hybrid dialogue

model we also included the following features to

assist the elderly during the multimodal interaction:

Figure 3: The system’s response model.

Figure 4: The user’s response model.

 Relevant dialogue history information, such as

the latest utterance, was added into the current

information state and provided in case of

speech recognition problems.

 Context sensitive information, which is kept in

the current information state, is designed to be

either directly presented after each interaction

pace, or included within dialogue utterance, in

order to ease the common problems caused by

the declining memory function.

 Additional context information is provided

with specific information state update rules in

extreme cases, e.g. if the automatic speech

recognition problems become too interfering,

messages containing possibly recognized

context will be presented.

 Instead of keeping rich transition alternatives

at the illocutionary level, the hybrid model

was kept as compact and intuitive as possible.

2.2.3 A Development Framework to Support

the Hybrid Dialogue Modelling

Approach

The structure of a hybrid dialogue model is in fact a

typical finite state transition model. This feature

enables any hybrid dialogue model to be formally

specified as a set of machine readable codes, e.g.

BetterChoice?CombiningSpeechandTouchinMultimodalInteractionforElderlyPersons

using mathematically well-founded formal language,

Communicating Sequential Processes (CSP) (cf.

(Roscoe, 1997)) in the formal methods and computer

science community. Furthermore, the CSP program

is also supported by well-established model

checkers, which provides the rich possibilities of

validating the concurrent aspects and increasing the

tractability of the specified model (cf. (Hall, 2002)).

Thus, in order to support the development of

hybrid dialogue models using the formal language

CSP and its integration into a practical multimodal

interactive system, we designed FormDia, the

Formal Dialogue Development Toolkit (cf. Figure

5).

Figure 5: The Structure of the FormDia Toolkit.

Theoretical and technical details about FormDia

can be found in (Shi and Bateman, 2005). In general,

the FormDia Toolkit supports the implementation

and integration of a hybrid dialogue model into an

interactive system with the four components:

Validator: after a hybrid dialogue model is

specified with CSP, it can be validated by an

external model checker: the Failures-Divergence

Refinement tool or FDR (Broadfoot and Roscoe,

2000), for validating and verifying concurrency of

state automata.

Generator: with the validated CSP specification,

machine readable finite state automata can be

generated by the Generator.

Simulator: with the generated finite state

automata and the communication channels,

dialogues scenarios can be simulated via a graphical

interface, which visualizes dialogue states as a

directed graph and provides a set of utilities for

primary testing.

Dialogue Management Driver: finally the

dialogue model is integrated into an interactive

system via the dialogue management driver.

Therefore, FormDia enables an intuitive design

of hybrid dialogue models with formal language,

automatic validation of the related functional

properties, easy simulation and verification of

specified interaction situations, and a

straightforward integration into a practical

interactive system.

3 SYSTEM DESCRIPTION

Based on the development foundation introduced in

the previous section, we developed a general

Multimodal Interactive Guidance System for Elderly

Persons (MIGSEP).

3.1 System Introduction

MIGSEP runs on a portable touch-screen tablet PC

and will serve as the interactive media, which is

intended to be used by an elderly or handicapped

person seated in an autonomous electronic

wheelchair that can automatically carry its users to

desired locations within complex environments. The

user should interact with MIGSEP with spoken-

language and touch-screen combining input

modality to find the desired target.

3.2 System Architecture

The architecture of MIGSEP is illustrated in figure

6. The Generalized Dialogue Manager was

developed using the introduced adapted hybrid

dialogue model and the FormDia toolkit. It functions

as the central processing unit of the entire system

and supports a formally controllable and extensible,

meantime flexible and context-sensitive multimodal

interaction management. An Input Manager receives

and interprets all incoming messages from the GUI

Action Recognizer for GUI input events, the Speech

Recognizer for natural language understanding and

the Sensing Manager for other possible sensor data.

An Output Manager on the other hand, handles all

outgoing commands and distributes them to the View

Presenter for presenting visual feedbacks, the

Speech Synthesizer to generate natural language

responses and the Action Actuator to perform

necessary motor actions, such as sending a driving

request to the autonomous electronic wheelchair.

The Knowledge Manager, constantly connected with

the Generalized Dialogue Manager, uses a Database

to keep the static data of certain environments and

the Context to process the dynamic information

exchanged with the users during the interaction.

All components of MIGSEP are closely

connected via XML-based communication channels

and each component can be treated as an open black

box which can be accordingly modified or extended

HEALTHINF2013-InternationalConferenceonHealthInformatics

for concrete domain specific use, without affecting

other components in the MIGSEP architecture. It

provides a general open platform for both theoretical

researches and empirical studies on single- or

multimodal interaction that can relate to different

application domains and scenarios.

Figure 6: The architecture of MIGSEP.

3.3 Interaction with the System

The current instance of MIGSEP was implemented

as a guidance system used by elderly persons for the

application domain of hospital environments. Figure

7 shows a user interacting with MIGSEP.

Figure 7: A user is interacting with MIGSEP.

This MIGSEP system consists of a button device

for triggering a “press to talk” signal, a green lamp

to signalize the “being pressed and ready to talk”

state, and the tablet PC, on which the MIGSEP

system is running and the interface is displayed. The

MIGSEP interface simply consists of two areas:

Function-area contains the function button

“start” on the top left for going to the start state, the

function button “toilet” below it regards the basic

needs of elderly persons, and the text area for

displaying the system responses in the middle;

Choice-area displays information entities as

single cards that can be selected, with a scrollbar

indicating the position of the current displayed cards

and a context sensitive coloured bar showing the

current concerned context if necessary.

Figure 8 shows a sample of spoken language and

touch-screen combined interaction dialogue between

MIGSEP and a user who would like go to the

cardiology department, to a doctor named Wolf.

Figure 8: A sample interaction with MIGSEP.

4 EXPERIMENTAL STUDY

To evaluate how well the MIGSEP system can assist

elderly persons by using a modality combining

spoken language and touch-screen, an experimental

study was conducted with the department of Medical

Psychology and Medical Sociology in Göttingen.

4.1 Participants

33 elderly persons (m/f: 19/14, mean age of 70.7,

standard deviation 3.1), all German native speakers,

participated in the study. They all had to pass the

mini-mental state examination (MMSE), a screening

test to assess the cognitive mental status (cf.

(Folstein et al., 1975)). A test score between 28 and

30 indicates slightest decline versus normal

cognitive functioning. Our participants showed an

average score of 28.9 (std.=.83).

4.2 Stimuli and Apparatus

Visual stimuli were presented via a green lamp and a

graphical user interface on the screen of a portable

tablet PC; audio stimuli were also generated by the

MIGSEP system and played via two loudspeakers at

a well-perceivable volume. All tasks were given as

keywords on the pages of a calendar-like system.

There were two types of input possibilities,

which could be freely chosen: the spoken language,

activated if the button was being pressed and the

green lamp was on; and the touch-screen action,

directly performable on the touch-screen display.

The same data set contains virtual yet sufficient

information about personnel, rooms and departments

in a common hospital, was used in the experiment.

BetterChoice?CombiningSpeechandTouchinMultimodalInteractionforElderlyPersons

During the experiment each participant was

accompanied by the same investigator, who

introduced the system and gave well-defined

instructions at the beginning, and provided help if

necessary during the trail (which was very rare).

An automatic internal logger of the MIGSEP

system was used to collect the real-time system

internal data, while the windows standard audio

recorder program kept track of the whole dialogic

interaction process.

A questionnaire, focusing on the user satisfaction

with MIGSEP with respect to the spoken language

and touch-screen combining input modality, was

especially designed for this study. It contains 6

questions concerning the quality of the combined

modality compared to a single modality, the

feasibility, the advantages, the usability, the

appropriateness and the preference. This

questionnaire was answered by each participant via a

five point Likert scale.

4.3 Procedure

Each participant had to undergo four phases:

Introduction: a brief introduction was given to

the participants, so that they could get the basic idea

and an overview of the experiment.

Learning and Pre-tests: the participants were

instructed how to interact with MIGSEP using the

spoken natural language and the touch-screen input.

In order to minimize the learning or bias effect with

respect to the use of one modality, we introduced a

cross-over procedure, 16 participants out of 33 had

to first use the touch-screen input and then the

spoken language, the other 17 used spoken language

first and then the touch-screen input. All of the

participants had to perform 11 tasks concerning their

navigation procedures in a hospital in order to reach

a certain aim. Each modality and each task contained

incomplete yet sufficient information about a

destination the participant should select. For

example they had to drive to “room 2603”, to “Sonja

Friedrich”, or to “room 1206 or room 2206 with the

name OCT-Diagnostics”. Tasks were fulfilled or

ended, if the goal was selected or the participant

gave up trying after six minutes.

Testing: After performing 22 tasks with both

modalities, each participant was asked to freely

choose between spoken language and touch-screen

input modality to perform again 11 tasks; they

contained similar information as in pre-tests (varied

only on data level) and were performed under the

same conditions.

Evaluation: After all tasks were run through,

each participant was asked to fill in the

questionnaire for evaluation.

5 RESULTS AND ANALYSIS

According to the Paradise framework (Walker, et al.,

1997), the performance of an interactive system can

be measured via the effectiveness, efficiency of the

system and the user satisfaction. Therefore, these

three aspects were analysed.

5.1 Effectiveness of the System

To find out how effective the elderly were assisted

by the MIGSEP system with the combining

modality, statistical method “Kappa coefficient” is

used. However, in the classic Paradise framework

the Kappa method was originally used to evaluate a

spoken dialogue system.

Therefore, in order to be able to calculate the

Kappa coefficient with respect to the multimodal

interaction with the MIGSEP system, we first had to

develop an adaptation of the original attribute value

matrix, which still contained all information that was

exchanged during the multimodal interaction

between MIGSEP and participants. For this reason,

we introduce the concept of an Attribute Value Tree

(AVT) (cf. the example in Figure 9).

Figure 9: An Attribute Value Tree.

An AVT is defined as a finite state transition

diagram, which contains all the expected correct

way, either touch-screen input or spoken language

command, as the state transitions from the start state

to the target state. As the AVT for the task “go to a

person named Sonja Friedrich” illustrated in Figure

9, any correct interaction should go from the state

mainView, then e.g. to PersonView by selecting the

HEALTHINF2013-InternationalConferenceonHealthInformatics

100

first card (MS: select 0), or performing the spoken

language command “I want to go to a person” (M:

Person), or go to AllWomen state by simply saying

“I want to go to a woman” from the MainView, etc.

An AVT contains the expected data set of a task,

and therefore functions similarly as the original

attribute value matrix, yet with the possibility of

recording multimodal interaction exchange.

Thus, 11 AVTs were created for the 11 tasks

respectively and by combining the actual data

recorded during the experiment with the expected

attribute values in the AVTs, we can construct the

confusion matrices for all tasks. E.g., table 1 shows

the confusion matrix for the task ”drive to a person

named Sonja Friedrich”, where ”M” and ”N” denote

whether the actual data match with the expected

attribute values in the AVTs. E.g. there were 25

correctly selected actions in the PersonSelect (PS)

state; and the spoken language command regarding

the first name (FN) was misrecognized by the

system for 6 times. Note that, because of the width

of the text, not all attributes of this confusion matrix

can be shown in this example.

Table 1: The confusion matrix for the task “drive to a

person named Sonja Friedrich”.

PS MS ... FN

sum

Data M N M N M N M N

PS 25 0 25

MS 14 0 14

... ... ... ...

FN 62 6 68

The data for all confusion matrices were merged

and a total confusion matrix for all the data of the 11

performed tasks was created.

Given the total confusion matrix, the Kappa

coefficient was calculated with

=













, (Walker, et al., 1997)

In our experiment, P(A) =

∑

,







is the

proportion of times that the actual data agree with

the attribute values, and P(E) =

∑















is the

proportion of times that the actual data are expected

to be agreed on by chance, where M(i, M) is the

value of the matched cell of row i, M(i) the sum of

the cells of row i, and T the sum of all cells.

Thus, we could calculate the Kappa Coefficient

of the total confusion matrix κ=0.91, suggesting a

highly successful degree of interaction between the

MIGSEP and the participants using the spoken

language and touch-screen combining modality.

5.2 Efficiency of the System

In order to find out how efficiently the participants

were assisted using the combining modality,

quantitative data of every single interaction during

the testing were automatically logged. Results are

summarized in Table 2, with respect to four

important aspects for efficiency analysis.

Table 2: Efficiency of the system for each participant and

each task.

Average

Standard

deviation

User turns 7.4 3.6

Sys turns 7.4 3.6

ASR error times 0.3 0.4

Elapsed time (s) 48.7 20.0

The average 7.4 user turns and 7.4 system turns

per participant per task have shown a very satisfying

efficiency of the system, because the average basic

turn numbers, which can be inferred with the

theoretically shortest solution, are 2.9 user turns and

2.9 system turns for the only spoken language input,

and 5.6 user turns and 5.3 system turns for the

touch-screen input. The standard deviation 3.6 even

indicates that, some of the participants are solving

tasks using the approximately shortest solutions.

However, as observed the average turn numbers

are a bit higher than the average number for the

shortest solution, with a further insight into the

detailed data, two reasons can be concluded:

 4 participants were using only touch-screen

input to interact with the system, which

significantly increased the total turn numbers.

 By combining spoken language and touch-

screen inputs, many participants first used the

touch-screen to sort out the rough direction for

each task and then used spoken language

instructions to find the target, which however

inevitably increased the turn numbers, yet

clearly indicates their intention of avoiding the

possible problems caused by the automatic

speech recognition. This is also reflected by

the very good average ASR error rate: 0.3, no

ASR error occurred during the interaction.

Meanwhile, the average elapsed time for each

task and participant (48.7 seconds) is considered as

very short as well, because even with the shortest

solution using spoken language commands, merely

48.7 seconds were used for 5.8 user interaction

paces (2.9 user turns + 2.9 system turns), which is

averagely maximum 8.4 seconds for each turn, this

even includes long sentences uttered by the system

BetterChoice?CombiningSpeechandTouchinMultimodalInteractionforElderlyPersons

101

for over 10 seconds. Although the standard deviation

20.0 is a bit high, this is caused by the same

participants, especially the one who was using only

touch-screen and doing brute-force searching, and

used averagely 135.8 seconds for each task.

5.3 User Satisfaction of the System

Regarding the user satisfaction of the system, we

analysed the subjective data coming from the

evaluation questionnaire concerning the interaction

with the system with the combining modality. The

results are summarized in Table 3, underlining very

good user experiences with the combining modality.

Table 3: Data concerning subjective user satisfaction.

Mean

Standard

deviation

Better than single modality?

4.4 1.1

Easier solving tasks? 4.0 1.3

Showing advantages?

4.5 1.0

Usable to use combi-modality?

4.1 1.5

Prefer to use combi-modality?

4.4 1.3

Not confusing?

4.5 0.9

Overall

4.3 1.0

However, the scores of easier solving tasks and

the usability of the combining modality were a bit

lower than the others and the corresponding standard

deviations were also higher. It is again mainly due to

the extreme cases, where the participants only used

touch-screen input and had made unpleasant

impression of using only touch-screen, and therefore

gave comparably lower score in the questionnaire.

6 CONCLUSIONS AND FUTURE

RESEARCH

In this paper we reported our work on multimodal

interaction for elderly persons by focusing on the

following two important aspects:

 The summary of our systematically designed

and empirically improved foundation for

developing and implementing the elderly-

centered multimodal interaction;

 The evaluation of the spoken language and

touch-screen combined input modality of a

multimodal interactive guidance system for

the elderly by applying an adapted well-

established evaluative framework.

Results of the evaluation showed a very high

degree of effectiveness, efficiency and user

satisfaction of our system, specifically by using the

spoken language and touch-screen combining input

modality. This confirmed our theoretical and

technical foundation, approaches and frameworks on

developing effective, efficient and elderly-friendly

multimodal interactive systems.

The reported work continued the pursuit of our

goal towards building effective, efficient, adaptive

and robust multimodal interactive systems and

frameworks for elderly in ambient assistive living

environments. Further studies are needed to

investigate the reported extreme cases. Corpus-based

supervised and reinforcement learning techniques

will be applied to support and improve the formal

language driven and agent-based hybrid modelling

and management approach. More relevant research

and experiments on assisting elderly in navigating

through complex buildings are also being conducted.

ACKNOWLEDGEMENTS

We greatly acknowledge the support of the German

Research Foundation through the Collaborative

Research Center SFB/TR 8 Spatial Cognition, the

department of Medical Psychology and Medical

Sociology and the department of Neurology of the

University Medical Center Göttingen.

REFERENCES

Alston, P. W., 2000, Illocutionary acts and sentence

meaning. Cornell University Press.

Birdi, K., Pennington, J., Zapf, D., 1997. Aging and errors

in computer based work: an observational field study.

In Journal of Occupational and Organizational

Psychology. pp. 35-74.

Broadfoot, P., Roscoe, B., 2000. Tutorial on FDR and Its

Applications. In K. Havelund, J. Penix and W. Visser

(eds.), SPIN model checking and software verification.

Springer-Verlag, London, UK, Volume 1885, pp. 322.

Charness, N., Bosman, E., 1990. Human Factors and

Design. In J.E. Birren and K.W. Schaie, (eds.),

Handbook of the Psychology of Aging. Academic

Press, Volume 3, pp. 446-463.

Craik, F., Jennings, J., 1992. Human memory. In F. Craik

and T.A. Salthouse, (eds.), The Handbook of Aging

and Cognition. Erlbaum, pp. 51-110.

Folstein, M., Folstein, S., Mchugh, P., 1975. “mini-mental

state”, a practical method for grading the cognitive

state of patients for clinician. In Journal of Psychiatric

Research. Volume 12, 3, pp. 189-198.

Fozard, J. L., 1990. Vision and hearing in aging. In J.

Birren, R. Sloane and G. D. Cohen (eds), Handbook of

HEALTHINF2013-InternationalConferenceonHealthInformatics

102

Metal Health and Aging. Academic Press, Volume 3,

pp. 18-21.

Goetze, S., Fischer, S., Moritz, N., Appell, J. E., Wallhoff,

F., 2012. Multimodal Human-Machine Interaction for

Service Robots in Home-Care Environments. In

Proceedings of the 1

International Conference on

Speech and Multimodal Interaction in Assistive

Environments. The Association for Computer

Linguistics, pp. 1-7.

Hall, A., Chapman, R., 2002. Correctness by construction:

Developing a commercial secure system. In IEEE

Software. Vol. 19, 1, pp. 18-25.

Hawthorn, D., 2000. Possible implications of ageing for

interface designer. In Interacting with Computers. pp.

507-528.

Jaimes, A., Sebe N., 2007. Multimodal human-computer

interaction: A survey. In Computational Vision and

Image Understanding. Elsevier Science Inc., New

York, USA, pp. 116-134.

Jian, C., Scharfmeister, F., Rachuy, C., Sasse, N., Shi, H.,

Schmidt, H., Steinbüchel-Rheinwll, N. v., 2011.

Towards Effective, Efficient and Elderly-friendly

Multimodal Interaction. In PETRA 2011: Proceedings

of the 4th International Conference on PErvasive

Technologies Related to Assistive Environments.

ACM, New York, USA.

Jian, C., Scharfmeister, F., Rachuy, C., Sasse, N., Shi, H.,

Schmidt, H., Steinbüchel-Rheinwll, N. v., 2012.

Evaluating a Spoken Language Interface of a

Multimodal Interactive Guidance System for Elderly

Persons. In HealthInf 2012: Proceedings of the

International Conference on Health Informatics.

SciTepress, Vilamoura, Algarve, Portugal.

Kline, D. W., Scialfa, C. T., 1996. Sensory and Perceptual

Functioning: basic research and human factors

implications. In A. D. Fisk and W. A. Rogers. (eds.),

Handbook of Human Factors and the Older Adult,

Academic Press.

Kotary, L., Hoyer, W. J., 1995. Age and the ability to

inhibit distractor information in visual selective

attention. In Experimental Aging Research. Volume

21, Issue 2.

Krajewski, J., Wieland, R., Batliner, A., 2008. An acoustic

framework for detecting fatique in speech based

human computer interaction. In Proceedings of the

International Conference on Computers Help

People with Special Needs. Springer-Verlag Berlin,

Heidelberg, pp. 54-61.

Lutz, W., Sanderson, W., Scherbov, S., 2008. The coming

acceleration of global population ageing. In Nature.

pp. 716-719.

Mackay, D., Abrams, L., 1996. Language, memory and

aging. In J. E. Birren and K. W.Schaie (eds),

Handbook of the psychology of Aging. Academic

Press, Volume 4, pp. 251-265.

Mandel, C., Lüth, T., Laue, T., Röfer, T., Gräser, A.,

Krieg-Brückner, B., 2009. Navigating a Smart

Wheelchair with a Brain-Computer Interface

Interpreting Steady-State Visual Evoked Potentials. In

Proceedings of the 2009 IEEE/RSJ International

Conference on Intelligent Robots and Systems. IEEE

Xplore, St. Louis, Missouri, United States, pp. 1118-

1125.

McDowd, J. M., Craik, F. 1988. Effects of aging and task

difficulty on divided attention performance. In Journal

of Experimental Psychology: Human Perception and

Performance 14. pp. 267-280.

Moeller, S., Goedde, F., Wolters, M., 2008. Corpus

analysis of spoken smart-home interactions with older

users. In N. Calzolari, K. Choukri, B. Maegaard, J.

Mariani, J. Odjik, S. Piperidis, and D. Tapias, (eds.),

Proceedings of the Sixth International Conference on

Language Resources Association. ELRA.

Rodríguez, M. D., García-Vázquez, J. P., Andrade, Á. G.,

2011. Design dimensions of ambient information

systems to facilitate the development of AAL

environments. In Proceedings of the 4

International

Conference on PErvasive Technologies Related to

Assistive Environments. ACM Press, New York,

United States, pp. 4:1-4:7.

Roscoe, A. W., 1997. In The Theory and Practice of

Concurrency, Prentice Hall.

Salthouse, T. A., 1994. The aging of working memory. In

Neuropsychology 8, pp. 535-543.

Schieber, F., 1992. Aging and the senses. In J. E. Birren,

R. B. Sloane, and G. D. Cohen, (eds.) Handbook of

Mental Health and Aging, Academic Press, Volume 2.

Shi, H., Bateman, J., 2005. Developing human-robot

dialogue management formally. In Proceedings of

Symposium on Dialogue Modelling and Generation.

Amsterdam, Netherlands.

Shi, H., Jian, C., Rachuy, C. 2011. Evaluation of a Unified

Dialogue Model for Human-Computer Interaction. In

International Journal of Computational Linguistics

and Applications. Bahri Publications, Volume 2.

Traum, D., Larsson, S., 2003. The information state

approach to dialogue management. In J.v. Kuppevelt

and R. Smith (eds.), Current and New Directions in

Discourse and Dialogue. Kluwer, pp. 325-354.

Walkder, N., Philbin, D. A., Fisk, A. D., 1997. Age-

related differences in movement control: adjust

submovement structure to optimize performance. In

Journal of Gerontology: Psychological Sciences 52B,

pp. 40-52.

Walker, M. A., Litman, D. J., Kamm, C. A., Kamm, A. A.,

Abella, A., 1997. Paradise: a framework for evaluating

spoken dialogue agents. In Proceedings of the eighth

conference on European chapter of Association for

computational Linguistics, NJ, USA, pp. 271-280.

BetterChoice?CombiningSpeechandTouchinMultimodalInteractionforElderlyPersons

103