the Ambient Assisted Living Joint Program
2
which
aims at providing multilingual voice controlled
home care and communication services for seniors
who suffer from chronic diseases and/or (fine-)
motor skills restrictions. The goal of the project is to
develop an adaptive communication interface that
allows for a simple and efficient interaction with
elderly users. While vAssist’s main focus lies on
providing predominantly speech-based access to
services, GUIs will also be available, in particular
for situations in which a rather traditional interaction
paradigm is more appropriate (e.g. interactions in
public places). Furthermore, in order to lower
potential adoption barriers as well as to reduce
service-delivery costs, vAssist aims at using
hardware platforms that often already exist in users’
homes (e.g. PCs, Smart TVs, mobile phones, tablet
PCs).
3 THE VOICE INTERFACE
Users who are not familiar with the small user
interface of a smartphone and the amount of content
it shows, may be disoriented by the number of steps
that are required to access a specific bit of
information. Event if speaking to a machine may not
seem natural, early user tests have shown a great
acceptance of the concept despite its still primitive
capabilities. To cope with the imperfection of the
automatic system, vAssist has currently included a
'Wizard of Oz'-type service where a human replaces
the machine to train the automatic dialog manager as
long as possible.
3.1 vAssist Wizard of Oz
In the initial stages of the deployment of a cloud
based spoken dialog system, it is desirable to
experiment with a limited number of clients in order
to check the acceptance of the application, collect
speech data and make sure that the limitations of the
speech recognition technology will not interfere with
our goal. Therefore the dialog system will be hosted
in a call center where operators will be available to
monitor the discussions with clients. These operators
will be able to listen to the spoken interactions and
have access to the output of the speech recognizer.
The operator has the ability to interfere if a
misunderstanding of the speech recognizer leads to
improper responses.
2
http://www.aal-europe.eu/
Actions of these operators will direct the dialog
in ways that may not have been planned initially.
Spoken dialogs are being recorded for further
adaptation of the required speech recognition
resource (acoustic and language models) (Schlögl et
al., 2013a).
3.2 The Spoken Dialog System
A Spoken Dialog System (SDS) is the integration of
a dialog system whose core component is the Dialog
Manager (DM) connected to speech input and output
modalities (Milhorat et al., 2012). Such a software
engages the user in an exchange of information in
order to access back-end services, store user data in
an organized and reusable way or simulate a
conversational partner. A way to characterize a
dialog system is with its input and output modalities
available for interaction. This may include speech,
text, gestures, or possibly a touch screen. In the case
of speech, a further distinction can be made between
command-and-control, keyword-based or natural
language-based interaction. The first one may not be
considered as “interacting” with an SDS since it
only represents the association of well-defined
spoken commands to actions to be undertaken by the
system. A keyword-based system extends those
associations by the context of the interaction as
maintained by the DM. The interpretation of user
utterances is based on single-word pattern detection.
Keywords need to be carefully selected to avoid
overlaps between intents. This paradigm therefore
does not allow for subtle nuances with respect to the
input that can be understood. Moreover it is very
sensitive to speech recognition errors.
In the case of vAssist, the system (Milhorat et
al., 2013a) is communicating with the user via
natural speech, thus it does not require the user to
learn a set of available commands nor a set of
keywords. Instead, the natural language
understanding components (Milhorat et al., 2013b)
attempt to extract a meaning representation out of
the word-level utterance using the external context,
the internal context and some previously learned
knowledge. The obtained representation (e.g. a
semantic frame) is interpreted by the DM which
calls back-end services and produces outputs to the
user, such as requests for more information, or the
meaningful represented results of service calls.
As mentioned previously, speaking to a piece of
technology, using natural language, may be
disturbing; even to some extend uncomfortable.
However, early trials of the simulated Wizard of Oz-
based system (Schlögl et al., 2013b) produced
vAssist:BuildingthePersonalAssistantforDependentPeople-HelpingDependentPeopletoCopewithTechnology
throughSpeechInteraction
491