
information shown by the GUI and the information
accepted by the SR engine. Indeed, with current
commercial ITSs, very often the vocal interfaces
seems to be badly integrated with the underlying
graphic interface. The result is that usually the vocal
commands are not related at all with the information
shown in the GUI. Our approach, instead, was to
make the GUI and the SR to share the same context,
i.e. changing of states operates via GUI reflects on
the SR state and vice versa. This allows the users to
mix visual/tactile and auditory inputs, making the
interaction smarter and easier to learn.
4.2 The auditory prompts
In vocal interaction, the auditory prompts are, in
some way, the basis of the “interface”, because they
drive the user through all the dialogue to achieve the
intended task. Hence, it is fundamental that the
prompts fit in with the ongoing dialogue (Krahmer,
1997), and that never be ambiguous, making the user
always aware of the state of the vocal interaction.
When dealing with vocal interfaces, designers
can exploit two main kinds of prompts: the
“earcons” (Brewster, 1989) and the machine-driven
dialog. In the former case, the system plays a tone to
report an event of the interaction (such as the
acceptance of a command or the completion of a
task), while in the latter the system “says” one or
more word, to the same aim. Each of the two
approaches has its main advantages and
disadvantages. The machine-driven one is very
useful to guide novice users to their goals, because
in each state the system lists the set of valid
commands, but interactions result slowed by this
forced iteration of options, most of which are
presumably irrelevant to the user’s goals. Instead,
the earcons-based one leads towards a very quick
interaction, but it can result hostile for novice users,
who do not receive any kind of support.
In our proposal we mixed the two approaches:
the interaction is mainly based on the earcons, and in
particular any system output always terminates with
and earcon, but, if the user appears to be in
difficulty, the system starts to provide some kind of
vocal support to the user. About the adopted
earcons, we used a single earcon to represent the
state when the system is able to manage a new input
from the user, and a double earcon to highlight the
successful end of a vocal interaction. This approach
will be detailed in section 4.4.
4.3 The error-recovery strategies
In the automotive field, the SR engine very often has
to deal with errors in the speech recognition. About
the cause of these errors, we noticed that, during a
vocal interaction with the system, two kind of fault
situations can arise:
• The user does not utter any word.
• The word uttered by the user does not match any
valid command.
In both cases, the system has to initiate some
error-recovery strategy, but, in our opinion, these
two situations imply two different kinds of
problems. In the first case, most likely the user does
not know what the accepted commands are for the
specific state of the hierarchy, and thus an
immediate support is required. The second situation,
on the other hand, can be generated either from a
wrong uttered word or from a system error in the
recognition. These two fault situations requires
different recovery strategies, but surprisingly, most
of the current commercial systems manage the two
conditions in the same way. Moreover, about the
second situation, we were interested in
understanding how much of the unmatched errors
are caused by a user’s error and how much by the
system. To this aim, we conducted some evaluations
on medium-high class cars on the market equipped
with some of the most advanced commercial ITS.
We found an average recognition rate of the 95%
when dealing with the commands, and of almost the
92% when inputting numbers in the cell-phone
dialing task. Notice that, even if those results seem a
very good achievement, they mean that, when
vocally composing a phone-number, there is a
misunderstood cipher every eleven uttered! This
quick survey lead us to an important consideration:
if a recognition error occurs, it is very likely the
system missed to recognise the word rather than the
user uttered an invalid word. This consideration has
some deep implications in the definition of error-
recovery strategies. The most obvious is that if an
error does occur, we do not have to “condemn” the
user, with something like “What you say is illegal”,
but instead we should let the system take the
responsibility of the error and focus on recovering.
Moreover, in presence of a recognition error caused
by the system, if it starts to list all the acceptable
words, the only result is to annoy and irritate the
user. Finally, it is widely recognized that one of the
worst (and most irritating) approaches in presence of
an error is to systematically request a repetition of
the command to the user (Gellatly, 1997). This
because, in presence of a recognition error, users act
like as they would do when dealing with a human
dialogue partner: they start speaking slowly, varying
volume, pitch and rate of the pronunciation, as well
as syllabicating the words. All these actions work
well with human counterparts, but unfortunately
they deteriorate very much the results of the SR
engine.
ICEIS 2004 - HUMAN-COMPUTER INTERACTION
12