VOICE USER INTERFACE USING VOICEXML
Environment, Architecture and Dialogs Initiative
Alexandre M. A. Maciel and Edson C. B. Carvalho
Center of Informatics, Federal University of Pernambuco, Recife, Brazil
Keywords: Voice User Interface, Voice Technologies, Dialog Initiative and VoiceXML.
Abstract: In this work we present a set of applications for Internet with voice user interface using VoiceXML
language. Architecture, main platforms and dialog initiative ways were studied. Applicability and
limitations were determined.
1 INTRODUCTION
According to (Lévy, 1993) a man-machine interface
assigns to a set of programs and material devices
that allows the communication between an
information system and its human users. So, while
humans and machines were not able to speak the
same language, the interfaces will be necessary to
mediate the communication between them.
The technologies associated with the
construction of interfaces affect and guide our
perception and the way we interact with the
computational systems. It changes the way as we
create and communicate with them (Johnson, 2001).
If the metaphor of computer interface was another
one, probably we would think different.
At the moment, surrounded by technology and
machines anywhere, and, consequently of interfaces,
applications since Internet maintains the same
metaphor of the traditional computational systems,
which limits our interaction. Voice interfaces offer,
besides an easy and extremely intuitive
communication form, other advantages in the
machine interaction as, for example, the speed in the
input data, safety in the speaker identification and
accessibility of handicapped people.
(Gabriel, 2005) affirms that in almost all the
technological areas, the boundaries between medias,
technologies and concepts, are suffering a
dissolution and hybrid process, such as all analysis,
classification of processes and interactions become
much more complex than before. In that way, when
joining capacity and user-friendliness of voice
interfaces to dialogue, with capacity and web
diversity brings about a mean of communication
with an enormous potential.
2 VOICE USER INTERFACE
Construction of voice interface applications is a
challenge and the reason for this is that the language
is deeply related to human behaviour (Schinelle,
2005). As a consequence, the expectations related to
the interface become very high. This kind of
interface tries to lead the user to the sensation that he
could speak as if it he was talking with a human,
however it was not perfectly achieved.
The main objective of a voice user interface
project is to support the user navigation with
options, commands and available information in a
system to carry out a specifc task. Unfortunately,
access information through navigation is more
complex in the audio ambit.
A good voice interface projection can
successfully attenuate the effect of these deficiencies
using a user interaction structure that tries to carry
out the required tasks successfully. For this, some
factors must be considered in the voice interface
design: the application requirements, potentialities
and limitations of the technology and the population
characteristics (Kamm, 1995).
Once understood those factors, the voice
interface designer can anticipate some difficulties
and incompatibilities that will affect the success of
the application, minimizing its impacts.
The project and the execution of an interface are
most successful as an interactive process with the
interfaces tested empirically in groups of
representative users where the problems are
detected, corrected, and re-examined until the
system achieves a steady and satisfactory
performance.
380
M. A. Maciel A. and C. B. Carvalho E. (2007).
VOICE USER INTERFACE USING VOICEXML - Environment, Architecture and Dialogs Initiative.
In Proceedings of the Second International Conference on Signal Processing and Multimedia Applications, pages 370-374
DOI: 10.5220/0002138103700374
Copyright
c
SciTePress
2.1 Interaction
Voice interfaces supply the information systems
with an interesting alternative for input and output
data such as a voice-only interface (phone) or a
component of a multimodal and/or multimedia
system.
A voice-only interface in an information system
can become desirable for two reasons. First, the
application can require free hands in the interaction.
Second, the telephone system is a net technology
truly robust and universal. Then, it makes sense to
extend the information services from computer to
phone (Dey, 1997).
Multimodal interfaces are a human-machine
interaction for sequential or parallel applications of
input/output data. Speech recognition, keyboard,
mouse, mimic, gestures can be used as modality of
input data and to get a synthesized reply voice,
graphics or text message. These ways of interaction
can be combined dynamically to provide bigger
mobility to the user (Englert, 2006).
2.2 Dialogue Initiative
One of the fundamental aspects of the development
of applications with voice interface is the way the
dialogue initiative is taken. The strategy of
management dialogue can be by system, user or
mixed initiative (SPI Group, 2006).
In a system-initiative dialogue, the computer
asks the user and when the necessary information is
received, the solution is processed and the answer is
given. Dialogues with user-initiative assume that the
user knows what to do and how interact with the
system. Generally, the system waits for the user
input and answers it through operations.
Applications with mixed-initiative assume that the
initiative of the dialogue can be taken by the system
or the user.
3 VOICEXML
VoiceXML is a markup language and its main
objective is to bring the powerful Web development
and to give the content for applications with voice
interface. It allows the voice services integration
with data services giving access to information and
services in phone devices like the traditional Web
(VoiceXML Forum, 2000).
The advantage of using VoiceXML language to
construct voice services is that companies can create
voice automatized applications using a similar
technology used to create Web visual sites, reducing
significantly the construction cost of corporative
voice sites (Kondratova, 2004).
In VoiceXML, an application is composed of a
set of linked documents, all of them making
reference to a main document called root of the
application. All the applications begin from this
document when it is loaded onto the server. See
Figure1.
The content of a VoiceXML document is
normally divided in a series of dialogues and sub-
dialogues. The dialogs contains information for a
particular transaction processing, for example,
supply information considered in the next dialogue
after the complement of current transaction. The
sub-dialogs is generally treated as functions that are
used for specific tasks, such as processing, and is
called a dialogue “father” and returns it after
completing the requested task.
Figure 1: VoiceXML document flow.
Speech recognition systems enbale the
computers “to listen” the user’s speech and
recognize what was said. Voice synthesis systems
allow the “reading” of information. However to
obtain satisfactory performance and time-out, the
current systems limit what the user can speak inside
of a context through grammars.
A grammar is a language definition. It can be
used to describe natural languages, spoken and
written by people, and formal languages such as
programming languages, markup language
documents, mathematical language and many others
(Bringrt, 2005).
Grammars are based on a set of words and
sentences that define the possible ways of interaction
that can be used in an application. For example,
there are many ways of asking for some product.
“I´d like”, “Give me”, “I want”. The grammatical
rules can also specify kinds of user pronunciation,
depending on regional accent (Enden, 1998).
The main standard grammars are Java Speech
Grammar Format (JSGF), independent of platform
and speaker based on Java technology, and Nuance
Grammar Specification Language (NGSL) used in
Nuance systems.
VOICE USER INTERFACE USING VOICEXML - Environment, Architecture and Dialogs Initiative
381
3.1 Architecture
In the VoiceXML applications, in the same way as
in the Web applications, documents are stored on a
Web server. In addition to this server, the
VoiceXML architecture requires another server, the
voice server, which deals with all interaction
between the user and the Web server. The voice
server works as a browser in the voice applications,
interpreting all users input data and promoting
audible messages as reply. In the case of the voice
applications, the final user does not need to have a
last generation computer with any sophisticating
browser. It can access the voice application by a
fixed/mobile phone or by VOIP (Voice Over
Internet Protocol) software (Shukla, 2005). The cited
architecture is shown in Figure 2.
Figure 2: VoiceXML language architecture.
The voice server is an operational platform that
executes the VoiceXML language services. Also
called gateway, it works making a bridge between
the world of telephony and Internet and assures that
the development of the voice applications is
successful in operation and maintenance.
A communication between those servers is made
by means of HTTP (Hypertext Transfer Protocol),
protocol for the Internet or Intranet. Communication
between the voice server and the device customer is
carried through the net PSTN (Public Switched
Telephone Network).
To make this communication in an efficient way,
the voice server has diverse technological resources
(Beasley, 2001). The most important is the
interpreter who supplies advanced characteristics
similar to visual browsers (cache, favourites) and
analyzes the source code that uses speech
recognition and synthesis resources.
3.2 Platforms
In market exist many options of voice-based
applications platforms. The main informatics
companies possess production and publication tools
each one with special characteristics.
Table 1: Main Voice Platforms.
IBM Nuance Microsoft
Platform
WVS NVP MSS
Version
5.1 3.0 2004
Idiom
English,
Portuguese
English,
Portuguese
English
Source
Open Open Close
Hardware
Required
Proc. 1.0Ghz
Mem 2GB,
Disc 3GB
Proc. 1.0Ghz
Mem 1GB,
Disc 3GB
Proc 2.5Ghz
Mem. 4GB,
Disc 20GB
Software
Required
AIX, Linux,
WIN2003
Solaris,Linux
WIN2K
WIN 2003
Languages
VXML 1.0,
2.0
VXML 2.0 XML,SSML,
SRGS,SALT
The Websphere Voice Server (WVS), one of the
most known platforms, supports diverse standards of
voice interface, which gives more freedom in
relation to propriety technologies. The WVS has a
refined speech recognition and voice synthesis
resources due to the investments of IBM and offers a
great amount of idioms (IBM, 2005).
The Nuance Voice Platform (NVP) is a
VoiceXML open source platform optimized for the
development, debugging and monitoring of voice
solutions. An important characteristic of NVP is the
distributed architecture, easily scaled and managed,
specifically created to supply robustness and
flexibility to massive applications (Nuance, 2007).
The Microsoft Speech Server (MSS) combines
Web technology, voice processing services and
telephony facilities in one integrated system,
qualifying the companies to unify its infrastructure.
It was not used in this work because it has not
VoiceXML language support yet (Microsoft, 2007).
4 APPLICATIONS
Three applications have been developed with the
objective to test the main characteristics of the
VoiceXML language, its environment , architecture
and dialogue initiatives. The infrastructure used in
the applications is shown in Table 2.
Table 2: Development infrastructure
Interaction Voice-Only
Initiative System, User e Mixed
Grammar JSGF, NGSL
Platform NVP
Language VoiceXML 2.0
Idiom Portuguese
Input Device Skype Software
SIGMAP 2007 - International Conference on Signal Processing and Multimedia Applications
382
4.1 News On-Line
Taking care with the system-initiative specifications,
one application with voice interface accessed by
phone was developed to offer the regional last news
published in the Web sites by means of verbal
communication, in flexible and comfortable way.
System-initiative strategy reveals that those
applications are efficient due to its simplicity in the
interaction with the user (Lopéz, 2004).
The application starts the dialogue with a
respectful greeting to the user and later it asks using
voice commands, which canal the user wants to
access. After that, the system verifies if it is an
available option. In the positive case, the heading is
read and if the user wants to listen to it he only has
to answer yes or no; in the negative case, the system
repeats the available options until the user chooses a
valid one.
[System]:Bem vindo ao sistema de notícias.
Selecione um dos nossos canais:
Esportes, Economia ou Lazer.
[User]: Economina
[System]:PIB brasileiro tem queda de 3.4% no
primeiro trimestre. Você deseja
ouvir o conteúdo da notícia?
[User]:Não
[System]:Dólar acumula queda de 16% no ano.
Você ouvir o conteúdo da notícia?
[User]:Sim
[System]:O preço do dólar vem caindo devido
as compras do Banco Central...
[System]:Obrigado por usar nosso sistema.
Figure 3: News application simulation
4.2 Dictionary
Systems with user-initiative tend to be sufficiently
complex due to naturalness to recognize voice user´s
input, mainly because it treats the recognition of
concatenated words. In order to study the man-
machine interaction, one portuguese dictionary
application was developed achieving the
requirements of this strategy. It is an adaptation from
the book metaphor (visual) to the phone metaphor
(vocal). This kind of applications is common in the
industrial area executing control commands
(Yankelovich, 1997).
The dictionay starts with a respectful greeting
and waits for a user command. It is necessary that
the user takes the initiative of the dialogue to get the
interaction, or the call phone can be disconnected in
a determined time. From the input data, the
dictionary can give the translation or meaning of a
word by consulting its grammar. If the input given
by the user was understood, a desired answer will be
listened, otherwise it emits an error message.
[System]:Bem vindo ao sistema de dicionário.
O que deseja?
[User]: Quero o endereço da UFPE
[System]:Expressão não compreendida. O que
deseja?
[User]:Quero o significado de computador
[System]:Aparelho eletrônico concebido para
realização de cálculos e operações
lógicas...
[System]:Obrigado por usar nosso sistema.
Figure 4: Dictionary application simulation
4.3 Airlines Information
Taking care with the mixed-initiative specifications,
one application was developed to offer brazilian
airlines information with schedules and fees. Those
kinds of applications are common in dialogue
systems. Other applications offer trips consult
services, bookings and tickets sales (Lopéz, 2000).
The airlines information system starts with a
respectful greeting followed by a question.
Considering that the user has one brief knowledge of
the purpose of the application, it must say something
that demonstrates its interest in travel from one city
to another. As if there’s a mixed-initiative
application, a small intervention of the system is
necessary to guarantee the information security
during the interaction. Therefore, the second stage of
the dialogue is a confirmation of what was said by
the user. In case of positive reply the available
schedule is supplied by fees, otherwise the system
will turn back to the initial condition.
[System]:Bem vindo ao sistema de informações
aéreas. O que deseja?
[User]: Gostaria de viajar de Recife a São
Paulo
[System]:Você deseja viajar partindo de
Recife até São Paulo, sim ou não?
[User]:Sim
[System]:Foram encontrados 3 vôos saindo de
Recife até são Paulo.
Terça-Feira, 18:00h, R$389,00
Quinta-Feira, 15:30h R$389,00
Sábado, 0:15h, R$339,00
[System]:Fim dos horários. Obrigado por
usar nosso sistema.
Figure 5: Airline information application simulation
4.4 Evaluations
Each application was tested by four users of both
sex, with different skills, 20 times in different
environments (quiet or noisy). No differences of
performance in relation to the environment and sex
were determined. The experienced users had better
performance in 15% than others. The Skype
software had some transmission delays that harmed
the test. An average of 30% of calls suffered
unexpected interruptions.
VOICE USER INTERFACE USING VOICEXML - Environment, Architecture and Dialogs Initiative
383
A quantitative evaluation of the voice synthesis
system was not possible because it can not be
measured. The result was considered satisfactory by
the users. The effectiveness of the recognition
system was 100% for news online, 90% for airline
information and 86% for dictionary. The results
were obtained by a simple average of rigthness and
errors in the system. However, two factors made
difficult a correct analysis. The error can be in the
imperfection of transmission or in the grammars
especification.
5 CONCLUSIONS
The speech recognition based on the telephonic net
offers an enormous potential because it is extremely
spread out. It is also a difficult technique because it
is impossible to control use conditions. Problems
involve a great and unexpected population,
differences in the microphones of the devices, noise
and short band. The most succeed systems are those
which limit the vocabulary size.
The VoiceXML is an excellent language for
voice applications with well defined criteria.
However, it is not the perfect tool for all kind of
projects. Factors that influence the choice of the
VoiceXML are architecture, hardware, operational
system, idioms and the platform used in a particular
project. Then a correct assessment must be done in
order to decide for its use.
Regarding the dialogue initiative. Although the
system-initiative has an excellent performance in the
speech recognition, it allows a little interaction with
the user, this is ideal only for informative systems as
exchange, time forecast, etc. Nevertheless, the user-
initiative requires natural language and needs a deep
user´s knowledge about the application, and it is
recomended to corporative applications, like agenda
and email, for example. The mixed-initiative is the
interaction way which has a largest applicability
because its confirmation strategy allows a good
quality in the speech recognition and it has got a
good interaction with the user. This type of initiative
can be used as a substitution for traditional call-
centers.
ACKNOWLEDGEMENTS
We thank the Center of Informatics for infrastructure
and technical support on server’s installation.
REFERENCES
Beasley, Rick. et al. Voice Application Development with
VoiceXML. Sams Publishing, 2001.
Bringrt, Björn. Embedded Grammars . Master Thesis.
Göteborg University, Sweden, 2005.
Dey, Anind K., et al. Developing Voice-Only Applications
in Absence of Speech Recognition Technology. GVU
Technical Report, Submitted to DIS '97.
Enden, Jarkko. Java Speech API. Technical Report of
University of Helsink, 1998.
Englert, Roman, et al. Architecture of Multimodal Mobile
Application. 20th International Symposium on Human
Factors in Telecommunication. France, 2006.
Gabriel, M. C. Cruz. Entre a Máquina e o Homem.
Revista Eletrônica Cibercultura, N:1679-6756, 2005.
IBM Websphere Voice Server for Multiplataforms
V5.1.1/5.1.2 Handbook.IBM Coorporation. 2005.
Johnson, Steven. Cultura da Interface: Como o
Computador Transforma Nossa Maneira de Criar e
Comunicar. Rio de Janeiro, 2001.
Kamm, C. User Interfaces for Voice Applications.
Proceedings of the Natural Academy of Science of the
USA. PND:1995;92;10031-10037.
Kondratova, Irina. Performance and Usability of
VoiceXML Application. 8
th
World Multi-Conference
on Systemics, Cybernetics and Informatics, 2004.
Lévy, Pierry. As Tecnologias da Inteligência: O Futuro do
Pensamento na Era da Informática. São Paulo, 1993.
Microsoft, Speech Server Web Site.
http://www.microsoft.com/sppech, 2007.
Nuance Voice Plataform DataSheets.
http://www.nuance.com/voiceplatform/, 2006.
Schnelle, Dirk, et al. Audio Navigation Patterns.
EuroPLop, 2005.
Shukla, Charul. et al. VoiceXML 2.0 – Developers Guide.
Dreamtech Software Índia INC, 2000.
SPI - Speech-based & Pervasive Interaction Group.
University Tampere. Http://www.cs.uta.fi/hci/spi/ddsi
VoiceXML Fórum. Voice Extensible Markup Language.
Manual, versão 1.0, 2000.
Yankelovich, Nicole – Using Natural Dialogs as the Basis
for Speech Interface Design, Sun Microsystems
Laboratories, 1997.
SIGMAP 2007 - International Conference on Signal Processing and Multimedia Applications
384