MULTIMODAL INTERACTION WITH MOBILE DEVICES

Outline of a Semiotic Framework for Theory and Practice

Gustav Öquist

Department of Linguistics and Philology, Uppsala University, Sweden

Keywords: Mobile devices, Interaction, Multimodality, Framework, Semiotics.

Abstract: This paper explores how interfaces that fully uses our ability to communicate through the visual, auditory,

and tactile senses, may enhance mobile interaction. The first step is to look beyond the desktop. We do not

need to reinvent computing, but we need to see that mobile interaction does not benefit from desktop

metaphors alone. The next step is to look at what we have at hand, and as we will see, mobile devices are

already quite apt for multimodal interaction. The question is how we can coordinate information

communicated through several senses in a way that enhances interaction. By mapping information over

communication circuit, semiotic representation, and sense applied for interaction; a framework for

multimodal interaction is outlined that can offer some guidance to integration. By exemplifying how a wide

range of research prototypes fit into the framework today, it is shown how interfaces communicating

through several modalities may enhance mobile interaction tomorrow.

1 INTRODUCTION

Mobile devices have inherited a large body of

principles regarding how to interact with computers

from the desktop paradigm. This is natural since

most developers of mobile interfaces have a

background in desktop interface design, whereas the

users generally have thought of handheld computers

and cellular phones as an extension of the office

domain. Mobile devices do however differ in many

respects from desktop computers. As a result we see

that although the computational power of mobile

devices are ever increasing, the two main constraints

that reduce usability remains. The first is the limited

input capabilities and the second is the limited

output capabilities, both caused by a combination of

the user's demand for small devices and the

developer's reuse of desktop interaction methods.

However, it is vital to see that small display and

keyboard sizes are not elective; they are decisive

form factors since mobile devices have to be small

to be mobile.

Nonetheless, size is not all, and mobile devices

do have many beneficial properties that may

enhance interaction. For example, they usually have

good information processing capabilities. Yet, the

opportunities offered by having devices that can

process and display information in ways that suit the

small screen better, moreover with respect to whom

is using it and where, have not been widely

employed. Moreover, since mobile devices are

strongly associated with cellular phones, they are

much more socially acceptable to speak with than

desktop computers. Yet, the opportunities offered by

having a device that you can verbally interact with

have not been widely put into practice. Furthermore,

mobile devices lack physical confinement, probably

the foremost reason for any customer to buy one in

the first place. Yet, the opportunities offered by

having a device that you can hold in your palm and

freely interact with in space, and in relation to other

devices, have not been widely put to use.

To be able to make more of these promising

properties, and combine them into useful interfaces

for multimodal interaction, we need models for

reasoning around how to interact through several

senses. A natural starting point is to look at how

humans use multimodal communication and this is

where we begin in the next section. Based on our

observations we proceed by outlining a model of

multimodal interaction based on three tiers:

Communication circuit, semiotic representation, and

sense applied for interaction. We then exemplify

how a range of research prototypes fit into the

framework. Finally, a discussion and a few

concluding remarks wrap up the paper.

276

Oquist G. (2006).

MULTIMODAL INTERACTION WITH MOBILE DEVICES - Outline of a Semiotic Framework for Theory and Practice.

In Proceedings of the International Conference on Wireless Information Networks and Systems, pages 276-283

 SciTePress

2 MULTIMODAL

COMMUNICATION

Perception is both the key and keyhole for

communication since it both enables and restrains

acquisition and further interpretation of information.

From a human centred perspective, a modality can

essentially be seen as one of the senses we utilize to

make ourselves aware of the world around us. If we

stick with Aristotle’s traditional categorization, we

have vision, hearing, touch, smell, and taste. Each of

these senses can be used to perceive information that

is quite different in nature, but it is also often

possible to perceive the same information through

several senses at once.

The form of communication we are interested in

is the interactive where the sender and receiver

intentionally and actively transfer intelligible

information between each other with the aim of

achieving a mutually understood goal. Information

can be seen as the raw material for message

construction and the exchange of meaning. Some

sort of coding is always a part of the creation of

information since meaning cannot be delivered

through any given medium in its pure form (that

would equal mind reading).

This is where semiotics or the theory and study

of signs come into play, since information is what

we decipher from signs (Chandler, 2001). However,

we do not really produce signs, we produce

stimulus, neither do we perceive signs, we perceive

stimulus. By encoding meaning into signs, the

sender shapes information into the form of stimulus

that the receiver is assumed to perceive and decode

as the signs conveying the original meaning. This

means that the sender must be able to anticipate

what the receiver is going to recognize the stimuli

as, an anticipation that is based on situational, social,

and cultural conventions.

Since humans can communicate more efficiently

through several senses, it seems straightforward that

humans should communicate more efficiently with

mobile devices through several senses. Yet, there are

several questions that arise in the wake of this

assumption. Do mobile devices really have what it

takes? How should sight, hearing, and touch be

combined in a way that actually enhances

interaction? In order to find some answers to these

questions, we will now have a closer look at what

multimodal interaction implies by outlining a

framework. The framework in itself is not limited to

mobile devices but in the scope of this paper, we

will focus on how we can interact with mobile

devices using several senses.

3 A FRAMEWORK

MULTIMODAL INTERACTION

The challenge with multimodal interaction is to

channel the right information through the right sense

in the right way. We will attempt to offer some

guidance to this by structuring the communicative

situation where a user interacts multimodally with a

device into three tiers. Each tier can be thought of as

a level of reasoning that is interleaved with the

others, thus it is not a layer in a strict sense as such

can be pealed off and viewed in isolation. A tier is

more like something that binds other things together,

and in the case, they bind our model of multimodal

communication together.

The first tier is the circuit of communication that

defines how the information can flow between the

sender and receiver through interaction. The second

tier is the form of information that governs on how

meaning is represented as signs used in the

interaction. The third, and last, tier is the mode of

interaction that categorizes the communication

depending on the modality that is used for transfer of

the information. Let us now examine each tier in

turn and then see how they fit together.

3.1 Circuit of Communication

The first tier has to do with the relation between the

participants in the interaction and how information is

communicated between them. Interaction implies at

least two communicative participants that we will

refer to as the sender and the receiver. Since we are

interested in interaction with mobile devices, we can

safely assume that either the sender or the receiver is

a mobile device. We can also assume that there is a

channel of communication established between them

based on a mutual understanding of the purpose with

the interaction.

Consider the following brief scenario: “A user

reads an e-mail on a mobile phone by paging down

with a joystick”. It is quite evident that the user

interacts with the device by pushing down the

joystick, whereas the device interacts with the user

by presenting more of the e-mail on the screen. Do

both the user and the device intentionally and

actively transfer intelligible information between

each other? Yes. Do both the user and the device

produce and perceive stimulus? Yes. Since both the

user and the device simultaneously produce and

perceive stimulus we have two separate circuits of

communication (Table 1).

MULTIMODAL INTERACTION WITH MOBILE DEVICES - Outline of a Semiotic Framework for Theory and Practice

277

Table 1: Circuits of communication.

Circuit Sender Receiver

Forward Produces stimulus Perceives stimulus

Reverse Perceives stimulus Produces stimulus

The interaction in the reverse circuit is what we

think of as feedback. What the sender or receiver

perceives of its own stimulus production is also

feedback, but that has more to do with the senders

inert communication skills than interaction.

Throughout this paper, the mobile device will be

referred to as the receiver and the user referred to as

the sender. Input is thus when the device perceives

the user and output is when the device stimulates the

user. However, we also have a feedback loop, where

the input is when the user perceives the device and

the output is when the user stimulates the device. At

this point, it should be clear that the sender and

receiver reciprocally transfers information in one

forward and one reversed circuit during interaction.

Now we have a frame for the communication, next

we will turn to the form of information.

3.2 Form of Information

The second tier has to do with the properties of the

information that is transferred during interaction.

The constructs of information, or meaning

representations that can be communicated, are

usually called signs. Signs do not convey any

meaning in themselves, only when meaning is

adhered to them do they become signs. Analogously,

anything can be a sign as long as someone interprets

it as signifying something, e.g. referring to or

standing for something other than itself, “Nothing is

a sign unless it is interpreted as a sign.” (Peirce cited

in Chandler, 2001). The Swiss linguist Ferdinand de

Saussure and the American philosopher Charles

Sanders Peirce developed the two currently

dominant models of what constitutes a sign around a

century ago.

Saussure offered a two-part model of the sign.

He defined the sign as being composed of the

signifier and the signified, where the signifier is the

form a sign takes whereas the signified is the

concept it represents. The sign itself is the result of

an association between the signifier and the

signified. The association is purely arbitrary and

there is no one-to-one relation between the signifier

and the signified; signs have multiple rather than

single meanings and the meaning of a sign depends

on its context in relation to other signs. Peirce on the

other hand formulated a model of the sign composed

of three parts. He defined the sign as consisting of a

representamen, the form of the sign, an interpretant,

the sense made of the sign, and an object, what the

sign refers to.

Whereas Saussure did not offer any typology of

signs, Peirce offered several. Peirce’s categorization

of signs also provides a richer context for

understanding how representations convey meaning.

The most general categorization is based on three

kinds of signs. Firstly, there are indications, or

indices; that show something about things because

of their being physically connected with them.

Secondly, there are likenesses, or icons; that serve to

convey ideas of the things they represent simply by

imitating them. Thirdly, there are symbols, or

general signs, that have become associated with their

meanings by usage (Chandler, 2001) (Table 2).

Table 2: Forms of information.

Form Definition

Indexical Sign is directly connected to the object

Iconic Sign is analogously connected to the object

Symbolic Sign is arbitrarily connected to the object

Indexical signs can be thought of as all

representations and actions that directly connect the

mobile device with the user and the environment.

Examples of indexical signs are an alarm signal

indicating an alarm, pointing the device at

something, or tilting the device. For mobile

appliances, there is also a close relation between

indexical signs and instances of context awareness

(Kjeldskov, 2002). Iconic signs can be thought of as

all representations and actions that resemble

something else. Examples of iconic signs are a

battery icon indicating battery status, a picture of

lifted phone indicating the connect call function, or a

tone resembling a popular pop song. Symbolic signs

can be thought of as all representations and actions

that have to be learned, including all instances of

language used in an interface.

3.3 Mode of Interaction

The third tier has to do with how signs can be

stimulated and perceived in interaction. This is

where multimodality come into play as signs can be

expressed through several senses. As mentioned, we

mostly use the visual, auditory, and tactile

modalities for interaction. Each of these modalities

has unique properties for conveying information that

is very different in nature. There is also a difference

in how the same information can be expressed

through the different modalities. There is obviously

no point in designing interfaces that interact through

modalities in ways that the mobile devices cannot

perceive. However, most mobile devices actually do

have means to use all three modalities. Not every

WINSYS 2006 - INTERNATIONAL CONFERENCE ON WIRELESS INFORMATION NETWORKS AND SYSTEMS

278

device may have all means for input and output, but

it is likely that most devices will feature several of

them, if not primarily for multimodal interaction, at

least for multimedia content delivery (Table 3).

Table 3: Modes of interaction.

Mode Input Output

Visual Camera, IR sensor Screen, LED’s

Auditory Microphone Speaker,

Headphones

Tactile Buttons, Tilt sensor Buzzer, Gyro

The most commonly used input modality for

mobile devices, as for computers at large, is the

tactile, and generally in the form of button presses.

One could argue that the audible input channel is

more commonly used on mobile phones, given that

most people use them to talk in, but the information

transferred then is not really aimed for the mobile

device. What the most commonly used output

modality is for mobile devices is a little harder to

decide on. For a majority of the users it is probably

the visual via the screen, but for people who only

use mobile devices for telephony, it may just as well

be the audible for call notification. Yet, when

communicating actively with the device, the main

output modality is the visual.

3.4 Bringing the Framework

Together

Interaction presupposes a forward and reverse circuit

of communication. Through each of these circuits,

information can be represented in the form of

indexical, iconic, or symbolic signs. Depending on

the modality that is used the information can be

expressed in a visual, auditory, or tactile mode. If we

map these instances against each other, we get a

graph with nine multimodal information types. Each

type corresponds to a certain type of information

that corresponds to the combination of semiotic form

and interaction mode. In multimodal communication

each information type can be independently and

concurrently communicated (Figure 1).

The labels that are used, e.g. image, sound, or

signal, should not be interpreted literally; they are

denotations for a certain combination of sign form

and modality mode. The types are not unconditional,

nor are they unambiguous, since it in most cases is

hard to draw a line between different sign forms.

The intention with the categorization is to give us a

richer framework for reasoning around how different

types of information are used in multimodal

interaction. Furthermore, by categorizing between

the different types we get a more specific view of

how information is worked with in different

interfaces.

There are other frameworks and typologies that

are similar to the one outlined here. Bernsen (1994)

presents a typology based on a generic approach to

the analysis of output modality types. There are two

main differences between Bernsen’s typology and

this framework. Firstly, we consider input and

output inseparable in interaction whereas Bernsen

mainly focus on output. Secondly, our framework is

based on semiotic theory whereas Bernsen

categorizes properties of multimodal interfaces,

User

Produces

Perceives

Indexical

Iconic

Symbolic

Auditory

Perceives

Tactile

Device

Visual

Produces

Tone

Push

Picture

Sound

Touch

Text

Voice

Image

Signal

Information TypeForm of Information

Forward Circuit

Reverse CircuitMode of Interaction

Sender

Receiver

Figure 1: Framework for multimodal interaction where the information types correspond to the mapping between indexical,

iconic, and symbolic forms over the visual, auditory, and tactile modes.

MULTIMODAL INTERACTION WITH MOBILE DEVICES - Outline of a Semiotic Framework for Theory and Practice

279

resulting in no less than 48 more or less atomic

types. Our framework is less detailed, but also more

expressive. Nigay and Coutaz (1993) presents a

design space for multimodal systems that is

complementary to the framework outlined here in

the sense that it primarily focus on the distinction

between sequential and parallel use of modalities

and their combination. In our framework, we do not

make this distinction although we do allow for both

concurrent processing and data fusion.

4 MULTIMODAL INTERACTION

WITH MOBILE DEVICES

We will now show how each information type can

be interacted with for both output and input by

providing examples from previous research. We

have chosen to group the examples together

according to the sign forms, mostly to make it

apparent how similar the information becomes

although different interaction modes are used.

Although most examples only use a certain sign

form or mode of modality it is shown how the

different information types may be used for

interaction. When we have looked at all the

examples, we will turn to a concluding discussion

about how different interaction techniques may be

integrated into useful interfaces for multimodal

interaction.

4.1 Indexical Interaction: Image,

Tone and Push

The first information type is the indexical visual

image. By image we mean visual information that is

directly connected to a specific context. For input of

images the digital camera, more or less standard on

mobile phones supporting MMS, is probably what

comes first to mind. However, the input image could

also be something only the device uses, for example

using a sensor to monitor the light in the surrounding

environment, or using a camera to monitor if the

user is looking on the screen or not as in the

SmartBailando browser (Öquist, 2002). For output

of images almost all new mobile devices, as PDAs

or multimedia enabled phones, offer high-resolution

colour screens. If the screen is not large enough, a

solution may be to display images on a device with a

larger screen in the vicinity as exemplified in the

Pick-and-Drop interface (Rekimoto, 1987), another

possibility is to use a head-mounted display

(www.virtualvision.com).

The indexical auditory type is referred to as the

tone. By tone we mean audible information that is

directly connected to a specific context. For input of

tones in the form of sounds we need a microphone.

One possibility with using tones as auditory input is

exemplified in the Tuneserver (Prechelt and Typke,

2001), where sounds were transformed into an

indexical representation and matched against

templates of musical scores to find the name of a

song or melody. Another, more mobile specific

example, is to monitor the loudness level in the

surrounding and adapt interfaces to that (Mäntyjärvi

and Seppänen, 2001). For output of tones we always

have the ring tone as an example, but there are

others that are more interesting. Earcons were

introduced by Brewster as a substitute for graphical

elements when navigating a hierarchy of nodes in an

interface. Earcons are abstract, synthetic tones

constructed from motives using timbre, register,

intensity, pitch, and rhythm (Brewster, 1998). By

using a pair of headphones, it is possible to index

sounds in three dimensions and for example create

audible interfaces for menu selection (Lorho et al,

2002), or directing the user’s attention to objects that

are outside the visual area of the screen, as

exemplified in the Fishears interface (McGookin and

Brewster, 2001).

The indexical tactile information type is referred

to as the push. By push we mean tactile information

that is directly connected to a specific context. For

input of push we need some form of tactile sensor, it

can be a button, touch screen, or accelerometer (for

sensing degree of tilt). Examples of using tilt as

indexical tactile input for navigation, e.g. tilting

up/down, or left/right, has been exemplified in very

small interfaces, as in the Hikari interfaces (Fishkin

et al, 2000). Physically pointing the device at objects

as an interaction method has been explored in the

mobile Direct Combination interaction technique

(Rekimoto, 1987). For push output, the most

common example is the tactile feedback you get

when pressing buttons. For device initiated tactile

output we need some form of tactile generator, most

mobile phones do also have a vibrator for

unobtrusive call notification. A more elaborate, yet

straightforward, example is the TactGuide (Sokoler

et al, 2002) that literary points the user to a target

location by using subtle tactile directional cues.

4.2 Iconic Interaction: Picture,

Sound and Touch

The visual iconic information type is referred to as

the picture. By picture we mean visual information

that connects to an object or entity since it looks like

it. For input of new pictures we would need some

form of pad or touch screen to draw on, but more

common is to have predefined pictures to choose

WINSYS 2006 - INTERNATIONAL CONFERENCE ON WIRELESS INFORMATION NETWORKS AND SYSTEMS

280

from, such as inserting a graphical smiley emoticon

in a text message, or combining pictures with each

other, as in the direct manipulation paradigm

(Schneiderman, 1982). One another possibility is to

use something similar to the Bitpict program

(Furnas, 1991), where a matrix of pixels served as a

blackboard for a picture production. However, the

most common use of pictures is for output, in the

form of icons and metaphores used in the graphical

user interface. Not only in minituarized desktop

interfaces, but also when content is viewed on

mobile devices. An example is the Smartview

browser (Milic-Frayling and Sommerer, 2002), it

displays geometrically sectioned miniaturized

representations of web pages as they would have

been displayed in a full size screen, by selecting a

section it is possible to view that portion of the page

in isolation

The iconic auditory type is referred to as the

sound. By sound we mean audible information that

connects to an object or entity since it sounds like it.

The most commonly used sound input is probably

the voice activated calling function, the user has then

added a sound profile to a contact in the phone book.

By saying the “magic word”, the contact is called.

This should not be confused with speech recognition

that typically concerns continuous speech (Gold and

Nelson, 1999). Just as for pictures, sounds are most

commonly used for output on mobile devices. The

most common example on mobile phones is

probably to turn pop songs into monophonic ring

tones, then being a metaphor of the actual song.

However, as more and more devices get polyphonic

sound playback capabilities, these sounds are likely

to be exchanged for sound effects instead. The

addition of nomic auditory icons (Gaver, 1986), e.g.

straight depictions like sound effects in a movie, to

self-paced reading of text on a mobile device has

been found to significantly increase the feeling of

immersion while reading (Goldstein et al, 2002).

The iconic tactile information type is referred to

as the touch. By touch we mean tactile information

that connects to an object or entity since it resembles

the feeling of it. Pirhonen et al. (2002) investigated

the use of metaphorical gestures to control an MP3

player. For example, the “next track” gesture was a

sweep of a finger across the screen left to right and a

“volume up” gesture was a sweep up the screen,

bottom to top. For output of iconic tactile

information, there are to the authors’ knowledge no

stimuli generators for mobile devices yet. However,

there are some under development. Immersion Inc.

(www.immersion.com) claims that their engineers

have developed a device that would make it possible

to create tactile sensations that resemble how

surfaces feel and how a certain action feels in three

dimensions. Research on touch output has otherwise

mostly been about medical equipment and robotics,

however more recently a number of researchers have

reported improvements in interaction with tactile

feedback (Oakley et el, 2002).

4.3 Symbolic Interaction: Text,

Voice and Signal

The visual symbolic information type is referred to

as the text. By text, we mean visual information that

has a connection to an object or entity that has to be

learned. The most widely used input figure is of

course the characters in language. Language is

extremely expressive, but you have to learn how to

use it. Text input on mobile devices is hard to get

efficient because of the devices small form factors.

A multitude of solutions have been devised, among

those that use pure symbolic input we have different

forms of character recognition as text is written on a

touch sensitive screen (either as regular characters or

as short forms), or when characters are entered on a

soft-keyboard on the screen (MacKenzie and

Soukoreff, 2002). Nonetheless, if we thought

entering text was cumbersome, output can be even

worse. Since text, and other figures such as graphs

or tables, in a document usually has a spatial layout,

problems arise when you are attempting to read it on

a screen with the size of your palm. It gets even

worse if you want to view additional content, such

as images as well. A few different solutions have

been proposed; one similar to the predictive text

input interface is Adaptive RSVP (Öquist and

Goldstein, 2003), where the text is broken up in

smaller units that are successively displayed on the

screen for durations that are assumed to match the

processing time.

The auditory symbolic information type is

referred to as the voice. By voice we mean auditory

information that has a connection to an object or

entity that has to be learned. The prime form of

vocal input is naturally speech recognition, an input

method that offers great promises, but is extremely

difficult. Especially in mobile environments even

more difficult because of additional sounds in the

surroundings. Recognition of fluent speech on

mobile clients is a major research topic and there are

many issues that need to be resolved until we can

rely on it for interaction. However, limited speech

recognition is not far fetched, and there is work in

progress on how to define limited vocabularies and

at least attain limited speech interaction (von Niman

et al, 2002). For auditory symbolic output, there is of

course speech synthesis, somewhat easier to

accomplish than recognition, but similarly hard to

get natural. It is also hard to get interaction with

speech synthesis efficient since listening to speech is

MULTIMODAL INTERACTION WITH MOBILE DEVICES - Outline of a Semiotic Framework for Theory and Practice

281

half as fast as reading text (Williams, 1998). In order

to achieve conversational interface there are also

several other components, besides those for speech

recognition and synthesis, that must be integrated

into a system that can sustain a fruitful dialog

(McTear, 2002).

The tactile symbolic information type is referred

to as the signal. By signal we mean tactile

information that has a connection to an object or

entity that has to be learned. For entering text there

are several tactile interfaces. Typing on buttons is

probably the most commonly used although most

mobile devices do not have a proper keyboard.

There are several solutions for text entry on mobile

devices without a proper keyboard. The smarter of

the methods are those similar to Tegic T9

(www.tegic.com) or LetterWise (MacKenzie et al,

2001) that use linguistic knowledge to achieve

single-tap, instead of multi-tap typing. There are also

a few interfaces for tilt based typing as exemplified

by the Unigesture prototype (Sazawal et al, 2002),

where different characters are added to words by

tilting the device in different directions. A quite

different solution to text entering is Dasher (Ward et

al, 2002) where characters slides across the screen

and are selected by indicating them by tilt selection

or gaze detection. The only form of symbolic tactile

output the authors could come to think of was the

Braille printer for blind people that is based on six

pegs that are raised in different combinations that

can be interpreted as characters.

5 DISCUSSION

The main contribution of this paper is that it offers a

framework for reasoning around multimodal

interaction with mobile devices in a structured

manner. We offer a design space that encapsulates

all of the interaction possibilities using a multimodal

interface. Empirical usability evaluations are always

necessary to validate hypothesizes about usability,

but finding a common ground to compare and

discuss results are equally important. Allwood

(2002) has presented a framework for bodily

communication that is similar to the one presented

here in the sense that it also rests on Peirce’s

indexical, iconic, and symbolic, signs. It does

however mainly concern bodily interaction, in

addition to voice and writing, and is thus not fully

comparable to the framework outlined here. Yet, it

raises one very interesting question. Will the

inclusion of more expressive descriptions of

communication support or complicate our

understanding? Allwood argues that this is not likely

to be without problems, but hopefully the reward

“will consist in an increased understanding of human

communication” (2002:20). The intention with the

framework outlined in this paper is similarly to

provide a richer context for understanding

multimodal interaction.

Interfaces in which users are able to choose

between using different modalities are already in

use. As more integrated interfaces appear, users will

not have to select the modality to use, they will be

able to switch seamlessly from one to another.

Multimodal interfaces will allow the mobile user to

interact through the modality that bests suit them

and the environment where they are. Integrated

multimodal interfaces will allow users to make use

of their ability to work with multiple modes of

interaction in parallel. Eventually, multimodal

interfaces may let users interact with mobile devices

in the way humans normally do with each other: by

looking, talking, and touching, all at the same time.

As functionality gets more sophisticated, interaction

gets more natural. This represents a challenge today,

but it also represents the promise of multimodal

interaction with mobile devices for the future.

6 CONCLUSION

We have shown how interfaces that utilizes our

ability to communicate through the visual, auditory,

and tactile senses, may enhance mobile interaction.

As we have seen mobile devices are apt for

multimodal interaction, and we have raised the

question of how we may coordinate information

communicated through several senses in a way that

promotes interaction. A framework for integration of

multimodal interaction has been outlined by

mapping information over communication circuit,

semiotic representation, and sense applied for

interaction. By exemplifying how different research

prototypes fit into the framework today, we have

shown how interfaces can be interacted with through

multiple modalities. The foremost benefit of the

framework is that it can support our reasoning

around how to make the best of these possibilities

tomorrow.

REFERENCES

Allwood, J. (2002). Bodily Communication - Dimensions

of expression and Content. In B. Granström, D. House,

and I. Karlsson (Eds). Multimodality in Language and

Speech Systems, 7-26. Kluwer Academic Publishers.

Bersen, O. (1994). Foundations of multimodal

representations. A taxonomy of representational

modalities, Interacting with Computers, 6(4), 347-371.

WINSYS 2006 - INTERNATIONAL CONFERENCE ON WIRELESS INFORMATION NETWORKS AND SYSTEMS

282

Brewster, S.A. (1998). Using non speech sounds to

provide navigation cues. ACM Transactions on

Computer-Human Interaction, 5(3), 224-259.

Chandler, D. (2001). Semiotics: The Basics. New York:

Routledge.

Fishkin, K. P., Gujar, A., Harisson, B. L., Moran, T. P.

and Want, R. (2000). Embodied user interfaces for

really direct manipulation. Communications of the

ACM, 43(9), 75-80.

Furnas, G.W. (1991). New graphical reasoning models for

understanding graphical interfaces. In Proceedings of

ACM CHI'91 Conference (New Orleans, LA), 71-78.

New York, NY: ACM Press.

Gaver, W. (1986). Auditory icons: Using sound in

computer interfaces. Human-Computer Interaction, 2,

167-177.

Gold, B., and Nelson, M. (1999). Speech and Audio Signal

Processing: Processing and Perception of Speech and

Music. New York, NY: John Wiley & Sons.

Holland, S., Morse, D.R., and Gedenryd, H. (2002). Direct

Combination. A new user interaction principle for

mobile and ubiquitous HCI. In Proceedings of Mobile

HCI 2002 (Pisa, Italy), 108-122. Berlin: Springer.

Kjeldskov, J. (2002). "Just-in-Place" information for

mobile device interfaces. In Proceedings of Mobile

HCI 2002 (Pisa, Italy), 271-275. Berlin: Springer.

Lorho, G., Hiipakka, J., and Marila, J. (2002). Structured

menu presentation using spatial sound separation. In

Proceedings of Mobile HCI 2002 (Pisa, Italy), 419-

424. Berlin: Springer.

MacKenzie, I. S., and Soukoreff, R. W. (2002). Text entry

for mobile computing: Models and methods, theory

and practice. Human-Computer Interaction, 17, 147-

198.

MacKenzie, I. S., Kober, H., Smith, D., Jones, T., and

Skepner, E. (2001). LetterWise: Prefix-based

disambiguation for mobile text input. In Proceedings

of UIST’01 (Orlando, FL), 111-120. New York, NY:

ACM Press.

McGookin, D.K., and Brewster, S.A. (2001) Fishears –

The design of a multimodal focus and context system.

In Proceedings of IHM-HCI’01, Vol. II (Lille,

France), 1-4. Toulouse: Cépaduès-Editions.

McTear, M.F. (2002). Spoken dialogue technology:

enabling the conversational interface. ACM

Computing Surveys, 34(1), 90 – 169.

Milic-Frayling, N., and Sommerer, R. (2002). SmartView:

Flexible viewing of web page contents. In Proceedings

of WWW’02, (Honolulu, USA).

Nigay, L., and Coutaz, J. (1993). A design space for

multimodal interfaces: concurrent processing and

data fusion. In Proceedings of InterCHI'93,

(Amsterdam, The Netherlands), 172-178.

Mäntyjärvi, J., and Seppänen, T. (2002). Adapting

applications in mobile terminals using fuzzy context

information. In Proceedings of Mobile HCI 2002

(Pisa, Italy), 95- 107. Berlin: Springer.

Oakley, I., Adams, A., Brewster, S.A., and Gray, P.D.

Guidelines for the design of haptic widgets. In

Proceedings of BCS HCI 2002 (London, UK), 195-

212. London: Springer.

Öquist, G., Goldstein, M., and Björk, S. (2002). Utilizing

gaze detection to stimulate the affordances of paper in

the Rapid Serial Visual Presentation Format. In

Proceedings of Mobile HCI 2002 (Pisa, Italy), 378-

381. Berlin: Springer.

Öquist, G., and Goldstein, M. (2003). Towards an

improved readability on mobile devices: Evaluating

Adaptive Rapid Serial Visual Presentation. Interacting

with Computers, 15(4), 539-558.

Pirhonen, A., Brewster, S.A., and Holguin, C. (2002).

Gestural and audio metaphors as a means of control

for mobile devices. In Proceedings of ACM CHI’02

(Minneapolis, MN), 291-298. New York, NY: ACM

Press.

Prechelt, L., and Typke R. (2001). An interface for melody

input. ACM Transactions on Computer-Human

Interaction, 8(2), 133-194.

Rekimoto, J. (1987). Pick and Drop: A direct

manipulation technique for multiple computer

environments. In Proceedings of UIST’87, 31-39. New

York, NY: ACM Press.

Sazawal, V., Want, R., and Borriello, G, (2002). The

Unigesture approach. In Proceedings of Mobile HCI

2002 (Pisa, Italy), 256-270. Berlin: Springer.

Schneiderman, B. (1982). The Future of Interactive

Systems and the Emergence of Direct Manipulation.

Behaviour and Information Technology, 1, 237-256.

Sokoler, T., Nelson, L., and Pedersen, E.R. (2002). Low-

resolution supplementary tactile cues for navigational

assistance. In Proceedings of Mobile HCI 2002 (Pisa,

Italy), 369-372. Berlin: Springer.

Ward, D. J., Blackwell, A. F., and MacKay, D. J. C.

(2002). Dasher: A gesture-driven data entry interface

for mobile computing. Human-Computer Interaction,

17, 199-228.

Williams, J. R. (1998). Guidelines for the use of

multimedia in instruction. In Proceedings of the

Human Factors and Ergonomics Society 42nd Annual

Meeting (Chicago, IL), 1447-1451. Santa Monica,

CA: HFES.

MULTIMODAL INTERACTION WITH MOBILE DEVICES - Outline of a Semiotic Framework for Theory and Practice

283