SPATIALIZED AUDIO CONFERENCES

IMS Integration and Trafﬁc Modelling

Christopher J. Reynolds, Martin J. Reed

University of Essex, Colchester, Essex, UK

Peter J. Hughes

Broadband Applications Research center, BT Group, Adastral Park, Ipswich, UK

Keywords:

Spatial audio, audio conferencing.

Abstract:

Existing monophonic multiparty VoIP conferencing applications are currently limited to supporting a single

conversation ﬂoor, with limited numbers of simultaneous speakers. We discuss the additional requirements

and beneﬁts of delivering a spatially enhanced audio application via Head Related Transfer Function (HRTF)

ﬁltering, which may support many conversation ﬂoors. Several network delivery architectures are presented,

including integration to the Next Generation Network (NGN) IP Multimedia Subsystem (IMS). The delivery

architectures are compared using trafﬁc models, and implications for the scope of such an application are

discussed.

1 INTRODUCTION

Multiparty VoIP conferencing is among a range of ad-

vanced voice services to be offered by next generation

networks (NGN’s), with the IP multimedia subsys-

tem (IMS) providing native support both for media

delivery, and session management via Session Initi-

ation Protocol (SIP). We explore a proposal for the

delivery of a new headphone based VoIP multiparty

conferencing application, with Head Related Transfer

Function (HRTF)(Cheng and Wakeﬁeld, 2001) ﬁlter-

ing used to provide a spatially enhanced audio envi-

ronment. Existing monophonic conferencing systems

impose severe limits upon the participant’s ability to

naturally converse, particularly for large groups. Spa-

tialized audio conferencing allows for a much more

natural audio environment, and extends support for

larger groups by allowing overlapping speech to be

distinguished as separate perceptual streams (Breg-

man, 1994). Whilst the theoretical limit to the num-

ber of participants within a monophonic conference

may be large, users are typically limited to interact-

ing via a single conversation ﬂoor in which they align

their turns of speech, with a suggestion that the max-

imum number of simultaneous speakers be set at 3

(Venkatesha et al., 2003). As such, support for mul-

tiple conversation ﬂoors is restricted and indeed not

expected to occur, as many overlapping speakers pre-

sented monaurally are difﬁcult to distinguish. How-

ever the addition of spatial cues can extend support

for multiple conversation ﬂoors. The ability to fo-

cus upon a particular talker in the presence of other

conversations is greatly enhanced when the sources

are spatially separated, a phenomenon well known as

the cocktail party effect (Cherry, 1953). It is known

that presenting multiple audio sources from differ-

ent spatial locations aids the perceptual organization

of sound streams (Bregman, 1994), and can enhance

memory, comprehension and intelligibility (Baldis,

2001). Conferences with spatial cues more closely re-

semble face to face meetings and conversations, and

represent a signiﬁcant advance over existing mono-

phonic conferencing applications.

The audio mixing and ﬁltering process may be

performed locally at a users terminal, or centrally via

a dedicated server, and whilst methods for mixing au-

dio using these models have been discussed (Singh

et al., 2001), spatialized audio conferences have not

yet been covered. We discuss the additional require-

ments for delivering such an application, the relative

merits of adding spatial cues, and how such an appli-

cation may be integrated within the NGN/IMS model.

We cover the limits imposed by both the psychoa-

coustic properties of such an audio environment, and

247

J. Reynolds C., J. Reed M. and J. Hughes P. (2007).

SPATIALIZED AUDIO CONFERENCES - IMS Integration and Trafﬁc Modelling.

In Proceedings of the Second International Conference on Signal Processing and Multimedia Applications, pages 243-248

DOI: 10.5220/0002134002430248

 SciTePress

the network delivery architecture. Three network de-

livery architectures are then considered, Centralized,

Unicast Full Mesh, and a brief outline of a Hybrid

system. A trafﬁc model is constructed with reference

to NGN core/access partitioning, and comparisons of

resulting trafﬁc are made for each architecture.

2 CONFERENCE MODELS AND

SPATIAL AUDIO

Spatialized or 3D audio for virtual multiparty confer-

encing has been implemented by Kilgore et al (Kil-

gore et al., 2003), with simple manipulation of Inter

Aural Time Differences (ITD) and the Inter Aural In-

tensity Differences (IID) in accordance with duplex

theory (Cheng and Wakeﬁeld, 2001). HRTF based

systems are known to produce effective spatial repro-

duction (Crispien and Ehrenberg, 1995) (Evans et al.,

2000) and have been integrated into a conferencing

application under our development.

Using HRTF based spatial audio, a participant’s

mono voice stream may be convolved with a HRTF

to give a binaural audio stream that has temporal and

spectral effects that mimic a sound source from a

given point in space. Convolving each participant

with a different HRTF (relating to a different azimuth

and/or elevation), and then mixing the output for all

participants produces an audio space in which each

different speaker’s utterance will appear to emanate

from a different spatial location. As mentioned previ-

ously, this has many beneﬁts for communication and

more importantly allows multiple conversation ﬂoors

to emerge through the process of schisming (Egbert,

1997), in which a large conversation ﬂoor involving

many participants may fragment into several smaller

ﬂoors. Users make use of the cocktail party effect

to ignore other conversations within the audio space,

and to align their speech turns to a conversation ﬂoor

of their choosing. As a result many ﬂoors may exist

within the space/conference. The ﬂoor control mech-

anism of limiting and choosing the number of simul-

taneous speakers is no longer required, as many par-

ticipants may speak simultaneously without masking

each other. Limits to the number of conferees are dis-

cussed later in relation to the delivery architecture.

Where the mixing and HRTF ﬁltering is per-

formed has direct implications for both the scope of

such an audio space, and the resulting network trafﬁc.

The next section introduces the possible architectures,

with a brief discussion on NGN partitioning.

Core

User A

User B

User

Access Link

Server Mix Stream

User Stream

Core Link

MRFP

Figure 1: Core/Access Network Division.

2.1 NGN and IMS: Centralized

Conferencing

The NGN architecture provides logical division be-

tween service functions and the underlying trans-

port technologies. The transport functions are fur-

ther divided into access and core network functions,

which perform a range of quality of service mech-

anisms including packet ﬁltering, marking, shaping,

buffer management, scheduling and queuing (Knight-

son et al., 2005). The core transport network and its

associated control functions provide a platform to de-

liver trafﬁc for services such as the IMS, and may

be logically separated by technology, ownership or

administrative boundaries. An IMS may be located

within a core network partition, and can provide sup-

port for media services such as audio conferencing.

An Application Server (AS) within the IMS can be

used for conference control, with SIP based session

control through call session control functions (CSCF).

In the NGN/IMS model, ASs have control over au-

dio mixing and ﬁltering through the media resource

function controller (MRFC) that directly controls the

media resource function processor (MRFP) which is

responsible for audio processing. The AS and MRFP

may be physically separate, and thus it is the MRFP

location that is critical as the audio trafﬁc dominates

the signalling trafﬁc.

2.1.1 Mixing

The MRFP allows for a centralized audio conferenc-

ing model, under the control of an application server.

An outline for server based audio mixing for mono-

phonic conferencing is described in (Singh et al.,

2001), including a discussion of the decoding, jitter

buffering and mixing procedure, as well as some per-

formance statistics. Figure 2 shows the additional

ﬁltering process within the MRFP required to pro-

SIGMAP 2007 - International Conference on Signal Processing and Multimedia Applications

248

vide a spatially enhanced audio scene for a set of 3

participants. The controllers may act upon SIP mes-

sages as users leave or join the conference, to signal

the MRFP to select the relevant HRTF and to deter-

mine the mixing conﬁguration. The MRFP may con-

volve each stream with a different HRTF according

to some pre-deﬁned spatial arrangement (some pre-

liminary suggestions have been made (Brungart and

Simpson, 2003)), and deliver a mix back to each par-

ticipant. As such, each participant will receive a cus-

tom mix consisting of all the other participants’ bin-

aural streams, with their own stream missing. As sug-

gested by (Singh et al., 2001) local removal of a par-

ticipant’s stream from a mix may be difﬁcult, hence a

single (possibly multicast) stream for all participants

is not possible as the participant would hear their own

voice. As participants leave or join the conference,

the spatial arrangement may be altered by the server

by applying different HRTF’s to each stream. For ex-

ample a mix of x streams may be spatially arranged in

the frontal hemisphere with equal θ degrees of sepa-

ration. Should a participant leave, the angular separa-

tion θ may become 180/(x-1).

As mixing is performed by the MRFP, clients that

are not capable of mixing and spatializing audio such

as smart phones and PDA’s, may still participate in

conferences. Each mix may then be encoded with an

arbitrary waveform codec and distributed to the ap-

propriate participant.

HRTF

Controller

Mix

Controller

BR +

AL +

AR +

+ AL

BR + AR

SIP Interface and Logic

Mixer

HRTF

Decoded

Audio

Encode

Send to A

Send to B

Send to

Application Server

Figure 2: Convolution and Mixing Arrangement.

2.1.2 Centralized Conference Size Limits

Whilst the audio processing is handled by a server,

thus reducing the processing load on the clients, in

practice a limit to the number of conference par-

ticipants may still be imposed. A limit may occur

due to the degree of localization performance that

allows perceptual voice streams to be spatially dis-

tinguished from each other, as the mix returned by

the MRFP presents each talker at a ﬁxed spatial lo-

cation. Also spatialization effects such as reversals

(Begault and Wenzel, 1993) may limit the arrange-

ment of audio sources to the frontal hemisphere only

(a back to front reversal may lead to the perceptual

overlap of two speakers if another source exists in a

position mirrored in the interaural axis). Limits may

also be imposed by the MRFP processing capabili-

ties. However, network bandwidth limits should not

be restrictive as participants only ever send one audio

upstream, and receive the single audio mix from the

server downstream. A comparison of trafﬁc estimates

is made in section 3. Next we consider an alternative

delivery model.

2.2 Unicast Full Mesh

With the unicast full mesh model, participants send a

copy of their own voice stream to every other con-

ference participant. HRTF convolution and mixing

is then performed locally at the participant’s termi-

nal. This method would be restricted to terminals

with the capabilities to ﬁlter and mix an allocated

number of streams. Since the spatial cues are added

to each stream locally, users have full control over

their own audio space. The ﬁlters may be adjusted

to allow each user to fully customize where they hear

other members of the conference. For example a user

may choose to group a number of voice streams with

whom they are not conversing to a similar azimuth

(effectively merging the multiple streams to one per-

ceptual stream), or make adjustments to the volume

of each speaker.

2.2.1 Unicast Full Mesh Conference Size Limits

In the unicast full mesh model, limits to the number of

conferees are imposed by the terminal resources and

access network technologies, rather than the percep-

tual spatial arrangement. This may be restrictive for

asymmetric technologies where upstream bandwidth

is limited.

2.3 Hybrid

With a centralized model, the spatial locations and the

mix for each user are ﬁxed, as any changes a partic-

ipant makes to the HRTF set would be common for

the group. However, a hybrid model may be imple-

mented to give users control over their mix. We pro-

pose that the application server may respond to SIP

INFO messages sent from participants, and adjust the

users mix upon request via the Mix Controller shown

SPATIALIZED AUDIO CONFERENCES - IMS Integration and Traffic Modelling

249

in Figure 2. This would allow each user to control

the volume at which they hear other participants, per-

haps at a reduced level for conversation ﬂoors they are

not involved in. Trafﬁc modelling for this architecture

may be considered equal to the centralized model, on

the assumption that SIP INFO signalling trafﬁc may

be ignored.

3 TRAFFIC MODELLING

This section describes a comparison of network traf-

ﬁc generated for two suggested conference delivery

models, centralized and unicast full mesh, across a

network logically divided into core and access par-

titions. Figure 1 shows an example of the network

partitioning with an MRFP (placed at a single node)

and 3 users. Four conference group sizes were inves-

tigated with varying degrees of distribution across the

network. Each group consisted of N users, and only

one group was modelled at a time. A random core net-

work was generated using the BRITE topology gener-

ator with the AS Waxman conﬁguration. The network

consisted of 130 nodes, based upon a hypothetical na-

tional sized NGN core, with node degree 3. Edge

bandwidths were set to inﬁnite on the assumption of

an unconstrained capacity model. Edges were also

assumed to be of unitary cost, i.e. hop count consid-

ered more dominant in cost than distance, though for

long distance topologies this may require revision. A

ﬁxed low bit rate voice codec rate r was set to 16kb/s

for user voice streams, whilst the MRFP return rate s

was set to 128kb/s, based upon an MPEG II layer 3

waveform codec to preserve stereo reproduction. The

MRFP was positioned such that in each scenario, the

sum of all paths between users and the MRFP was at

a minimum.

3.1 Access Network

The access network was modelled as a single link

from each user to core as shown in Figure 1. Since ac-

cess trafﬁc is independent of how distributed the users

within the group are, the trafﬁc is trivial to calculate

but for completeness is shown below. Unicast access

upstream trafﬁc U

may be deﬁned as:

= rN(N − 1) (1)

Unicast downstream access trafﬁc D

is deﬁned as:

= rN(N − 1) (2)

Access upstream trafﬁc with application servers may

be deﬁned by (3), whilst downstream trafﬁc is deﬁned

by (4)

= Nr (3)

= Ns (4)

Figure 4 shows a comparison of downstream access

trafﬁc between the two models. Clearly at group size

9 the trafﬁc for both models is equal, whilst trafﬁc is

reduced using a centralized model when group sizes

grow beyond this. A signiﬁcantly greater saving is

made using a centralized model when considering up-

stream trafﬁc, as illustrated in Figure 3, as less trafﬁc

is generated for all group sizes. This has signiﬁcant

implications for asymmetric access technologies.

4 6 8 10 12

500

1000

1500

2000

2500

3000

Group Size N

Traffic kb/s

Centralized

Unicast FM

Figure 3: Access Upstream Trafﬁc.

4 6 8 10 12

500

1000

1500

2000

2500

Group Size N

Traffic kb/s

Unicast FM

Centralized

Figure 4: Access Downstream Trafﬁc.

3.2 Core Network

Trafﬁc generated in the core is dependent on how dis-

tributed the group of users are. To measure this dis-

tribution a value of mean hop count (MHC) was used,

SIGMAP 2007 - International Conference on Signal Processing and Multimedia Applications

250

and may be deﬁned as follows. Let A be the set of N

users where each user is connected to one node V in

the core network G(V,E). We deﬁne the mean hop

count L between all users in A as follows:

L = 1/N

∑

{s,t:s,t∈A,s6=t}

P(s,t) (5)

where P(s,t) is the length of the shortest path in G

between users s and t measured using unit length for

each edge E. Note that this does not include any ac-

cess cost.

Core trafﬁc for unicast T may be summed as:

T = r

∑

{s,t:s,t∈A,s6=t}

P(s,t) (6)

Unicast trafﬁc and mean hop count are trivially re-

lated. However it is possible to distribute a group with

a ﬁxed MHC (and hence ﬁxed unicast trafﬁc), and cal-

culate the trafﬁc saving with an MRFP.

For each group size, 50 scenarios were simulated

each with the same MHC, and their results averaged.

The trafﬁc generated with the centralized model was

then deducted from the Unicast model to generate a

value of trafﬁc saving. Figure 5 shows the values for

trafﬁc saving for each group size, with varying dis-

tributions measured by MHC. For a group size of 6

no saving is made by using the centralized model, as

indicated by the negative saving values across all dis-

tributions. For group size 8, trafﬁc savings increase

slightly as the group spreads out, though savings are

always made for this group size. Signiﬁcant savings

are made for group sizes of 10 and 12 showing an

increased trafﬁc saving as the group becomes more

distributed.

Thus for larger, more distributed groups, large

savings are made when using the centralized model.

For smaller, more concentrated distributions, audio

ﬁltering and mixing should not be done centrally,

rather locally at the participant’s terminal. When con-

sidering the case for supporting terminals with no ﬁl-

tering and mixing capabilities, a centralized model

is the only possible solution, though at a cost of in-

creased network trafﬁc for small groups.

4 CONCLUSIONS AND FUTURE

WORK

We have presented three models for the delivery

of a collaborative spatially enhanced audio space,

and outlined the necessary modiﬁcations to existing

conferencing mixing architectures to support such

an environment. This included a discussion of

8 10 12 14 16 18 20

−500

500

1000

1500

2000

2500

Mean Hop Count (MHC)

Traffic Saving kb/s

N = 6

N = 8

N = 10

N = 12

Figure 5: Core Trafﬁc Saving.

changes related to ﬂoor control, in order to sup-

port many simultaneous conversation ﬂoors within a

space/conference.

The centralized model ﬁts naturally within an IMS

infrastructure located within an NGN core network

partition, in order to reduce core trafﬁc for larger

group sizes, demonstrated by a trafﬁc modelling in-

vestigation for a randomly generated core network.

Access trafﬁc has also been shown to drop for a cen-

tralized model, in particular upstream where access

network bandwidth for asymmetric technologies may

be restricted. The advantages of convolution and mix-

ing at the users terminal have also been discussed, a

process which allows a fully customizable audio en-

vironment for the user, and potentially larger confer-

ence sizes. Future work in this area needs to ad-

dress the modelling of multiple groups and optimal

server locations, some of which has been discussed by

(Venkatesha et al., 2005), as well as investigation into

the psychoacoustic limits for conferences with ﬁxed

spatial locations. The limit to the maximum number

of simultaneous conversation ﬂoors, and hence simul-

taneous speakers needs to be found. This may require

an analysis of users conversing within a spatially en-

hanced environment, in order to determine how difﬁ-

cult they ﬁnd it to communicate. However, early ex-

periments point to spatialized audio conferencing as a

highly attractive technology.

REFERENCES

Baldis, J. (2001). Effects of spatial audio on memory,

comprehension, and preference during desktop con-

ferences. In CHI ’01: Proceedings of the SIGCHI

conference on Human factors in computing systems,

pages 166–173. ACM Press.

SPATIALIZED AUDIO CONFERENCES - IMS Integration and Traffic Modelling

251

Begault, D. and Wenzel, E. (1993). Headphone localization

of speech. Human Factors, 35:361–376.

Bregman, A. S. (1994). Auditory Scene Analysis: The Per-

ceptual Organization of Sound. The MIT Press.

Brungart, D. and Simpson, B. (2003). Optimizing the spa-

tial conﬁguration of a seven-talker speech display. In

Proceedings of the 2003 International Conference on

Auditory Display.

Cheng, C. and Wakeﬁeld, G. H. (2001). Introduction to

head-related transfer functions (HRTFs): Representa-

tions of HRTFs in time, frequency, and space. J. of the

AES, 49:231–249.

Cherry, E. C. (1953). Some experiments on the recognition

of speech, withone and with two ears. J. Acoust. Soc.

Am, 25(5):975–979.

Crispien, K. and Ehrenberg, T. (1995). Evaluation of the

cocktail party effect for multiple speech stimuli within

a spatial audio display. J. of the Aud. Eng. Soc,

43(11):932–941.

Egbert, M. (1997). Schisming: The collaborative transfor-

mation from a single conversation to multiple conver-

sations. Research on Language and Social Interac-

tion, 30(1):1–51.

Evans, M., Tew, A., and Angus, J. (2000). Perceived perfor-

mance of loudspeaker-spatialized speech for telecon-

ferencing. J. of the Aud. Eng. Soc, 48(9):771–785.

Kilgore, R., Chignell, M., and Smith, P. (2003). Spatialized

audioconferencing: what are the beneﬁts? In Confer-

ence of the Centre for Advanced Studies on Collabo-

rative research, pages 135–144.

Knightson, K., Morita, N., and Towle, T. (2005). NGN ar-

chitecture: generic principles, functional architecture,

and implementation. IEEE Communications Maga-

zine, 43(10):49–56.

Singh, K., Nair, G., and H, S. (2001). Centralized con-

ferencing using SIP. In Proceedings of the 2nd IP-

Telephony Workshop (IPTel).

Venkatesha, P., Jamadagni, H., and Shankar, H. (2003).

On the problem of specifying the number of ﬂoors

for a voice-only conference on packet networks. In

ITRE2003: International Conference on Information

Technology: Research and Education.

Venkatesha, P., Shankar, H., Jamadagni, H., and Vijay, S.

(2005). Server allocation algorithms for VoIP con-

ference. In Proceedings of the First International

Conference on Distributed Frameworks for Multime-

dia Applications.

SIGMAP 2007 - International Conference on Signal Processing and Multimedia Applications

252