Smart Video Orchestration for Immersive Communication

Alaeddine Mihoub

and Emmanuel Marilly

Alcatel-Lucent Bell Labs France, Multimedia Technologies Domain, Centre de Villarceaux, Route de Villejust, 91620,

Nozay, France

Keywords: Camera Orchestration, Hidden Markov Model, Learning, User, Immersive Communication.

Abstract: In the context of immersive communication and in order to enrich attentional immersion in

videoconferences for remote attendants, the problem of camera orchestration has been evoked. It consists of

selecting and displaying the most relevant view or camera. HMMs have been chosen to model the different

video events and video orchestration models. A specific algorithm taking as input high level observations

and enabling non expert users to train the videoconferencing system has been developed.

1 INTRODUCTION

A key challenge of the telecommunication industry

is to identify the future of communication.

Immersive communication has been defined as the

way to exploit video and multimedia technologies in

order to create new relevant and valuable usages.

But in a context where the objective is to

improve distant communications, sensorial

immersion (i.e. all technical capabilities to mimic

sensorial feelings) is not enough. Because

communication is made of social interaction,

narration, task driven activities, we need to include a

new aspect for immersion: attentional immersion.

Attentional immersion concerns the cognitive

experience to be immersed in a narration, in a task or

in a social interaction.

Figure 1: Remote Immersive meeting use case.

In order to improve sensorial and attentional

immersion, the remote immersive meeting &

experience sharing (e-education, town hall meeting)

use case (Figure 1) has been observed and several

pain points were identified such as keeping attention

(e.g. interactivity, dynamicity, concentration,

comprehention, boredom, diversion), remote

audience feedback (e.g. reactions, questions,

discussions) and video orchestration issues (e.g. how

to switch between cameras?, which camera to

displayed in the main view?, which metadata use?,

How to model this metadata?).

In this paper we will focus mainly on the video

orchestration issues.

2 VIDEO ORCHESTRATION

Having attentional immersion used for remote video

presentation use cases (i.e. town hall meeting, e-

learning, etc...) imply to develop and implement

specific reasoning mechanisms. Such mechanisms

enable for instance to identify which of the video

events happening is the most relevant (Lavee, 2009)

to display. Or, it may help to implement elements of

the Cognitive Load Theory (Mayer, 2001) in order

to support a better knowledge transfer (for instance

when narration and visual information are

complementary and presented simultaneously).

Our experimental video conference system has

been extended to enable video orchestration

supporting some of these attentional immersion

aspects.

Several solutions and systems were proposed to

solve the problem of camera selection/orchestration.

752

Mihoub A. and Marilly E..

Smart Video Orchestration for Immersive Communication.

DOI: 10.5220/0004230307520756

In Proceedings of the International Conference on Computer Vision Theory and Applications (VISAPP-2013), pages 752-756

ISBN: 978-989-8565-47-1

 2013 SCITEPRESS (Science and Technology Publications, Lda.)

For instance, a remote control has been chosen to

select videos/cameras to display or pre-defined

orchestration templates have been used to show

participants of the meeting. Such exisiting systems

are unable to manage high number of video streams

with high level of details, dynamicity in the

rendering, adaptability to the user intent and

programmability and flexibility in the orchestration.

Video orchestration based on “audio events” is

one way in this direction. Yet, as around 70% of all

the meaning is derived from nonverbal

behavior/communication (Engleberg, 2006) useful

information for video orchestration are missing (i.e.

gesture, expression, attention,…).

Al-Hames (Al-Hames and Dielmann, 2006)

proved that the audio information is not sufficient

and visual features are essential. Then, Al-Hames

(Al-Hames and Hörnler, 2006) proposed a new

approach based on rules applied on low level

features such as global motions, skin blobs and

acoustic features. HMMs (Hidden Markov Models)

have been also used (Hörnler, 2009) for video

orchestration by conbining low and high level

features.

Based on theses observations and inspired from

(Al-Hames and Hornler, 2007) and (Ding, 2006), we

will use for our video orchestration a system based

on HMMs taking as input only high level features

such as Gesture (Fourati and Marilly, 2012), Motion

(Cheung and Kamath, 2004), Face expression

(Hromada et al., 2010), Audio (O’Gorman, 2010).

The benefit of the use of high level features is to

solve the problem of programmability of the video

orchestration during video conferences. Basic users

can define their own rules transparently and such

approach improves the user experience, the

immersion and efficiency of video-conferences.

3 PROGRAMMABILITY

Implicit or user intent-based programmability

capabilities enabling to model video orchestration

and to smartly orchestrate the displaying of

video/multimedia streams have been implemented in

our system. Data used by our HMM engine to model

the video orchestration are captured through the

combination of two approaches: visual programming

and programming by example. In our HMM model,

the transition matrix A contains transition

probabilities between diverse camera views; the

emission matrix B contains emission probabilities of

each observation knowing the current state or

screen; the initialization matrix  contains the

probability for each camera to be showed the first.

3.1 Solution Description

Therefore, the “multimedia orchestrator” module,

part of the videoconferencing system, has been

augmented by the three following functionalities:

o Smart video orchestration capabilities thanks to

HMMs.

o Learning/programmability capabilities. That

means that the system is able to define

automatically new orchestration models through

user intent capture and interactions.

o Smart template detection. That means that the

system is able to recognize the video

orchestration model that best fits the video

conference context/scenario and the user profile.

Figure 2 presents a basic scheme of the solution. The

engine of the “Multimedia Orchestrator” module is

based on specific mechanisms (e.g. learning

mechanisms, scenario recognition,…) integrating

HMMs.

Figure 2: Basic scheme of the solution.

The “MM orchestrator” module takes as inputs

video streams and video/audio events metadata

(coming for instance form video/audio analyzers

outputs). Video analyzers enable to detect high level

video events such as gestures, postures, faces and

audio analyzers enable to detect audio events such as

who is speaking, keywords, silence and noise level.

Initially, based on the first received video and audio

events metadata such as “speaker metadata”, the

classifier module selects the template that fits best

the temporal sequence of events. By default, the user

can select a model related to the current meeting

scenario. During the use, the classifier can change

the model if another one fits better the temporal

sequence of events.

This problem of selecting the right model is known

as recognition problem. Both, Forward algorithm

(Huang et al., 1990) and Backward algorithm can

solve this issue. In our MM orchestrator we have

used the Forward algorithm. Next step after the

selection of the best template is to select the most

SmartVideoOrchestrationforImmersiveCommunication

753

relevant camera to display. This decoding step is

assured by the Viterbi algorithm (Viterbi, 1967).

Once the decoding done, the HMM engine will

orchestrate videos through a video mixer.

3.2 A New Learning Mechanism

In usual approaches (Al-Hames and Hornler, 2007);

(Hörnler et al., 2009), the learning problem is known

as an estimation problem. The EM algorithm

(Dempster et al., 1977) (a.k.a. Baum Welch

algorithm (Baum et al., 1970)) is used to

reestimation the parameters



, , 



. By default this

process is done by experts and directly implemented

in systems.

Figure 3 gives an overview of the proposed solution

enabling basic users to create and personalize their

own video orchestration models through the use of

learning mecanisms (e.g. intent-based

programming).

Figure 3: Video Orchestration Learning Module.

A visual programming interface is providing to

the user (figure 6). The interface displays the video

streams and the detected video events. The user

selects which video stream has to be displayed as

main stream by the orchestrator depending of the

detected video event. The learner module records the

events and the corresponding chosen screens and

generates a new template (or updates an existing

one). From a technical point of view, the module

records the observations and the corresponding

selected states and generates a new HMM with the

appropriate probabilities. The following section

details the implemented learning process.

Learning Module Theory

The learning algorithm enables to create and train

video orchestration models based on the user uses

without any technical skills in progamming. It is

composed of 3 modules: the user visual interface,

the user activities recorder and the HMM generator.

The three components



, , 



of the HMM has

been determined in the following manner:

1. Training of the Initialization Matrix

The initialization probability of the first state

selected by the user is set to 1 and the others to 0.

2. Training of the Transition Matrix

The training of this matrix is composed of 4 steps:

Step 1: Get the number of states for the HMM

inputted.

Step 2: Generate a comparison matrix. This matrix

will contain all possible transitions.

Step 3: Browse the states sequence and each

transition will be compared to each transition in the

comparison matrix. If a similarity is found, the

occurrence matrix will be filled.

Step 4: Once the occurrence matrix obtained, the

transition matrix is estimated. The equation 1 gives

the formula enabling the transition matrix

estimation.











∑









(1)

Where Occ is the occurrence matrix.

3. Training of the Emission Matrix

For each state, each type of observation is count, and

then divided by the total observations of that state.

The equation 2 gives the formula enabling to

estimate the emission matrix:











∑









(2)

Where occObs represents the occurrence matrix for

each type of observation knowing the state.

3.3 HMM Model for e-Learning

The Video Orchestration Learning module has been

applied and tested in the context of a basic e-

learning video conferences scenario. The scenario

consists in one video stream for the lecturer/tutor,

one video stream for the virtual class room and

several individual video streams for the

students/learners. Figure 4 gives a description.

Figure 4: e-Learning use case description.

The HMM model is configured as follow:

VISAPP2013-InternationalConferenceonComputerVisionTheoryandApplications

754

o 3 States: 1-Tutor Screen, 2-Virtual Class Room

Screen and 3-Learner Screen.

o 17 Observations. This number corresponds to

the number of video or audio events that can be

detected by our system. These observations are

split in 7 families: Gestures, Motion, Face

Expressions, Keywords, Audio Cues, Slide

Number, Sub-Events. Figure 5 gives a detail

representation of the observations used.

Figure 5: Model Observations.

For the scenario, 5 basics use-cases been defined

corresponding to 5 initial video orchestration models

which are: normal lecture, question/answer

interactions, unsupervised question, exercise and

learner presenting a work.

3.4 Evaluation

Figure 6 presents the graphical user interface of the

learning module used to capture the user interactions

and model the orchestration.

Figure 6: GUI of the learning module.

Once the learning module implemented in the

videoconference system, the performance of the

HMM to correctly orchestrate the video streams has

been evaluated. Table 1 gives an overview of the

video orchestrattion performance per state.

The evaluation was based on K-Cross Validation

(K=10). For 10 sequences, 24 observations for each

one, we have in total 209 observations that have

been well decoded and affected to the right state, so

the global rate of a good detection is 0.87 (209/240).

Table 1: Evaluation of the Video Orchestration.

Recall Precision F-measure

Confusion Matrix

for Tutor State

0.97 0.86 0.91

Confusion Matrix

for Class State

0.58 0.92 0.71

Confusion Matrix

for Learner State

0.94 0.86 0.90

4 CONCLUSIONS

The paper highlights the interest of a learning

module in the context of video orchestration with

two main objectives: In the first hand enable user

intent based programming to enhance the

interactivity and the attentional immersion. In the

other hand maintain good technical results. In

addition to the learning module, the orchestration

system was enhanced by a classification module

enabling automatic detection of the appropriate

scenario to make the orchestrator more flexible and

more dynamic.

The next important step will consist in the usability

evaluation. Qualitatively, the capability offers to the

user to create or modify the video orchestration has

to be evaluated in term of acceptance and interest. A

lot of questions have to be considered, for instance:

Did the user interact at ease with the module? Did he

appreciate the use? Can we give to the user a total

freedom in video orchestration? … A whole session

for user testing will be organized in order to study

usuability issues.

REFERENCES

Lavee G., Rivlin E., and Rudzsky M., “Understanding

Video Events: A Survey of Methods for Automatic

Interpretation of Semantic Occurrences in Video,”

Systems, Man, and Cybernetics, Part C: Applications

and Reviews, IEEE Transactions on, vol. 39, no. 5, pp.

489 –504, Sep. 2009.

Mayer, R. E., 2001, “Multimedia learning.” Cambridge

University Press.

Engleberg, I. N. and Wynn D. R., 2006, Working in

Groups: Communication Principles and Strategies.

Al-Hames M., Dielmann A., Gatica-Perez D., Reiter S.,

Renals S., Rigoll G., and Zhang D., 2006,

SmartVideoOrchestrationforImmersiveCommunication

755

“Multimodal Integration for Meeting Group Action

Segmentation and Recognition,” in Machine Learning

for Multimodal Interaction, vol. 3869, Springer Berlin

Heidelberg, 2006, pp. 52–63.

Al-Hames M., Hörnler B., Scheuermann C., and Rigoll G.,

2006, “Using Audio, Visual, and Lexical Features in a

Multi-modal Virtual Meeting Director,” in Machine

Learning for Multimodal Interaction, vol. 4299, S.

Renals, S. Bengio, and J. G. Fiscus, Eds. Berlin,

Heidelberg: Springer Berlin Heidelberg, 2006, pp.

63–74.

Hörnler B., Arsic D., Schuller B., and Rigoll G., 2009,

“Boosting multi-modal camera selection with semantic

features,” in Proceedings of the 2009 IEEE

international conference on Multimedia and Expo,

Piscataway, NJ, USA, 2009, pp. 1298–1301.

Al-Hames M., Hornler B., Muller R., Schenk J., and

Rigoll G., 2007, “Automatic Multi-Modal Meeting

Camera Selection for Video-Conferences and Meeting

Browsers,” in Multimedia and Expo, 2007 IEEE

International Conference on, 2007, pp. 2074 –2077.

Ding Y. and Fan G., 2006, “Camera View-Based

American Football Video Analysis,” in Multimedia,

2006. ISM’06. Eighth IEEE International Symposium

on, 2006, pp. 317 –322.

Fourati N., Marilly E., 2012, “Gestures for natural

interaction with video”, Electronic Imaging 2012, Jan.

2012, Proceedings of SPIE Vol. 8305.

Cheung S. and Kamath C., 2004, ”Robust techniques for

background subtraction in urban traffic video”

Electronic Imaging: Visual Communications and

Image, San Jose, California, January 20-22 2004.

Hromada D., Tijus C., Poitrenaud S., Nadel J., 2010,

"Zygomatic Smile Detection: The Semi-Supervised

Haar Training of a Fast and Frugal System" in IEEE

International Conference on Research, Innovation and

Vision for the Future - RIVF , 2010.

O'Gorman L., 2010, Latency in Speech Feature Analysis

for Telepresence Event Coding" in 20th International

Conference on Pattern Recognition (ICPR), Aug.

2010.

Huang X. D., Ariki Y., and Jack M. A., 1990, “Hidden

Markov Model for Speech Recognition.” Edmgurgh

Univ. Press, 1990.

Viterbi A., “Error bounds for convolutional codes and an

asymptotically optimum decoding algorithm,”

Information Theory, IEEE Transactions on, vol. 13,

no. 2, pp. 260 –269, Apr. 1967.

Dempster A. P., Laird N. M., and Rubin D. B., 1977,

“Maximum likelihood from incomplete data via the

EM algorithm,” Journal of the Royal statistical

Society, Series B, vol. 39, no. 1, pp. 1–38, 1977.

Baum L. E., Petrie T., Soules G., and Weiss N., 1970, “A

Maximization Technique Occurring in the Statistical

Analysis of Probabilistic Functions of Markov

Chains,” The Annals of Mathematical Statistics, vol.

41, no. 1, pp. 164–171, Feb. 1970.

VISAPP2013-InternationalConferenceonComputerVisionTheoryandApplications

756