For instance, a remote control has been chosen to
select videos/cameras to display or pre-defined
orchestration templates have been used to show
participants of the meeting. Such exisiting systems
are unable to manage high number of video streams
with high level of details, dynamicity in the
rendering, adaptability to the user intent and
programmability and flexibility in the orchestration.
Video orchestration based on “audio events” is
one way in this direction. Yet, as around 70% of all
the meaning is derived from nonverbal
behavior/communication (Engleberg, 2006) useful
information for video orchestration are missing (i.e.
gesture, expression, attention,…).
Al-Hames (Al-Hames and Dielmann, 2006)
proved that the audio information is not sufficient
and visual features are essential. Then, Al-Hames
(Al-Hames and Hörnler, 2006) proposed a new
approach based on rules applied on low level
features such as global motions, skin blobs and
acoustic features. HMMs (Hidden Markov Models)
have been also used (Hörnler, 2009) for video
orchestration by conbining low and high level
features.
Based on theses observations and inspired from
(Al-Hames and Hornler, 2007) and (Ding, 2006), we
will use for our video orchestration a system based
on HMMs taking as input only high level features
such as Gesture (Fourati and Marilly, 2012), Motion
(Cheung and Kamath, 2004), Face expression
(Hromada et al., 2010), Audio (O’Gorman, 2010).
The benefit of the use of high level features is to
solve the problem of programmability of the video
orchestration during video conferences. Basic users
can define their own rules transparently and such
approach improves the user experience, the
immersion and efficiency of video-conferences.
3 PROGRAMMABILITY
Implicit or user intent-based programmability
capabilities enabling to model video orchestration
and to smartly orchestrate the displaying of
video/multimedia streams have been implemented in
our system. Data used by our HMM engine to model
the video orchestration are captured through the
combination of two approaches: visual programming
and programming by example. In our HMM model,
the transition matrix A contains transition
probabilities between diverse camera views; the
emission matrix B contains emission probabilities of
each observation knowing the current state or
screen; the initialization matrix contains the
probability for each camera to be showed the first.
3.1 Solution Description
Therefore, the “multimedia orchestrator” module,
part of the videoconferencing system, has been
augmented by the three following functionalities:
o Smart video orchestration capabilities thanks to
HMMs.
o Learning/programmability capabilities. That
means that the system is able to define
automatically new orchestration models through
user intent capture and interactions.
o Smart template detection. That means that the
system is able to recognize the video
orchestration model that best fits the video
conference context/scenario and the user profile.
Figure 2 presents a basic scheme of the solution. The
engine of the “Multimedia Orchestrator” module is
based on specific mechanisms (e.g. learning
mechanisms, scenario recognition,…) integrating
HMMs.
Figure 2: Basic scheme of the solution.
The “MM orchestrator” module takes as inputs
video streams and video/audio events metadata
(coming for instance form video/audio analyzers
outputs). Video analyzers enable to detect high level
video events such as gestures, postures, faces and
audio analyzers enable to detect audio events such as
who is speaking, keywords, silence and noise level.
Initially, based on the first received video and audio
events metadata such as “speaker metadata”, the
classifier module selects the template that fits best
the temporal sequence of events. By default, the user
can select a model related to the current meeting
scenario. During the use, the classifier can change
the model if another one fits better the temporal
sequence of events.
This problem of selecting the right model is known
as recognition problem. Both, Forward algorithm
(Huang et al., 1990) and Backward algorithm can
solve this issue. In our MM orchestrator we have
used the Forward algorithm. Next step after the
selection of the best template is to select the most
SmartVideoOrchestrationforImmersiveCommunication
753