Data Fusion in Multimodal Interface
Alexander Alfimtsev
Department of Information Systems and Telecommunications, BMSTU, 2nd Baumanskaya St., Moscow, Russia
Keywords: Multimodal Interface, Data Fusion, Fuzzy Aggregation.
Abstract: A method of data fusion in multimodal interface using fuzzy fusion operators is described in this paper.
Special scenes are determined by dynamic and static patterns for which the sources of information are the
results of pattern recognition by low-level algorithms.
1 INTRODUCTION
In general case the recognition tasks in multimodal
interfaces are not limited by only one specific
pattern each instant. At the moment the research is
widely held in the field of so-called multimodal
recognition. A modality is usually meant as a human
form of influence on another human being or a
personal computer using speech, gestures, touch,
mimics, appearance etc. It is considered now
(Averkin, 1986); (Higuchi, 2004); (Sharma, 2003)
that even within the bounds of only one interaction
form (with the PC), for example, the control of an
interface via hand gestures, different modalities can
be used. In this case there appears a task of
combining, or as often used, fusion of different
modalities. The intention to combine several sources
of information can be explained by the fact that each
source separately may have high uncertainty or
inaccuracy of data, which are decreased by their
fusion (Wan04). The goals of multimodal
recognition may be different. One of them is the
control of physical objects based on scenes analysis
which are actually relations of dynamic and static
objects.
The fusion can be performed at two levels
(Gra00): low level and high level. Let’s suppose that
each signal, i.e. a sequence of n+1 counts Y
i
[t
0
,
t
n
]={y
i
(t
0
), y
i
(t
1
), y
i
(t
2
), …, y
i
(t
n
)} is connected with
its modality. The fusion that deals with signals is
usually referred to the low level. Signals and
modalities that correspond to them are synchronized
on the low level, the interconnection and interaction
of signals can be clearly seen, modalities are often
referred to the same interaction form. The fusion of
the high level is done usually after the work of
recognition algorithms on the low level. Each of
them realizes the recognition of the group of signals
referred to the same form of interaction or even to
the same modality. Forms of modalities can be
independent of time.
Usually functions used for fusion are called
fusion operators (Grabisch and Roubens, 2000).
Max-operator is one of the most well-known fusion
operators. The use of max-operator for multimodal
recognition provides with a high level of reliability
but can be ineffective if used on the high level.
Weighted arithmetic operator is another popular
fusion operator. But fusion using weighted
arithmetic operator may lead to insufficient
recognition accuracy where recognition is
understood as the percent of successful recognitions
from the total number of attempts. This can be a
consequence of the empirical choice of weighting
coefficients and also the difficulties for considering
a possible interconnection of membership functions
that characterize the pattern recognition result.
The method that uses fuzzy fusion operators
(Sugeno and Choquet fuzzy integral) for data fusion
and multimodal recognition of video scenes defined
by dynamic and static patterns for which recognition
results by the low-level algorithms are the sources of
information (secondary attributes (Devyatkov and
Alfimtsev, 2008)) is described.
Sugeno and Choquet fuzzy fusion operators are
considered in Section 2. The procedures required for
data fusion based on fusion operators are described
in Section 3. Multimodal scene analysis is
considered in Section 4.
306
Alfimtsev A..
Data Fusion in Multimodal Interface.
DOI: 10.5220/0004365903060310
In Proceedings of the International Conference on Biomedical Electronics and Devices (MHGInterf-2013), pages 306-310
ISBN: 978-989-8565-34-1
Copyright
c
2013 SCITEPRESS (Science and Technology Publications, Lda.)
2 FUZZY FUSION OPERATORS
Fuzzy fusion operators that allow us to consider the
interconnection of membership functions use a fuzzy
measure. The fuzzy measure is a function
:2 [0,1]
R
g , where R is a set of some
parameters that characterize some object. The fuzzy
measure
()
i
gQ
characterizes a total significance of
parameters that are included in the set
i
Q
. The
fuzzy measure satisfies the set of conditions
(Ave86): specifically
;1)(,0)(
Ygg
if
,QP Y and QP then () ()gQ gP .
If R is a set of all subsets of modalities subset
1
{ ,..., }
m
YY Y
then fusion operators can be in
the following way.
Fuzzy Sugeno operator:
1
max [min( , ( )]
im
С k
kk ii
i
AA gQ

(1)
where
11 22 1
( ) ( ) ... ( ), { ,..., }, 1,...,
kk k
mm i i
yy yQYYim


Fuzzy Choquet operator:
11
1
[() ( )]()
im
Ш kk
kk iiii i
i
A
AyygQ



(2)
where
11 22 1 1 1
( ) ( ) ... ( ), { ,..., }, 1,..., , ( ) 0
kk k k
mm i i m m
yy yQYYimy



Fuzzy Choquet operator is usually interpreted as a
generalization of the weighted arithmetic average
notion and Sugeno operator as generalization of the
weighted median concept (with fusion of no less
than three modalities).
The most popular due to their simplicity are the
methods for calculation of the fuzzy measure based
on the concept of
g
-fuzzy measure introduced by
Sugeno. Fuzzy measure is called
g
-fuzzy measure
if the following condition is true for it: for all
,QP Y that QP occurs
( ) () () ()()gQ P gQ gP gQgP
 for
some
1

.
Let’s consider the procedure of the most popular
method for
g
-fuzzy measure (Averkin, 1986);
(Devyatkov and Alfimtsev, 2008): (Grabisch and
Roubens, 200); (Marichal, 2000) calculation still
labeling it as
g .
Step 1. For each signal (modality)
, 1,...,
i
Yi m
select the value of fuzzy measure
()[0,1]
i
gY
as an importance degree of the
modality
i
Y
. Values
()
i
gY
can be set by an expert,
can be a result of an experiment or can be received
another way.
Step 2. Find a value
using the equitation (3).
m
i
i
Yg
1
))(1(1
(3)
Step 3. For all
1
{ ,..., }, 1,...,
ii
QY Yi m
find recursive fuzzy measures
()
i
gQ
using the
following expressions:
11
() ()gQ gY
,
11
() () ( ) ()( )
ii i ii
gQ gY gQ gY gQ

,
2,..., .im
(4)
3 MULTIMODAL RECOGNITION
Before considering a common procedure of data
fusion let’s formalize the procedure of composition
of the set
i
Y
and the procedure of recognition using
the algorithm i with function
()
i
ij
y
. In the general
case the sources for fusion are i algorithms, i =
1,…,m that use hidden modalities. In this case
hidden modalities and the methods for their fusion
are not considered. The things that are of interest -
the results of the work of each algorithm as a source
of a new separate signal (modality)
, 0,...,
i
Yi m
and a membership function
( ), , 0,..., , 0,...,
ii
ij ij i i i
yyYi mj n

.
The main task of the procedure is fusion of
modalities
, 0,...,
i
Yi m
. Each algorithm passes a
preprocessing according to the next procedure 1 in
DataFusioninMultimodalInterface
307
order to form a set
i
Y
and membership functions
( ), , 0,..., , 0,...,
ii
ij ij i i i
y
yYi mj n

.
Step 1. A combination of empty sets
, 1,...,
k
i
Yk K
is specified.
Step 2. For each reference object k, k=1,…, K, a
reference model
, 1,...,
k
i
Gk K is formed using
the hidden modalities.
Step 3. Model G is formed for the recognizable
object based on the same principles and modalities.
Step 4. Model G compared to each model
, 1,...,
k
i
Gk K resulting the calculation of the set
of counts
12
{ , ,..., }
K
ii i
yy y
which characterize the
proximity of model G to models
, 1,...,
k
i
Gk K
respectively.
Step 5. Sets
, 1,...,
kk
ii
Yyk K
are formed
which are considered as the new sets
k
i
Y
. If sets
k
i
Y
stop to change then go to step 6 (other criterions can
be used to go to step 6). Else the procedure is started
from step 2.
Step 6. Sets
k
i
Y
are joined resulting the set of
K
k
k
ii
YY
1
which is sorted (if it’s numeral then the
ascending sort is done) and its elements are indexed
1,..., , 0,...,
ii
imj n
resulting a set
{ 1,..., , 0,..., }
i
iiji i i
YyYi mj n
. A
membership function
( ), , 1,..., , 0,...,
ij ij i i
yyYi mj n

is
specified on the set
i
Y
.
Recognition based on the separate algorithm i with
function
()
i
ij
y
can be done according to the
following procedure 2.
Step 0. The forming of set
i
Y
and membership
function
( ), , 0,..., , 0,...,
ii
ij ij i i i
yyYi mj n

is done with procedure 1.
Step 1. For each reference object k, k=1,…, K
their own reference model is done using hidden
modalities.
Step 2. Model
G
is formed for the recognizable
object based on the same principles and modalities.
Step 3. Model
G
is compared with each model
, 1,...,
k
i
Gk K
resulting a calculation of the set of
counts
12
{ , ,..., }
K
ii i i
y
yyY
which characterize
the proximity of model
G
to models
, 1,...,
k
i
Gk K
respectively.
Step 4. Model G is considered concurrent with
that reference model
k
i
G
for each the value
i
k
i
k
i
Yyy ),(
is maximum.
Thus the membership function
i
k
i
k
i
Yyy ),(
estimates the proximity of the
recognizable model to the corresponding reference
model. The main task is fusion of modalities
, 1,...,
i
Yi m
to increase the reliability of
recognition.
Thus the common method for data fusion in
multimodal interface using Sugeno and Choquet
operators will be the following.
Step 1. For each modality (signal)
, 1,...,
i
Yi m
check the value
()[0,1]
i
gY
as
an importance degree of modality
i
Y
.
Step 2. Find value
using equitation (3).
Step 3. Calculate a set of membership functions
( ), , 1,..., .
kk
iii
y
yYi m

using procedure 2 for
the recognizable object for each algorithm
1,...,im
and for each 1,...,kK .
Step 4. For each
1,...,kK
sort a set of
functions
()
k
i
y
so
12
() ()... ( ), {1,...,}
m
kk k
jj jn
yy yj m


.
Step 5. For each
1,...,kK
calculate fuzzy
measures values
()
k
i
gQ
recursively, where
1
{ ,..., }, 1,...,
i
k
ij j
QY Yi m
using
equitation (4).
Step 6. Calculate operator values
C
kk
AA
(or
Ш
kk
A
A
) for all
1,...,kK
. The
recognizable object is considered concurrent with
the reference object for which the value
C
kk
AA
is maximum.
BIODEVICES2013-InternationalConferenceonBiomedicalElectronicsandDevices
308
4 RECOGNITION OF VIDEO
SCENES
We hope you find the information in this template
useful in the preparation of your submission.
In a general case each frame can contain L
objects
Ll
l
,...,1,
which are subject to
recognition. Objects from different sets
l
can be
in defined, in a general case r-nary relations
12
12
... , { , ,..., } {1,..., }
r
ll l r
ll l L 
Each of these relations
we’ll call the reference
scene. By analogy with the recognizable object we’ll
identify the recognizable scene
12
, ,... ,
r
ll l


,
12
{ , ,..., } {1,..., }
r
ll l L
,
where
12
, ,...,
r
ll l

are the recognizable objects
and the process of recognizing similarities with
some reference scene
we’ll call the recognition
of the scene
.The similarity of the recognized
scene
with the reference scene
characterized
with the nonzero value of the similarity criterion
we’ll write as

. In case the value of the
similarity criterion is equal to zero then scene
is
not similar to scene
. This non-similarity is
written this way

.
Operator
11 2 2
[ ( ), ( ),..., ( )]
j
mm
A
yy y

is
used for calculation of the similarity of the
recognizable
and the reference object
j
.
Membership functions
11 2 2
( ), ( ),..., ( )
mm
y
yy

with values in the
interval [0,1] are the arguments of this operator.
Operator values also lie in the interval [0,1]. Thus
the operator is a function
[0,1] [0,1]
m
. Then if
it is known for each object of the recognizable scene
12
, ,... ,
r
ll l


a set of criterion
values
12
, ,...,
r
ll l
A
AA
of its similarity with the
objects
1
2
12
, ,...,
ll
l
r
r
kk
k
ll l

of the reference scene
then using some fusion operator A we can calculate
the similarity measure between the recognizable
scene and the reference scene as a value of function
12
[ , ,..., ]
r
ll l
A
AA A
.
Procedure 4 of recognition of the separate scene
(relation)
12
12
, ,... , , { , ,..., } {1,..., }
r
ll l r
ll l L


that uses this idea will look as follows.
Step 1. Each object
r
lll
,...,,
21
is recognized
separately comparing with reference objects
1
2
12
, ,...,
ll
l
r
r
kk
k
ll l

,
1,..., ,
rr
ll
kK
12
{ , ,..., } {1,..., }
r
ll l L
using fusion
operators
12
, ,..., .
r
ll l
AA A
. If for all recognizable
objects
r
lll
,...,,
21
similar reference objects
1
2
12
, ,...,
ll
l
r
r
kk
k
ll l


1,..., ,
rr
ll
kK
with
them were found such as
1
2
1122
, ,...,
ll
l
r
rr
kk
k
llll ll



then go to
step 2. If there was found no similar reference object
for at least one object
r
lll
,...,,
21
then go to step
3.
Step 2. Scene
12
, ,... ,
r
ll l


is
considered recognized and similar with scene
1
2
12
, ,...,
ll
l
r
r
kk
k
ll l


and the value of the
criterion for scene similarity is equal to
1
2
12
[ , ,..., ]
ll
l
r
r
kk
k
ll l
AA A A

.
Step 3. Scene
12
, ,... ,
r
ll l


was
not recognized.
We’ll call scenes
12
12
... , { , ,..., } {1,..., }
r
ll l r
ll l L
the scenes of the 1st level and identify them
1
.
We’ll call scenes
12
11 1
...
v
s
ss s

as
scenes of the s-layer, where
12
11 1
, ,... ,
v
s
ss

-
scenes of the (s-1) layer. Thus the scenes of the 1st
layer are the relations of the objects and scenes of s-
layer, where s>1 are relations of the scenes of (s-1)-
layer. In order to recognize (s-j)-level scenes
(j=0,1,…, s-2) it is needed to recognize the scenes of
(s-j-1)-level the relation of which are (s-j)-level
scenes. If during the recognition of any (s-j)-level
scene it is found that at least one (s-j-1)-level scene
included in the relation of this (s-j)-level scene can’t
be recognized then the recognition process of the
latter is stopped.
The development of the procedure 4 of the
DataFusioninMultimodalInterface
309
recognition of the 1st-level scenes can be put in the
base of the method of s-layer scenes
12
11 1
...
v
s
ss s

recognition as
follows.
Step 1. Each object
r
lll
,...,,
21
that is
included at least in one 1st-level scene
1
is
recognized by separate comparison with reference
objects
1
2
12
, ,...,
ll
l
r
r
kk
k
ll l

1,..., ,
rr
ll
kK
12
{ , ,..., } {1,..., }
r
ll l L
using fusion
operators
12
, ,..., .
r
ll l
A
AA
Step 2. Each 1st-level scene
1
for all objects
of which are found similar reference objects is
considered recognized and a similarity criterion is
extracted for it (the value of the fusion operator)
1
A
. After this go to step 3. If there was found no
such scenes then there’s no recognized scenes of the
first level and higher and the execution is stopped.
Step 3. The value of level is set to 2 and we go to
step 4.
Step 4. If there were found s-level scenes
s
for
all (s-1)-level scenes of which had been found
nonzero values of the similarity criterion then these
scenes
s
are considered recognized and similarity
criteria (fusion operator values)
s
A
are calculated
for them. If there are any (s+1)-level scenes then
step 4 is executed once again with the value s=s+1.
Else the execution is stopped.
If there were found no s-level scenes
s
for all
(s-1)-level scenes of which had been found nonzero
values of the similarity criterion then there are no
recognized s-level scenes and the execution is
stopped.
5 CONCLUSIONS
There is considered a developed method that uses
fuzzy fusion operators with fuzzy measure for data
fusion in multimodal interface. The main advantages
of the method from the well-known analogs are the
following:
The ability of hierarchic multimodal recognition of
scenes that consist of static and dynamic (moving)
objects.
The ability to consider the measure of importance
of each modality during the process of hierarchic
scene recognition due to the use of fusion
operators that use a fuzzy measure.
The ability to be the base for developing of control
systems for different objects with the help of
dynamic patterns (robots, computers, TV-sets etc.).
The ability to increase the reliability of recognition
of separate objects (for example, a human being)
in the scene using relations between these objects
and other objects of the scene (background
objects).
Promising opportunities for development of
intellectual and intuitive human-machine interfaces
by using more modalities and relations.
REFERENCES
Alatan, A. A. Automatic multimodal dialogue scene
indexing // Proc. of image processing.- 2001.- Vol. 3.-
P. 374-377.
Averkin, A. A., Batyrshin, I.Z., Blishun, A.F. Fuzzy sets
in control models and artificial intelligence systems //
Moscow . Book, Nauka, 1986.
Devyatkov V., Alfimtsev A. Optimal Fuzzy Aggregation
of Secondary Attributes in Recognition Problems //
Proc. of 16-th Int. Conf. in Central Europe on Comp.
Graphics, Visual. and Computer Vision.-Plzen, 2008.-
P. 78–85.
Grabisch M., Roubens M. Application of the Choquet
Integral in Multicriteria Decision Making. In Fuzzy
Measures and Integrals- Theory and Applications,
Physica Verlag, 2000, pp. 415-434.
Higuchi M. and oth. Scene Recognition Based on
Relationship between Human Actions and Objects //
17th Int. Conf. on Pattern Recog.-2004.- Vol. 3.- P.
73-78.
Liu F., Lin X. Multimodal face tracking using Bayesian
network // IEEE Internat. Workshop on Analysis and
Modeling of Faces and Gestures.- Nice, 2003.-P. 135-
142.
Marichal J. On Choquet and Sugeno Integrals as
Aggregation Functions // In Fuzzy Measures and
Integrals.-2000.-Vol. 40.-P. 247-272.
Ronshin A. L., Karpov A.A., Li I.V. Speech and
multimodal interface // Moscow. Book, Nauka, 2006.
Sharma R. Speech-Gesture Driven Multimodal Interfaces
for Crisis Management// The IEEE Proc.-2003.-Vol.
91, 9.- P. 1327–1354.
Wang X., Chen J. Multiple Neural Networks fusion model
based on Choquet fuzzy integral // Proc. of the Third
Intern. Conf. on Mach. Learn. and Cybern.- Vol.4.-P.
2024-2027.
BIODEVICES2013-InternationalConferenceonBiomedicalElectronicsandDevices
310