Data Fusion in Multimodal Interface

Alexander Alfimtsev

Department of Information Systems and Telecommunications, BMSTU, 2nd Baumanskaya St., Moscow, Russia

Keywords: Multimodal Interface, Data Fusion, Fuzzy Aggregation.

Abstract: A method of data fusion in multimodal interface using fuzzy fusion operators is described in this paper.

Special scenes are determined by dynamic and static patterns for which the sources of information are the

results of pattern recognition by low-level algorithms.

1 INTRODUCTION

In general case the recognition tasks in multimodal

interfaces are not limited by only one specific

pattern each instant. At the moment the research is

widely held in the field of so-called multimodal

recognition. A modality is usually meant as a human

form of influence on another human being or a

personal computer using speech, gestures, touch,

mimics, appearance etc. It is considered now

(Averkin, 1986); (Higuchi, 2004); (Sharma, 2003)

that even within the bounds of only one interaction

form (with the PC), for example, the control of an

interface via hand gestures, different modalities can

be used. In this case there appears a task of

combining, or as often used, fusion of different

modalities. The intention to combine several sources

of information can be explained by the fact that each

source separately may have high uncertainty or

inaccuracy of data, which are decreased by their

fusion (Wan04). The goals of multimodal

recognition may be different. One of them is the

control of physical objects based on scenes analysis

which are actually relations of dynamic and static

objects.

The fusion can be performed at two levels

(Gra00): low level and high level. Let’s suppose that

each signal, i.e. a sequence of n+1 counts Y

]={y

), y

), …, y

)} is connected with

its modality. The fusion that deals with signals is

usually referred to the low level. Signals and

modalities that correspond to them are synchronized

on the low level, the interconnection and interaction

of signals can be clearly seen, modalities are often

referred to the same interaction form. The fusion of

the high level is done usually after the work of

recognition algorithms on the low level. Each of

them realizes the recognition of the group of signals

referred to the same form of interaction or even to

the same modality. Forms of modalities can be

independent of time.

Usually functions used for fusion are called

fusion operators (Grabisch and Roubens, 2000).

Max-operator is one of the most well-known fusion

operators. The use of max-operator for multimodal

recognition provides with a high level of reliability

but can be ineffective if used on the high level.

Weighted arithmetic operator is another popular

fusion operator. But fusion using weighted

arithmetic operator may lead to insufficient

recognition accuracy where recognition is

understood as the percent of successful recognitions

from the total number of attempts. This can be a

consequence of the empirical choice of weighting

coefficients and also the difficulties for considering

a possible interconnection of membership functions

that characterize the pattern recognition result.

The method that uses fuzzy fusion operators

(Sugeno and Choquet fuzzy integral) for data fusion

and multimodal recognition of video scenes defined

by dynamic and static patterns for which recognition

results by the low-level algorithms are the sources of

information (secondary attributes (Devyatkov and

Alfimtsev, 2008)) is described.

Sugeno and Choquet fuzzy fusion operators are

considered in Section 2. The procedures required for

data fusion based on fusion operators are described

in Section 3. Multimodal scene analysis is

considered in Section 4.

306

Alﬁmtsev A..

Data Fusion in Multimodal Interface.

DOI: 10.5220/0004365903060310

In Proceedings of the International Conference on Biomedical Electronics and Devices (MHGInterf-2013), pages 306-310

ISBN: 978-989-8565-34-1

 2013 SCITEPRESS (Science and Technology Publications, Lda.)

2 FUZZY FUSION OPERATORS

Fuzzy fusion operators that allow us to consider the

interconnection of membership functions use a fuzzy

measure. The fuzzy measure is a function

:2 [0,1]

g  , where R is a set of some

parameters that characterize some object. The fuzzy

measure

()

characterizes a total significance of

parameters that are included in the set

. The

fuzzy measure satisfies the set of conditions

(Ave86): specifically

;1)(,0)(



 Ygg

,QP Y and QP then () ()gQ gP .

If R is a set of all subsets of modalities subset

{ ,..., }

YY Y

then fusion operators can be in

the following way.

Fuzzy Sugeno operator:

max [min( , ( )]

С k

kk ii

AA gQ







(1)

where

11 22 1

( ) ( ) ... ( ), { ,..., }, 1,...,

kk k

mm i i

yy yQYYim

 

  

Fuzzy Choquet operator:

[() ( )]()

Ш kk

kk iiii i

AyygQ









 



(2)

where

11 22 1 1 1

( ) ( ) ... ( ), { ,..., }, 1,..., , ( ) 0

kk k k

mm i i m m

yy yQYYimy

  



   

Fuzzy Choquet operator is usually interpreted as a

generalization of the weighted arithmetic average

notion and Sugeno operator as generalization of the

weighted median concept (with fusion of no less

than three modalities).

The most popular due to their simplicity are the

methods for calculation of the fuzzy measure based

on the concept of



-fuzzy measure introduced by

Sugeno. Fuzzy measure is called



-fuzzy measure

if the following condition is true for it: for all

,QP Y that QP occurs

( ) () () ()()gQ P gQ gP gQgP



   for

some





Let’s consider the procedure of the most popular

method for



-fuzzy measure (Averkin, 1986);

(Devyatkov and Alfimtsev, 2008): (Grabisch and

Roubens, 200); (Marichal, 2000) calculation still

labeling it as

g .

Step 1. For each signal (modality)

, 1,...,

Yi m



select the value of fuzzy measure

()[0,1]



as an importance degree of the

modality

. Values

()

can be set by an expert,

can be a result of an experiment or can be received

another way.

Step 2. Find a value



using the equitation (3).







))(1(1



(3)

Step 3. For all

{ ,..., }, 1,...,

QY Yi m



find recursive fuzzy measures

()

using the

following expressions:

() ()gQ gY



() () ( ) ()( )

ii i ii

gQ gY gQ gY gQ









2,..., .im



(4)

3 MULTIMODAL RECOGNITION

Before considering a common procedure of data

fusion let’s formalize the procedure of composition

of the set

and the procedure of recognition using

the algorithm i with function

()



. In the general

case the sources for fusion are i algorithms, i =

1,…,m that use hidden modalities. In this case

hidden modalities and the methods for their fusion

are not considered. The things that are of interest -

the results of the work of each algorithm as a source

of a new separate signal (modality)

, 0,...,

Yi m

and a membership function

( ), , 0,..., , 0,...,

ij ij i i i

yyYi mj n







The main task of the procedure is fusion of

modalities

, 0,...,

Yi m



. Each algorithm passes a

preprocessing according to the next procedure 1 in

DataFusioninMultimodalInterface

307

order to form a set

and membership functions

( ), , 0,..., , 0,...,

ij ij i i i

yYi mj n



 

Step 1. A combination of empty sets

, 1,...,

Yk K 

is specified.

Step 2. For each reference object k, k=1,…, K, a

reference model

, 1,...,

Gk K is formed using

the hidden modalities.

Step 3. Model G is formed for the recognizable

object based on the same principles and modalities.

Step 4. Model G compared to each model

, 1,...,

Gk K resulting the calculation of the set

of counts

{ , ,..., }

ii i

yy y

which characterize the

proximity of model G to models

, 1,...,

Gk K

respectively.

Step 5. Sets

, 1,...,

Yyk K

are formed

which are considered as the new sets

. If sets

stop to change then go to step 6 (other criterions can

be used to go to step 6). Else the procedure is started

from step 2.

Step 6. Sets

are joined resulting the set of



1



which is sorted (if it’s numeral then the

ascending sort is done) and its elements are indexed

1,..., , 0,...,

imj n

resulting a set

{ 1,..., , 0,..., }

iiji i i

YyYi mj n 

. A

membership function

( ), , 1,..., , 0,...,

ij ij i i

yyYi mj n



 

specified on the set

Recognition based on the separate algorithm i with

function

()



can be done according to the

following procedure 2.

Step 0. The forming of set

and membership

function

( ), , 0,..., , 0,...,

ij ij i i i

yyYi mj n



 

is done with procedure 1.

Step 1. For each reference object k, k=1,…, K

their own reference model is done using hidden

modalities.

Step 2. Model

is formed for the recognizable

object based on the same principles and modalities.

Step 3. Model

is compared with each model

, 1,...,

Gk K

resulting a calculation of the set of

counts

{ , ,..., }

ii i i

yyY

which characterize

the proximity of model

to models

, 1,...,

Gk K

respectively.

Step 4. Model G is considered concurrent with

that reference model

for each the value

Yyy ),(



is maximum.

Thus the membership function

Yyy ),(



estimates the proximity of the

recognizable model to the corresponding reference

model. The main task is fusion of modalities

, 1,...,

Yi m



to increase the reliability of

recognition.

Thus the common method for data fusion in

multimodal interface using Sugeno and Choquet

operators will be the following.

Step 1. For each modality (signal)

, 1,...,

Yi m



check the value

()[0,1]

gY 

an importance degree of modality

Step 2. Find value



using equitation (3).

Step 3. Calculate a set of membership functions

( ), , 1,..., .

iii

yYi m





using procedure 2 for

the recognizable object for each algorithm

1,...,im



and for each 1,...,kK .

Step 4. For each

1,...,kK



sort a set of

functions

()



() ()... ( ), {1,...,}

kk k

jj jn

yy yj m

 

 

Step 5. For each

1,...,kK



calculate fuzzy

measures values

()

recursively, where

{ ,..., }, 1,...,

ij j

QY Yi m

using

equitation (4).

Step 6. Calculate operator values

AA 

(or

A

) for all

1,...,kK

. The

recognizable object is considered concurrent with

the reference object for which the value

AA 

is maximum.

BIODEVICES2013-InternationalConferenceonBiomedicalElectronicsandDevices

308

4 RECOGNITION OF VIDEO

SCENES

We hope you find the information in this template

useful in the preparation of your submission.

In a general case each frame can contain L

objects

,...,1, 



which are subject to

recognition. Objects from different sets



can be

in defined, in a general case r-nary relations

... , { , ,..., } {1,..., }

ll l r

ll l L     

Each of these relations



we’ll call the reference

scene. By analogy with the recognizable object we’ll

identify the recognizable scene

, ,... ,

ll l



 

{ , ,..., } {1,..., }

ll l L

where

, ,...,

ll l





are the recognizable objects

and the process of recognizing similarities with

some reference scene



we’ll call the recognition

of the scene



.The similarity of the recognized

scene



with the reference scene



characterized

with the nonzero value of the similarity criterion

we’ll write as





. In case the value of the

similarity criterion is equal to zero then scene



not similar to scene



. This non-similarity is

written this way





Operator

11 2 2

[ ( ), ( ),..., ( )]

yy y

 

used for calculation of the similarity of the

recognizable



and the reference object



Membership functions

11 2 2

( ), ( ),..., ( )

 

with values in the

interval [0,1] are the arguments of this operator.

Operator values also lie in the interval [0,1]. Thus

the operator is a function

[0,1] [0,1]

 . Then if

it is known for each object of the recognizable scene

, ,... ,

ll l



 

a set of criterion

values

, ,...,

ll l

of its similarity with the

objects

, ,...,

ll l

 

of the reference scene

then using some fusion operator A we can calculate

the similarity measure between the recognizable

scene and the reference scene as a value of function

[ , ,..., ]

ll l

AA A

Procedure 4 of recognition of the separate scene

(relation)

, ,... , , { , ,..., } {1,..., }

ll l r

ll l L



  

that uses this idea will look as follows.

Step 1. Each object

lll



,...,,

is recognized

separately comparing with reference objects

, ,...,

ll l





1,..., ,

kK

{ , ,..., } {1,..., }

ll l L

using fusion

operators

, ,..., .

ll l

AA A

. If for all recognizable

objects

lll



,...,,

similar reference objects

, ,...,

ll l









1,..., ,

kK



with

them were found such as

1122

, ,...,

llll ll

 



 





then go to

step 2. If there was found no similar reference object

for at least one object

lll



,...,,

then go to step

Step 2. Scene

, ,... ,

ll l







considered recognized and similar with scene

, ,...,

ll l



  





and the value of the

criterion for scene similarity is equal to

[ , ,..., ]

ll l

AA A A





Step 3. Scene

, ,... ,

ll l







was

not recognized.

We’ll call scenes

... , { , ,..., } {1,..., }

ll l r

ll l L  

the scenes of the 1st level and identify them



We’ll call scenes

11 1

...

ss s 



   

scenes of the s-layer, where

11 1

, ,... ,

ss 

 

scenes of the (s-1) layer. Thus the scenes of the 1st

layer are the relations of the objects and scenes of s-

layer, where s>1 are relations of the scenes of (s-1)-

layer. In order to recognize (s-j)-level scenes

(j=0,1,…, s-2) it is needed to recognize the scenes of

(s-j-1)-level the relation of which are (s-j)-level

scenes. If during the recognition of any (s-j)-level

scene it is found that at least one (s-j-1)-level scene

included in the relation of this (s-j)-level scene can’t

be recognized then the recognition process of the

latter is stopped.

The development of the procedure 4 of the

DataFusioninMultimodalInterface

309

recognition of the 1st-level scenes can be put in the

base of the method of s-layer scenes

11 1

...

ss s 

 

recognition as

follows.

Step 1. Each object

lll



,...,,

that is

included at least in one 1st-level scene



recognized by separate comparison with reference

objects

, ,...,

ll l

 

1,..., ,

kK

{ , ,..., } {1,..., }

ll l L

using fusion

operators

, ,..., .

ll l

Step 2. Each 1st-level scene



for all objects

of which are found similar reference objects is

considered recognized and a similarity criterion is

extracted for it (the value of the fusion operator)

. After this go to step 3. If there was found no

such scenes then there’s no recognized scenes of the

first level and higher and the execution is stopped.

Step 3. The value of level is set to 2 and we go to

step 4.

Step 4. If there were found s-level scenes



for

all (s-1)-level scenes of which had been found

nonzero values of the similarity criterion then these

scenes



are considered recognized and similarity

criteria (fusion operator values)

are calculated

for them. If there are any (s+1)-level scenes then

step 4 is executed once again with the value s=s+1.

Else the execution is stopped.

If there were found no s-level scenes



for all

(s-1)-level scenes of which had been found nonzero

values of the similarity criterion then there are no

recognized s-level scenes and the execution is

stopped.

5 CONCLUSIONS

There is considered a developed method that uses

fuzzy fusion operators with fuzzy measure for data

fusion in multimodal interface. The main advantages

of the method from the well-known analogs are the

following:

 The ability of hierarchic multimodal recognition of

scenes that consist of static and dynamic (moving)

objects.

 The ability to consider the measure of importance

of each modality during the process of hierarchic

scene recognition due to the use of fusion

operators that use a fuzzy measure.

 The ability to be the base for developing of control

systems for different objects with the help of

dynamic patterns (robots, computers, TV-sets etc.).

 The ability to increase the reliability of recognition

of separate objects (for example, a human being)

in the scene using relations between these objects

and other objects of the scene (background

objects).

 Promising opportunities for development of

intellectual and intuitive human-machine interfaces

by using more modalities and relations.

REFERENCES

Alatan, A. A. Automatic multimodal dialogue scene

indexing // Proc. of image processing.- 2001.- Vol. 3.-

P. 374-377.

Averkin, A. A., Batyrshin, I.Z., Blishun, A.F. Fuzzy sets

in control models and artificial intelligence systems //

Moscow . Book, Nauka, 1986.

Devyatkov V., Alfimtsev A. Optimal Fuzzy Aggregation

of Secondary Attributes in Recognition Problems //

Proc. of 16-th Int. Conf. in Central Europe on Comp.

Graphics, Visual. and Computer Vision.-Plzen, 2008.-

P. 78–85.

Grabisch M., Roubens M. Application of the Choquet

Integral in Multicriteria Decision Making. In Fuzzy

Measures and Integrals- Theory and Applications,

Physica Verlag, 2000, pp. 415-434.

Higuchi M. and oth. Scene Recognition Based on

Relationship between Human Actions and Objects //

17th Int. Conf. on Pattern Recog.-2004.- Vol. 3.- P.

73-78.

Liu F., Lin X. Multimodal face tracking using Bayesian

network // IEEE Internat. Workshop on Analysis and

Modeling of Faces and Gestures.- Nice, 2003.-P. 135-

142.

Marichal J. On Choquet and Sugeno Integrals as

Aggregation Functions // In Fuzzy Measures and

Integrals.-2000.-Vol. 40.-P. 247-272.

Ronshin A. L., Karpov A.A., Li I.V. Speech and

multimodal interface // Moscow. Book, Nauka, 2006.

Sharma R. Speech-Gesture Driven Multimodal Interfaces

for Crisis Management// The IEEE Proc.-2003.-Vol.

91, №9.- P. 1327–1354.

Wang X., Chen J. Multiple Neural Networks fusion model

based on Choquet fuzzy integral // Proc. of the Third

Intern. Conf. on Mach. Learn. and Cybern.- Vol.4.-P.

2024-2027.

BIODEVICES2013-InternationalConferenceonBiomedicalElectronicsandDevices

310