CONTEXT ANALYSIS FOR SEMANTIC MAPPING OF DATA

SOURCES USING A MULTI-STRATEGY MACHINE LEARNING

APPROACH

Youssef Bououlid Idrissi and Julie Vachon

DIRO, University of Montreal

Montreal (Quebec), Canada

Keywords:

context analysis, semantic mapping, data sources alignment, machine learning, multi-strategy, semantic web.

Abstract:

Be it on a webwide or inter-entreprise scale, data integration has become a major necessity urged by the

expansion of the Internet and of its widespread use for communication between business actors. However,

since data sources are often heterogeneous, their integration remains an expensive procedure. Indeed, this

task requires prior semantic alignment of all the data sources concepts. Doing this alignment manually is quite

laborious especially if there is a large number of concepts to be matched. Various solutions have been proposed

attempting to automatize this step. This paper introduces a new framework for data sources alignment which

integrates context analysis to multi-strategy machine learning. Although their adaptability and extensibility

are appreciated, actual machine learning systems often suffer from the low quality and the lack of diversity of

training data sets. To overcome this limitation, we introduce a new notion called “informational context” of

data sources. We therefore brieﬂy explain the architecture of a context analyser to be integrated into a learning

system combining multiple strategies to achieve data source mapping.

1 INTRODUCTION

Machine learning systems using a multi-strategy ap-

proach are composed of a set of basic learners. Al-

though independent, these learners are coordinated

by a special unit called meta-learner. These ma-

chine learning systems are adaptive since they can de-

ploy learners able to specialize in the processing of a

speciﬁc type of information (e.g. ﬁeld names, data

types, etc.). These systems are also easily extensi-

ble since new learners, developed independently, can

naturally be integrated under the control of a meta-

learner. Each basic learner is responsible for auto-

matically mapping the elements of two given data sets

(source data onto target data) according to its speciﬁc

knowledge. The main tasks of basic learners are the

following:

• learning data mappings from training examples

provided by a user.

• generating mappings between new data sets by us-

ing the classiﬁcation model settled during the train-

ing stage.

As for the meta-learner, its task is to combine all the

mapping proposals issued by the basic learners and to

compute a ﬁnal matching for each concept present in

the source data set.

It is well-known that the precision of the mapping

directly depends on the quantity and quality of the in-

formation used in the training set. Most of the time,

this training set is composed of data and their corre-

sponding data scheme (XSD, DTD, RDFS, etc.). This

selection appears to be too thin and restrictive if one

hopes to unveil semantic ambiguities which makes it

hard to identify implicit relations between concepts.

Achieving accurate semantic analysis is thus a major

challenge. Here are some examples of problems one

can meet:

• The attribute date1 of an entity Command does

not indicate if it represents the invoicing date, the

delivery date or the reception date.

• The attribute name of an entity Employee does

not specify if the content is about the ﬁrst name or

the full name of the employee.

• The attribute amount of an entity Invoice does

not allow one to know if indicated values include

tax or not, neither does it specify the used currency

type.

• The attribute cl of a entity Command uses an ab-

445

Bououlid Idrissi Y. and Vachon J. (2005).

CONTEXT ANALYSIS FOR SEMANTIC MAPPING OF DATA SOURCES USING A MULTI-STRATEGY MACHINE LEARNING APPROACH.

In Proceedings of the Seventh International Conference on Enterprise Information Systems, pages 445-448

DOI: 10.5220/0002539804450448

 SciTePress

breviated form which makes it hard to automati-

cally identify that it denotes a c

ommand line.

Taken alone, a specialized learner can prove inef-

ﬁcient or inadequate. For example, a learner special-

ized in the mapping of ﬁeld names does not prove par-

ticularly outstanding when it comes to match names

which are not well-known synonyms (e.g. "‘com-

ment"’ and "‘outline"’), or names which abbrevi-

ate a concept (e.g. the name "‘home"’ used instead

of "‘telephone at home"’) or names whose broad

meaning would allow them to be matched with al-

most everything (e.g. "‘thing"’ or "‘entity"’). Sim-

ilarly, a content learner, that bases its semantic de-

duction on the frequency of lexical units appearing

in ﬁelds, would prove quite inadequate to analyze

numerical ﬁelds! Moreover, a learner relying on a

naive bayesian approach (Berlin and Motro, 2002; Pe-

dro Domingos, 1997; Kohavi, 1996) would not be

proﬁtable for analyzing ﬁelds accepting values of nu-

merical or enumerated types (e.g. "‘gender"’).

Hence, this article proposes to broaden and diver-

sify both data sets and training sets by extending them

with documents coming from, what we call, the in-

formational context of data sources. Indeed, isolat-

ing a data source from its context (as it is the case

when solely considering its XML schema) reduces

beforehand the set of usable cognitive information

which underlies the conceptualization of the repre-

sented data. In practice, the context within which the

data source lies, constitutes a precious fount of infor-

mation calling for a more systematic exploration so as

to better deﬁne the semantics of concepts.

In the sequel, the notion of informational context is

deﬁned and the architecture of a context analyzer is

presented.

2 INFORMATIONAL CONTEXT

The informational context of a data source is com-

posed of all the information, saved in electronic for-

mat, which belongs to the data source’s environment

and shares the same domain.

1. The descriptive context of a data source gathers all

the speciﬁcation ﬁles describing the data or their

application environment. These ﬁles document the

data according to various abstraction levels. For

example, if the data source is a database the de-

scriptive context could be composed of the follow-

ing documents:

• A requirements document describing data and

services which the user calls for in applications

using the database. A test plan for instance,

might be practical to establish the link between

input and output data what could hide relevant

complex concepts.

• Analysis and design speciﬁcations including the

various formal and semi-formal models elabo-

rated for applications relying on the database.

Data dictionnaries are worth citing under this

category. It describes in a formal fashion, among

others, data ﬂows, data structures et data de-

posits. As an example, consider a structure de-

scription of the concept "Order", using regular

expressions:

Order = O_Header + O_Item

∗

+ O_F ooter

O_Header = O_Number + Date + CustAdress

O_Item = ItemN um + Descr + Qty + P rice

O_F ooter = T axAmount + T otalAmount

This provides relevant information about compo-

sitions and dependencies of "Order" and "Items"

concepts. Furthermore, detailed description of

each data element can be obtained from a data

description deposit.

• User manuals. In the same way as for dictio-

naries, one can think of the numerous formula

linking concepts present in a such resource.

2. The operational context of a data source is com-

posed of all the data management and processing

ﬁles. Among others, these ﬁles can be

• programs written in any known programming

paradigm and language. The way concepts

are manipulated could hide valuable information

about how they are linked to each other.

• Files containing SQL-type requests.

For each data source, the important is to list all the

documents which may compose the descriptive and

operational contexts of this data set. The analysis of

these documents (in addition to the analysis of the

data themselves and their schema deﬁnition) will help

enhancing the knowledge required by the learners to

deduce the best semantic mapping between the given

data sources.

3 CONTEXT ANALYSIS

The main objective of context analysis consists in ex-

panding data sources with semantic information and

hints drawn from the context. Among other, this in-

formation is intented to be used by learners during

their training stage to increase the precision of the

mapping they are asked to compute.

In particular, context analysis offers an interesting

opportunity for resolving complex mappings which

are pairing combinations of concepts (e.g.(street,

zip code, city) → employee_address).

To the best of our knowledge, there is still no

satisfactory solution addressing this problem al-

though it is frequently encountered. For example,

ICEIS 2005 - DATABASES AND INFORMATION SYSTEMS INTEGRATION

446

we can imagine a context analyzer detecting the

presence of a function call "‘concat(address,

zip_code, city)"’ in a program ﬁle. The ana-

lyzer should therefore be able to add the deﬁnition

of a new concept complete_address to the con-

cerned data source and present it as a new candidate

to be mapped (hence if complete_address →

employee_address then our problem is solved

by transitivity). As mentioned earlier, learners spe-

cialized in ﬁeld name analysis may often prove in-

effective when processing, for example, abbreviated

names or names with a very broad meaning. Mak-

ing the most of the context, it is hence possible to

tag ﬁeld names with additional information to make

them more signiﬁcant. From this viewpoint, a simple

documentation text ﬁle may turn out to be a valuable

source of information to analyze if it contains a com-

plete data description table (with two columns: one

for data names, the other for their description).

Furthermore, the analysis of formal annotations in

models may also contribute to reﬁne mapping solu-

tions. Among others, OCL is a formal language (Bot-

ting, 2004) which allows the expression of constraints

in object-oriented models. In fact, OCL is a standard

of the OMG

and can be used together with the UML

to constrain models. Let’s consider the UML class

diagram shown on Figure 1.

University Student

1...*1

context

University

inv:

Student.allInstances ->forAll(s1,s2 |

s1<>s2 implies s1.name <> s2.name

Figure 1: OCL invariant

Using OCL invariant constraints, one can easily

specify that there cannot be two students with the

same name within the same university. This OCL in-

variant (speciﬁed in the attached note) must be veri-

ﬁed by all data sets constructed on this model. One

must therefore ensure that no computed mapping pro-

duces target data violating the invariant. The analysis

of such invariants contributes de facto to the accuracy

of computed mappings.

To summarize, context analysis (c.f. Figure 2) is a

process applied to a data source and its context and

producing:

• an enriched data source, which can be used by

learners as a training set or as a candidate data set

for mapping.

Object Management Group

Context

analyser

data source

informational

context

enriched

data source

integrity

constraints

Figure 2: Context analysis

• a set of integrity constraints, that computed map-

pings must ensure.

Data sources enhancement consists, among others,

in replacing ambiguous or vaguely deﬁned concept

names with more signiﬁcant ones. It may also en-

rich the source with new names to deﬁne composite

concepts.

4 ARCHITECTURE OF THE

CONTEXT ANALYZER

The informational context of a data source may in-

clude a wide range of document types. As mentioned,

they can differ in their style, which can either be de-

scriptive or operational. They can also differ in the

formality degree of their contents. Some may have

a formal content (e.g. formal models such as tran-

sition systems, programs, etc.), some may rely on a

semi-formal notation(e.g. decision tables, UML dia-

grams, etc.), while others are written without any for-

mality consideration (e.g. textual documents, draw-

ings, etc.). Of course, informal documents are the

most demanding regarding analysis. For efﬁcient re-

sults, it is important that the analysis addresses each

type of contents individually, taking into account its

speciﬁcities.

Program

Analyzer

C++ program

Analyzer

Java program

Analyzer

Fortran program

Analyzer

SQL

Analyzer

Table

Analyzer

Main

Meta-Analyzer

Figure 3: Architecture of a context analyzer

Hence, each type of contents is treated by a ana-

lyzer trained for this speciﬁc kind of information. Fol-

lowing a multi-strategy approach, we advocate an ar-

chitecture (c.f. Figure 3) composed of many special-

ized analyzers coordinated by a main meta-analyzer.

CONTEXT ANALYSIS FOR SEMANTIC MAPPING OF DATA SOURCES USING A MULTI-STRATEGY MACHINE

LEARNING APPROACH

447

The role of the meta-analyzer consists in running

through the context documents, identifying the var-

ious contents types and determining which special-

ized analyzer is the most appropriate to take care of it.

Results returned by specialized analyzers are then ﬁl-

tered (redundancy elimination) and combined by the

meta-analyzer. The context analyzer has a hierarchi-

cal architecture of arbitrary depth. Indeed, a child an-

alyzer can itself supervise, as a meta-analyzer, a set

of more specialized analyzers. As illustrated on Fig-

ure 3, a program analyzer can itself coordinate the

processing of many specialized analyzers, each one

being an expert in the parsing of a speciﬁc program-

ming language.

Finally, it is possible for analyzers to collaborate

with each other without being related in the coordi-

nation hierarchy. For example, a program analyzer

could resort to a SQL analyzer for the analysis of a

request inserted in the program it is parsing.

5 CONCLUSIONS

Exploitation of contextual resources is a promising

approach to solve the problem of semantic alignment

of data sources (Tierney and Jackson, 2004). Docu-

ments composing the context of a data source offer

valuable information which can help resolving major

problems such as the identiﬁcation of complex map-

pings (e.g. mapping a combinination of source con-

cepts onto a single target concept).

As an extension to machine learning architectures

using a multi-strategy approach for semanctic map-

ping (Doan et al., 2003; Berlin and Motro, 2002;

Doan et al., 2002; Kurgan et al., 2002), we propose

a context analyzer based on multiple specialized an-

alyzers which are themselves coordinated by a meta-

analyzer. This architecture has both the advantage of

being adaptative and extensible. The hierarchical or-

ganization of analyzers allows a more efﬁcient repar-

tition of tasks between units. It also facilitates ana-

lyzing the context according to different abstraction

levels. Hence, the context analyzer can enhance data

sources with a range of details going from the most

general (with low time cost) to the most speciﬁc (with

higher time cost).

Moreover, this way of exploiting the context can

prove to pay off for the effort put into data documen-

tation. It also reduces the importance of user inter-

vention in the mapping, which can be tedious, costly

and error prone.

Although only two enrichment types have been

brought up in this paper (i.e. expliciting abbrevi-

ated ﬁeld names and identifying composite concepts),

many other types of data source enrichment can be

achieved from context analysis. Among others, our

project aims at identifying and experimenting various

enrichment strategies (based on context information)

which could help improve the quality of data source

mappings.

REFERENCES

Berlin, J. and Motro, A. (2002). Database schema matching

using machine learning with feature selection. In Pro-

ceedings of the 14th International Conference on Ad-

vanced Information Systems Engineering, pages 452–

466.

Botting, R. J. (2004). The object

constraint language. Web site.

http://www.csci.csusb.edu/dick/samples/ocl.html.

Doan, A., Domingos, P., and Halevy, A. (2003). Learning

to match the schemas of databases- a multistrategy ap-

proach. Journal of Machine Learning, 50(3):279–301.

Doan, A., Madhavan, J., Domingos, P., and Halevy, A.

(2002). Learning to map between ontologies on the se-

mantic web. In Proceedings of the 11th international

conference on World Wide Web, pages 662–673.

Kohavi, R. (1996). Scaling up the accuracy of naive-bayes

classiﬁers: A decision-tree hybrid. In Press, A., ed-

itor, Proceedings of the Second International Con-

ference on Knowledge Discovery and Data Mining,

pages 202–207, Portland, OR.

Kurgan, L., Swiercz, W., and Cios, K. J. (2002). Se-

mantic mapping of xml tags using inductive machine

learning. In Proceedings of the 2002 International

Conference on Machine Learning and Applications

(ICMLA’02), pages 99–109, Las Vegas, NV.

Pedro Domingos, M. P. (1997). On the optimality of the

simple bayesian classiﬁer under zero-one loss. Ma-

chine Learning, 29:103–130.

Tierney, B. and Jackson, M. (2004). Contextual semantic

integration for ontologies. In Doctoral Consortium

of the 21rst Annual British National Conference on

Databases.

ICEIS 2005 - DATABASES AND INFORMATION SYSTEMS INTEGRATION

448