AUTOMATIC DISCOVERY OF SEMANTIC RELATIONSHIPS

BETWEEN SCHEMA ELEMENTS

Nikos Rizopoulos

Imperial College

London, England

Keywords:

Automatic schema matching, semantic relationships, data integration.

Abstract:

The identiﬁcation of semantic relationships between schema elements, or schema matching, is the initial step

in the integration of data sources. Existing approaches in automatic schema matching have mainly been con-

cerned with discovering equivalence relationships between elements. In this paper, we present an approach

to automatically discover richer and more expressive semantic relationships based on a bidirectional compar-

ison of the elements data and metadata. The experiments that we have performed on real-world data sources

from several domains show promising results, considering the fact that we do not rely on any user or external

knowledge.

1 INTRODUCTION

The integration of heterogeneous data sources is a

well-known research subject. Its key issue is the iden-

tiﬁcation of semantic relationships between schema

elements (Kashyap and Sheth, 1996), which is a

labor-intensive and time-consuming process when

performed manually. Automatic schema matching re-

solves this problem by automatically discovering se-

mantic relationships between schema elements.

Several approaches can be found in the literature

concerned with automatic schema matching. Most

of them are focused on discovering equivalence rela-

tionships between elements. However, in many cases

more expressive relationships exist. For example, ele-

ment person subsumes student and the elements post-

graduate and undergraduate can be merged into stu-

dent. Such matches are called indirect (Xu and Emb-

ley, 2003).

Example: Figure 1 illustrates cut-down represen-

tations of two databases of the Computing Depart-

ment at Imperial College, London.

Schema S

shows that staff members tutor under-

graduate students and supervise PhD students. In S

the element course represents all the non-laboratory

courses, i.e. all the courses that have lectures in

theatres. Undergraduate students register on these

courses and members of staff teach them. In the

college, PhD students assist in both tutorials and lab

demonstrations. This is depicted in schema S

, where

course describes both laboratory courses and courses

that have tutorials. Element staff in S

represents the

members of staff that supervise tutorials and labora-

tories and can be both lecturers or teaching assistants.

We have asserted the constraints that each PhD stu-

dent has to assist in at least one course and each

lecturer has to teach at least one non-laboratory

course and supervise one laboratory course. Also,

non-laboratory courses might not have any tutorials.

These constraints implicitly express that the phd el-

ements in S

and S

represent identical sets of PhD

students (direct match), that the concept of staff in S

subsumes the concept of staff in S

(indirect match)

and that the two course elements intersect since both

include those non-laboratory courses that have tutori-

als (indirect match).

In this paper, we describe a framework to automat-

ically discover matches like the ones in the example.

Our goal is to identify semantic relationships between

elements without relying on external knowledge, like

ontologies, user-knowledge or schema structure. We

adopt a composite approach that exploits several types

of information (element names, data instances, statis-

tical information on the data) to discover incompati-

ble, disjoint, intersecting, subsumming and equiv-

alent elements. Our methodology performs a bidirec-

tional comparison of the elements, which proves to be

indicatory of these types of semantic relationships.

Rizopoulos N. (2004).

AUTOMATIC DISCOVERY OF SEMANTIC RELATIONSHIPS BETWEEN SCHEMA ELEMENTS.

In Proceedings of the Sixth International Conference on Enterprise Information Systems, pages 3-8

DOI: 10.5220/0002611900030008

 SciTePress

ug login

ugname

phd

phd login

pname

course

course id

cname

tch

staff

staff login

sname

tut

sup

reg

staff

staff login

sname

phd

phd login

assist

course

course id

cname

hours

date

activity

Figure 1: Source Schemas S

and S

This paper contributes in the formal deﬁnition of

semantic relationships between schema elements and

the automatic discovery of these relationships. As far

as we know, no other automatic schema matching ap-

proach discovers disjointness and intersection rela-

tionships without relying on external knowledge. In

addition, this paper proposes an innovative compos-

ite architecture which differentiates between modules

that identify schema matches and modules that clarify

the type of the relationship in each match.

The structure of this paper is as follows. In Sec-

tion 2, we deﬁne the ﬁve types of semantic relation-

ships that our methodology identiﬁes. Section 3 ex-

plains the way a bidirectional comparison of schema

elements can assist in the clariﬁcation of their se-

mantic relationship. Section 4 shows the architecture

of our framework and describes the implemented re-

lationship identiﬁcation and relationship clariﬁcation

modules. In Section 5 the results of the experiments

we have conducted to evaluate our approach are pre-

sented. Section 6 describes related schema match-

ing approaches and Section 7 gives our concluding

remarks and directions for further work.

2 SEMANTIC RELATIONSHIPS

Various types of semantic relationships between

schema elements have been deﬁned in the literature.

In (Larson et al., 1989), a non-automated approach

for schema integration is proposed, based on man-

ually identiﬁed semantic relationships between ele-

ments. We adopt similar relationship deﬁnitions, ex-

cept from disjointness and incompatibility. We de-

ﬁne Inst

ext

(x) to be the instances of an element

x that are currently stored in the data source, and

Dom

ext

(x) the extentional domain of the element,

i.e. all its possible valid instances. We also deﬁne as

Ent

int

(x) the intentional entities of x, i.e. the real-

world entities that map to the instances of Inst

ext

(x),

and Dom

int

(x) the intentional domain of x, i.e. the

real-world entities that map to Dom

ext

(x).

Five types of semantic relationship between

schema elements are identiﬁed based on the compari-

son of their intensional domains. These relationships

are:

1. equivalence: Two schema elements A and B are

equivalent, A = B, iff

Dom

int

(A) = Dom

int

(B)

2. subsumption: Schema element A subsumes

schema element B, B ⊂ A, iff

Dom

int

(B) ⊂ Dom

int

(A)

3. intersection: Two schema elements A and B are

intersecting, A ∩ B, iff

Dom

int

(A) ∩ Dom

int

(B) 6= ∅,

∃C : Dom

int

(A) ∩ Dom

int

(B) = Dom

int

(C)

4. disjointness: Two schema elements A and B are

disjoint, A

∩B, iff

Dom

int

(A) ∩ Dom

int

(B) = ∅,

∃C : Dom

int

(A) ∪ Dom

int

(B) ⊆ Dom

int

(C)

5. incompatibility: Two schema elements A and B

are incompatible, A 6= B, iff

Dom

int

(A) ∩ Dom

int

(B) = ∅,

¬∃C : Dom

int

(A) ∪ Dom

int

(B) ⊆ Dom

int

(C)

The notation ∃C : condition means that there is

a real-world concept that can be represented by an

existing or non-existing schema element C that sat-

isﬁes the condition. The notation ¬∃C : condition

in the deﬁnition of incompatibility means that there is

no real-world concept that would be represented by a

schema element C to satisfy the speciﬁed condition.

Throughout this paper, we are going to use the term

semantically compatible schema elements when ele-

ments are related with a semantic relationship other

than incompatibility.

There are four pairs of compatible elements in the

example of the previous section. The phd elements

in S

and S

are equivalent because they represent

identical sets of PhD students based on the constraint

that each PhD student in S

must assist in at least one

course, i.e. each entity of S

.phd belongs to the set

of entities represented by S

.phd and vice versa. The

elements ug and phd in S

and S

respectively are

disjoint because each student in this particular exam-

ple can be either an undergraduate or a PhD student.

Thus, there is a concept (student) that subsumes the

union of ug and phd. Element staff in S

subsumes

staff in S

, because S

.staff represents all the lectur-

ers, i.e. all the entities of S

.staff, in addition to teach-

ing assistants. The two course elements intersect

because they have a common set of entities that can

be represented by the concept courses

with tutorials;

ICEIS 2004 - DATABASES AND INFORMATION SYSTEMS INTEGRATION

some courses of S

have tutorials, i.e. they belong

to the entities represented by S

.course, and the lab

courses of S

.course are not included in S

.course.

3 DISCOVERING SEMANTIC

RELATIONSHIPS

In order to discover the semantic relationships de-

ﬁned in the previous section, we perform a bidirec-

tional comparison of the schema elements. Suppos-

ing that there are two schema elements X and Y , we

deﬁne as d(X, Y ) the similarity degree produced by

the comparison of element X against Y and d(Y, X)

the similarity degree produced by the comparison of

Y against X. We call d(X, Y ) and d(Y, X) bidirec-

tional similarity degrees. Intuitively, the more similar

X is to Y , the higher the similarity degree d(X, Y )

will be. Essentially, d(X, Y ) indicates to what extent

X, with Dom

int

(X) 6= ∅, is a subset of Y , ranging

from 0, if none of the entities of X are entities of Y ,

to 1, if the set of entities of X is a proper subset of

the entities of Y . This can also be described by the

following formula:

d(X, Y ) =

|Dom

int

(X)∩Dom

int

(Y )|

|Dom

int

(X)|

where |S| deﬁnes the number of entities in set S.

The above formula will give high bidirectional sim-

ilarity degrees for equivalent elements, high d(X, Y )

and low d(Y, X) when Y subsumes X and average-

high bidirectional degrees when the elements inter-

sect. The problems arising when using this formula

are: (a) it cannot be computed automatically since

the comparison of the elements’ real-world entities

(Dom

int

(X)) is required, and (b) it cannot differen-

tiate between incompatibility and disjointness, since

in both of these cases the intensional domains of the

elements are disjoint, producing bidirectional similar-

ity degrees with values equal to 0. A more detailed

description of the problems arising when an idealized

approach, like the above formula, is used to discover

semantic relationships between schema elements can

be found in (Rizopoulos, 2003).

In our framework, we attempt to resemble the

same formula by examining the elements instances

(Inst

ext

(X)) and their metadata, e.g. data types,

names, lengths, etc. Based on this information, even

disjoint elements exhibit similarity, which arises from

their relationship with the same super element C

(see deﬁnition in the previous section). For exam-

ple, usernames of PhD (phd

ate (ug login) students follow the same format as any

student username (login). Therefore, our approach re-

solves the problems mentioned previously: (a) auto-

matic computation of the similarity degrees is feasi-

ble because element instances and metadata are freely

X 6= Y

X = YX ⊂ Y

Y ⊂ X

X ∩ Y

∩Y

0 d(Y, X)

d(X, Y )

equivalence

threshold

intersection

threshold

disjointness

threshold

disjointness

threshold

intersection

threshold

equivalence

threshold

equivalence pair

subsumption pair

intersection pair

disjointness pair

incompatibility pair

Figure 2: Bidirectional Similarity Comparison Graph

available and (b) disjointness and incompatibility can

be distinguished because disjoint pairs of elements

will have higher bidirectional similarity degrees than

incompatible pairs. In the case of intersecting ele-

ments, their common set of entities in their intensional

domains deﬁnes a common set of instances in their

extensional domains, i.e. the relationship is preserved

across the domains. Thus, intersecting pairs of ele-

ments are going to display higher similarity degrees

than disjoint pairs. Relationship preservation also ap-

plies to subsumption and equivalence, which suggests

that the same bidirectional similarity degrees should

be expected whether the intensional or the extensional

domains of the elements are examined.

Based on these observations, our insight on the way

the bidirectional comparison can be applied in the

identiﬁcation of semantic relationships between ele-

ments is illustrated in Figure 2, the bidirectional com-

parison graph. The graph shows the areas where we

expect the bidirectional degrees to position each pair

of elements based on their semantic relationship. It is

important to notice that the deﬁned areas in the graph

are fuzzy areas because they only represent an esti-

mation of the expected results. For the same reason

some areas of the graph are not covered.

4 ARCHITECTURE

In this section, we describe our framework’s architec-

ture (Figure 3) and present each implemented compo-

nent.

Our framework consists of several comparison

modules that exploit different types of information to

determine the similarity of schema elements. These

modules take as input the source schemas and their

data instances and work independently to produce

partial bidirectional similarity degrees. Partial in the

sense that they are produced by just comparing partial

AUTOMATIC DISCOVERY OF SEMANTIC RELATIONSHIPS BETWEEN SCHEMA ELEMENTS

. . .

partial bidirectional similarity degrees

Filter

6 6

. . .

6 6

. . .

bidirectional similarity degrees of compatible elements

Aggregator

aggregated bidirectional

similarity degrees

Degree Combinator

thresholds

semantic relationships

clariﬁcation

similarity degrees

relationship identiﬁcation modules

relationship clariﬁcation modules

Figure 3: Architecture

information, e.g. element names only, and are there-

fore partially correct. These degrees are later com-

bined to provide the ﬁnal bidirectional similarity de-

grees, which indicate (according to the discussion in

the previous section) the semantic relationships be-

tween the schema elements.

In our framework, there are two types of modules:

relationship identiﬁcation and relationship clariﬁca-

tion modules. Relationship identiﬁcation modules at-

tempt to discover compatible pairs of elements and

relationship clariﬁcation modules attempt to specify

the type of the semantic relationship in each compat-

ible pair.

Initially, the bidirectional similarity degrees pro-

duced by the modules are examined by the Filter com-

ponent to separate the compatible from the incompati-

ble pairs of elements. Then, the Aggregator combines

the similarity degrees of the compatible elements and

indicates their semantic relationships. It achieves this

by mapping the compatible pairs onto the bidirec-

tional graph. The output of the Aggregator becomes

the input of the Degree Combinator, which based on

the relationship clariﬁcation modules and the fuzzy

areas deﬁned in the bidirectional graph outputs the

discovered semantic relationships. The user is then

able to validate or reject these relationships and pro-

ceed to the data integration process.

All the components of the architecture that have

been implemented in the prototype tool are now

brieﬂy described. More details can be found in (Ri-

zopoulos, 2003).

The Element Name Module performs a case-

insensitive comparison of element names. When el-

ement X has exactly the same name with Y then

d(X, Y ) = d(Y, X) = 1 and if X’s name is a sub-

string of Y ’s then d(X, Y ) = 0.2 and d(Y, X) = 1.

The similarity degrees have been chosen so that when

two elements have the same names, then their bidi-

rectional similarity degrees are going to map them in

the equivalence area of the graph. If one element is a

substring of another, e.g. login and phd login, then this

is an indication of a subsumption relationship, there-

fore the pair is mapped in the subsumption areas of

the graph.

The rest of the modules operate in a similar way.

The Data Type Module compares element data types.

The Numerical Statistics Module compares numer-

ical elements on their average value, medium value

and the standard deviation of their instances. The

Non-numerical Statistics Module compares non-

numerical elements based on the average number of

appearances of special characters (@, $, -, etc) in their

instances. The Instances Module uses a Naive Bayes

classiﬁer to identify similarities between elements by

comparing their instances. The Number of Instances

Module is a naive module that is used for relation-

ship clariﬁcation and compares the number of distinct

instances of the elements. The Precision Module

is a relationship clariﬁcation module that compares

the range of each element’s instances and the Length

Module compares the range of the elements’ lengths.

The Existence Module is a relationship clariﬁcation

module that examines the existence of instances in el-

ements.

The Filter component separates the compatible

from the incompatible pairs of elements. For each

pair, it computes the average bidirectional similarity

degrees and compares them to a user-deﬁned thresh-

old. The Aggregator component indicates the type

of the semantic relationship for each pair of compati-

ble elements by computing the product bidirectional

similarity degrees. Modules in both the Filter and

the Aggregator can have auxiliary roles, i.e. they can

only increase the similarity degrees produced by other

modules. The Degree Combinator uses the output of

the relationship clariﬁcation modules and the Aggre-

gator to determine the semantic relationships between

the elements.

5 EXPERIMENTS

We have evaluated our prototype tool on three schema

matching tasks of real-world data sources that come

from three different domains.

The ﬁrst task, called Pop&Geo, is between two data

sources with geography and population data. The sec-

ond task, University, is on two relational databases

that store information about tutorials and students,

and the third task, Real-Estate, in on two real-estate

ICEIS 2004 - DATABASES AND INFORMATION SYSTEMS INTEGRATION

Table 1: Problem Size and Schema Similarity in each Inte-

gration Task

Pop&Geo University Real-Estate

#pairs 897 4389 868

#equivalents 0 7 0

#subsets 12 6 0

#intersections 8 46 0

#disjoints 18 57 18

#incompatibles 859 4273 850

threshold

.1 .2 .3 .4 .5 .6 .7 .8 .9 1

equivalence pair

subsumption pair

intersection pair

disjointness pair

incompatibility pair

(a) Filter

.1 .2 .3 .4 .5 .6 .7 .8 .9 1

(b) Aggregator

Figure 4: Filter and Aggregator Results

data sources that list houses for sale.

Table 1 presents the problem size and the schema

similarity in each matching task. It shows the number

of all possible pairs of elements between the schemas

and for each type of semantic relationship the number

of its appearances manually detected. The semantic

relationships discovered by the tool are compared to

these manually identiﬁed relationships.

Figure 4 illustrates the results of the Filter and Ag-

gregator components for the University task. Each

pair of elements is plotted using the appropriate sum-

bol based on the manually identiﬁed relationship of

the pair. The position where each pair is plotted de-

pends on the bidirectional degrees produced by the

tool. In Figure 4(a), the pairs of elements below the

user-deﬁned threshold have been omitted for clarity.

As it can be seen, the Filter discards most of the in-

compatible pairs and the Aggregator attempts to map

the compatible ones onto the correct areas of the com-

parison graph.

To examine the reliability and the cover of the tool,

i.e. how many relationships are identiﬁed correctly

and how many relationships are indeed identiﬁed, we

have used precision and recall. If C is the number

of the correctly identiﬁed relationships, F the num-

ber of the incorrectly identiﬁed relationships and A

all the manually identiﬁed relationships, then preci-

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

⊂ ∩

∩

Pop&Geo

= ⊂ ∩

∩

University

∩

Real-Estate

Precision

Recall

Figure 5: Precision and Recall

sion is the fraction C/(C + F ) and recall is C/A.

Figure 5 shows the precision and recall of the tool for

each type of semantic relationship in each task. In the

ﬁrst task, the precision and recall bars for the equiv-

alence relationship are missing because there do not

exist any equivalent elements between the schemas.

The same applies in the third task where there are not

any equivalent, subsuming or intersecting elements.

In the ﬁrst experiment, the low precision for in-

tersection and disjointness comes from incompatible

pairs of elements with intersecting ranges. The high

recall in the ﬁrst experiment means that very few rela-

tionships are lost and therefore the user will only have

to reject the wrongly identiﬁed ones.

In the second experiment the main problems are

caused by (a) elements that are sequences of numbers,

in particular automatically incremented primary key

attributes, (b) elements that have a small number of

distinct instances, and (c) character elements whose

instances are small strings.

In the third experiment, the precision for disjoint-

ness is affected by incompatible pairs of elements that

should have been discarded by the Filter. The prob-

lems are caused by composite elements that contain

multiple values in each one of their instances, and el-

ements with small numerical instances.

Overall in the three experiments, the average preci-

sion for each semantic relationship is 100% for equiv-

alence, 75% for subsumption, 46% for intersection

and 39% for disjointness. The average recall is 77%

for equivalence, 58% for subsumption, 71% for inter-

section and 96% for disjointness

. These results are

encouraging, considering the fact that there was no

user intervention on the data or the data sources. In

addition, no external knowledge was used, like user-

supplied training data, user-deﬁned concept hierar-

chies, synonym tables, online ontologies, dictionar-

ies, etc, neither any assumptions were made about the

data instances or the sources.

The average values for the equivalence relationship

come from the University task, but in general we expect

high precision and recall for equivalence in all tasks.

AUTOMATIC DISCOVERY OF SEMANTIC RELATIONSHIPS BETWEEN SCHEMA ELEMENTS

6 RELATED WORK

Several approaches concerned with automatic schema

matching exist in the literature. Most of the ap-

proaches are focused in discovering equivalence re-

lationships (A. Doan and Halevy, 2002; Madhavan

et al., 2001), some of them also identify subsumption

relationships (Bergamaschi et al., 1998) and some in-

tersection (Hakimpour and Geppert, 2002). How-

ever, subsumption and intersection are discovered us-

ing external knowledge, like ontologies and thesauri,

or user-knowledge. Our approach identiﬁes equiva-

lence, subsumption, intersection and disjointness re-

lationships by only examining element metadata and

data instances, without any user-intervention.

The work most related to ours is the one presented

in (Xu and Embley, 2003), where direct and indi-

rect matches between elements are discovered. Di-

rect matches are identiﬁed between equivalent ele-

ments and indirect matches are identiﬁed between

(a) subsuming elements, (b) boolean elements and

elements whose instances contain the boolean ele-

ments’ names, and (c) elements whose instances can

be merged or splitted. These relationships are dis-

covered based on schema information, ontologies and

regular expressions deﬁned to match the instances of

elements.

Our framework covers all the relationships of (Xu

and Embley, 2003), except from the last one (c) which

in some cases is similar to our disjointness relation-

ship. In the case of boolean elements, our methodol-

ogy replaces their true and false instances with the

elements’ names and the concatenation of not and

their names, respectively, since the actual instances

do not provide much information. Therefore, if one

element contains the name of a boolean element in

its instances, this relationship will be identiﬁed. In

our framework, we also identify intersecting elements

that are not considered in (Xu and Embley, 2003).

GLUE (A. Doan and Halevy, 2002) is also similar

to our work. It proposes a bidirectional comparison

of schema elements, but it produces a single similar-

ity degree which takes the lowest value when the el-

ements do not have any common instances and the

highest when the elements are equivalent. Therefore,

the semantic relationships described in this paper can-

not be discovered by this approach.

7 CONCLUSIONS

In this paper, we have presented our approach to au-

tomatically discover semantic relationships between

schema elements. Based on a bidirectional compari-

son of the elements metadata and instances and with-

out any user or external knowledge, we are able to

discover equivalence, subsumption, intersection, dis-

jointness and incompatibility relationships. We have

shown our framework’s architecture and described the

components that we have implemented in the pro-

totype tool. Our experimental results are promising

with a 66% average precision and 75% average recall.

In the future, we are going to focus in the ﬁltering

process, since low precision has been mainly caused

by incompatible pairs of elements that have not been

discarded. We can consider assinging weights to

modules based on their importance and reliability.

Precision can also be improved by detecting automat-

ically incremented elements and elements with small

domains. A brute-force module can assist in this pro-

cess and it would only impose a small overhead to

exhaustively compare a small number of instances.

Additionally, in the future we are going to extend

our prototype tool with a graphical user interface,

which will permit the user to validate or reject the se-

mantic relationships identiﬁed by our methodology,

and a component which will integrate the input data

sources based on the validated relationships.

REFERENCES

A. Doan, J. Madhavan, P. D. and Halevy, A. (2002). Learn-

ing to map ontologies on the Semantic Web. In Pro-

ceedings of the World-Wide Web Conference (WWW-

02), pages 662–673.

Bergamaschi, S., Castano, S., di Vimercati, S., Montanari,

S., and Vincini, M. (1998). An intelligent approach

to information integration. In In International Con-

ference on Formal Ontology in Information Systems

(FOIS’98), Italy, 1998, pages 253–267.

Hakimpour, F. and Geppert, A. (2002). Global schema

generation using formal ontologies. In Proceedings

of ER02, volume 2503 of LNCS, pages 307–321.

Springer-Verlag.

Kashyap, V. and Sheth, A. (1996). Semantic and schematic

similarities between database objects: a context-

based approach. VLDB Journal, 5(4):276–304.

Larson, J., Navathe, S., and Elmasri, R. (1989). A theory of

attribute equivalence in databases with application to

schema integration. IEEE Transactions on Software

Engineering, 15(4):449–463.

Madhavan, J., Bernstein, P. A., and Rahm, E. (2001).

Generic schema matching with Cupid. In Proc. 27th

VLDB Conference, pages 49–58.

Rizopoulos, N. (2003). Discovery of semantic relationships

between schema elements. Technical report, AutoMed

Project.

Xu, L. and Embley, D. W. (2003). Discovering direct and

indirect matches for schema elements. In 8th Interna-

tional Conference on Database Systems for Advanced

Applications (DASFAA ’03), Kyoto, Japan, March 26–

28, 2003, pages 39–46.

ICEIS 2004 - DATABASES AND INFORMATION SYSTEMS INTEGRATION