RELEVANT VALUES: NEW METADATA TO PROVIDE INSIGHT ON

ATTRIBUTE VALUES AT SCHEMA LEVEL

Sonia Bergamaschi, Mirko Orsini

Dipartimento di Ingegneria dell’Informazione, Universit

a di Modena e Reggio Emilia, Italy

Francesco Guerra

Dipartimento di Economia Aziendale, Universit

a di Modena e Reggio Emilia, Italy

Claudio Sartori

Dipartimento di Elettronica, Informatica e Sistemistica, Universit

a di Bologna, Italy

Keywords:

Metadata, querying, data integration, data mining.

Abstract:

Research on data integration has provided languages and systems able to guarantee an integrated intensional

representation of a given set of data sources. A signiﬁcant limitation common to most proposals is that only

intensional knowledge is considered, with little or no consideration for extensional knowledge.

In this paper we propose a technique to enrich the intension of an attribute with a new sort of metadata: the

“relevant values”, extracted from the attribute values. Relevant values enrich schemata with domain knowl-

edge; moreover they can be exploited by a user in the interactive process of creating/reﬁning a query. The

technique, fully implemented in a prototype, is automatic, independent of the attribute domain and it is based

on data mining clustering techniques and emerging semantics from data values. It is parametrized with vari-

ous metrics for similarity measures and is a viable tool for dealing with frequently changing sources, as in the

Semantic Web context.

1 INTRODUCTION

Integration of data from multiple sources is one of the

main issues facing the database and artiﬁcial intelli-

gence research communities. A common approach

for integrating information sources is to build a me-

diated schema as a synthesis of them. By managing

all the collected data in a common way, a mediated

schema allows the user to pose a query according to

a global perception of the handled information. A

query over the mediated schema is translated into a set

of sub-queries for the involved sources by means of

automatic unfolding-rewriting operations taking into

account the mediated and the sources schemata. Re-

sults from sub-queries are ﬁnally uniﬁed by data rec-

onciliation techniques (see (Lenzerini, 2002; Ben-

eventano and Bergamaschi, ) for an overview).

Research on data integration has provided lan-

guages and systems able to guarantee an integrated

representation of a given set of data sources. A sig-

niﬁcant limitation common to most proposals is that

only intensional knowledge is considered, with little

or no consideration for extensional knowledge.

In this paper, we describe a technique for provid-

ing metadata related to attribute values. Such meta-

data represent a synthesized extensional knowledge

emerging from the attribute values. We call these

metadata “relevant values” as they provide the users

a synthetic description of the values of the attribute

which refer to by representing with a reduced number

of values its domain. We claim that such metadata are

useful when querying an integrated database, since

integration puts together in the same global class a

number of local semantically similar classes coming

from different sources and a set of global attributes

which generalize the local classes. Consequently, the

name/description of a global class/global attribute is

often generic and thus signiﬁcantly limiting the effec-

tiveness of querying. Let us suppose, for instance,

that the user has a good knowledge of a single source,

say “S”, and that she/he is interested in items whose

global attribute “A” contains the word “x”, as of ter-

minology of source “S”. The user could completely

274

Bergamaschi S., Orsini M., Guerra F. and Sartori C. (2007).

RELEVANT VALUES: NEW METADATA TO PROVIDE INSIGHT ON ATTRIBUTE VALUES AT SCHEMA LEVEL.

In Proceedings of the Ninth International Conference on Enterprise Information Systems - DISI, pages 274-279

DOI: 10.5220/0002376202740279

 SciTePress

miss the fact that in source “T” the word “y” refers

to a very similar attribute value, and therefore a query

with target “x” would return only a partial result w.r.t.

the contents of the global class. Moreover, ignoring

the values assumed by a global attribute may gener-

ate meaningless, too selective or empty queries. On

the other hand, knowing all the data collected from a

global class is infeasible for a user: databases contain

large amount of data which a user cannot deal with. A

metadata structure derived from an analysis of the at-

tribute extension could be of great help in overcoming

such limitation.

This work is done in the context of the MOMIS

(Mediator envirOnment for Multiple Information

Sources) project

(Bergamaschi et al., 2001), a frame-

work to perform information extraction and integra-

tion from both structured and semi-structured data

sources, plus a query management environment able

to process incoming queries through the navigation of

the mediated schema. The MOMIS integration pro-

cess gives rise to a Global Virtual View (GVV) in the

form of Global Classes and global attributes of the a

set of data sources.

In (Beneventano et al., 2003), we proposed a par-

tial solution to the semantic enrichment of a GVV

by providing a semantic annotation of all the Global

Classes of the GVV with respect to the WordNet lex-

ical database

, and thus providing each term with a

well-understood meaning. Relevant Values will se-

mantically enrich a GVV, since they provide semantic

information about the data sources the GVV refers to.

Moreover, in (Beneventano et al., 2006) a ﬁrst heuris-

tic for calculating relevant values was described.

In this paper we improve the approach proposed

in (Beneventano et al., 2006), by providing a ﬂexi-

ble parametric technique to deal with string attributes.

It is not a severe limitation, as: (1) data coming

from web-site wrappers are generally represented as

strings; (2) several techniques have been developed

in literature for clustering numeric values where it

is easy to deﬁne element orderings (see (Jain et al.,

1999) for a survey). The method was implemented

in a prototype called RELEVANT (RELEvant VAlue

geNeraTor) we describe in section 3.

The outline of the paper is the following: next sec-

tion deﬁnes the technique to elicit relevant values for a

selected attribute, section 3 describe the implemented

prototype and section 4 describes how relevant values

may be exploited for querying data sources. Finally

section 5 sketches out some conclusions and future

works.

See http://www.dbgroup.unimo.it for more publica-

tions about the project.

http://wordnet.princeton.edu/

2 ELICITING RELEVANT

VALUES FROM DATA

In the context of the data integration, there are sev-

eral models for representing ontologies. Without loss

of generality, let us refer to the concepts of MOMIS.

The GVV built with MOMIS is composed of Global

Classes (i.e. OWL Classes), with Global Attributes

(GA), i.e. OWL datatype attributes . Our goal is to ex-

tract the relevant values of a GA. Each relevant value

is described by a relevant value name and a set of val-

ues of the attribute domain.

Deﬁnition 1 Given a class C and one of its attributes

At, a relevant value for it, rv

is a pair rv

hrvn

, values

i. rvn

is the name of the relevant

value, while values

is the set of values referring to

it, i.e. values

⊆ π

(C).

The set of the relevant values of At, RVS =

{rv

, rv

, . . . , rv

} has two attributes : 1)

i=1

values

= π

6= rvn

∀i 6= j.

For simplicity, in the following we will drop the At

index.

Now we should answer two questions: how can

we cluster the values of the domain in order to put to-

gether in a relevant value a set of values which are

strongly related? How can we choose the relevant

value names? The ﬁrst question will be answered by

means of data mining clustering techniques, adapted

to the problem on hand; the second will require the

intervention of the integration designer, but we will

provide an effective assistant.

2.1 Adding Semantics: The Root

Elements

Nomina sunt consequentia rerum

Giustiniano, Institutiones, Liber II, 7, 3

On the basis of an analysis of manufacturing

databases, we observed that it is frequent to have

string domains with values composed by many words,

also with abbreviations. We observed also that the

same word, or group of words, may be further qual-

iﬁed with multiple words in many ways. For exam-

ple, the attribute describing a kind of production for a

mechanical enterprise may contain the value “Mould”

and the values “Mould ejectors, Mould engineering,

...”. Thus, it is possible to devise a lattice of similar

values. The lattice semantics has to be veriﬁed by the

integration designer for the speciﬁc attribute domain,

but, in our experience, it is sound.

Names are consequences, or, one might say, the expres-

sions, of things.

RELEVANT VALUES: NEW METADATA TO PROVIDE INSIGHT ON ATTRIBUTE VALUES AT SCHEMA LEVEL

275

Our idea is to exploit the semantics expressed by

such containment for building clusters of values. The

proposed technique ﬁnds out the “relevant” words in-

cluded in attribute values and uses them as roots to

build a lattice over the ﬂat set of values. The lattice

will be used together with the usual syntactic cluster-

ing methods, in order to effectively group values.

Deﬁnition 2 A root element re = hren, valuesi is

a relevant value computed by means of a func-

tion based on string containment: Contains(X, Y) =

true iff stem(X) ⊇ stem(Y), where X and Y are sets

of words.

A root element name ren

is deﬁned as a value

of the domain such that it does not contain any other

value of the domain, formally:

ren ∈ π

(C), Contains(ren, x), ren 6= x.

The values of a root element are the values of the do-

main containing the root element name, i.e. values =

{v ∈ π

The deﬁnition implies that the set of all the root

elements is closed w.r.t. the domain, i.e., if there ex-

ists m root elements,

j=1

values

= π

(C). RES is

the set of all the root elements of a property.

Our claim, supported by experience, is that root

elements do bear semantics, since they derive directly

from the linguistic choices of people who populated

the data source.

2.2 Relevant Value Sets

Like most cluster tasks with non-numeric attributes,

the problems are related to ﬁnd an effective repre-

sentation of the points (i.e. the attribute values) in

a space, and to devise a suitable similarity function

to be exploited by the clustering algorithm. The tech-

nique we propose builds a binary representation of the

attribute values, by mapping each value into the uni-

verse of words generated considering all the values of

the property.

At present, we implemented two ways to build

some structure upon the ﬂat set binary representation:

1) the semantics of containment, expressed by the root

elements deﬁned above; and 2) the semantics of clus-

ters, mapping the words of attribute values in an ab-

stract space, and deﬁning a syntactic similarity func-

tion in such space.

The similarity measures computed with both the

semantics are then used by a clustering algorithm (in

RELEVANT the user may generate both partitions and

overlapped clusters). The algorithm produces clusters

of values which are the Relevant Values set.

2.3 Relevant Value Names

The name of a relevant value rvn has to be representa-

tive of the elements belonging to the values set associ-

ated. Consequently, let us consider the set of relevant

values RVS associated to a speciﬁc attribute At. The

set of RVN = {rvn

, ..., rvn

} provides the integra-

tion designer with a synthesized knowledge about the

At contents.

A relevant value name is typically the most gen-

eral value among the values, i.e. given a generic

= hrvn

, values

i, rvn

is the most general value of

values

. The simplest way to detect a list of rvn

can-

didates is to use the Contains function: the integra-

tion designer may select the most appropriate among

them.

In particular, let us consider a rv

∈ RVS. The

list of rvn

candidates is calculated as follows: let

¯s = {v ∈ values

| Contains(v

′

, v) = true , v 6= v

′

, and 6

∃v

′′

∈ values

| Contains(v, v

′′

) = true v 6= v

′′

}. If ¯s

is a singleton, then rvn

= ¯s, otherwise the integration

designer has to select the name among the elements

of ¯s, or introduce another name.

3 THE RELEVANT PROTOTYPE

Figure 1 shows the RELEVANT functional architec-

ture, which is organized into four blocks:

1. Data pre-processing: a binary representation of

the values of an attribute is obtained with a ma-

trix where the columns represent the universe of

words (the extension of the property) and each

row represents the mapping of an attribute value.

2. Similarity Computation: two tasks are enabled:

the selection of the metrics for computing the sim-

ilarity between pairs of attribute values and the

selection of the semantics to be used: the seman-

tics of containment, the semantics of clusters or a

combination of both.

3. Relevant Values elicitation: this module imple-

ments some clustering algorithms to compute the

set of relevant values on the basis of the choices

made at step 2.

4. Validation: we implemented some standard tech-

niques to evaluate cluster quality. At present, ad-

ditional work and experiments are necessary go

further the simple evaluation, so as to provide ef-

fective assistance to the designer in the critical

task of parameter conﬁguration.

ICEIS 2007 - International Conference on Enterprise Information Systems

276

Similarity

computation

-Similaritymeasure

selection

-Methodselection

(Containment /

Clustering/

Combination

ofboth)

tuning

RELEVANT

RELEvant VAlues geNeraTor

Datapre-

processing

-Stemming

-Stopwords

-Binary

representation

Relevant

Values

elicitation

- Algorithm selection

(Hierarchical /

Overlapping)

- Relevant Name

selection

Validation

-Validationmethod

selection

(Variance/

Countsingle/

Silhouette/...)

1 2 3 4

Figure 1: The RELEVANT functional architecture.

3.1 Step 1: Binary Representation of

Attribute Values

The starting point of the process is the Matching Table

MT, that is a binary representation of the values of a

property, obtained as follows:

1. LetW = π

}, the extension of At. Each

string w

is found in at least one value and is a set

of elementary words, say w

= {w

2. From information retrieval we import the concept

of stemming, and we use a stemming operator for

words, such that stem(w) = a, where a is either

a stem form of w or

0, if w is a stop–word. For

string and sets of strings the deﬁnition of stem()

is easily extended: stem(s) = {stem(w

)} for s =

} and stem(S) = {stem(s

)} for S = {s

3. The universe of words considered is then

U =

∈W

stem(w

) = {a

where a

are numbered in lexicographic order.

4. The binary representation of At is described by

means of the rows of MT, where

MT = kmt

|W|×|U|

, m



1 if a

∈ stem(w

)

0 otherwise

MT is typically sparse: for each row there is a

number of elements different from zero that is equal

to the number of words contained in the associated

property, except for the stop–words. Our approach

applies clustering techniques to MT to obtain sets of

similar values and gather them in a single relevant

value.

3.2 Step 2: Similarity Computation

The binary representation is used to calculate similar-

ity between attribute values. The computed similarity

is then exploited by a clustering algorithm to create

sets of values. Two tasks are executed in this step:

the similarity measure and the selection of the kind of

“exploited” semantics.

Concerning the ﬁrst task, we consider some well-

known similarity measures collected in (Luke, ) (e.g.:

Simple matching, Jaccard,...). The choice of the simi-

larity measure is orthogonal w.r.t. the clustering algo-

rithm, and RELEVANT sees it as a parameter, which

can be chosen by the integration designer and changed

to compare different settings. We build the “Afﬁn-

ity Matrix” AM = kam

|W|×|W|

where am

is the

similarity between the mt

and mt

rows of the ma-

trix MT, binary representation of the original values

and a

∈ At.

Concerning the second task, it is trivial to show

that syntactically similar values may refer to very dif-

ferent objects: consequently the use of the semantics

of clusters may produce improper clusters. On the

other hand, adding semantics to each attribute value

is a time-consuming and error-prone operation: the

high number of values and the frequency of update

discourage a manual processing as required by the

usual techniques (e.g.: by annotating each value ac-

cording to a reference ontology).

Our choice is to introduce a new technique that

grasps the semantics of the values containment (ex-

pressed by the Root Elements) working in conjunc-

tion with the approach based on the above deﬁned se-

mantics of cluster which exploits the syntactic sim-

ilarity. The designer may balance the relevance of

the two kind of semantics by means of speciﬁc co-

efﬁcients.

To exploit such semantics, we introduce the

Matching Table for Root elements MTR, a bi-

nary matrix that shows the membership of the at-

tribute values to the Root Elements. In par-

ticular, MTR = kmtr

|W|×|SRE|

, where mtr

RELEVANT VALUES: NEW METADATA TO PROVIDE INSIGHT ON ATTRIBUTE VALUES AT SCHEMA LEVEL

277



1 if a

∈ values

0 otherwise

Therefore, a new matrix, Afﬁnity Matrix for Root

elements (AMR), is built.

AMR = kamr

|W|×|W|

represents the similarity mea-

sure between two elements mtr

and mtr

. This mea-

sure is computed by means of the metric selected by

the integration designer.

Finally the results of AM and AMR are linearly

combined into the Global Afﬁnity Matrix GAM =

kgam

|W|×|W|

. In particular, gam

= lc

× am

× amr

, where the values of lc

and lc

are

chosen by the designer such that lc

, lc

∈ [0, 1] and

+ lc

= 1.

3.3 Step 3: Relevant Values Elicitation

The prototype implements two different clustering al-

gorithms: a classical agglomerative hierarchical clus-

tering algorithm performs a partition of the values set,

a second algorithm generates overlapping clusters (a

variation of the algorithm in (Cleuziou et al., 2004) is

implemented).

The hierarchical clustering algorithm. A hi-

erarchical clustering algorithm classiﬁes elements

into groups at different levels of afﬁnity, forming a

tree (Everitt, 1993). The hierarchical clustering pro-

cedure is applied to the matrix GAM. Once the afﬁn-

ity tree has been built, clusters are interactively com-

puted on the basis of the numerical afﬁnity values in

the afﬁnity tree and a threshold-based mechanism for

cluster selection speciﬁed by the designer. High val-

ues of threshold return small, highly fragmented clus-

ters. By decreasing the threshold value, bigger clus-

ters are generated.

The overlapping clustering algorithm. The algo-

rithm is based on the technique described in (Cleuziou

et al., 2004) and it is based on the idea of extending

some sets of values given as input with other data set

elements. In particular, the algorithm starts from a set

of poles

P = {P

, ..., P

} where P

is a subset of the

considered values set and P

∩ P

= {} ∀i 6= j. Then,

a membership degree is calculated for each elements

of the values set with respect to each pole. Finally, by

means of a speciﬁc similarity measure evaluating the

membership degrees, each element is assigned to one

or more poles similar to it.

It is trivial to show that the results are highly de-

pendent on the heuristic used for calculating the ini-

tial set of poles. Using the semantics available in our

speciﬁc model, we implemented two techniques for

calculating poles: the ﬁrst one considers the results

of the hierarchical clustering as poles, the second one

considers the root elements as poles. The results are

different, as shown below. In the ﬁrst case the simi-

larity measures assume a key role; in the second case,

no similarity measure is computed since the algorithm

exploits only the containment property.

3.4 Step 4: Validation

We implemented a set of standard quality measures,

to support the designer in the tuning activity. In par-

ticular, the tool provides:

• countRV: number of relevant values obtained for

the conﬁguration. This value depends on the

threshold set by the integration designer;

• average, max

elements, variance: the descrip-

tive statistics over the number of elements. In

particular, average expresses the average num-

ber of values belonging to a relevant value,

max

elements indicates the dimension of the

largest cluster and the variance shows the vari-

ance degree among the dimensions of the clusters.

For values set equally distributed on the domain

max

elements is close to the average value and

variance is low;

• count single: number of relevant values with a

single element. Count single indicates a low num-

ber if the values set is equally distributed on the

domain;

• silhouette (Rousseeuw, 1987) (only if the hier-

archical clustering algorithm is used): calculates

a width for each cluster based on the comparison

of its tightness and separation. If the silhouette

value is close to 1, it means that the object is well-

clustered and has been assigned to a appropriate

cluster. If the silhouette value is close to

-1

, it

means that the object is not well-clustered;

• overlapping degree (only if the overlapping clus-

tering algorithm is used): indicates number of el-

ements which are in more than one relevant value.

4 QUERYING WITH RELEVANT

VALUES

Thanks to the knowledge provided by relevant values,

the user has two new ways of formulating queries, ac-

cording to two scenarios.

1. The user has only a general idea of what she/he is

searching for and composes a query predicate for

instance by selecting a value x among the relevant

value names. Note that instead of using the clas-

sical equality or LIKE operator, we should con-

sider a new one, say RELATED TO, taking into ac-

ICEIS 2007 - International Conference on Enterprise Information Systems

278

count the mapping between relevant value names

and values. It is beyond the scope of this paper to

discuss such operator, but a naive implementation

could be to substitute At RELATED TO x where x is

a relevant value name, with At IN (SELECT values

FROM METADATA.At WHERE rvn=’x’)

To give a ﬂavor of the novelty of the approach,

we should observe that: (a) The user seldom has

a deep knowledge of all the integrated data, so the

list of the relevant value names, elicited from data,

is of great help in providing insight on the value

domain, and in assisting query formulation; (b)

w.r.t. the base SQL predicate At LIKE

′

%x%

′

propose a rewriting of the query which is guided

by the semantics of clustering and string contain-

ment, and uses also, as base tools, the information

retrieval techniques of stemming and stop words.

2. The user knows that the result must include tu-

ples satisfying the predicate At = v, but she/he is

aware that, due to the integration process, tuples

with values v

′

similar to v might also be relevant.

In this case the query could be transformed in a

query of type 1 above by substituting At = v with

At RELATED TO rvn, where v ∈ values(rvn), or

possibly with a disjunction of predicates like that,

if overlapping clustering is used.

5 CONCLUSIONS AND FUTURE

WORK

In this paper we deﬁned a new type of metadata, the

relevant values of an attribute . These values are

provided to the user in order to increase his sources

knowledge. We addressed several critical issues with

the aim of efﬁciency and effectiveness, in different

domains with different updating frequencies. In par-

ticular, the method is based on data analysis: if data

change, the relevant values have to be updated. As

usual in data analysis, the startup phase requires the

setting of several critical parameters. Nevertheless,

for a given parameter setting, the developed technique

is able to calculate the relevant value set without any

human intervention. Moreover the parameters and

similarity metric selection determine the quality of the

relevant value set. Therefore, the integration designer

has to carefully evaluate the results and eventually

change some parameter setting in order to improve

the quality result.

The experimental results evaluated by means of

RELEVANT show that the developed technique may

produce results close to the relevant values provided

by a domain expert. The best results are obtained by

applying the overlapping clustering algorithms.

Future work will be addressed on improving the

relevant values selection by automatically calculating

some indicators for evaluating the quality of the rel-

evant values. In this way, the designer may be sup-

ported in the parameters selection. Moreover, we will

study the problem of the generation of the relevant

value set for multiple attributes and that of quantita-

tive evaluation of cluster quality in the overlapping

case.

ACKNOWLEDGEMENTS

This work was partially supported by the Italian Min-

istry of University and Research within the projects

WISDOM and NeP4B.

REFERENCES

Beneventano, D. and Bergamaschi, S. Semantic Search En-

gines based on Data Integration Systems. In In Seman-

tic Web: Theory, Tools and Applications (Ed. Jorge

Cardoso). Idea Group Publishing, forthcoming (see

http://www.dbgroup.unimo.it).

Beneventano, D., Bergamaschi, S., Bruschi, S., Guerra, F.,

Orsini, M., and Vincini, M. (2006). Instances naviga-

tion for querying integrated data from web-sites. In In

International Conference on Web Information Systems

and Technologies (WEBIST 2006), Setubal, Portugal.

Beneventano, D., Bergamaschi, S., Guerra, F., and Vincini,

M. (2003). Synthesizing an integrated ontology. IEEE

Internet Computing, pages 42–51.

Bergamaschi, S., Castano, S., Beneventano, D., and

Vincini, M. (2001). Semantic integration of hetero-

geneous information sources. Data & Knowledge En-

gineering, Special Issue on Intelligent Information In-

tegration, 36(1):215–249.

Cleuziou, G., Martin, L., and Vrain, C. (2004). PoBOC: An

overlapping clustering algorithm, application to rule-

based classiﬁcation and textual data. In Proceedings

of the 16th ECAI conference, pages 440–444.

Everitt, B. S. (1993). Cluster Analysis. Edward Arnold and

Halsted Press.

Jain, A. K., Murty, M. N., and Flynn, P. J. (1999). Data

clustering: A review. ACM Comput. Surv., 31(3):264–

323.

Lenzerini, M. (2002). Data integration: A theoretical per-

spective. In Popa, L., editor, PODS, pages 233–246.

ACM.

Luke, B. T. Clustering binary objects. In Technical Report,

http://fconyx.ncifcrf.gov/ lukeb/binclus.html.

Rousseeuw, P. (1987). Silhouettes: a graphical aid to the in-

terpretation and validation of cluster analysis. J. Com-

put. Appl. Math., 20:53–65.

RELEVANT VALUES: NEW METADATA TO PROVIDE INSIGHT ON ATTRIBUTE VALUES AT SCHEMA LEVEL

279