count the mapping between relevant value names
and values. It is beyond the scope of this paper to
discuss such operator, but a naive implementation
could be to substitute At RELATED TO x where x is
a relevant value name, with At IN (SELECT values
FROM METADATA.At WHERE rvn=’x’)
To give a flavor of the novelty of the approach,
we should observe that: (a) The user seldom has
a deep knowledge of all the integrated data, so the
list of the relevant value names, elicited from data,
is of great help in providing insight on the value
domain, and in assisting query formulation; (b)
w.r.t. the base SQL predicate At LIKE
′
%x%
′
we
propose a rewriting of the query which is guided
by the semantics of clustering and string contain-
ment, and uses also, as base tools, the information
retrieval techniques of stemming and stop words.
2. The user knows that the result must include tu-
ples satisfying the predicate At = v, but she/he is
aware that, due to the integration process, tuples
with values v
′
similar to v might also be relevant.
In this case the query could be transformed in a
query of type 1 above by substituting At = v with
At RELATED TO rvn, where v ∈ values(rvn), or
possibly with a disjunction of predicates like that,
if overlapping clustering is used.
5 CONCLUSIONS AND FUTURE
WORK
In this paper we defined a new type of metadata, the
relevant values of an attribute . These values are
provided to the user in order to increase his sources
knowledge. We addressed several critical issues with
the aim of efficiency and effectiveness, in different
domains with different updating frequencies. In par-
ticular, the method is based on data analysis: if data
change, the relevant values have to be updated. As
usual in data analysis, the startup phase requires the
setting of several critical parameters. Nevertheless,
for a given parameter setting, the developed technique
is able to calculate the relevant value set without any
human intervention. Moreover the parameters and
similarity metric selection determine the quality of the
relevant value set. Therefore, the integration designer
has to carefully evaluate the results and eventually
change some parameter setting in order to improve
the quality result.
The experimental results evaluated by means of
RELEVANT show that the developed technique may
produce results close to the relevant values provided
by a domain expert. The best results are obtained by
applying the overlapping clustering algorithms.
Future work will be addressed on improving the
relevant values selection by automatically calculating
some indicators for evaluating the quality of the rel-
evant values. In this way, the designer may be sup-
ported in the parameters selection. Moreover, we will
study the problem of the generation of the relevant
value set for multiple attributes and that of quantita-
tive evaluation of cluster quality in the overlapping
case.
ACKNOWLEDGEMENTS
This work was partially supported by the Italian Min-
istry of University and Research within the projects
WISDOM and NeP4B.
REFERENCES
Beneventano, D. and Bergamaschi, S. Semantic Search En-
gines based on Data Integration Systems. In In Seman-
tic Web: Theory, Tools and Applications (Ed. Jorge
Cardoso). Idea Group Publishing, forthcoming (see
http://www.dbgroup.unimo.it).
Beneventano, D., Bergamaschi, S., Bruschi, S., Guerra, F.,
Orsini, M., and Vincini, M. (2006). Instances naviga-
tion for querying integrated data from web-sites. In In
International Conference on Web Information Systems
and Technologies (WEBIST 2006), Setubal, Portugal.
Beneventano, D., Bergamaschi, S., Guerra, F., and Vincini,
M. (2003). Synthesizing an integrated ontology. IEEE
Internet Computing, pages 42–51.
Bergamaschi, S., Castano, S., Beneventano, D., and
Vincini, M. (2001). Semantic integration of hetero-
geneous information sources. Data & Knowledge En-
gineering, Special Issue on Intelligent Information In-
tegration, 36(1):215–249.
Cleuziou, G., Martin, L., and Vrain, C. (2004). PoBOC: An
overlapping clustering algorithm, application to rule-
based classification and textual data. In Proceedings
of the 16th ECAI conference, pages 440–444.
Everitt, B. S. (1993). Cluster Analysis. Edward Arnold and
Halsted Press.
Jain, A. K., Murty, M. N., and Flynn, P. J. (1999). Data
clustering: A review. ACM Comput. Surv., 31(3):264–
323.
Lenzerini, M. (2002). Data integration: A theoretical per-
spective. In Popa, L., editor, PODS, pages 233–246.
ACM.
Luke, B. T. Clustering binary objects. In Technical Report,
http://fconyx.ncifcrf.gov/ lukeb/binclus.html.
Rousseeuw, P. (1987). Silhouettes: a graphical aid to the in-
terpretation and validation of cluster analysis. J. Com-
put. Appl. Math., 20:53–65.
RELEVANT VALUES: NEW METADATA TO PROVIDE INSIGHT ON ATTRIBUTE VALUES AT SCHEMA LEVEL
279