user queries, formulated in terms of a mediated
schema, into queries that can be handled by local
databases. The mediator must therefore match each
export schema with the mediated schema. The
problem of query mediation becomes a challenge in
the context of the Web, where the number of local
databases may be enormous and, moreover, the
mediator does not have much control over the local
databases, which may join or leave the mediated
environment at will.
In general, the match operation takes two
schemas as input and produces a mapping between
elements of the two schemas that correspond to each
other. Many techniques for schema and ontology
matching have been proposed to automate the match
operation. For a survey of several schema matching
approaches, we refer the reader to (Rahm &
Bernstein, 2001).
Schema matching approaches may be classified
as syntactic vs. semantic and, orthogonally, as a
priori vs. a posteriori (Casanova et al., 2007). The
syntactic approach consists of matching two
schemas based on syntactical hints, such as attribute
data types and naming similarities. The semantic
approach uses semantic clues to generate hypotheses
about schema matching. It generally tries to detect
how the real world objects are represented in
different databases and leverages on the information
obtained to match the schemas. Both the syntactic
and the semantic approaches work a posteriori, in
the sense that they start with pre-existing databases
and try to match their schemas. The a priori
approach emphasizes that, whenever specifying
databases that will interact with each other, the
designer should start by selecting an appropriate
standard (a common schema), if one exists, to guide
the design of the export schemas.
An implementation of a mediator for
heterogeneous gazetteers is presented in (Gazola et
al., 2007). Gazetteers are catalogues of geographic
objects, typically classified using terms taken from a
thesaurus. Mediated access to several gazetteers
requires the use of a technique to deal with the
heterogeneity of different thesauri. The mediator
incorporates an instance-based technique to align
thesauri that uses the results of user queries as
evidences (Brauner et al., 2006).
A semantic approach for matching export
schemas of geographical database Web services is
described in (Brauner et al., 2007). The approach is
based on the use of a small set of global instances
and on an ISO-compliant predefined global schema.
An instance-based schema matching technique,
based on domain-specific query probing, applied to
Web databases, is proposed in (Wang et al., 2004). A
Web database is a backend database available on the
Web and accessible through a Web site query
interface. In particular, the interface exports query
results as HTML pages. In particular, a Web
database has two different schemas, the interface
schema (IS) and the result schema (RS). The
interface schema of an individual Web database
consists of data attributes over which users can
query, while the result schema consists of data
attributes that describe the query results that users
receive.
The instance-based schema matching technique
described in (Wang et al., 2004) is based on three
observations about Web databases:
1. Improper queries often cause search failure, that
is, return no results. For the authors,
improperness means that the query keywords
submitted to a particular interface schema
element are not applicable values of the database
attribute to which the element is associated. For
instance, if you submit a string to query search
element that is originally defined as an integer,
you get an error. As an example, consider
submitting a title value to the search element
pages number.
2. The keywords of proper queries that return
results very likely reappear in the returned result
pages.
3. There is a global schema (GS) for Web
databases of the same domain (He & Chang,
2003). The global schema consists of the
representative attributes of the data objects in a
specific domain.
The query probing technique consists of
exhaustively sending keyword queries to the query
interface of different Web databases, and collecting
their results for further analysis. Based on the third
observation, they assume, for a specific domain, the
existence of a pre-defined global schema and a
number of sample data objects under the global
schema, called global instances. For Web databases,
they deal with two kinds of schema matching: intra-
site schema matching (that is, matching global with
interface schemas, global with result schemas, and
interface with result schemas) and inter-site schema
matching (that is, matching two interface schemas or
two result schemas).
The data analysis is based on the second
observation. Given a proper query, the results will
probably contain the reoccurrence of the submitted
value (referring to the values of the attributes of the
global instances). The results will be collected in the
HTML sent to the Web browser. Thus, the
reoccurrence of the query keywords in the returned
results can be used as an indicator of which query
submission is appropriate (i.e., to discover associated
elements in the interface schema). In addition, the
position of the submitted query keywords in the
ICEIS 2008 - International Conference on Enterprise Information Systems
50