can produce a set of mediated schemas. The above
procedures correspond to step 2 in Figure 2.
5 QUERY FORWARDING
At the schema clustering time, we produce a
mediated schema forest of various data sources. At
query phase, users can pose queries using the
terminology of any of the mediated schemas
although most of time the higher level schemas are
used. The query is reformulated and delivered
downward to the source schemas according to the
mapping between the mediated schema and its lower
schemas. A whole mapping called schema mapping
in our system is designed to consist of two parts.
First part is the one-to-many schema name mapping.
The mediated schema name is mapped downward to
the schema names of the lower schemas. We give an
ID to each schema which can uniquely identify the
schema. The name mappings keep schema names
along with their IDs. The IDs guarantee the schema
mappings are correct mappings. Consider the next
example of 3 source schemas.
Example 1:
(Paper: Paper_No., Title, Presentation,
Author_info, Nation, Contact, Company)
(Proceedings: Year, Conference, Title, Author1,
Company1, Author2, Company2)
(Publication: Ref_No., Type, Title, Author_list,
Reference, Volume, Issue_section, Pages, Year,
Month, Day, Conference_notes)
In the above example, schema name and schema
attributes are separated by colon, and attributes are
separated by comma. After processing by our system,
we could obtain the following mediated schema.
Example 2:
(Academic_publication: Title, No., Author,
Conference, Year)
The name mapping between the above mediated
schema and source schemas is as follow.
Example 3:
((4, Academic_publication) → ((1, Paper), (2,
Proceedings), (3, Publication)))
This mapping maps schema (4,
Academic_publication) to three lower schemas (1,
Paper), (2, Proceedings), and (3, Publication). The
first element in (4, Academic_publication) is the
schema ID and the other one is the schema name.
One-to-many attribute mappings are the other
part of the schema mapping. The attribute mapping
format is the same as the name mapping, but the
second element within the parenthesis is an attribute
such like “Title” in (4, Title) of Example 4. There
would be several attribute mappings in one schema
mapping since a mediated schema usually owns
several mediated attributes. Consider the next
mappings.
Example 4:
((4, Title) → ((
1, Title), (2, Title), (3, Title)))
((4, No.) → ((1, Paper_No.), (3, Ref_No.)))
((4, Author) → ((1, Author_info), (2, Author1), (2,
Author2), (3, Author_list)))
((4, Conference) → ((2, Conference), (3,
Conference_notes)))
((4, Year) → ((2, Year), (3, Year)))
Example 4 is the attribute mappings between
Example 1 and Example 2.
When users pose a query on the mediated schema
in Example 2, the system will reformulate the query
to the source schema according to the mappings in
Example 3 and Example 4. Several queries would be
generated matching the source schemas in Example
1. Finally the databases will return the required
answers to users. To illustrate the forwarding process,
consider the next queries. The query is posed on the
mediated schema in Example 2.
SELECT Title, Author, Year
FROM Academic_publication
WHERE Year
>
2006 AND Author = ‘Strehl’
According to the mappings in Example 3 and 4,
the next two queries will be posed on the
corresponding data sources automatically.
(1) SELECT Year, Title, Author1, Author2
FROM Proceedings
WHERE Year
>
2006 AND (Author1= ‘Strehl’
OR Author2 =‘Strehl’)
(2) SELECT Title, Author_list, Year
FROM Publication
WHERE Year
>
2006 AND Author_list = ‘Strehl’
6 EXPERIMENTS
Our algorithms were implemented in Java. We run
the experiments on a Windows 7 machine, with
2.60GH Intel(R) i5 processor and 8GB memory. The
goal of our experiments is to demonstrate that our
schema clustering algorithm is effective in clustering
the data sources of multiple domains, queries on the
mediated schemas could achieve answers with good
accuracy and the cost of writing query clauses for
users is reduced without losing query accuracy.
For the purpose of our query evaluation, we used
MySQL to store the data. Two string similarity
measurements are utilized to compute the schema
similarity since two strings may be semantically
Multi-domainSchemaClusteringandHierarchicalMediatedSchemaGeneration
115