TOWARDS THE SCHEMA HETEROGENEITY IN DISTRIBUTED

DIGITAL LIBRARIES

Hao Ding, Ingeborg T. Sølvberg

Dept. of Computer and Information Science, Norwegian Univ. of Science and Technology

Keywords: Digital Library, Schema, Metadata, Agent, Ontology

Abstract: In this paper, we discussed the problems brought by the schema heterogeneity in DLs, especially those

problems found in the application of the OAI-PMH protocol. This paper studies the problem from two

perspectives, namely the schema and the architecture respectively. A preliminary architecture is provided

that integrates the ontology, agent, P2P together to support the schema mapping. And the semantic

negotiation strategy between the heterogenous agents has also been described.

1 INTRODUCTION

With the explosive research in the Semantic Web,

many people believe that the Semantic Web may

first emerge in controlled communities like DLs

because of the reliability of metadata that can be

guaranteed. Meanwhile, because DLs could be

accessed over the Internet inexpensively and

conveniently, the constructions of DLs increase

sharply and a number of topics are covered, such as

science, history, culture, etc.. Moreover, more and

more libraries use Web resources to populate their

collections. It thus results in that different DL

schema/metadata formats range over not only in the

cooperative DLs but also the open-access web-based

collections, which definitely increases the difficulty

in finding the appropriate information on a specific

topic or requirement.

In the past decade, there are many approaches to

weave distributed DLs together (for Recall purpose)

and alleviate the problem brought by schema variety

(for Precision purpose). From the schema

perspective, in order to facilitate the federation of

distributed DLs or content providers on the Web, it

is necessary to have a protocol that can ‘harvest’ the

metadata in different collections. In DL community,

two well-known protocols are Z39.50 (Z39.50

protocol) and Open Archive Protocol for Metadata

Harvesting (Carl, OAI 2002). The former addresses

a number of issues in a more complete manner but it

is expensive to adopt. Generally speaking, no matter

how great the functionality is, an approach with a

high cost of adoption will not be widely used.

Z39.50 has rich mechanisms, but it ends with limited

distribution, which is contrast to the rapid and broad

acceptance of basic web components such as HTTP

and HTML (Carl, 2002). OAI-PMH thus aims to

establish a low-entry and well-defined

interoperability framework applicable across

domains (Carl, 2001). It provides an application-

independent interoperability framework based on

metadata harvesting. Two roles are involved in OAI-

PMH – Data Provider and Service Provider. The

requirement for metadata (schema) interoperability

is addressed by requiring all OAI Data Providers

supply a common metadata set – (unqualified)

Dublin Core (DCMES, 2003). However, in the

current approaches in the metadata harvesting, some

problems are brought out in terms of metadata

incorrectness (e.g. XML encoding or syntax errors),

poor quality of metadata, and metadata

inconsistency (MARTIN, 2003). The flexibility in

the usage of unqualified DC elements results in that

some elements, e.g., ‘type’, ‘format’, ‘language’,

etc., which may not share controlled vocabulary that

can improve the consistency and then the quality of

service (Hyunki, 2003). Furthermore, the simplicity

of DC somehow loses the Precision in searching

because of its limited description capability.

Anyway, the use of Qualified Dublin Core (QDC,

2001) would solve some of these problems, but it

will be also expensive to create and deploy as that in

Z39.50. From the DL infrastructure perspective,

there have been many federated DLs that are

implemented in a centralized architecture, which

requires a supporting organization to maintain them.

307

Ding H. and T. Sølvberg I. (2004).

TOWARDS THE SCHEMA HETEROGENEITY IN DISTRIBUTED DIGITAL LIBRARIES.

In Proceedings of the Sixth International Conference on Enterprise Information Systems, pages 307-312

 SciTePress

These approaches work well within a controllable

organization. For example, the BIBSYS library

system (BIBSYS) federates 92 sub libraries that are

distributed over the whole Norway in different

colleges and universities. Although the sub libraries

are geographically distributed, BIBSYS mandates all

of them to adopt the BIBSYS-MARC (BIBSYS-

MARC, 2001) metadata format. The National library

of Norway administrates the centralized library and

each sub library is allowed to have additional

metadata standards for her own specific usages.

As we argued above, we believe it is almost

impossible and impractical for us to create global-

applied and unique identifiers (names) for all kinds

of objects that we intend to search, browse, or

exchange. We also believe that the future DLs will

consist of many small or medium sized libraries that

can provide specific services for users. Additionally,

the users should be able to access not only the

cooperative (federated) DLs but also the non-

cooperating DLs at the same time.

In this paper, we propose to integrate DL systems in

a new manner that combining the semantic

negotiation, agent and Peer-to-Peer (P2P)

technologies together. Our goal is to let the agent

component embedded in different library

communicate semantically. The mutually

comprehensible agents will help to improve the data

quality when harvesting in between.

The following of the paper is organized as follows:

Section 2 introduces the related works in schema

interoperability; Section 3 provides a multi-agent

based P2P architecture for distributed DLs in which

heterogeneous agents can communicate for

ontology-based negotiating for understanding the

meaning of different schema if there are. Discussion

and Conclusion come in the final section.

2 RELATED WORK

Mappings between heterogeneous schemas have

been studied for quite a while.

A framework for dealing with heterogeneous OSM

schemas is presented in (Biskup, 2003). OSM

models contain objects, their relationships and a

predicate calculus for expressing constraints. The

global schema is defined ontologically and

independent from the source schemas. Interaction

with an administrator is assumed (however not

required) for setting up deterministic mappings

between objects (and relations, respectively).

TSIMMIS (Chawathe, 1994) is one of the early

systems integrating heterogeneous digital libraries.

Schema mappings are defined in a textual format

with actions which are executed when a

corresponding template matches a query.

With the growing popularity of XML, mappings

between different DTDs are also investigated. Due

to the deterministic nature of XML, uncertainty is

not supported by any of these approaches. A tree-

grammar-based approach for inducing integrated

views (XML-QL templates which can be used for

stating user queries) for XML data with

heterogeneous DTDs is presented in (Jeong, 1995).

Type trees derived from the source DTDs are

converted into a tree automaton. States belonging to

similar types are merged to obtain a minimized

integrated view.

MIND (Henrik, 2003) uses probabilistic logics for

uncertain schema mapping. They mapped

DAML+OIL into the probabilistic Datalog (Norbert,

2000) and use XSLT for actually transforming

queries and documents.

National Science Digital Library (NSDL) adopts

eight native metadata standards. The collections

selected for inclusion in the NSDL have metadata

conforming to the common or well-established

standards, if they have metadata at all. If they have,

the systems will automatically crosswalk native

metadata to qualified Dublin Core (QDC, 2001),

which will provide a lingua franca for

interoperability. If not, the systems will processes

content and generate metadata automatically (Carl,

2002).

3 SEMANTIC NEGOTIATION IN

AGENT P2P-BASED

DISTRIBUTED DLS

3.1 Architecture

Basing on the aforementioned arguments, we

propose to adopt an agent P2P-based platform where

‘harvesting agents’ can harvest the metadata from

other libraries in a semantic negotiation mechanism.

Agents are autonomous program units capable of

working towards a set of goals. In multi-agent

system, cooperating agents need a shared set of

conventions (Wooldridge, 1995). The legacy

approach is to agree upon a set of conventions,

ICEIS 2004 - HUMAN-COMPUTER INTERACTION

308

particularly, a set of domain ontology beforehand,

and then embed them into the agent communication

protocols. In constructing agent-based distributed

DLs, several open problems are still inherited as

follows.

It is hard to have a world-wide consensus ontology

base as mentioned above and hence it is groundless

to have an associated language for every possible

domain of multi-agent application.

Agent P2P-based DLs systems are open system

because they consist not only the cooperative DLs

but also the non-cooperating DLs. This means that

the conventions can not be defined once and for all

but are expected to expand as new needs arise.

Agent P2P-based DLs are typically distributed

systems. There is no central control server.

So, there should be a shared lexicon for the involved

agents to communicate a description. We believe

that a co-evolutionary coupling on ontology and

agent communication language will help improve

the coordination in distributed DLs.

Figure 1 illustrates a general sketch of the

architecture we propose.

The involved agents are autonomous and they can be

cooperative or not (e.g., Library B and Library D),

which is well-suited for the real-world situations.

According to the figure, Library A and Library D

share a common metadata format, say DC (DCMES,

2003), so A can directly harvest the metadata

records from D. However, B does not support DC

format, but the Encoded Archival Description (EAD,

2002). It thus needs the schema mapping from EAD

to DC if A wants also to harvest metadata records

from B. So, the agent in A can be activated to

negotiate with agent in A for the schema mapping in

between (details about the negotiation are described

in next sub-section).

In the architecture, the semantics-based negotiation

mechanism happens between two heterogeneous

agents that embedded in different library system. We

have not chosen the pure P2P infrastructure because

the current searching methods, such as the JXTA

search protocol, assume that all providers are

cooperative, thus they need to thus provide

complete, reliable resource descriptions. But it is

impracticable in some federated DLs environment

that many libraries consider their rich metadata to be

an important asset and only permit the ‘privilege’

users to access their collections (Carl, 2002). Thus,

we propose to import the agent technology because

it can support the communications between two

libraries without reference to that they are

cooperative or not. Furthermore, the agent-based

communication mechanism and technology is fairly

mature and is especially suitable for the explanations

on a specific schema (negotiation). The major

overhead may come from negotiation.

On the other side, dissimilar with the classical

adoption of multi-agents system in DLs, e.g., the

UMDL agent at University Michigan (William,

1995), which has a mediator for facilitating

communication between agents, we plan to integrate

the mediating functionality into an agent’s own

capabilities. Such that it will help keep track of an

agent’s neighbourhood and cache locations of other

agents. In this way an agent P2P network is formed

and a central bottleneck of the system is alleviated.

The major characteristics of the proposed

approach are:

z No central control server. The agents have to

coordinate by themselves in a self-adaptive fashion.

z The ontology remains adaptive. New coming

DL system which contains different metadata or no

metadata at all may require it to induce the meaning

of terms in a specific schema.

z Library systems can join and leave freely as

that in the P2P network.

Currently, in DL community, there has not been

much done in bringing together P2P networks and

agents for semantics-based interoperability. Thus,

putting together P2P, agent and semantics is an

unexploited research topic. And we believe it is a

worthwhile research to go further.

Figure 1: Semantics-based Interoperability in Distributed

DLs

TOWARDS THE SCHEMA HETEROGENEITY IN DISTRIBUTED DIGITAL LIBRARIES

309

3.2 The Role of Ontologies in DLs

Before we describe the semantic negotiation strategy

between two heterogeneous agents, it is necessary

for us to re-visit the role of ontologies in DLs.

According to aforementioned discussion, we believe

that in the development of future digital libraries, the

deployment of careful generated ontologies or

thesauri will offer higher reliability and quality for

the DL services. Furthermore, based on the adoption

of ontologies, it will also help make mapping among

related schema or integrate various schema into a

repository to support the content-based retrieval. In

fact, DL researchers have implicitly applied the idea

of ontologies in DLs, for example, the process of

classification on digital records. But there is still a

long way to go to realize the ontology-based

harvesting, searching and browsing, etc in DLs.

As concerning Ontology itself alone, James Hendler

states that the Semantic Web will contain a great

number of small possibly mutually inconsistent

ontological components that consist largely of

pointers to each other instead of few large and

consistent ontologies (James, 2001). Currently, the

most promising approach for the comparably ‘large’

standard ontologies is the effort to clean-up, refine,

validate and merge the existing resources, e.g.

WordNet (http://www.cogsci.princeton.edu/wn),

HowNet(http://www.keenage.com/zhiwang/ezhiwan

g.html),CoreLex(http://www.cs.brandies.edu/~paulb

/CoreLex/overview.html), the publicly accessible

part of Cyc (http://www.cyc.com/), etc., for the

practical application, like ontology/metadata

mapping in DLs. There is available program for

helping validating designed ontologies (Nicola,

2002).

According to the well-know ‘5 papers on Wordnet’

(Miller, 1990), the essential part of concepts are:

z Synonymy(similar concept): <creator,

maker>

z Hyponymy(narrower-broader/ISA):

z Meronymy(part-of/HASA): <creator has

personality>

z Derivationally related terms/concepts:

A number of papers in the DL and IR communities

have described the considerable improvement

obtained by adopting synonymy and hyponymy. For

example, in the application of query expansion. This

paper is yet not another endeavour to propose new

approaches for performance improvement. Rather, it

concentrates on how we can incorporate them into

distributed DLs and alleviate the problems brought

by schema heterogeneity. The following section will

concentrate on the semantic negotiation strategy.

3.3 Semantic Negotiation Strategy

Semantic Negotiation is a general purpose

mechanism that can be used in many different

contexts for exchanging schemas information and

description. In the procedure of negotiation, the

agent on the Service Provider (SP, the same meaning

as that in OAI-PMH) is expected to

interpret/understand the schema formats on the

heterogeneous Data Provider (DP, also from OAI-

PMH). The process is as follows:

1). When agent

sp(i)

asks agent

dp(j)

for the schema

format information, agent

dp(j)

sends agent

sp(i)

a list of

terms, using the description based on a lexical base,

for example, Wordnet. And the latter should also

support such a kind of lexical base. The reason for

doing so is that it is almost impossible for two

agents to mutually comprehend and exchange data

without any shared vocabulary or thesauri.

2). if agent

sp(i)

does not understand the description, it

responds with an error code indicating that the

description can not be understood. In this case, it

lists the particular terms not understood. Based on

this feedback, agent

dp(j)

can try to provide a

description that the server is more likely to

understand.

3). if the agent

sp(i)

partially understands the

description, that is, there are some mismatching

terms, it returns an error code saying so. It can

optionally also tell the agent

dp(j)

which part of the

description was not satisfied by any of the terms.

4). if the agent

sp(i)

understands the description, it

returns the confirmation to agent

dp(j)

. In the case

where the answer is a list of resources, the answer

may include additional data about each resource,

which the agent

sp(i)

may cache, in anticipation of

future queries about these resources.

The sequence diagram is illustrated in Figure 2.

Let us take a simple example, if agent

sp(1)

on Library

A queries agent

dp(2)

on Library B for the metadata

schema, agent

dp(2)

then responds his metadata format

in which there is one term – ‘author’ that agent

sp(1)

does not understand. Thus agent

sp(1)

sends a

feedback to agent

dp(2)

, claiming that unknown term.

Based on the feedback, agent

dp(2)

provides a

ICEIS 2004 - HUMAN-COMPUTER INTERACTION

310

description (see below) that is generated from the

prerequisite query on Wordnet.

From the fragment of description, agent

sp(1)

finds

that ‘creator’ is just one of the elements in DC that

Library A supports. Thus he responds which he

understands the term successfully and cache the

mapping for the application later on between Library

A and B. Hereby, the mapping should be focused on

specific relationships among specific libraries.

4 DISCUSSION AND

CONCLUSION

Even if the WWW contains more information than

any single traditional library, it can not substitute the

traditional library because it lacks these services

(particularly organization and sophisticated search

support) (William, DLib1995). No one is

disassembling their libraries because of WWW yet.

On the other side, because the webpage/media

editing tools become better and access to networks

becomes easier and cheaper, there will be millions of

content suppliers. The sharply increased public DLs

available on the Web are just a good proof for it.

However, people also find the difficulties in finding

the appropriate information because of the

voluminous collections and hence the problems in

locating the proper repositories. And the key issue in

the problem comes from the schema heterogeneity.

Many approaches in DL community have been

carried out to investigate the problem. There are also

many practical DL systems appear. Some of the

solutions create an integrated and global schema set

that may include exactly one (e.g. MARC21) or

several metadata formats (e.g. Dublin Core, Encoded

Archival Description, etc.). The individual library

thus maps its local metadata format to the global

one. If the global metadata set contains just one

format, such as the BIBSYS-MARC in BIBSYS, all

of the cooperative DLs should abide by the

BIBSYS-MARC format respectively although they

can extend some items locally. As to a metadata set

that may hold several schema formats, like NSDL,

which adopts eight metadata standards. The

collections selected for inclusion in the NSDL have

metadata conforming to the common or well-

established standards.

Such approaches will be unavoidably faced with the

problem in scalability, specifically, in the situations

when libraries join and leave. These cooperative

libraries will take pains in adjusting the global view

of the metadata set or reformatting the local

metadata standards. The UMDL adopts the agent

technology in the DL development, bearing the

intention to create a flexible software architecture

that can federate as many content suppliers,

information-organizational schemas, and service

providers as possible, and yet scale to the extremely

large size needed to support the DLs in the future

(William, DLib1995).

However, UMDL has not utilized the emerging

Semantic Web technology, which is widely accepted

that it can offer some semantic groundings. In the

distributed DLs, the profitable area is to embed the

semantic negotiation strategies into the agent

communication policies.

In this paper, we firstly discussed the problems

brought by the schema heterogeneity in DLs. Many

problems in the implementation of OAI-PMH

protocol have also reported their findings in this

issue. We believe that the future DLs could not be

accomplished without an adoption of a careful

design of ontologies. The essential types of

ontologies that could improve schema mapping were

also presented. In order to have a platform for the

semantic-based agent communication in distributed

DLs environment, we proposed a preliminary

architecture that integrates the ontology, agent, P2P

together to support the schema mapping. The

semantic negotiation strategy has also been

provided. We are aware that there are many open

questions, so this work should be considered a

Figure 2: The Sequence Diagram for the Semantic

Negotiation

TOWARDS THE SCHEMA HETEROGENEITY IN DISTRIBUTED DIGITAL LIBRARIES

311

stepping stone. And, it is a worthwhile research to

go further.

REFERENCE

BIBSYS, the Norwegian library automation network,

http://www.bibsys.no

BIBSYS-MARC : Bibliografisk format. BIBSYS, 2001.

http://www.bibsys.no/handbok/marc/marc.pdf

Biskup,J. and Embley, D. W., Extracting information from

heterogeneous information sources using

ontologically specified target views. Information

Systems, 28(3):169–212, 2003.

Carl Lagoze, Herbert Van de Sompel, 2002, The Open

Archives Initiative Protocol for Metadata Harvesting,

http://www.openarchives.org/OAI/2.0/openarchivesp

rotocol.htm

Carl Lagoze, Herbert Van de Sompel: The open archives

initiative: building a low-barrier interoperability

framework. JCDL 2001: 54-62

Carl Lagoze, William Y. Arms, Stoney Gan, Diane

Hillmann, Christopher Ingram, Dean B. Krafft,

Richard J. Marisa, Jon Phipps, John Saylor, Carol

Terrizzi, Walter Hoehn, David Millman, James Allan,

Sergio Guzman-Lara, Tom Kalt: Core services in the

architecture of the national science digital library

(NSDL). JCDL 2002: 201-209.

Chawathe, S., Garcia-Molina, H., Hammer, J., Ireland, K.,

Papakonstantinou, Y., Ullman, J., and Widom, J.. The

TSIMMIS project: Integration of heterogeneous

information sources. In 16th Meeting of the

Information Processing Society of Japan, pages 7–18.

Tokyo, Japan, 1994.

Dublin Core Metadata Element Set (DCMES), 2003,

Version 1.1: Reference Description,

http://dublincore.org/documents/dces/

Dublin Core Qualifiers (QDC), 2001,

http://dublincore.org/documents/2000/07/11/dcmes-

qualifiers/

Encoded Archival Description (EAD), Version 2002,

http://lcweb.loc.gov/ead/

Henrik Nottelmann, Norbert Fuhr: Combining

DAML+OIL, XSLT, and Probabilistic Logics for

Uncertain Schema Mappings in MIND. ECDL 2003:

194-206, Trondheim, Norway.

Hyunki Kim, Chee-Yoong Choo, Su-Shing Chen: An

Integrated Digital Library Server with QAI and Self-

Organizing Capabilities. ECDL 2003: 164-175,

Trondheim, Norway.

James Hendler, Agents and the Semantic Web. IEEE

Intelligent Systems, 16(2):30-37, 2001.

Jeong, E. and Hsu, C.-N.. Induction of integrated view

for XML data with heterogeneous DTDs. In Paques

et al. [17], pages 151–158.

Martin Halbert, Joanne Kaczmarek, and Kat Hagedorn,

Findings from the Mellon Metadata Harvesting

Initiative. ECDL2003, pp.58-69, Trondheim,

Norway.

Miller, G. A., Beckwith, R., Fellbaum, C., Gross, D.,

Miller, K.: Five Papers on WordNet. Special Issue of

International Journal of Lexicography, 3 (4). (1990)

National Science Digital Library (NSDL), http://nsdl.org

Nicola Guarino, Christopher Welty, Evaluating

Ontological Decisions with ONTOCLEAN,

Communication of the ACM, Feb. 2002/ Vol.45.

No.2.

Norbert Fuhr: Probabilistic datalog: Implementing logical

information retrieval for advanced applications.

JASIS 51(2): 95-110 (2000)

William P. Birmingham, Edmund H. Durfee, Tracy

Mullen, Michael P. Wellman, The Distributed Agent

Architecture of the University of Michigan Digital

Library (Extended Abstract), AAAI Spring

Symposium on Information Gathering, 1995.

William P. Birmingham, An Agent-Based Architecture for

Digital Libraries, DLib Magazine, July 1995.

Wooldridge, M. and N.R. Jennings, Intelligent Agents:

Theory and Practice. Knowledge Engineering

Review, 1995, 10(2).

Z39.50 protocol, http://www.loc.gov/z3950/

ICEIS 2004 - HUMAN-COMPUTER INTERACTION

312