HARVESTING AND AGGREGATION OF DIGITAL LIBRARIES

USING THE OAI FRAMEWORK

Alon Kadury and Ariel J. Frank

Department of ComputerScience, Bar-Ilan University, Ramat Gan, Israel

Keywords: Digital library, Harvesting, Aggregation, SDL, FDL, HDL, OAI, OAI-PMH

Abstract: Digital Libraries (DLs) are an important tool for quality information retrieval over the Internet. However,

with the information explosion on the Internet and the increase in the number of DLs, users might need to

search several DLs before finding the relevant information looked for. Federated DLs (FDLs) and Harvested

DLs (HDLs) can solve this problem. To overcome the lack of uniformity and interoperability problems

between DLs and to develop standards that can ease the distribution of information between them, an

initiative called OAI established the OAI-PMH (Open Archives Initiative Protocol for Metadata Harvesting)

protocol. While this protocol enables the construction of FDLs, HDLs are hardly found. This paper presents

the RIDDLE model for construction and aggregation of HDLs using the OAI-PMH protocol with some

modifications. The use of RIDDLE can provide for high quality information retrieval from digital libraries

on the Internet.

1 INTRODUCTION

The rapid growth nowadays of available information

on the Internet increases the risk of “Information

Explosion” and the difficulties in retrieving relevant

information rise. Ad hoc and unsound use of Web

Search Engines (SEs) is akin to searching for a

needle in a haystack, with similar outcomes.

On the other hand, in appropriate circumstances,

wise use of relevant Digital Libraries (DLs) provides

high quality information retrieval of authoritative

results, since DLs are quality oriented. However, the

availability, Awareness and discovery of DLs on the

Internet are still lacking (Yom Tov & Frank, 2006).

Moreover, even if DLs are used, users might

need to search several DLs before finding the

relevant information looked for. Here we investigate

some of the arisen questions, and propose a

promising solution for increasing the availability and

discovery of DLs using harvesting and aggregation

techniques.

In section 2 we elaborate more on digital

libraries, their types and use. Section 3 discusses the

OAI-PMH protocol and presents the problems

encountered in construction of HDLs using the OAI-

PMH protocol. Section 4 introduces the proposed

RIDDLE model for construction and aggregations of

HDLs, and describes an initial prototype. Section 5

concludes with a discussion and future directions.

2 DLS ON THE INTERNET

Digital Libraries (Arms, 2000) are both a direct

extension and complement of classical (analogical)

libraries. Here we define a DL as having six major

characteristics (Hanani & Frank, 2000; Sharon &

Frank, 2000). DLs exist on the Internet for over a

decade, and their number is growing. The

advantages of DLs holding digital collections are

that they provide quality, up-to-date materials,

offered with rich library services.

2.1 Types of Digital Libraries

There are various classifications of digital library

types in the literature. Here we classify DLs into the

following three types (Sharon & Frank, 2000):

1. Stand-alone Digital Library (SDL) –

a regular library implemented locally in a fully

computerized fashion, with networked access.

2. Federated Digital Library (FDL) –

a logical federation of entire autonomous

3. Libraries, based on common focus and topic, on

the network.

4. Harvested Digital Library (HDL) –

a virtual library providing mainly metadata

based access to relevant items that are

distributed over the network.

441

Kadury A. and J. Frank A. (2007).

HARVESTING AND AGGREGATION OF DIGITAL LIBRARIES USING THE OAI FRAMEWORK.

In Proceedings of the Third International Conference on Web Information Systems and Technologies - Web Interfaces and Applications, pages 441-446

DOI: 10.5220/0001284704410446

 SciTePress

As can be expected, SDLs are the most prevalent

type of DLs. However, with the ever increasing

number of SDLs, users looking for certain

information may be compelled to look into several

SDLs before they can fully locate relevant

information on the topic they are looking for.

In order to alleviate this difficulty, several

solutions have been proposed in order to generate a

single entry point that would transparently provide

coverage for relevant SDLs. Such solutions include:

1. Use of vertical search engines.

2. Access to a FDL library.

3. Search through a HDL library.

Use of vertical search engines can lead to finding

relevant results but they also suffer from the

aforementioned SE disadvantages (Xiaoming, 2001;

Yom Tov & Frank, 2006).

An FDL federates several autonomous SDLs by

logically constructing a flat composition of all their

contents to form a unified library. Example FDLs

are the Networked Computer Science Technical

Reference Library NCSTRL (www.ncstrl.org) and

the National Science Digital Library NSDL

(nsdl.org).

The FDL provides a transparent, uniform

interface to all the underlying SDLs' contents, while

overcoming lack of uniformity and any

interoperability problems between them. Since

different SDLs are involved, FDLs tend to have

relatively coarse granularities. Moreover, each

additional SDL that is federated increases the

granularity of the FDL even more.

An HDL, on the other hand, filters SDL

resources that are relevant to its focused library topic

by harvesting only their metadata into the HDL. A

framework for generating HDLs is the Katsir system

described in (Hanani & Frank, 2000). Katsir was

based on the Harvest system that used the SOIF

format for summarizing varied resources into

metadata records that were kept in the constructed

HDLs. Example HDLs are the National

Nanotechnology Initiative (www.nano.gov/) and

SourceBank (www.sourcebank.com/).

HDLs tend to have relatively fine topic

granularities, which isn't increased as additional

relevant resources on the same topic are harvested.

Retrieving quality information on a focused topic is

an advantage of HDLs over FDLs which tend to

have coarser topic granularities. Moreover, as need

arises, HDLs can be easily composed to form

aggregated HDLs with any coarser topic granularity

required. In principle, this aggregation can continue

for as much as needed, forming a hierarchy of DLs

represented and accessed as a relevant topics tree.

3 HDLS AND THE OAI-PMH

PROTOCOL

When developers attempt to generate FDLs and

HDLs they are usually confronted with lack of

uniformity and interoperability problems between

the different SDLs involved. Major problems are

different metadata formats used by DLs, lack of

consistency in the way requests are sent and

responses received in the varied user interfaces and

in the sharing of library resources (Suleman & Fox,

2002).

3.1 The OAI-PMH Protocol

To solve the lack of uniformity and interoperability

problems and to develop standards that can ease the

distribution of information between varied

repositories, OAI (Open Archives Initiative) was

initiated. OAI established the OAI-PMH (Open

Archives Initiative Protocol for Metadata

Harvesting) protocol (OAI-PMH, 2002). The

protocol attempts to solve these problems by setting

a standard for sharing information between many

repositories by use of a single metadata format.

OAI-PMH standardized the way repositories,

known as data providers, are queried (by defining 6

HTTP queries with different parameters) and the

way the answers are received (based on pre-defined

XML schemes) by the service providers. In addition,

the protocol specified DC (Dublin Core) as the basic

metadata standard that all repositories need to adhere

to (DC, 1998).

OAI-PMH clearly supports the construction of

FDLs (as service providers) based on gathering

information from SDLs (as data providers).

Moreover, the protocol was designed to provide

selective harvesting by the use of Sets (OAI-PMH,

2002) which seems to also provide for the

construction of HDLs (as service providers).

The idea of Sets is as follows. Instead of filtering

metadata records one by one, it would be efficient to

filter a group of related items at once. This is made

possible in the protocol through the use of Sets. A

set is an optional construct for grouping items for the

purpose of selective harvesting. SDLs may organize

items into sets. Set organization may be flat, i.e., a

simple list, or hierarchical. The number of sets and

their organization is up to the SDL developers. An

item record need not be affiliated with a set, even if

sets are defined, or it can be affiliated with one or

more of them.

WEBIST 2007 - International Conference on Web Information Systems and Technologies

442

3.2 Problems Constructing HDLs using

OAI-PMH

The use of the OAI-PMH protocol is accelerating

and it is employed by many SDLs and FDLs that

have implemented it (Van de Sompel & Lagoze,

2002). A few hundred OAI-based FDLs exist on the

Internet; these can be found in the OAI service

providers list

(www.openarchives.org/service/listproviders.html).

Considering the advantages of HDLs, and the

availability and benefits of OAI-PMH, we would

have expected an increasing number of HDLs on the

Internet. However, it is interesting to note that there

are nearly no HDLs that employ OAI-PMH. Even

those found, turn out to make limited use of the

protocol capabilities. The arisen question of course

is why that is so.

The major difference between FDLs and HDLs

is in the filtering and the selective harvesting done

by HDLs. Consequently, this leads to investigation

of the OAI-PMH protocol capabilities that can serve

in these goals. The outcome of this investigation

should lead to a potential solution for the harvesting

and aggregation of HDLs in the OAI-PMH

framework.

As part of our analysis, we found that OAI-PMH

enables filtering of metadata at three levels: item, set

and library, as described below.

Item-level Metadata

The item-level metadata is a summarized record of

each resource of the repository. The protocol

compels SDLs to expose their item-level metadata as

DC format, at the least, to the service providers. The

DC contains 15 elements (DC, 1998). All elements

are optional, and all elements may be repeated. This

DC metadata, comprised of fields with known

meanings, is of course an excellent source for

filtering the resources that get selected for the HDL.

Previous research (Dushay & Hillmann, 2003;

Lagoze et al., 2006; Tennant, 2004) has already

pointed out varied DC related problems that exist in

OAI repositories DC.

Group-level Metadata

Instead of filtering metadata records one by one, it

would be efficient to filter a group of related records

at once. As aforesaid, this is made possible in the

protocol through the use of Sets (Dushay &

Hillmann, 2003; Tennant, 2004).

In an inspection regarding Sets (Kadury, 2006), a

total of 164 sample OAI libraries were checked for

their use of sets and the structure and name

compatibility between them. However, nearly no

SDLs that make use of sets were found. The

investigation exposed that many SDLs do not use

sets at all. Of those that used sets, the use was low

and there was nearly no compatibility between the

extant sets. Even libraries on similar topics defined

sets in different ways using different names.

It is clear from this that attempting to use

selective harvesting using OAI-PMH for

constructing HDLs will currently be ineffective.

The difficulties and incompatibility in the use of

sets entail mainly from the following reasons:

1. Sets are defined in the protocol in a very general

manner.

2. No standard way to name sets and to describe

them by use of the optional Description field.

3. No instructions on organizing sets and

constructing a hierarchy between them.

4. It is not clear when items should be associated

just with a single set or with several sets.

The above can explain, for example, why we

have found several SDL libraries that have only one

set that is just named after the library itself.

Library-level Metadata

Having metadata about the DL itself can also be

useful when attempting to filter entire SDLs. Based

on this library-level metadata, we could initially

check if the SDL itself is at all relevant for us.

To achieve this, we could enhance the Identify

query to receive also the library metadata. The

returned information includes several mandatory

fields about the protocol parameters used and on the

identified library itself.

In addition, there is an optional Description field

that can contain a textual description of the library

and any additional details. Note that there is no fixed

syntax for this field. However, there are a few

suggestions regarding the information it should

include based on tags that describe the repository

(OAIGuide, 2002).

From the inspection done, there are libraries that

make use of this field but usually the information

contained in it is general – not very detailed and

henceforth not very meaningful. This means that the

Description field, as currently used, isn't useful in

HDL construction.

HARVESTING AND AGGREGATION OF DIGITAL LIBRARIES USING THE OAI FRAMEWORK

443

4 THE RIDDLE MODEL

Based on this investigation, we can see that OAI-

PMH has several features that could be used for

construction of HDLs. However, as they are defined,

they are too weak and need to be enhanced for

effective filtering and harvesting of SDLs as part of

the construction of HDLs.

To reach this goal, we propose here the RIDDLE

(Resource Inquiry and Discovery in a DL

Environment) model for the construction and

aggregation of HDLs (see Figure 1).

It is noteworthy that the OAI-PMH designers

envisioned an open protocol with a low entry level

to describing resources by metadata records. They

tried not to pre-enforce division of records into sets,

for example. The idea was to have a light and

flexible protocol that could be suited to a wide range

of applications. Henceforth, the RIDDLE model

preserves this spirit while suggesting enhancements

that enable better construction and aggregation of

HDLs, as described below.

4.1 Model for HDL Construction

The RIDDLE model (see Figure 1) supports the

construction of HDLs from SDLs (at layer 3) and

their further aggregation (at layer 4). The model is

based on the following proposed OAI-PMH

enhancements, introduced at the previously

described 3 levels: item, set and library.

Item-level Metadata

The solution to the aforementioned item-level

metadata problems is the mandated use of extended

DC (DCMI, 2005) that can provide a better detailed

description of the SDL resources. That is, extended

DC needs to be defined, at the least, for each

resource; DC itself isn't enough. Exemplary

extended DC fields are abstract and audience. Thus

extended DC metadata will enable better filtering at

the item-level of SDLs.

Group-level Metadata

The solution to the aforementioned group-level

metadata problems is to expand the protocol

definition of Sets through use of a naming standard

for uniform naming of the topics of the sets. The

naming standard proposed to be used here is DDC

(Dewey Decimal Classification) (DDC, 2003) that is

used for classification of topics in classical libraries

and within topic trees.

The DDC standard was chosen for this purpose

since it is in popular use in the library system, is

easy to use and provides for a detailed topics tree

(DDCServices, 2006). An example DDC topics

subtree is given in Figure 2. We propose to expand

the Sets definition with an additional DDC field that

indicates its DDC mapping. This field will enable

efficient filtering at the group-level of SDL sets.

Figure 2: An example of a DDC topics subtree.

Library-level Metadata

The solution to the aforementioned library-level

metadata problems is to mandate the use of extended

DC also in the Description field. This will enable

better filtering at the library-level itself.

It is noteworthy that with this proposed scheme,

the library itself is described in the same way as the

library resources are. Henceforth, we can use the

item-level filtering tools also for the library-level

itself. This enables the aggregation of HDLs as

suggested below.

Natural Science & Math

Physics Mathematics

Atomic nuclear physics

Heat

Life science

Magnetism

Radiations

Modern physics

Enhanced OAI-PMH

Layer 4 –

Aggregated Service

Providers

HDL

Layer 1 –

Internet

SDL SDL SDL

Layer 2 –

Data Providers

Layer 5 –

Presentation

Layer 3 –

Service Providers

Web interfaces

Figure 1: The RIDDLE model for construction and

aggregation of HDLs on the Internet.

Aggregated HDLs

Web

HDL

OAI

PMH

WEBIST 2007 - International Conference on Web Information Systems and Technologies

444

4.2 Model for HDL Aggregation

Once we have HDLs, it now becomes possible to

aggregate HDLs into coarser granularity HDLs, as

needed. The idea is to take relatively fine granularity

HDLs and construct from them relatively coarser

HDLs, in a hierarchical manner. The aggregated

HDL need not construct any new libraries but just

represent the logical composition of extant ones.

As envisioned, the aggregated HDL relies on a

hierarchical structure that is composed based on

extant hierarchies. An example of an aggregated

HDL can be seen in Figure 3. The resultant structure

enables hierarchical browse and search that is based

on the extant topics tree of libraries.

Figure 3: An example of an aggregated HDL.

To provide for this hierarchical structure we

propose to use DDC, as this standard was already

used for the Sets definition in HDL construction.

This enables each library to designate its DDC topic

and so denote its correct location in the DDC topics

tree (see Figure 2).

If need be, the DDC topics tree structure can be

expanded (Kadury, 2006), so this solves the problem

of a library whose topic name doesn't fit an existing

DDC name, or the inclusion of several libraries with

the same DDC topics name.

Each HDL will have a library-level metadata

record, like SDLs have. This record should be

delivered then in response to an Identify query sent

to an HDL. This will provide the needed library-

level metadata on the libraries that need be

aggregated beneath it.

If the HDLs would be enhanced with OAI-PMH

interfaces (as data providers), they could also enable

the harvesting of their data by the aggregated HDLs.

While OAI-PMH enables offline harvesting, with

few modifications (ibid) online harvesting could be

made possible and beneficial in certain cases.

4.3 The RIDDLE Prototype

The initial prototype implementation realizes the

functionalities of the RIDDLE model (Kadury,

2006). The prototype supports the construction of

both FDLs and HDLs, and the further aggregation of

HDLs. The aggregated HDLs can be searched by

browsing a DDC topics tree or by free textual

search, which searches the HDLs' library-level

metadata.

The prototype supports several user interfaces

used for displaying the search results: a common list

oriented interface, and a Google-like interface that

lists HDLs that answer the search criteria, instead of

showing the list of sponsored links.

A suitable test collection of FDLs and HDLs was

generated using the OAI-PMH protocol with needed

manual cleansing of the metadata records and their

augmentation with extended DC fields (ibid).

Several tests where carried out on the RIDDLE

prototype, in order to check the quality of

information retrieval from HDLs versus using

several FDLs, by checking precision, relative recall

and efficiency (by F-measure which is the weighted

harmonic average of precision and recall). Tests on

the ease of discovering and using aggregated HDLs

were also done.

Comparisons were made between tasks that were

first carried out on FDLs and then on HDLs (ibid).

The summarized results show that HDLs received

better marks on all measures compared to the FDLs

(see Figure 4).

In addition, several experiments were carried out

with a group of users that tried to separately locate

information using several FDLs and several HDLs.

These experiments also exhibited that the users

preferred HDLs over FDLs. The users preferred

using a single HDL interface rather than several

FDL interfaces and liked the ease of locating the

needed information from the HDL results.

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

0,9

PrecisionRecallF-measur e

HDLs

FDLs

Figure 4: Comparison of precision, recall and F-measure.

University

Life Sci

Exact Sci

Social Sci

Chemistry Comp Sci

Hardware Software

HARVESTING AND AGGREGATION OF DIGITAL LIBRARIES USING THE OAI FRAMEWORK

445

5 DISCUSSION AND FUTURE

DIRECTIONS

The major problems pointed out in this paper are the

lack of uniformity in SDLs using OAI-PMH that

prevents effective creation of HDLs, thus motivating

the need for the RIDDLE framework. The RIDDLE

model suggests the changes needed in the OAI-PMH

protocol to enable the efficient construction of HDLs

and their aggregation as needed. The major changes

promote the repeat use of both the extended DC and

DDC standards to provide for better metadata at all

library levels that enables selective harvesting and

aggregation.

Initial testing indicates that use of an HDL is

more efficient when compared to the use of several

separate FDLs. RIDDLE received high scores from

the end users. In addition, the users found the

process of locating aggregated HDLs in the DDC

topics tree easy to use. In general, the users preferred

RIDDLE due to its ease and effectiveness in

searching and retrieving results. The initial

experiments are promising but the current prototype

needs to be enhanced for wider experiments and for

public use.

In this paper we presented the changes required

in the OAI-PMH protocol to enable the construction

and aggregation of HDLs using RIDDLE.

Introducing these required extensions into the OAI-

PMH protocol can enable efficient construction and

aggregation of HDLs, and consequently better

information retrieval from digital libraries on the

Internet.

ACKNOWLEDGEMENTS

We kindly thank Dr. Yaffa Aharoni and Taly Sharon

for insightful contributions and support, and Nir

Yom Tov and Elina Gershenovich for their active

participation in the related digital libraries research.

REFERENCES

Arms, W. Y. (2000). Digital Libraries. MIT Press,

Cambridge.

DC, Dublin Core Metadata for Resource Discovery.

(1998, September). The Internet Engineering Task

Force (IETF). RFC2413. Retrieved August 2006, from

www.ietf.org/rfc/rfc2413.txt

DCMI, DCMI Metadata Terms. (2005, June). Dublin Core

Metadata Initiative. Retrieved August 2006, from

dublincore.org/documents/dcmi-terms/

DDC, Introduction to Dewey Decimal Classification.

(2003). OCLC, Forest Press. Retrieved August 2006,

www.oclc.org/dewey/versions/ddc22print/intro.pdf

DDCServices, Dewey services. (2006). OCLC, Cataloging

and Metadata. Retrieved October 2006, from

www.oclc.org/dewey/

Dushay, N., Hillmann, I. D. (2003, October). Analyzing

Metadata for Effective Use and Re-use. Dublin Core

Conference, Seattle.

Hanani, U., Frank, A. J. (2000, July). Katsir: A

Framework for Harvesting Digital Libraries on the

Web. ECIS 8th European Conference on

Information Systems, Vol. 1, 306-312, Vienna,

Austria.

Hanani, U., Frank, A. J. (2000, November). The Parallel

Evolution of Search Engines and Digital Libraries:

their Convergence to the Mega-Portal. ICDL'00,

Kyoto International Conference on Digital Libraries:

Research and Practice, Kyoto, Japan, 269-276.

Kadury, A. (2006). Harvesting and Aggregation of

Digital Libraries in the OAI framework.

Master Thesis, Department of Computer Science,

Bar-Ilan University.

Lagoze, C., Krafft, D., Comwell, T., Dushay, N.,

Eckstrom, D., Sayloy, J. (2006, June). Metadata

aggregation and "automated digital libraries": A

retrospective on the NSDL experience. JCDL Joint

Conference on Digital Libraries, North Carolina, USA,

230-239.

OAI-PMH, The Open Archives Initiative Protocol for

Metadata Harvesting, Protocol Version 2.0.

(2002). Retrieved August 2006, from

www.openarchives.org/OAI/openarchivesprotocol.html

OAIGuide, Implementation Guidelines for the Open

Archives Initiative Protocol for Metadata Harvesting.

(2002), Open Archives Initiative. Retrieved August

2006, from

www.openarchives.org/OAI/2.0/guidelines.html

Sharon, T., Frank, A. J. (2000, August). Digital Libraries

on the Internet. 66th IFLA Council and General

Conference, Jerusalem, Israel, 13-18.

Suleman, H., Fox, E. (2002). The Open Archives

Initiative: Realizing Simple and Effective Digital

Library Interoperability. Journal of Library

Administration, Vol. 35, Issue 1, 125-145.

Tennant, R. (2004, July). Metadata Bitter Harvest. Library

Journal, Vol. 32. Retrieved August 2006,

www.libraryjournal.com/article/ca43444322

Van de Sompel, H., Lagoze, C. (2002). Notes from the

Interoperability Front: A Progress Report on the Open

Archives Initiative. ECDL 6th European Conference

on Digital Libraries, 144-157.

Yom Tov, N., Frank, A. J. (2006, October). Harnessing

Search Engine Technologies to Increase Awareness

and Discovery of Digital Libraries. 4th IEEE

International Conference on Information Technology:

Research and Education, Israel.

Xiaoming, L. (2001, December). DP9 Service Providers

for Web Crawlers. D-Lib Magazine, Vol. 7, No. 12.

Retrieved October 2005, from

www.dlib.org/dlib/december01/12inbrief.html

WEBIST 2007 - International Conference on Web Information Systems and Technologies

446