that are considered to be the most suitable candidates
for representation. The frequency with which the
characteristics are shown in reference sites, and the
possible interest that a digital library might have in
considering them, are some of the factors taken into
account when making the choice.
The websites that were used as reference are:
Europeana, Internet Culturale, Cultura Italia, Internet
Archive, Open Library, and Project Gutenberg. These
websites offer an overview of the objects that a digital
library is interested in representing, making it
possible to examine and compare the classification of
those same objects found in portals. The first step was
to list the different types of analysed objects, based
on the name assigned to them by the website. Each
type of element is associated to one of the following
macro-categories: “Image”, “Text”, “Audio”,
“Video”, “Ebook”, “Other”. The macro-category
“Other” groups together metadata belonging to
elements that do not belong to the other labels (such
as metadata belonging to the legal documents group
from the previous sections). Once the nature of the
elements was defined, each group of metadata
describing an element becomes part of the group of
metadata belonging to the nature of that same
element. The importance of this phase is in
understanding how objects are classified and which
information were chosen to represent them. A list of
tags, divided by macro-category, is indeed
appropriate, but after that it is useful to create a list of
tags that uses their semantics to distinguish them,
regardless of their name. In order to avoid duplicates,
a name that reminds of the semantics of that tag is
assigned, while the choice of the most suitable name
is postponed to a later phase. With a list of metadata
by macro-categories, all we had to do was to decide
which tags to keep and which ones to reject,
considering the frequency of their use on the chosen
websites and the importance of each piece of
information for a digital library.
4.3 Refining Phase
This phase involved comparing metadata taken from
the standards analysed during the first phase with the
data collected during the second phase. The purpose
of the comparison was to verify whether all the
characteristics studied during the second phase were
represented by the metadata retrieved during the first
phase. If they were not, new metadata would be
created, either as an extension of the chosen metadata
(DC allows semantics extensions by adding
qualifiers) or as entirely new metadata, creating a new
namespace to include them. The process began with
a mapping phase, followed by the creation of new
metadata. Once the refining phase was completed,
and all available metadata were selected, we started
to design the taxonomy schema.
The first step was to compare the list of tags with
the metadata selected during the previous two phases,
based on their semantics. Thus, tags whose semantics
was not covered by any metadata were identified,
with the aim of creating new metadata specifically
designed for them. Tags with semantics similar to DC
elements, but more precise, were described via new
qualifiers, while tags that could not be encompassed
by the DC standard would be included in a new
namespace called “multimediatype”. For example,
the following new qualifiers were created for the DC
element “dc.identifier”: “isbn”, “LoC”, “dewey”,
“iccd”, where each of them represent a specific code
associated to the digital resource. It is not required to
create one metadata for each code type, but it was
considered wiser to create four qualifiers of
“dc.identifier” for the most relevant codes: ISBN,
LoC, dewey, iccd. For the other codes, the general
“dc.identifier” can be used, and the type of code has
to be specified during insertion. The namespace
“multimediatype”, instead, includes metadata
describing federal documents, publishing
information, institutions (for example, museums and
libraries), and User-Generated Content. After
creating the metadata derived from the second phase,
the capability of any metadata to represent
fundamental concepts needed to be investigated. The
fundamental concepts are, for example, the ebooks
categorization, the definition of “grey literature”
documents, UGCs, and rights management. The
results of our research showed that there were not any
metadata suitable for suggesting the optimal software
or hardware device for the exploitation of a resource,
e.g. an ebook. To overcome this, two new DC
qualifiers were created: “dc.format.testedSoftware”
and “dc.format.testedDevice”. These metadata define
the most suitable software and device through which
the resource can be exploited. Grey literature can be
defined by the level of education of their target users
(thus defining the suggested group of users that
typically use a specific kind of resources), and the
type of document, selected from a list of types
belonging to that category (for example, papers,
theses and scientific research documents). The
metadata are: “dc.audience.instructionLevel” and
“multimediatype.documentCategory”.
The integration of UGC metadata was performed
by focusing on those that featured a single semantics
during the first phase, and selecting the most suited
metadata for the context.
DATA2015-4thInternationalConferenceonDataManagementTechnologiesandApplications
244