Knowledge Formalization and Management in KMS
Filippo Eros Pani, Maria Ilaria Lunesu, Giulio Concas, Carlo Stara and Maria Pia Tilocca
Department of Electrics and Electronics Engineering, University of Cagliari, Piazza d'Armi, Cagliari, Italy
Keywords: Knowledge Management, Open Archive, Institutional Repository, Multimedia content, Metadata.
Abstract: Organization and availability of contents in Knowledge Management System (KMS) basically depend on
two factors: one is that KMS have effective tools for information indexing and retrieval; the other is how the
tools are actually understood and used by users. This work proposes a new approach for formalization and
management of knowledge, in this case a group of audio recordings in a corpus and linguistic information
added to that corpus with annotations. The formalization level of this approach allows for effective text
retrievals through a metadata schema and easy, quick corpus interrogations, by formalizing linguistic
annotation as a structured metadata schema. The proposed approach was experimented upon and validated
during a project that aimed to create the Analytical Sound Archive of Sardinia. The archive has an
electronic corpus of spoken texts, linguistically annotated at various levels.
1 INTRODUCTION
Archive and publishing tools that facilitate the
circulation of resources on the Internet must be able
to gather information in an organized, reliable
manner, describe it, store it, and retrieve it, all with a
minimum level of interoperability for tools and
description parameters. In the field of scientific
communication, that is what happened with the
creation of knowledge management systems like
DSpace, Eprints, etc., which are institutional
archives modelled on the Open Access Initiative that
allow structuring information and adding
standardized metadata to it (Tansley, 2003) (Linch
2003) (Swan and Carr, 2008).
In this context, the “Analytic Sound Archive of
Sardinia” project aims to create an institutional
archive with a linguistically annotated electronic
corpus. An electronic corpus is generally a
homogeneous collection of written or oral texts in
digital format, processed with coherent criteria in
order to build an empirical basis for language
analysis. Its advantage is that it can be annotated by
adding linguistic information in a specific portion of
text.
The electronic corpus in the studied Institutional
Repositories (IR) will be formed by a collection of
audio recordings from poetry contests and singing
performances in Sardinian language, stored and
annotated on different linguistic levels. The purpose
of the project is the preservation, appreciation and
knowledge of Sardinian oral traditions, especially
improvised poetry.
In accordance with Open Access Initiative
(OAI), the corpus will be included in an open IR,
being therefore available for Sardinian language
scholars and everyone who wishes to use it.
Linguists and musicologists, creators of the
corpus, needed to study and research the documents
in it, and they asked for the possibility to save their
work in an readily available digital archive to store,
index and manage it for both access and
communication inside the scientific community.
The purpose of this study is to offer an original
way to associate linguistic annotations (information
associated to specific text portions) to the corpus by
treating them as metadata, so as to insert and
manage them in the archive of choice after
formalizing them in XML, the universally used
markup language for representing metainformation.
In particular, an application profile was created
for the Dublin Core metadata schema, which is
suitable to the nature of the audio recordings in the
Analytic Sound Archive of Sardinia.
In the second section of this paper we recall
some aspects about the Knowledge Management. In
the third and fourth, we present our proposed
approach for knowledge formalization and
management, and the case study. The fifth section
includes the conclusion and reasoning about the
future evolution of the project.
132
Pani F., Lunesu M., Concas G., Stara C. and Tilocca M..
Knowledge Formalization and Management in KMS.
DOI: 10.5220/0004126001320138
In Proceedings of the International Conference on Knowledge Management and Information Sharing (KMIS-2012), pages 132-138
ISBN: 978-989-8565-31-0
Copyright
c
2012 SCITEPRESS (Science and Technology Publications, Lda.)
2 KNOWLEDGE MANAGEMENT
Organization and availability of contents in KMSs
basically depend on two factors: one is whether
KMS systems have effective tools for information
indexing and retrieval; the other is how those tools
are actually understood and used by users.
The solution to this issue was found in the
experience of the library and archive industry, which
have been dealing with the issues related to
organization and collection of information since way
before the digital revolution. This experience
suggested using metainformation, i.e. data used to
describe and classify information, as a possible
solution. The tools used to enter and manage
contents on the Internet must allow for entering and
retrieving organized and relevant metainformation,
as metadata.
2.1 Metadata
Metadata have thus a fundamental role in organizing
and managing digital resources, especially when
there is a great quantity of available information that
must be indexed and catalogued to facilitate search
and retrieval, as shown by Hillman and Westbrooks
(2004), Strintzis, Bloehdom, Handschuh et al.
(2004), Chopey (2005), Dunsire (2008), Solodovnik
(2011).
The selection of which metadata to use in
describing a resource depends on a thorough
observation of the characteristics, properties,
common features, and differences in the
informational environment the source belongs to.
A metadata schema is a set of structured
metadata, developed for specific purposes in order to
establish a standard of metadata structure and
terminology, and to associate different types of
metadata. Every metadata schema includes a definite
number of elements, called metadata elements, each
with its own meaning and purpose, i.e. describing
the information resource, as shown by Heery and
Patel (2000), and by Lagoze and Van de Sompel
(2003).
However, since standardization is the purpose, it
is always advisable to use largely used metadata
schemas rather than creating new ones. Application
profiles are made of metadata sets derived from
different schemas, and are aimed to create tools for
particular applications while keeping interoperability
with the original base schema. This procedure and
the application of common rules can make different
systems interoperable, like those in libraries,
museums and archives, making them able to share a
part of common metadata.
2.2 The Dublin Core Standard
A support to content management is offered by the
Dublin Core metadata schema, which easily pairs up
with other metadata schemas in the OAI
architecture, improving granularity and refinement
of their structures (Hutt and Riley, 2005).
The rapid spreading of DC as metadata schema
was doubtlessly favoured by its remarkable
simplicity, thanks to which it could adapt to many
kinds of resources and usage environments. It is
important, for a semantic model used in resource
discovery not to be dependent on the format of the
resource it needs to describe.
In the latest years, DC was increasingly used in
many fields to describe, organize, manage, resources
in possession of institutions and international
organizations, and also to support and provide added
value services, assuring a base format for
aggregation and exchange of metadata collections,
such as in the Open Archive Initiative, or as
indispensable search tools in portals (Hillman 2005)
(Jackson, Han, Groetsch and Mustafoff, 2008). The
use of a standardized general classification system
allows for metadata in such collections to be
combined and for knowledge inside each collection
to be shared, as proven by Lunesu, Pani and Concas
(2011).
2.3 Linguistic Annotations and Corpus
The so-called corpus linguistics studies great
quantities of linguistic productions, either spoken or
written, by observing their characteristics: lexicon,
syntax, collocations, phonic chain, morphologic
structures, etc. Computational linguistics, in order to
aid this study, developed the first automated or semi-
automated text analysis information tools, avoiding
manual analysis and data research.
A corpus is any complete and orderly collection
of written texts, by one or more authors, on a certain
topic, or, linguistically speaking, the sample of a
language as examined in the description of the same
language.
In order to exploit the wealth of information
stored in a corpus as linguistic data, the corpus must
be enriched with additional information: linguistic
annotations, i.e. the adding of linguistic or
metalinguistic information to different portions of a
text, as shown by Llisterri (1996) and in the
EAGLES Project.
KnowledgeFormalizationandManagementinKMS
133
3 PROPOSED APPROACH
Our proposed approach for knowledge formalization
and management, gathered in an annotated
electronic corpus in an IR based on the OAI model,
will be described below.
3.1 Formalization of Metadata
Schemas
In order to manage and organize the information that
makes up the corpus, KMSs associate organized and
relevant information to a text when it is entered.
Metadata schemas mirror the complex nature of data
and are often strongly structured and hierarchical,
including many kinds of metadata, with many
different functions.
Building an effective system of structured
metadata means creating a conceptual model to
formalize and model the essential semantic
characteristics of a knowledge domain.
After designing the conceptual model of the
knowledge domain, a top-down approach can be
used for structuring the metadata schema.
If the knowledge domain is made of an electronic
corpus and its objects are its texts, essential metadata
(author, title, language, publishing date, etc.) must
be deducted and formalized from their semantic
characteristics. Some of those metadata may be
further specified according to a hierarchical
structure: for example, the metadata "author" maybe
further refined as "main author", "illustrator",
"curator", etc.
3.2 Formalization of Linguistic
Annotations
The need to interrogate the corpus once entered in
the KMS makes it necessary to formalize
annotations in a way that permits the extraction of
linguistic information without using other software
agents, whose syntax may be obscure and
complicated.
Since KMS are based on metadata for the
organization and collection of resources, the most
efficient way to use their information is formalizing
them through metadata schemas. In this way, not
only annotations can be associated to their texts, but
they can also be used as search parameters for
finding texts.
Linguistic annotations created with special
software, like PRAAT for audio files, are generally
stored in a semi-structured manner. In fact, each
annotation is distinctly represented inside the file,
according to a defined, repetitive structure where the
annotation texts is paired with the instant or the time
interval it refers to. Moreover, the belonging of each
annotation to a certain linguistic level is clearly
stated in the file.
The formalization of annotations in a metadata
schema can be achieved using a bottom-up or
inductive reasoning. Starting with the analysis of the
structure of each annotations in the file and applying
inductive logic, a "category" is abstracted from
every linguistic level. This formalization allows for
easily coding and representing of annotations though
markup languages like XML, because their structure
can be described with tags or markers, for metadata
and their qualifiers, inside which a linguistic label is
found. All annotations in the same linguistic level,
e.g. phonetics, can be formalized in the XML as
different occurrences of the same metadata called
"phoneme", whose value can be made up of two
terms: linguistic label and eventually time interval.
3.3 Choosing a Metadata Schema
The use of both a deductive and an inductive
approach allows metainformation and linguistic
annotations to be formalized in a single structured
metadata schema.
Entering metadata in a knowledge management
system requires the selection of an operational
criterion based on the particular needs the system
has to work with.
Most archives use Qualified Dublin Core as main
schema for indexing and displaying metadata and
Simple Dublin Core to show them through the OAI-
PMH standard.
There are four main criteria for choosing a
metadata schema, with different approaches in
metadata organization: 1) mapping of native
metadata on existing DC elements; 2) mapping of
native metadata on DC elements and creation of new
customized qualifiers for DC elements; 3) creation
of a customized metadata schema, identical to the
native metadata set; 4) creation of DC metadata
records as abstraction of native metadata records and
entering of the latter as attachments to the resource.
Out of the criteria mentioned above, the first one
is the least satisfactory for preservation and reuse of
descriptive metadata of resources, while the third
one is the most preserving of the integrity and
granularity of original metadata but needs great
efforts for the creation of a customized metadata
schema, together with high maintenance costs for
the archive. The second and fourth criteria combine
preservation and granularity needs with archive
KMIS2012-InternationalConferenceonKnowledgeManagementandInformationSharing
134
management costs better than the other two.
Choosing between them depends solely upon the
particular requirements of the archive.
3.4 Update of Knowledge Management
System
Once the decision on which criterion to use is
settled, the archive must be configured so that it is
compatible with the approach of choice for metadata
management. In particular, if the second criterion is
adopted, the DC schema must be updated with new,
customized qualifiers; if the third criterion is chosen,
the entire metadata schema created ad hoc must be
entered into the system. In this way, customized
metadata and qualifiers can be used to describe texts
of the corpus inside the archive.
Generally, metadata schemas can be configured
through the user interface of the archive. However,
schemas rich in elements and qualifiers are better
configured with the import tools provided by
management systems, after having encoded them
with the XML markup language. XML is used by
archives to manage the import-export of metadata.
Compilation of metadata records associated to
texts in the corpus may be usually done with either a
user interface or with batch import tools. Instead,
when big quantities of metadata need to be
associated to one resource, like with linguistic
annotations, there are specific batch import tools that
require the specification of all metadata as attribute-
value pairs, coded in an XML file.
4 CASE STUDY
The "Analytic Sound Archive of Sardinia" project
(http://asas.flosslab.it) aims to create an IR with an
annotated spoken language electronic corpus that
could become a platform for the preservation, study,
communication and appreciation of oral traditions of
the Sardinian language, especially improvised
poetry.
The approach described in the previous section
was applied to knowledge formalization and
management, gathered in an annotated electronic
corpus, in a IR based on the OAI model.
4.1 Annotations through PRAAT
The electronic corpus was annotated by linguists and
musicologists through the PRAAT software, which,
besides performing spoken language analysis, allows
for multilevel segmentation and linguistic
annotations of audio files. The software has a
graphic interface with waveforms and voice
spectrum that make annotators' work easier and
make visible those acoustic phenomena that can be
found by an accurate spectrum analysis, followed by
annotation levels.
Linguists and musicologists working on the
Sardinian Linguistic Sound Archive chose a list of
possible annotation levels (syllable, tone,
morpheme, syntagm, accents, etc.), useful for both
linguistic and musical analysis of audio recordings.
4.2 Metainformation Associated to
Audio Recordings
Musicologists and Linguists, other than with
annotations, wanted to complete every audio
recording by describing it with a number of
information, chosen among the most relevant
features of the recordings. The information could be
used to manage recordings in the archive, because
by describing them they allow for selection and
organization, facilitating efficient retrieval and
usage.
Metainformation range from something closely
related to cataloguing, like author, title, object,
recording date, etc., up to more technical
information like the different singing types, speech
types, accompaniment or instruments.
Linguists and musicologists selected 38
metainformation associated to audio recordings:
title, author, object, description, performer,
language, format, etc.
4.3 Formalization of Semantic
Characteristics: Top-down
Approach
After designing the conceptual model of the
knowledge domain, a top-down or deductive
approach can be used for formalizing the semantic
characteristics of texts.
Through a continuous dialogue with the scholars,
audio recordings were analysed for their essential
and basic properties, needed to organize and retrieve
texts in the corpus.
Twelve general metadata were found: title,
author, publisher, object, contributor, date, place,
occasion, document accessibility, language,
description and format. Those metadata outlined the
necessary information to describe spoken texts in the
corpus, conveying in particular singing or speech
type, the occasion in which the audio was recorded,
and the linguistic variety it belongs to.
KnowledgeFormalizationandManagementinKMS
135
The top-down approach proceeds to further
specialize the metadata.
More specific, or qualified, metadata are
represented by adding a qualifier to the name of the
more general metadata and using the common
syntax metadata.qualifier.
Lastly, "relational" metadata are defined as well,
in order to define a certain relation among two or
more different objects belonging to the corpus. An
inclusion relation must be specified in order to
describe the belonging of one or more objects to the
same recording set, for example different songs in a
singing contest.
All descriptive metainformation were analysed
and formalized in a "basic" structured metadata
schema.
4.4 Formalization of Linguistic
Annotations: Bottom-up Approach
The formalization of annotations in a metadata
schema can be achieved using a bottom-up or
inductive reasoning, as explained in the previous
section.
The structure of annotations is analysed with the
PRAAT software. Annotations are organized with a
precise structure: each annotation is made of a time
interval and a text label or by an instant and a
marker with its text.
All annotations in the same linguistic category
are collected in the same tier (or annotation level),
which can be considered as the category they belong
to, giving its name to the corresponding metadata. In
this way, a repeatable metadata is found in each
annotation level of the TextGrid (the text file where
PRAAT stores all Tier with their own segmentations
and annotations) and each annotation can be
represented as multiple occurrences of that metadata.
All annotations are thus formalized in a structured
metadata schema.
4.5 Choosing a Metadata Schema for
KMS Entering
Depending on the interoperability needs that must be
met, importing the metadata schema that was just
created into the knowledge management system may
not be appropriate or convenient. It could be
necessary instead to map it, partially or totally, on
another schema.
Most archives use Qualified Dublin Core as main
schema for indexing and displaying metadata and
Simple Dublin Core to show them through the OAI-
PMH standard. Therefore, the adoption of Dublin
Core must be thoroughly evaluated when an archive
is needed to be compliant with the interoperability
principles required by OAI.
Our of the four criteria listed in section 3.3, the
most suitable technique for the case study is an
hybrid model between the second (mapping of
native metadata on DC elements and creation of new
customized qualifiers for DC elements) and the third
one (creation of a customized metadata schema,
identical to the native metadata set) The third
criterion is more convenient for linguistic
annotations, so that a dedicated metadata schema
can be created to preserve their granularity; while
the second criterion is best suited for all other
metadata, because it combines the advantages of
granularity as provided by qualifiers to
interoperability provided by DC metadata.
4.6 Application Profile for the
Analytical Sound Archive of
Sardinia
In creating a specific application profile for the
Analytical Sound Archive of Sardinia, a
"conservative" approach was used towards the
original Qualified DC elements and qualifiers in
order to use as many of them as possible for the
formalization of descriptive and relational metadata.
A special schema, identified by the prefix "asas",
was created instead for annotations. Its metadata
were entered into the DC application profile as
outlined below (customized qualifiers are in italics).
Table 1: Application profile for the Sound Archive of
Sardinia.
Metainformation
or ASAS
Annotation
DC Application Profile Metadata
Title dc.title
Author dc.creator
Publisher dc.publisher
Object dc.type
Description dc.type.category
Contributor dc.contributor
Annotator dc.contributor.annotatore
Location dc.coverage.spatial
Date dc.date.created
Occasion dc.subject
Source dc.relation.isbasedon
Document
Accessibility
dc.rights
Performer dc.contributor.sperakerPerformer
Performer's Age dc.description.speakerPerformer
KMIS2012-InternationalConferenceonKnowledgeManagementandInformationSharing
136
Table 1: Application profile for the Sound Archive of
Sardinia(cont.).
Performer's Place of
Origin
dc.description.speakerPerformer
Language dc.language
Source Completeness dc.description.integrità
Source No. dc.relation.ispartofseries
Source Section No. dc.relation.ispartofseries
Document Type dc.format.audioVideo
Format dc.format.medium
Acquisition Method dc.format.modoAcquisizione
Reading Type dc.type.lettura
Interview Type dc.type.intervista
Monody Type dc.type.monodia
Unison / Heterophony dc.type.unisonoEterofonia
Accompaniment Type dc.type.monodiaAccompagnamento
Polyphony Type dc.type.polifonia
Instrumental dc.type.strumentale
Instrument dc.type.strumento
Singing Type dc.type.tipoCanto
Other dc.description
Syllable asas.annotazione.sillaba
Tone asas.annotazione.toni
Morpheme asas.annotazione.morfema
Phone asas.annotazione.fono
Word asas.annotazione.parola
Part of Speech asas.annotazione.pos
Syntagm asas.annotazione.sintagma
Sentence asas.annotazione.frase
Information Structure asas.annotazione.strutturaInformativa
TurnPerf asas.annotazione.turnPerf
Musical Syllable asas.annotazione.sillabaMusicale
Metric Segment asas.annotazione.segmentoMetrico
Musical Segment asas.annotazione.segmentoMusicale
Tonal Centre asas.annotazione.centroTonale
Notation asas.annotazione.notazione
Ornamentation asas.annotazione.ornamentazione
Accents asas.annotazione.accenti
Melismatic Syllable asas.annotazione.sillabaMelismatica
ADD1 asas.annotazione.annotazioneLibera
The last step is to enter metadata in the
knowledge management system: once
metainformation have been organized and
structured, the knowledge management system is
configured so that it can be adapted to the selected
metadata schema.
5 CONCLUSIONS
The purpose of this work was to offer a new
approach to formalization and management of
knowledge represented by a set of audio recordings
belonging to a corpus plus the linguistic information
added to the same corpus with annotations. The
approach was applied to formalize knowledge in the
Analytical Sound Archive of Sardinia, a joint project
by linguists and musicologists at University of
Cagliari. The project aimed to present a study on
improvised poetry in Sardinian language, using an
electronic corpus they created and annotated.
In order to make the resources openly accessible
through the Internet, as per our aim, we entered the
annotated corpus in a knowledge management
system, compatible with OAI standards and
protocols for metadata sharing and knowledge
circulation.
The formalization of a structured metadata
schema was reached through the creation of an
application profile for the Qualified Dublin Core
metadata schema, where customized qualifiers were
added to the standard elements and qualifiers.
Metadata in non-standard schemas could then be
better represented.
Linguistic annotations were formalized as well
through a metadata schema. Corpus interrogation
was thus made easier and quicker, since it used the
knowledge management system's search tool.
This work leaves space for future research on
ways to improve the service. A dedicated website or
the integration of this system in an institutional
portal through an exploration interface would be
particularly interesting. Another feature that could
be implemented may be a virtual map where
recordings can be explored by geographic location.
REFERENCES
Chopey, M. A. (2005). Planning and Implementing a
Metadata-Driven Digital Repository. Haworth Press
Inc.
DSpace, http://www.dspace.org
Dunsire, G. (2008). Collecting metadata from institutional
repositories. OCLC Systems & Services, Vol. 24, No.
1, pp. 51-58.
EPrints, http://www.eprints.org
Heery, R. and Patel, M. (2000). Application profiles:
mixing and matching metadata schemas. Ariadne.
http://www.ariadne.ac.uk/issue25/app-profiles/
Hillmann, D. I. (2005). Using Dublin Core. Dublin Core
Metadata Initiative Recommendation. Retrieved from:
http://dublincore.org/documents/usageguide
KnowledgeFormalizationandManagementinKMS
137
Hillman, D. I. and Westbrooks, E. L. (2004). Metadata in
practice. American Library Association.
Hutt, A. and Riley, J. (2005). Semantics and Syntax of
Dublin Core Usage in Open Archives Initiative Data.
Joint Conference on Digital LibrariesACM Press.
Jackson, A. S., Han, M. J., Groetsch, K. and Mustafoff, M.
(2008). Dublin Core Metadata Harvested Through
OAI-PMH. In Proceedings of the 5th ACM/IEEE-CS
joint conference on Digital libraries.
Lagoze, C. and Van de Sompel, H. (2003). The making of
the Open Archives Initiative protocol for metadata
harvesting. Library Hi Tech.
Llisterri, J. (1996). Text Corpora Working Group Reading
Guide. EAGLES (Expert Advisory Group on language
Engineering Standards) Document EAG-TCWG-FR-
2. CNR, Istituto di Linguistica computazionale.
Lunesu, M. I., Pani, F. E. and Concas, G. (2011). An
approach to manage semantic informations from
UGC. International Conference on Knowledge
Engineering and Ontology Development (KEOD).
Lunesu, M. I., Pani, F. E. and Concas, G. (2011). Using a
standards-based approach for a multimedia
knowledge-base. International Conference on
Knowledge Management and Information Sharing
(KMIS).
Lynch, C. (2003). Institutional repositories: essential
infrastructure for scholarship in the digital age.
Association of Research Libraries: a bimonthly report,
no. 226.
PRAAT, http://www.fon.hum.uva.nl/praat/
Solodovnik, I. (2011). Metadata issues in Digital
Libraries: key concepts and perspectives. Italian
Journal of Library and Information Science, Vol. 2,
No. 2.
Strintzis, J., Bloehdom, S., Handschuh, S., Staab, S.,
Simou, N., Tzouvatras, V., Petridis, K., Kompatsiaris,
I. and Avrithis, Y. (2004). Knowledge representation
for semantic multimedia content analysis and
reasoning. In Proceedings of the European Workshop
on the Integration of Knowledge, Semantics and
Digital Media technology .
Swan, A. and Carr, L. (2008). Institutions, their
repositories and the Web. Serials Review, 34/1 (2008),
p. 31, http://eprints.ecs.soton.ac.uk/14965.
Tansley R., Bass, M., Stuve, D., Branschofsky, M.,
Chudnov, D., McClellan, G. and Smith, M. (2003).
The DSpace Institutional Digital Repository System:
Current Functionality. JCDL '03 Proceedings of the
3rd ACM/IEEE-CS joint conference on Digital
libraries.
The OAI Executive (2008). The Open Archives Initiative
Protocol for Metadata Harvesting. Document Version
2008-12-07T20:42:00Z. Retrieved from: http://www.
openarchives.org/OAI/openarchivesprotocol.html
KMIS2012-InternationalConferenceonKnowledgeManagementandInformationSharing
138