Managing Metadata Variability within a Hierarchy of
Annotation Schemas
Ionuţ Cristian Pistol
1
and Dan Cristea
1,2
1
Faculty of Computer Science, University “Al. I. Cuza” of Iaşi, Romania
2
Institute for Computer Science, Romanian Academy, Iaşi, Romania
Abstract. The paper describes the theoretical basis of the ALPE
1
model, a hie-
rarchy of annotation formats used to guide the automatic computation of
processing flows capable of performing complex linguistic processing tasks.
The hierarchy is comprised of a core, which is a direct acyclic graph whose
nodes represent XML annotation formats, and a halo which contains additional
annotation formats. The core hierarchy also serves as a standardization hub for
annotated documents. The focus of the paper is the description of the new addi-
tions to the model, allowing the integration and usage of non-XML formats in
processing flows and new equivalence relations between XML formats.
1 Introduction
In the latter years, the field of Natural Language Processing witnessed the emergence
of a significant effort concerning the standardization and usability aspects of devel-
oped processing tools and resources. Projects such as CLARIN
2
and FLaReNet
3
,
among others, intend to offer both developers and users of language resources and
tools a management solution for the growing set of resources available. The primary
objectives of these projects are to provide reusability in new contexts for existing
resources and to guarantee maximum visibility and reusability for newly developed
resources. An easy widening of the original setting of usage means a multiplication of
the visibility of a tool and, finally, of the productivity of the research activity. In terms
of managing linguistic processing tools, previous efforts lead to the development of
linguistic processing meta-systems, the most significant ones being GATE
4
[3] and
UIMA
5
[4].
ALPE [1, 2] is another such system, intended to define a framework which facili-
tates the integration of processing tools of different origins. ALPE offers several ad-
vantages over existent systems with a similar goal, as it is able to identify the anno-
tated format of the input file, then to automatically compute the processing steps re-
1
Automated Linguistic Processing Environment
2
http://www.clarin.eu/
3
http://www.flarenet.eu/
4
http://www.gate.ac.uk/
5
www.research.ibm.com/UIMA/
Pistol I. and Cristea D. (2009).
Managing Metadata Variability within a Hierarchy of Annotation Schemas .
In Proceedings of the 6th International Workshop on Natural Language Processing and Cognitive Science , pages 111-116
DOI: 10.5220/0002171501110116
Copyright
c
SciTePress
quired to bring an input file to the required output format, and, eventually to run this
chain if costs/IPR conditions are fulfilled.
Section two of this paper briefly describes the base ALPE hierarchy and section
three describes the enhancement of the base hierarchy with processing power and
“clouds” of equivalent formats. The conclusions, as well as the further planned devel-
opments are described in section four.
2 The Base Hierarchy
2.1 The Core Hierarchy
The basis of the core hierarchy is a directed acyclic graph which configures the meta-
data of linguistic annotation in a hierarchy of XML schemas. Nodes in this graph are
called core nodes. Each core node corresponds to a single XML annotation format.
We note as T(A) , where A is a core node, the set of elements (tags) defined in the
XML annotation format corresponding to the core node A. We note as t
a
(A), where A
is a core node and t
a
T(A) is the set of attributes belonging to the element t as it
appears in the core node A. Edges connecting core nodes are called core edges. If
there is a core edge linking a core node A with a core node B (we will say also that A
is formally subsuming B, noted as AsB) then the following conditions holds simulta-
neously:
any element (tag-name) of A is also in B: T(A)T(B);
any attribute in the list of attributes of a tag-name in A is also in the list of
attributes of the same tag-name of B: ta(A) ta(B) for all tT(A).
The direction of the core edge connecting nodes A and B is given by the subsuming
relation, with the subsuming node being the origin of the core edge and the subsumed
node the destination.
2.2 The Haloed Hierarchy
In addition to the core nodes and edges, which strictly observe the specified restric-
tions, we can include in an extension to the core hierarchy, called a halo, other types
of nodes and edges such as:
Nodes representing other annotation format than XML. We can consider each node
in the core hierarchy representing not just a specific format, but rather a class of
annotation formats, whose representative is an XML format. These formats can be
represented in the hierarchy as halo nodes;
Edges originating or ending in halo nodes. These edges can either originate or end
in the core hierarchy, or they can be completely outside the core hierarchy. The
semantic value of these edges is to mark the semantic subsumption between the
source and destination nodes, relation considered at an abstract level as opposed to
the formalized subsumption relation. Semantic subsumption means that the infor-
mation encoded in the origin node’s format is part of the information encoded in
the destination node’s format, but this inclusion cannot be strictly formalized using
XML elements and attributes.
112
Fig. 1. A full ALPE hierarchy (core and halo).
The core hierarchy The halo nodes and
edges expand the core hierarchy outside the limits of XML, allowing other types of
an
the full hierarchy.
st
ent. The definition of this node
le
nta-
om:
g the orientation of the edges).
be included as a node in the hierarchy it
has to have at least one other node already in the hierarchy which it subsumes (or is
A
B
C
D
F
G
E
H
and the halo form a full ALPE hierarchy.
notation formats to be represented in the full hierarchy. In Figure 1 is shown an
example of a full hierarchy. With continuous lines are marked core nodes and edges
(and the core hierarchy), and with interrupted lines halo nodes and edges. Nodes A,
B, C and D are core nodes and nodes E, F, G and H are halo nodes.
As for the core edges, the following hold true:
Ax o halo nodes. iom: There exists at most one edge between tw
Axiom: There exists at most one edge between any two nodes in
These axioms say that between any two nodes in the full hierarchy there is at mo
one subsumption relation, either formal or semantic.
In all core hierarchies we introduce an obligatory root core node. This node corres-
ponds to the basic XML format, with only a root elem
ads to the following theorem:
Theorem: The root core node of a hierarchy subsumes all nodes in the hierarchy.
Theorem: The core hierarchy is a connected graph (disregarding the edges orie
tion).
In order to guarantee the connectivity of the full hierarchy, we introduce the follow-
ing axi
Axiom: From each halo node there is at least one core node which can be reached
(disregardin
This axiom basically says that no halo nodes are disconnected from the core hie-
rarchy: in order for an annotation format to
113
su
o node all core codes can be reached. Since all halo nodes can be reached
fr
rocessing Power
If there is an edge (either core or halo) between nodes A and B, there should be a
trictions imposed by the node A
and produces as output a file observing the restrictions imposed by the node B. While
ing their input and
ou
ersion Edges and Synclouds
An edge connected a halo node with another node of the full hierarchy is called a
e least one attached processing module.
That module is basically a wrapper capable of converting a format into another. Part
bsumed by) - either formal subsumption (introduced in 2.1) or semantic subsump-
tion. The previous theorem and this axiom lead to the next theorem:
Theorem: The full hierarchy is a connected graph (disregarding orientation of the
edges).
The proof is direct: the core nodes are connected (as proven by the previous theo-
rem) and each halo node is connected to at least one core node. This means that from
each hal
om at least one core node, as a corollary to the previous axiom, this means that from
each core node all other nodes can be reached. Also, the axiom and the following
conclusion lead to the fact that from each halo node all other nodes can be reached.
Thus, the full hierarchy is a connected graph.
3 The Hierarchy Augmented with P
3.1 Adding Processing Power to the Hierarchy
process which takes as input a file observing the res
doing this type of processing the module might make use also of some additional
resources outside the hierarchy, such as language models and lexicons. A graph of
annotation schemas on which processing modules have been marked on edges is
called augmented with processing power (or simply, augmented).
An edge to which there is at least one processing module attached will be called a
processing edge. A single edge in the graph can have multiple processing modules
attached to it, if those modules observe the same restrictions regard
tput formats. If there is no known processing module attached to an edge, that edge
is called a carrier edge. For a more detailed and commented description, please con-
sult [1, 2].
3.2 Conv
conv rsion edge. All conversion edges have at
of the encoded information in the source node is rewritten in another format in the
destination node. No new information is added. If one of the connecting nodes is in
the halo, the module attached to the conversion edge rewrites either XML into a non-
XML format or the other way around, depending on the direction of the edge.
All nodes that can be reached from the same node using only conversion edges (dis-
regarding the direction of the edges) form a syncloud (synonymy clouds). All nodes
in the same syncloud contain the same information, but encoded differently.
114
3.
el is the
accommodation of various available processing modules using non-XML documents
as either input or output. In previous papers [1,2] we introduce processing flows as
s on an ALPE augmented hierarchy. These flows are
computed with regards to specified input and output nodes and generate actual
(a) (b) c)
Fig. 2. Flows in synclouds.
A simple computed flow, on the left of figure 2, shows only the general picture of
an ALPE erarchy, where a e identifies a ct annotation for . Figure
depicts th case of the proces modules cons ed on the left bein ual mod-
ules between the core nodes of ows that a flow can include
processing edges between XML and non-XML formats in a syncloud. These modules
ca
3 Flows and Synclouds
The main benefit of the introduction of synclouds in the ALPE hierarchy mod
combinations of basic operation
processing sequences (workflows) capable of transforming an input document corres-
ponding to the input node to the output format by applying individual processing
modules attached to edges in the hierarchy.
hi
e
nod
sing
distin
ider
mat
g act
2a
the synclouds. Figure 2b sh
n be pipelined with XML processing modules using wrappers attached to edges
between the core nodes and other nodes in the syncloud. Also, as in 2c, flows can
actually exist outside core nodes and include only processing modules using non-
XML intermediate formats, allowing the straightforward integration in ALPE of flows
produced by other systems, such as GATE or UIMA, with the only change being the
addition of the input/output wrappers.
115
4 Conclusions and Further Work
Adopting ALPE as a management and access environment for the resources employed
an veloped in a computational linguist project proposing the development of mul-
tilingual resources and tools, such as CLARIN [6], has the potential of benefiting
important further development of ALPE
, configure and use ALPE hierarchies on
the web, either as a limited password-protected resource or a global linguistic re-
s appear. The first step would be the clear defi-
nit
Bernadette Sharp (Ed.): Proceedings of the 3rd International
Workshop on Natural Language Understanding and Cognitive Science - NLUCS 2006, in
conjunction with ICEIS 2006, Cyprus, Paphos, May 2006. INSTICC Press, Portugal.
: 972-8865-50-3. (2006).
. : Managing Language Resources and Tools Using a Hierarchy of
f LREC-2008,
Marakesh (2008).
d de
both the project and the interested user. One
will be a web-service allowing users to build
sources collection. Since UIMA is the prominent system comparable to ALPE and
since both GATE and UIMA are now open-source, we also study the possibility of
integrating ALPE in either system.
Standards usually appear late. In order for an annotation convention to become a
standard it should be adopted by a community of people. Therefore, there is a strong
need for accepting new formats, which should work together with well accepted ones.
We need a mechanism able to “understand” the notations, to detect the semantics
beyond the notations, to infer the meaning of notations and to establish semantic links
between new formats before standard
ion of the semantics of a standard. A promising new model of describing annota-
tion semantics, the Linguistic Annotation Format [5], has the potential of clearly de-
fining semantic links between various annotation formats. We are currently in the
process of integrating a version of this model as a way to formally describe nodes in
the ALPE core hierarchy and as a possible base for an automatic detection of seman-
tic links between formats.
References
1. Cristea D., Forăscu C., Pistol I. : Requirements-Driven Automatic Configuration of Natural
Language Applications. In
92006) ISBN
2. Cristea, D., Pistol, I
Annotation Schemas. Proceedings of the Workshop on Sustainability of Language Re-
sources, LREC-2008, Marakesh. (2008).
3. Cunningham H., Maynard D., Bontcheva K., Tablan V. : GATE: A framework and graphi-
cal development environment for robust NLP tools and applications. In Proceedings of the
40th Anniversary Meeting of the ACL (ACL’02). Philadelphia, US. (2002).
Ferrucci D. and Lally A. : UIMA: an4. architectural approach to unstructured information
processing in the corporate research environment, Natural Language Engineering 10, No. 3-
4, 327-348. (2004).
Romary L., Ide N. : International Standard 5. for a Linguistic Annotation Framework, Natural
Language Engineering 10, 3-4 (09/2004) 211-225 (2007).
6. Váradi T., Krauwer S., Wittenburg P., Wynne M. and Koskenniemi K. : CLARIN:
Common Language Resources and Technology Infrastructure, Proceedings o
6
http://www.mpi.nl/clarin/
116