4 FULFILLING THE
REQUIREMENTS
The four mechanisms work closely together to meet
all of the text-mining annotation requirements.
Expressiveness: The requirement of expressive
power has two aspects–the need to express complex
hierarchical structures and the need to express arbi-
trarily distributed surface strings. For the former,
the object-oriented mechanism enables modeling of
complex hierarchical structures, similar to the use of
object-oriented programming languages in modeling
software application environments. For the latter, the
monad-based mapping mechanism enables listing
the tokens anywhere in a document.
Resolution of annotation mapping: Monad-
based mapping also enables annotation mapping at
various resolutions. Monads are the atomic units in
SOOML. Their sizes, hence the mapping resolu-
tions, are determined by the set of delimiting charac-
ters. For example, “insulin-induced” can be treated
as a single monad (an adjective) in linguistic part-of-
speech tagging, so it need not be tokenized at the
character “-.” However, this resolution is not
enough for protein name recognition. Therefore, it
should be split into two monads (“-” being a delim-
iter). The finest resolution is single character map-
ping (using “null” delimiting character).
Reusability: The inclusion mechanism enables
annotation reuse, facilitating modular design of
complex text-mining systems in accordance with
software engineering principles. It is not necessary
to design a single powerful super-module to extract
the information all at once. Specialized modules can
target particular aspects of a complicated task and
create annotations on top of each other. The standoff
mechanism leaves the original documents un-
changed, thereby avoiding interference among dif-
ferent modules and/or applications.
Independence of availability: The standoff
mechanism separates annotations from the original
documents, and the monad-based mapping mecha-
nism avoids copying any contents from the original
text. Therefore, the availability of the annotations is
independent of the originals.
Flexibility: SOOML’s mechanisms enable anno-
tating files of various formats in a consistent way
(“one shoe fits all”). First, the standoff mechanism
separates annotations from the original documents;
therefore, the formats and the organization of the
annotations are not restricted by those of the original
documents. In contrast, in-line markup methods
have to follow the formats of the original texts. Sec-
ond, although SOOML’s mapping mechanisms (es-
8
http://www.w3.org/Addressing
pecially node-level mapping using XPath) are de-
signed for XML-based original documents, they can
be extended easily to any document with well-
defined fields or sections, because they are concep-
tually equivalent to the nodes in XML documents.
The worst case is for those files without any appar-
ent internal structures. SOOML can still treat such a
file as a single large text node, and monadize it from
the first token to the last.
Efficiency: The monad-based mapping mecha-
nism is space efficient. Instead of copying the con-
tent from the original documents, it uses monads
(equivalent to pointers) pointing to the sources. For
example, it takes only five integers to mark up
en-
tity[1]
in Fig. 1, while XPointer needs over 200 char-
acters. This makes SOOML an ideal format for an-
notation storage and exchange, as well as for serving
as an intermediate data-flow format among the mod-
ules/applications.
The monad-based mapping mechanism also
greatly reduces the complexity of annotation proc-
essing. First, gapped and overlapped annotations are
handled in exactly the same way as continuous and
non-overlapped ones. Second, monad-based map-
ping does not create any ambiguities, which are in-
evitable in string matching-based processing.
Extensibility: Because it is object oriented,
SOOML can be integrated readily with other ontolo-
gies. Ontologies typically already have well-defined
hierarchical structures. All we need to do is define
the main ontology entries as subclasses of the
anno-
tation
class (or one of its subclasses). The rest of the
ontology is automatically included in the hierarchy.
In conclusion, we presented here the design of a
standoff object-oriented markup language (SOOML),
which provides an expressive, efficient, flexible and
extensible framework for text annotation in bioin-
formatics – as well as other similar applications.
REFERENCES
Bird, S. and Liberman, M. (1999) A Formal Framework
for Linguistic Annotation. Technical Report MS-CIS-
99-01, Department of Computer and Information Sci-
ence, University of Pennsylvania.
Doedens, C.-J. (1994) in Text Databases. One Database
Model and Several Retrieval Languages. Amsterdam
and Atlanta, GA.
Hucka, M., et al. (2003) The Systems Biology Markup
Language (SBML): A Medium for Representation and
Exchange of Biochemical Network Models. Bioinfor-
matics 19: 524-531.
Kim, J.D., Ohta, T., Tateisi, Y., and Tsujii, J. (2003)
GENIA Corpus - A Semantically Annotated Corpus
for Bio-textmining. Bioinformatics 19: i180-i182.
DESIGN OF A STANDOFF OBJECT-ORIENTED MARKUP LANGUAGE (SOOML) FOR ANNOTATING
BIOMEDICAL LITERATURE
385