a minimum Threshold as shown in Figure 2. With this
procedure we remove concepts found by the seman-
tic annotator that may introduce noise in the resulting
ontology.
3.2 Ontology Extractor
The Ontology Extractor task consists of retrieving
from the knowledge resource an application ontology
(e.g. a set of concepts along with their taxonomic re-
lationships) that fulfils the user needs. The applica-
tion ontology has to provide the necessary semantics
demanded by the user query while keeping a reduced
size in order to achieve scalability. In order to ac-
complish both requirements we have developed and
tested an indexing mechanism over the “is-a” rela-
tionships of the knowledge resource that enables us to
efficiently retrieve a small subset that satisfies the user
query. Next section describes the indexing process of
the knowledge resource and then we present an ontol-
ogy retrieval strategy based on the indexes created.
3.2.1 Knowledge Resource Indexing Process
In most knowledge resources, as is the case of UMLS,
concepts are organized into “is-a” hierarchies, which
constitute the backbone of the repository. This leads
to an underlying graph-like structure. In order to effi-
ciently retrieve a sub-ontology from this graph struc-
ture guided by the signature, we need some kind of
indexing scheme over the graph that encodes descen-
dant and ancestor relationships in a compressed and
efficient way. We have adopted a labeling scheme,
which assigns to each node in the graph some iden-
tifier that allows the computation of relationships
between nodes using simple arithmetic operations.
For our purposes, we have adopted and extended
Agrawal’s interval scheme (Agrawal et al., 1989) but
with a labeling variation from (Schubert et al., 1983),
which takes preorder identifiers of nodes instead of
postorders used in Agrawal’s technique. The ap-
proach can be applied to directed trees and Directed
Acyclic Graphs (DAGs), which will be the underly-
ing structure of most ontologies (Christophides et al.,
2003). With respect to our application scenario, we
have preprocessed UMLS in order to delete cycles in
the “is-a” hierarchy and obtain a DAG.
Figure 3 (left graph) shows a labeled DAG. The
process is as follows: in an initial step, disjoint com-
ponents can be hooked together by creating a virtual
root node. The compression scheme first finds a span-
ning tree T for the given graph (solid edges). Then it
assigns an interval to each node based on the preorder
traversal of T. That is, the interval associated with a
node v is [pre(v), maxpre(v)], where pre(v) is the pre-
order number of v and maxpre(v) is the highest pre-
order number of v’s descendants. Notice the preorder
of each node is used as its unique identifier. Next, all
nodes of the graph are examined in the reverse topo-
logical order so that for every edge from node p to
q, all the intervals associated with node q are added
to the intervals associated with node p, taking into
account that if one interval is subsumed by another,
the subsumed interval is not added. In the figure, the
interval [2, 7] is associated to node d when labeling
the spanning tree. Then, during the reverse topologi-
cal traversal, node d inherits intervals [6, 6], [4, 6] and
[5, 5] corresponding to nodes g, e and h, which come
from the dashed edges not belonging to the spanning
tree. Since these intervals are already subsumed by
d’s interval [2,7], they are not added to d. Otherwise,
they would be included.
The storage requirements for trees labeled with
this interval scheme is O(n), since one interval per
node is enough. For DAGs, the worst case requires
O(n
2
) space. However, this situation is unlikely be-
cause Agrawal’s approach for DAGs finds the opti-
mum spanning tree, that is, the spanning tree that
leads to minimum amount of intervals per node and
thus, minimum storage requirements.
Next step consists of obtaining analogous infor-
mation about ancestors of each node. The strategy ap-
plied is as follows. First, we reverse the edges of the
original structure so that each node now points to its
parent/s (see right graph of Figure 3). Then, a virtual
root node has to be created to hook together what are
leaf nodes in the original structure. Then, the same la-
beling scheme described previously is applied to the
reversed structure. Since now the edges denote an-
cestor relationships, the labeling scheme will encode
ancestor nodes. Notice that each node identifier is its
preorder number and both the original structure and
the reversed one have each own preorder system.
We finally define the descriptor function of a node
v as follows:
descriptor(v) =<descpre(v), descintervals(v),
ancpre(v), ancintervals(v),
topo(v) >
where descpre(v) denotes the preorder number of v in
the original structure, descintervals(v) denotes the set
of intervals encoding v’s descendants, ancpre(v) de-
notes the preorder number of v in the reversed struc-
ture, ancintervals(v) denotes the set of intervals en-
coding v’s ancestors and topo(v) denotes the topolog-
ical order of v.
Gathering all together, we have designed an en-
coding mechanism for concepts in a knowledge re-
BUILDING TAILORED ONTOLOGIES FROM VERY LARGE KNOWLEDGE RESOURCES
147