GENERATION AND USE OF ONE ONTOLOGY FOR

INTELLIGENT INFORMATION RETRIEVAL FROM

ELECTRONIC RECORD HISTORIES

Miguel A. Prados de Reyes, M. Carmen Peña Yañez, M. Amparo Vila Miranda,

M. Belen Prados Suarez

Computer Science and A.I. Department. University of Granada,Computer Science Department. San Cecilio Hospital,

Granada,Computer Science and A.I. Department. University of Granada,Computer Science Department.University of Jaen

Keywords: Information Retrieval System, Semantic Information, Semantic Organization, Ontology, Electronic Health-

care Record.

Abstract: This paper analyzes the terminology used in the diagnosis, treatment, exploration, and operation descrip-

tions entered by doctors in the electronic healthcare record. From this, expression stability (and the use of a

sufficiently limited and controlled language) is shown, which is therefore reasonably valid for a conceptu-

alization process to be employed on it. This conceptualization process is performed by the generation of an

ontology which proposes semantic classes according to the different medical concepts to be used on data-

base query profiles. By way of summary, we shall propose a semantic organizational method so that

classes, attributes and properties in the ontology may act as links between the database and the users, both

in information incorporation processes and in queries. It offers a wide range of benefits by extending and

making information management possibilities more flexible, and enabling the application of traditional data

mining techniques.

1 INTRODUCTION, STARTING

HYPOTESYS

Existing Medical Information Retrieval Systems

(MIRS) present certain problems and drawbacks as

they usually require deterministic query profiles

which use structured data sets that do not always

meet the real requirements of the staff accessing the

medical information database (Prados, 2003). This is

why documentary searches are not very thorough

and have a lot of noise (Noy, 2003). In the majority

of cases, the most interesting medical information is

to be found in the doctor's own notes, but queries on

these are not always successful as the information is

unstructured and there are the usual problems of

natural language and its formalization.

An observation of the expressions describing di-

agnoses, treatments, interventions and explorations

reveals the following characteristics: They are short;

They contain the most relevant medical aspects;

They are repetitive and have a limited language.

These characteristics suggest that it is possible to

analyze the terminology used, the composition and

the structure of these phrases in greater depth,

thereby demonstrating that: its terminological do-

main is limited; the conceptual domain is suffi-

ciently strict and categorized; phrase composition

and structure is stable.

This leads us to approach the generation of an

ontology by establishing a conceptualization based

on an in-depth analysis of the language and the

terms used, and this can be used in the data capture

and information retrieval process. The ontology

must be constructed in such a way that it can charac-

terize the diverse semantic contents that an expres-

sion entails.

2 BACKGROUND

The background deals with two closely related top-

ics. Traditional "documentary techniques" generate

an entire set of techniques and tools in their digital

management which is the object of documentary

computer science (Maniez, 1992).

Nowadays, there is a wider opinion about the

problem which is based on so-called "Text Mining"

(TM), specially motivated by interest in the "seman-

565

A. Prados de Reyes M., Carmen Peña Yañez M., Amparo Vila Miranda M. and Belen Prados Suarez M. (2006).

GENERATION AND USE OF ONE ONTOLOGY FOR INTELLIGENT INFORMATION RETRIEVAL FROM ELECTRONIC RECORD HISTORIES.

In Proceedings of the Eighth International Conference on Enterprise Information Systems - ISAS, pages 565-571

DOI: 10.5220/0002465205650571

 SciTePress

tic web". Like the data mining definition, text min-

ing may be defined as: "The entire process of ex-

tracting relevant information that is not explicitly

present in a document collection". There is a clear

distinction between this and documentary computer

science, which only attempts to show explicitly

present concepts (Delgado, 2002).

In text mining, it should be remembered that we

are dealing with information which is not particu-

larly structured, and therefore traditional data min-

ing techniques cannot be applied. This lack of struc-

ture is the biggest problem for TMs and requires the

texts to be preprocessed and converted into an "in-

termediate form" so that the algorithms and methods

(classification, association, etc.) being used may be

applied to them.

Current study of ontologies and their design and

development tools (Lambrix,2003) has produced

new methods and techniques, but mainly a wider,

more ambitious vision in textual data processing,

since they make action possible on previously con-

ceptualized domains. This is particularly interesting

as here there is an intermediate element that can be

used as a link between the "gross" textual informa-

tion and its preprocessing and subsequent treatment

using usual data mining techniques. More specifi-

cally, the ontology is particularly important in in-

formation retrieval systems since they provide the

connection between final user applications and data-

bases and, contrariwise, in visualization processes

(Guarino,1998 ).

In short, it is clear that this is an interdisciplinary

field which includes elements, methods and tech-

niques from documentary computer science, linguis-

tics, and data mining, with their contributions relat-

ing to information retrieval, information extraction,

clustering, categorization and automatic learning

(among others) (Prados,2004).

3 WORK DESCRIPTION

The basic idea behind this work is that the medical

language used in diagnostic expressions and deter-

minations (based on natural language) is sufficiently

controlled and strict with an adequately formal

grammar as to be worth using in data processing and

documentary searches which could be applied di-

rectly on the medical language without the need for

code systems. This would make documentary

searches more powerful by enabling descriptive and

qualifying aspects to be included in the pathology or

treatment, and not merely unqualified searches on

the diagnosis (Shankar, 2002). A diagnostic expres-

sion comprises terms with a semantic content which

are relevant for diagnosis. It is first necessary to

analyze the semantic typology of these terms and to

establishing the different classes to which a lexical

unit can belong.

Our aim in this paper is to show that this is pos-

sible. In order to do so, we have considered four

classes of term sets: identifiers, locators, qualifiers,

and etiologicals. From this first semantic structure,

we can extract an ontology which conceptually

describes the domain of the medical diagnosis, and

propose a methodology for information storage and

retrieval, extending the restrictive possibilities of

traditional searches on the electronic healthcare

record using the defined ontology.

Our first objective is to define what we call the

"semantic classes", each of which represents a kind

of concept used by doctors in their descriptions. Our

second objective is to check that the medical expres-

sions in natural language are in keeping with the

element coordination of each semantic class. Our

third objective is to confirm the hypothesis that the

constituent elements of each semantic class repre-

sent a perfectly determinable and controllable set. A

class may have superclasses which offer a medical

meaning membership from one given concept to

another with a more general meaning (Dameron,

2004).

We also define what we call the "semantic class

sequence" (SCS) as the sequence obtained in the

identification of the semantic classes from an ex-

pression, and their sequence of presentation. By way

of example, the SCS “I-E-L” means "identifier-

etiological-locator". Although the purpose of these

SCSs is to check stability in expression composition,

they will also be useful in both the data acquisition

process and the documentary search process from

the data structures of the semantic classes.

The system is enriched by extracting the corre-

sponding ontology (ONTOARCHINET), its hierar-

chies and properties, with it having a welldefined

conceptual environment and making its use avail-

able for the documentary search.

We shall also discuss whether this proposal is

worthwhile and if it is possible to construct a set of

data structures representing the members of each

semantic class and their properties (associations,

frequency, medical environment, etc.) so that under

the analyzed (cleaned and filtered) documentary

search, the SCS can be recognized. Its components

are then extracted in order to perform the search,

focusing or extending it according to the semantic

components in the ontology.

ICEIS 2006 - INFORMATION SYSTEMS ANALYSIS AND SPECIFICATION

566

4 PREVIOUS MEDICAL

TERMINOLOGY RESULTS:

EXAMPLE OF CARDIOLOGY

A search was performed on our electronic healthcare

record system (relating to admissions between 2000

and 2004) for medical records listing cardiology in

their documentary profile. The number total of ex-

pressions found in the period of time is, 433.492;

Related to cardiology 39.781; and syntactically

different 1.656.

Two expressions are considered syntactically

different when there is some variability in the terms

they comprise or in their order. At this point, we

should introduce the concept of "medically different

expressions". By syntactically grouping different

expressions according to this criteria, we obtain 627

medically different expressions. On examination of

the results obtained, we find that the number of high

frequency expressions is greatly reduced, and that

the frequencies of the others are less relevant. This

suggests that the language is unified to a large ex-

tent.

In order to provide our system with a value rep-

resenting the differentiation degree of expressions,

we define an index (which we call the “diversifica-

tion index”) as the quotient between the number of

syntactically different expressions and the number

of different medical expressions. By analyzing the

results, we can extract the following fundamental

considerations:

The number of medically different expressions

does not obviously differ from the number of syn-

tactically different expressions. This confirms the

previously indicated language unification.

As the number of expressions increases, diversi-

fication tends to decrease or to stabilize. This sug-

gests that the system is stable: if we increase the size

of the study period, the obtained results regarding

differentiation would not be significantly different

from the ones presented here.

In the second stage, the semantic composition of

the expressions is analyzed by studying the terms

comprising them. This analysis suggests the exis-

tence of a small number of the following specific

semantic categories:

Identifiers: Description terms with maximum

rank within the expression, absence of which deter-

mines its lack of logical sense. Such a descriptor

necessarily refers to a pathological process, opera-

tion or exploration.

Qualifiers: Secondary terms that specify the

meaning of the identifier. On their own, they have

no clinical meaning but they indicate a medical

qualification: “serious", "acute", "spontaneous", etc.

These qualifiers can, in turn, have different typolo-

gies such as those qualifying severity, time, com-

pensation or location.

Etiologicals: Terms representing the origin or

causes of the pathological process (when it exists).

Some examples we can mention are accidental poi-

soning, infections or age, and these appear in the

main diagnosis as the cause.

Locators: These represent the anatomical place

of the human organism where the damage or injury

appears and the identifier takes place (such as auri-

cle, ventricle, or myocardium, etc. for the specialty

that this work focuses on).

The number of descriptors found for each cate-

gory is: 115 for identifiers, 167 for qualifiers, 29 for

etiologiclas, and 95 for locators.

The vocabulary is clearly small enough for data

management not to be excessively difficult. Al-

though we have considered the possibility of ignor-

ing very low frequency descriptors, we do not be-

lieve this to be appropriate since there are patholo-

gies with a very low frequency but with a very high

medical interest. In addition, a kernel of high fre-

quency members can be seen, with a considerably

different frequency regarding the classes outside this

kernel. The border comprises a much reduced num-

ber of members.

5 “ONTOARCHINET”

ONTOLOGY EXTRACTION

According to the terminological study undertaken, it

is necessary to establish which the classes of the

ontology and their possible hierarchizations are in

accordance with the generality character. For this

purpose, utility character must prevail in order to

offer access to the information depending on the

elements in each class and their possible priorities.

The ontology focuses on documentary search using

the concepts contained in the free-text diagnosis

descriptions used by doctors.

The stages of the ontology construction process

are: Determination of domain and scope; Definition

of domain concepts (classes); Hierarchical classifi-

cation of concepts (subclasses–superclasses hierar-

chy) Definition of class properties (slots) and their

value restrictions;Definition of instances, according

to the defined properties.

5.1 Scope and Domain

Determination

As seen before, the domain is limited to diagnosis

expressions: summary representations relating to a

diagnosis, treatment, operation or exploration de-

GENERATION AND USE OF ONE ONTOLOGY FOR INTELLIGENT INFORMATION RETRIEVAL FROM

ELECTRONIC RECORD HISTORIES

567

termined by doctors involved in a health care proc-

ess (consultation, clinical admission, urgent assis-

tance, etc.). There may be several expressions within

any process. The domain of this ontology reaches

expressions such as "aborted myocardial infarction"

or "previous acute extensive myocardial infarction".

Several queries may be done in this field like:

• Could we determine the characteristics for a

given identifier?

• Is a given descriptor cause or effect?

• Which patients have suffered an identifier

for a severity qualifier?

• Does a given semantic class have the etio-

logical as a secondary associated cause?

The role of the ontology is two-fold: firstly, it is

a place to store expressions which it has classified

terminologically and semantically, according to the

established classes and hierarchies; and secondly, it

enables documentary search in accordance with the

established semantic classes and their corresponding

hierarchies and properties.

Figure 2: General query scheme.

Figure 1: Relationship general scheme.

ICEIS 2006 - INFORMATION SYSTEMS ANALYSIS AND SPECIFICATION

568

5.2 Concept Definition in the

Domain (classes)

These concepts must be those used in diagnosis

expressions and also those delimiting the scope of

the medical specialty. In addition, the concept of

descriptor type according to the established catego-

ries mentioned in the previous section must also be

introduced.

It is clearly of clinical interest to consider essen-

tial factors describing a pathology which is accom-

panied by relevant descriptions. In this regard, the

categories established for terminological study seem

appropriate: identifiers, qualifiers, etiologicals and

locators (Figure 1).

The target of these main classes is to define se-

mantic concepts. However, it is possible to establish

other classes which are useful for the documentary

search and acquisition process in addition to others

which are directed towards the structuring of differ-

ent medical fields such as: Medical specialty , Sys-

tem (human ana-tomy) and Infraconcepts (some

terms that become representative when others are

present.

5.3 Hierarchical Classification of

Concepts

Although, in principle, hierarchies are established

according to the generality levels of each descrip-

tor’s semantic concept, the generalist nature we may

encounter might require character “subtype” sub-

classes to be defined.

For each class, we have established the levels we

consider strictly necessary, thereby avoiding their

proliferation which could make ontology manage-

ment uncomfortable. In the following sections, we

shall present the hierarchization for each of the main

classes considered.

Identifiers: In this case, the hierarchy is strictly

set acording to a sense of generalitation or speciali-

zation from a strictly clinical point of view.

Qualifiers: This class suggests the existence of

“subtype” members: qualifier classes established

according to the character they qualify. We can then

find qualifiers such as: Severity: qualifies how the

pathology affects the patient; Relevance: qualifies

the degree to which the pathology is present; Com-

pensation: stability in development and presentation;

Clinical: qualifies the existence of associated pa-

thologies such as “obstructive ischemia” ; Timing:

qualifies the chronicity level.

From this first level including different qualifier

sub-types, hierarchies are established according to

the level of generality.

Etiologicals.- Similarly, we should examine the

different types of causes producing the illness. We

should consider three initial subtypes: external

causes, secondary (to other pathologies) causes, and

internal causes (metabolic, physiologic). In order to

avoid an excessive number of levels in the ontology,

internal causes have been included in the secondary

causes. External causes have a wide typology de-

pending on whether they refer to habits, accidents,

external agents, etc.

Locator: The purpose of this class is to define

the affected anatomical area; the anatomical areas

must then be classified. It is then necessary to decide

which of the commonly used hierarchization criteria

to follow. From the point of view of the ontology,

this is a secondary topic since what we want to iden-

tify is an area rather than its function (this might be

a property in the ontology).

Infraconcepts: Because of the usefulness and

ease of use of the ontology, we consider this class in

order to clarify the meaning of the concepts repre-

sented by other classes in the ontology. Initially, we

only considered one kind of infraconcept (the so

called infralocator) that enables us to precisely de-

termine an anatomical location (e.g. the “internal

side” of a given anatomical locator). The infraloca-

tor name is given by the ambiguous meaning, and

needs to be joined to a locator class member in order

to have a specific meaning, such as in “ventricle

internal side”.

The existence of more infraconcepts is an open

topic in the ontology, since we have only considered

the Infralocator infraconcept.

Once concepts have been defined according to

the class structure outlined above, we propose an

implementation in "Protégé", the tool used to define

our ontology.

5.4 Definition of Class Attributes

and Properties (slots), and

Restrictions on Their Values

The properties (intrinsic properties) represent value

relationships with other classes (Figure 1). There

may also be extrinsic properties which provide

added values for the corresponding instances which

are meaningful from the clinical or medical man-

agement point of view.

Looking at the intrinsic properties, the slots in

the class definition describe the class instance attrib-

utes and relationships with other instances. By way

of example, each identifier is then related to a cause

and anatomical area. An anatomical location may

also be linked to a certain qualification. The purpose

of these slots is to establish relationships and restric-

tions between the different class instances. This

GENERATION AND USE OF ONE ONTOLOGY FOR INTELLIGENT INFORMATION RETRIEVAL FROM

ELECTRONIC RECORD HISTORIES

569

enables class hierarchy navigation in the documen-

tary search but assuming restrictions which give

coherence to the possible profiles designed.

Designing these properties and relationships is

laborious and complex since they refer to objects

from other classes. So far, we have been referring to

intrinsic or extrinsic properties.

5.5 Instance Definition

Instances are each of the terms extracted during

previous linguistic analysis once the class they be-

long to has been typified. As mentioned previously,

the set is small enough for it to be managed easily.

Class-subclass hierarchization facilitates this by

reducing the number of direct final instances ob-

tained.

Two problems arise when these instances are de-

fined. The first problem is the need for medical

knowledge in order to define an instance member-

ship to a given class-subclass, particularly in the

identifier class. This is important since a wrong

definition would imply loss of exhaustiveness in the

documentary search. In such a case, the only solu-

tion is technical criteria. According to the method

we have developed, however, instantiation comes

from the very terminology and phrases used by

doctors and their classification according to ICD-9-

OMS. All of this reduces the error possibilities to a

minimum, and these are no greater than the differ-

ences that we would find if two specialists were to

construct the class-subclass instance structure. The

second problem is the treatment of documentary

synonymy. In order to solve this problem, before

defining the ontology, a previous and exhaustive

lexical analysis is needed (as we have done),

whereby this synonym is established using medical

supervision criteria.

6 USAGE METHODOLOGY

The use of the ONTOARCHINET ontology has

three levels of interest: acquisition process and data

storage; documentary query; moving through the

ontology

6.1 Acquisition Process and Data

Storage

Using the expression in natural languages, once it

has been filtered and cleaned, the terms with seman-

tic meaning can be matched to concepts in the on-

tology. In this process, the expression is expressed

using the concepts in the ontology and adding the

additional information stored in the ontology.

In the matching process, we also obtain a se-

quence of semantic SCS classes. These can be stored

in the database and used to calculate concept rela-

tion frequencies and the depth of each concept in the

ontology. By way of example, let us suppose the

expression in natural language: "Congestive cardiac

insufficiency due to acute lung edema". We would

obtain two expressions from this in the lexico-

graphic analysis: "Congestive cardiac insufficiency

due to acute lung edema" and “acute lung edema",

which would be described by means of the classes:

I=Insufficiency, L=Heart, C=Congestive, E=Edema,

C=Acute, L=Lung with SCS = ICLECL; and

I=Edema, C=Acute, L=Lung, with SCS=ICL, for the

second expression. Both the structured data and the

expression in natural language would be stored in

the database.

6.2 Documentary Query

The documentary query acquires the capacity of

multiple flexible queries which are very useful for

users. The process attempts to establish the concepts

contained in a query profile. Once these concepts

have been determined, selection on the database will

be made according to the profiles generated using

the hierarchical structure of classes and properties in

the ontology. This provides the query process with

flexibility and exhaustiveness and also frees it of

noise. The general scheme of user access is shown

in Figure 2, together with the different agents in-

volved in the query. Rather than exploring the read-

dressing possibilities in the query, or profiles of

forced sequences, exact or generic searches, we

shall simply say that all of them are perfectly feasi-

ble.

6.3 Moving through the Ontology

The possibility of moving through the structure of

the ontology is the first objective of the link with the

database, but it can also contribute to knowledge

about the relations and properties contained in the

ontology, which represents a way to access the data

structure and therefore build profiles by selecting

members and properties from the ontology. Depend-

ing on the extrinsic properties included, the ontology

will be able to answer a certain type of query, and

not only act as a link with the database. The great

advantage of the proposed method is that it does not

require a totally built or static initial ontology, and

therefore it can be improved as new expressions,

terms, concepts and relations arise, and this requires

ICEIS 2006 - INFORMATION SYSTEMS ANALYSIS AND SPECIFICATION

570

there to be an interface for the user specializing in

the maintenance work.

7 CONCLUSIONS

Our main objective was to establish the foundations

for developments that would allow information to be

processed using natural language. In our opinion,

this could then be used for terminology in the area

of cardiology:

• The structured sample is representative of the

system since the diversification index tends to

become stabilized as the number of cases in-

creases, and diversification is not excessively

high for the heads of series studied.

• By classifying the terms, the size of the classes

allows comfortable indexation of the terms.

Structures of classes-subclasses can be gener-

ated

• The physical data model can easily be generated

in the computer system by means of the corre-

sponding class tables.

• The classes and instances are enriched by means

of properties offering a greater richness of se-

mantic interpretation and more possibilities to

focus on the queries or extend them.

• The classes are sufficiently short and controlla-

ble, thereby ensuring process execution effi-

ciency

• The system allows an appreciably more powerful

documentary search than the traditional access

system using ICD-OMS codes. The tedious

process of codifying each expression is not nec-

essary

• Redundant fields in the tables are not needed to

codify contents which are already in expres-

sions provided by the doctor.

• Class and member properties can be added so that

the search process can be enriched by howing

these properties or using then in decision proc-

esses.

• The system can be enhanced by the incorporation

of new terms and expressions. In order to con-

sider more classes than those presented in this

paper, the semantic richness of the system

would be increased.

REFERENCES

Dameron O., Gibaud B., & Musen M. A. 2004. Using

semantic dependencies for consistency management of

an ontology of brain-cortex anatomy. First Interna-

tional Workshop on Formal Biomedical Knowledge

Representation KRMED04, Whistler, Canada.

Delgado M., Martin-Bautista M.J., Serrano J.M., Sanchez

D., Vila M.A., 2002. Association Rules extraction for

Text Minino. Lectures Notes in Artificial Intelligence

V. 2522 pp.154-162

Guarino N., 1998. Formal Ontology and Information

Systems. Formal Ontology in Information Systems.

pp. 3-15. (FOIS 98). Trento (Italia). IOS Press Weav-

ing the Biomedical Semantic Web with the Protégé

OWL Plugin. First International Workshop on Formal

Biomedical Knowledge Representation, Whistler,

Canada.

Lambrix P., Habbouche M., Perez M., 2003. Evaluation of

ontology development tools for bioinformatics.

Bioinformatic Vol. 19, n.12 pag 1564-1571.

Maniez J., 1992. Los lenguajes documentales y de

indización. Fundación Germán Sánchez Ruperez.

Pirámide, Madrid.

Martin-Bautista M.J., Sanchez D, Chamorro-Martinez J.,

Serrano J.M., Vila M.A. 2004. Mining Web Docu-

ments To Find Additional Query Terms Using Fuzzy

Association Rules. Fuzzy Sets and Systems V.148 pp.

85-1104.

Noy N., Tu S., 2003. Developing Medical Informatics

Ontologies Using Protégé Tutorial materials for AMIA

2003 tutorial.

Prados M., Peña M.C. 2003, Sistemas de Información

hospitalarios. Organización y gestión de Proyectos.

EASP (Escuela Andaluza de Salud Pública), Granada,

edition.

Prados M., Peña M.C., Prados M.B., Garrido J.M. 2004,

Gestión de conocimiento en el ámbito hospitalario.

EASP (Escuela Andaluza de Salud Pública), Granada,

edition.

Shankar R. D., Tu S. W., & Musen M. A.,2002. Use of

Protégé-2000 to Encode Clinical Guidelines. SMI-

2002-0944.(http://protégé.stanford.edu/doc)

GENERATION AND USE OF ONE ONTOLOGY FOR INTELLIGENT INFORMATION RETRIEVAL FROM

ELECTRONIC RECORD HISTORIES

571