MINING OF HEALTH INFORMATION FROM ONTOLOGIES
Maja Hadzic
Research Lab for Digital Health Ecosystems, Curtin Universityof Technology, Austria
Fedja Hadzic, Tharam Dillon
Digital Eciossytems and Business Intelligence Institute (DEBII), Curtin Universityof Technology, Austria
Keywords: Data mining, Tree mining, Ontology, Ontology mining, Human disease ontology, Human disease study,
Health information system.
Abstract: Data mining techniques can be used to efficiently analyze semi-structured data. Semi-structured data are
predominantly used within the health domain as they enable meaningful representations of the health
information. Tree mining algorithms can efficiently extract frequent substructures from semi-structured
knowledge representations. In this paper, we demonstrate application of the tree mining algorithms on the
health information. We illustrate this on an example of Human Disease Ontology (HDO) which represents
information about diseases in 4 ‘dimensions’: (1) disease types, (2) phenotype (observable characteristics of
an organism) or symptoms (3) causes related to the disease, namely genetic causes, environmental causes or
micro-organisms, and (4) treatments available for the disease. The extracted data patterns can provide
useful information to help in disease prevention, and assist in delivery of effective and efficient health
services.
1 INTRODUCTION
New modern techniques are providing huge, rapidly
accumulating amounts of information. To extract
and analyze the data poses a much bigger challenge
for researchers than to generate the data (Holloway
et al., 2002). Experienced scientists and doctors are
overwhelmed with this situation.
There is a need for an intelligent and efficient
system to make use of all the available information.
The true value of this information can be
significantly increased through smart information
processing and analysis. Such systems could play a
crucial role in filtering the flood of data to the point
where human experts could apply their knowledge
sensibly.
Information technologies must be effectively
implemented within health domain. In their paper,
Horvitz-Lennon et al. (2006) state that we need to
fully embrace information technology and its
potential for improving service efficiency and
develop a better information infrastructure for the
patient’s care.
Data mining is a set of processes that is based on
automated searching for actionable knowledge
buried within a huge body of data. Data mining
techniques extract information and find hidden
patterns and behaviors, and support making
predictive models for decision making and new
discoveries. Within the biomedical and health field,
data mining techniques have been predominately
used for tasks such as text mining, gene expression
analysis, drug design, genomics and proteomics
(Zaki et al., 2003). The data analysis necessary for
microarrays has necessitated data mining (Piatetsky-
Shapiro & Tamayo, 2003). Recently, use of data
mining methods has been proposed for the purpose
of mapping and identification of complex disease
loci (Onkamo & Toivonen, 2006). However, the
proposed methods are yet to be implemented.
Frequent pattern analysis has been a focused theme
of study in data mining. A lot of algorithms and
methods have been developed for mining frequent
sequential and structural patterns (Han & Kamber,
2001; Agrawal & Srikant, 1994; Tan et al., 2006a).
Implementation of data mining techniques within
health domain could help in disease prevention and
155
Hadzic M., Hadzic F. and Dillon T. (2008).
MINING OF HEALTH INFORMATION FROM ONTOLOGIES.
In Proceedings of the First International Conference on Health Informatics, pages 155-160
Copyright
c
SciTePress
assist in delivery of effective and efficient health
services.
Within the data mining field, tree mining has
recently attracted lots of interest. Our work in the
tree mining field is characterized by a Tree Model
Guided (TMG) (Tan et al., 2005; Tan et al., 2006b)
candidate generation approach. This non-redundant
systematic enumeration model uses the underlying
tree structure of the data to generate only valid
candidates. Using the general TMG framework a
number of algorithms were developed as follows.
MB3-Miner (Tan et al., 2005) mines ordered
embedded subtrees while IMB3-Miner (Tan et al.,
2006b) can mine both, induced or embedded ordered
subtrees by using the level of embedding constraint.
Razor algorithm (Tan et al., 2006c) was developed
for mining of embedded subtrees where the
distances of nodes relative to the root of the original
tree need to be considered. UNI3 algorithm (Hadzic
et al., 2007a) mines induced unordered subtrees.
Our algorithms were applied on large and complex
tree structures and their scalability was
experimentally demonstrated (Tan et al., 2005; Tan
et al., 2006b). From the application perspective, in
(Hadzic et al., 2006) we have applied our tree
mining algorithm for the analysis of Protein
Ontology (Sidhu et al., 2004) database for Human
Prion proteins which was represented in XML
format. In this paper, we explain how tree mining
techniques can be applied within the health domain
for deriving useful knowledge patterns that can help
disease prevention and management.
2 HUMAN DISEASE ONTOLOGY
We designed Human Disease Ontology (GHDO)
(Hadzic & Chang, 2005) to have the following four
branches or subontologies:
disease types, describing different types of a
disease;
phenotype, describing disease symptoms;
causes responsible for that disease that can be
genetic, environmental and/or microorganism;
treatments, providing an overview of all
treatments possible for a particular disease;
Top-level hierarchy of the HDO is illustrated in
Figure 1.
The information presented in Figure 1 state that a
disease may have different types, and associated
subtypes and sub-subtypes. For each disease, there
is a corresponding phenotype (or observable
characteristics of an ill individual), namely
symptoms of a disease. Disease cause can be genetic
(genotype), environmental or a microorganism.
Genetic causes can be a mutated gene, a complex of
genes or DNA region of interest. DNA region of
interest is a region in the DNA sequence that
potentially contains a gene responsible for the
disease. This region needs to be further examined in
order to correctly locate the mutated gene.
Environmental causes of a disease can be stress,
climate, drugs misuse, family conditions or
economic condition. Microorganisms that may cause
a disease may be virus or bacteria. Possible
treatments for a disease can be drug therapy,
chemotherapy, surgery, psychotherapy or
physiotherapy.
Figure 1: Top-level hierarchy of the HDO.
Researchers in the medical ontology-design
field have developed different terminologies and
ontologies in many different areas of medical
domain. In order to obtain some uniformity across
different ontologies, definitions from other
published and consensual ontologies can be reused
(Noy & Musen, 2000). We can use other ontologies
such as LinkBase (Montyne, 2001) and UMLS
(Bodenreider, 2004). These ontologies contain over
million concepts, and we do not require all of them.
The four different branches (subontologies) of the
GHDO ontology can serve as a reference point
against which the concepts from the existing
ontologies can be reorganized, aligned and merged.
Bimolecular ontologies such as TAMBIS ontology
(Stevens et al., 2002) that represents general
knowledge in regard to proteins and associated
genes may be suitable to cover the ontology part in
regard to genetical disease causes. So, we can use
terminology from existing ontologies but select and
organize the concepts in a way that can be used in
our application.
HEALTHINF 2008 - International Conference on Health Informatics
156
Note that the HDO has a tree (hierarchy)
structure which allows it to be analyzed using some
available tree mining techniques. The current
ontology languages allow the use of graph structures
to represent the domain knowledge in an ontology.
A large portion of current ontologies have
predominantly hierarchical structures. Furthermore,
it is often the case that the graph-structured
knowledge representation can be modelled using
tree structures without losing too much semantics.
The root of the complexity of mining graph
structures is in the existence of cycles, and in many
cases the number of cycles in graph instances is
limited. The complexity of processing tree structures
tends to be more manageable and is one of the
promising directions towards automatic analysis of
ontologies.
We assume that the available health information
can be represented according to the four HDO
‘dimensions’ or subontologies. In the rest of our
paper, we will use the term ‘HDO instance’ to refer
to a specific record found within a given
database/application that can be represented using
the HDO structure. In our previous publications
such as (Hadzic & Chang, 2005), this ‘HDO
instance’ corresponds to Specific Human Disease
Ontology.
3 TREE MINING CONCEPTS
The aim of this section is to provide the definitions
of some basic tree concepts necessary for
understanding the current work. Please refer to (Tan
et al., 2005; Tan et al., 2006b; Hadzic et al., 2007a)
for a more extensive overview of the tree mining
area including the discussion of implementation
issues and algorithm comparisons. A tree is a special
type of graph where no cycles are allowed. It
consists of a set of nodes (or vertices) that are
connected by edges. Each edge has two nodes
associated with it. A path is defined as a finite
sequence of edges and in a tree there is a single
unique path between any two nodes. The length of a
path p is the number of edges in p. A rooted tree has
its top-most node defined as the root that has no
incoming edges and for every other node there is a
path between the root and that node. A node u is
said to be a parent of node v, if there is a directed
edge from u to v. Node v is then said to be a child of
node u. Nodes with no children are referred to as
leaf nodes and otherwise they are called internal
nodes. If for each internal node, all the children are
ordered, then the tree is an ordered tree. The
problem of frequent subtree mining can be generally
stated as: Given a tree database T
db
and minimum
support threshold (σ), find all subtrees that occur at
least σ times in T
db
.
A HDO instance can be captured by an OWL
document analogous to the one shown in Figure 2.
In our example, the actual information content can
be viewed as an ordered labeled tree. This specific
instance aims to capture the information about
causal factors of a human disease and corresponds
with the Causes subontology shown in Figure 1.
Figure 2: OWL representation of part of the HDO
subontology i.e. Causes subontology.
A tree can be denoted as T(V, L, E), where:
(1) V is the set of vertices or nodes;
(2) L is the set of labels of vertices, for any vertex
vV, L(v) denotes the label of v; and
(3) E = {(x,y)| x,y
V } is the set of edges in the
tree.
Figure 3: Causes subontology of HDO (viewed in
Protégé).
MINING OF HEALTH INFORMATION FROM ONTOLOGIES
157
In Figure 3, we represent the OWL document
from Figure 2 as a tree with “Cause” as the root
node.
Most of the current tree mining algorithms are
mainly focused on extracting induced and embedded
subtrees. An induced subtree preserves the parent-
child relationships of each node in the original tree.
In addition to this, an embedded subtree allows a
parent in the subtree to be an ancestor in the original
tree and hence ancestor-descendant relationships are
preserved over several levels. Formal definitions
follow:
A tree T’(V’, L’, E’) is an induced subtree of a
tree T (V, L, E) iff (1) V’V, (2) E’E, (3) L’L and
L’(v)=L(v), (4) v’V’, vV, v’ is not the root
node, and v’ has a parent in T’, then
parent(v’)=parent(v).
A tree T’(V’, L’, E’) is an embedded subtree of a
tree T(V, L, E) iff (1) VV, (2) if (v
1
,v
2
) E’ then
parent(v
2
) = v
1
in T’, only if v
1
is ancestor of v
2
in T
and (3) LL and L’(v)=L(v).
The subtrees can further be distinguished
depending on the order of sibling nodes. An
ordered subtree preserves the left-to-right ordering
among the sibling nodes in the original tree. In an
unordered subtree the left-to-right ordering among
the sibling nodes does not need to be preserved. The
order of the sibling nodes (and the subtrees rooted at
sibling nodes) can be exchanged and the resulting
subtree would be considered the same. Examples of
different subtree types related to the Figure 3 are
given in Figure 4.
The available support definitions are transaction
based, occurrence-match, and hybrid support (Tan et
al., 2006b; Hadzic et al., 2007b). Transaction
based support only checks for the existence of
items in a transaction while occurrence-match
support takes the repetition into account and counts
the item occurrences in the database as a whole.
Using hybrid support with a threshold set to x|y, a
subtree will be considered as frequent iff it occurs at
least y times in x number of transactions.
Figure 4: Example of different subtree types.
4 MINING OF HEALTH
INFORMATION
In order to apply tree mining to the problem, the
following five phases take place: data selection and
cleaning, data formatting, tree mining, pattern
discovery, and knowledge testing and evaluation.
In the first phase, we focus on the Data Selection
and Cleaning. Most of the databases also contain
information that is not needed by the application.
The irrelevant information, as well as noise and
inconsistent data, should be removed from the data
set to be mined. Extra care should be taken during
this process since some data may appear to be noisy
but in fact represents a true exceptional case.
In the second phase, Data Formatting, all the
collected data needs to be represented using the
same format that is understandable by the tree
mining algorithm used.
The third phase is concerned with applying the
tree mining algorithms for extracting interesting
patterns from now a clean and correctly formatted
dataset. One needs to consider what particular type
of subtree is the most suitable one to mine for
satisfying the application needs. When the data to be
mined comes from one organization, the format and
ordering of the presented information is expected to
be the same, and hence mining of ordered subtrees
will be sufficient. On the other hand, if the collected
data originate from separate organizations, then
mining of unordered trees would be more suitable.
The organizations could order their concepts
differently, but since an unordered subtree with
different order among the sibling nodes is still
considered as the same candidate, the common
characteristics of a particular illness would still be
found. Another choice to be made is whether
induced or embedded subtrees should be mined.
When mining health related data it is important that
particular information stays in the context where it
occurred. With respect to tree patterns this implies
that the relationship of nodes in the extracted
subtrees should be limited to parent-child
relationships. Allowing ancestor-descendant
relationships would result in information loss about
the context where a particular disease characteristic
occurred. Some attributes in the dataset may have a
similar set of values and by mining induced subtrees
there the attribute to which each value belongs will
be indicated in the extracted pattern.
As a final consideration the support definition
chosen should be dependant on the way that the data
is organized. Next we provide three common ways
HEALTHINF 2008 - International Conference on Health Informatics
158
in which the data could be presented, and indicate
the suitable support definition that should be used.
Case 1: Each HDO instance is stored as a separate
subtree in the OWL document and HDO instances
describing different diseases are stored in separate
documents. In this case both, occurrence match or
transaction based support would be suitable.
Case 2: Each HDO instance is stored as a separate
subtree in the OWL document but now one OWL
document contains all HDO instances for all
investigated diseases. Here the transactional support
would be more appropriate.
Case 3: A collection of HDO instances related to
one particular disease is always contained in a
separate subtree of an OWL document. Hybrid
support definition is most suitable in this case.
In cases 1 and 2, the minimum support threshold
should be close to the number of HDO instances that
the dataset contains about a particular disease, in
order to find the commonality among all the records
related to that particular disease. However, since
noise is often present in the data, the support can
lowered but not too much so that irrelevant factors
are not picked up as important. In case 3, the
number of diseases described would be used as the
transactional part of the hybrid support, while the
minimum occurrence of a subtree within each
transaction should reflect the number of HDO
instances.
As a common data mining practice, the data set
at hand could be split into two subsets. One is used
for deriving the knowledge model (‘internal data’
from Figure 5) and while the second one is used for
testing the derived knowledge model (‘external data’
from Figure 5). When possible the data collected by
another organization can be used as external data.
During Pattern Discovery phase, new knowledge
about specific disease(s) emerges. For example,
these results may help to associate precise
combinations of genetic and environmental factors
with a specific disease type. The results could
increase the understanding of the disease under
study and make a breakthrough in the research,
control and prevention of this disease. Knowledge
Testing and Evaluation is illustrated in Figure 5. The
‘external data’ is used to verify a formed hypothesis
before it can extend the current body of knowledge.
The choice of the tree mining parameters often
affects the nature and granularity of the obtained
results. In cases where the hypothesis is not
supported by the ‘external data’, the parameters will
be adjusted and the previous steps alternated.
Figure 5: Testing and evaluation of the derived
knowledge.
5 CONCLUSIONS
Data mining systems in general could play a crucial
role in deriving knowledge and assisting in the
prevention, diagnosis, treatments and control of
human diseases. In this paper, we have illustrated
application of the data mining technology within the
health domain. We have explained how the tree
mining algorithms can be effectively used for
mining of information about human diseases. We
aim to apply those techniques on the real-word data.
The newly derived knowledge could help in
prevention of human diseases and assist in delivery
of effective and efficient health services. Various
community groups would greatly benefit from
application of data mining techniques on a large
scale, e.g. physician would receive support in early
diagnosis and treatment of diseases; patients would
receive support in dealing with, managing and
treating disease; accurate, reliable and up-to-date
information would be available for general public,
and medical researchers would receive support in
advancing their research. Furthermore, the cost of
the health budget would be significantly reduced by
providing better information use (Garber, 2006).
Such systems go some way to delivering what Patel
et al (2006) say is necessary to transform the quality
of health care. They improve the infrastructure for
evidence-based interventions and provide innovation
for improvement in health care.
REFERENCES
Agrawal, R, Srikant, R., 1994. Fast algorithms for mining
association rules. Very Large Databases, Chile.
MINING OF HEALTH INFORMATION FROM ONTOLOGIES
159
Bodenreider, O., 2004. The Unified Medical language
System (UMLS): integrating biomedical terminology,
Nucleic Acids Res, vol. 32, no. 1, pp. 267-270.
Garber, A.M., 2006. PERSPECTIVE: To Use Technology
Better. Health Affairs, pp. W51-W53.
Hadzic, F., Dillon, T.S., Sidhu, A., Chang, E., Tan, H.,
2006. Mining Substructures in Protein Data. ICDM’06
Wkshp on Data Mining in Bioinformatics, China.
Hadzic, F., Tan, H., Dillon, T.S., 2007a. UNI3 – Efficient
Algorithm for Mining Unordered Induced Subtrees
Using TMG Candidate Generation. IEEE Symposium
on Computational Intelligence and Data Mining, USA.
Hadzic, F., Tan, H., Dillon, T.S., Chang, E., 2007b.
Implications of frequent subtree mining using hybrid
support definition, Data Mining & Information
Engineering, UK.
Hadzic, M., Chang, E., 2005. Ontology-based Support for
Human Disease Study. Proc. of the 38th Hawaii Int’l
Conf. on System Sciences (HICSS-38), USA.
Han, J., Kamber, M., 2001. Data Mining: Concepts and
Techniques. San Francisco: Morgan Kaufmann.
Holloway, A., Van Laar, R.K., Tothill, R.W., and Bowtell
D., 2002. Options available - from start to finish - for
obtaining data from DNA microarrays II. Nature
Genetics Supplement, vol. 32, pp. 481-489.
Horvitz-Lennon, M., Kilbourne, A.M., Pincus, H.A.,
2006. From Silos To Bridges: Meeting The General
Health Care Needs Of Adults With Severe Mental
Illnesses. Health Affairs, vol. 25, no. 3, pp. 659-669.
Montyne, F., 2001. The importance of formal ontologies:
a case study in occupational health. Proc. of Int’l
Wkshp on Open Enterprise Solutions: Systems,
Experiences, and Organizations (OES-SEO2001), Italy.
Noy, N.F., Musen, M.A., 2000. PROMPT: Algorithm and
Tool for Automated Ontology Merging and
Alignment, Proceedings of the seventeenth national
conference on Artificial Intelligence (AAAI-2000), USA.
Onkamo, P., Toivonen, H., 2006. A survey of data mining
methods for linkage disequilibrium mapping. Human
genomics, vol. 2, no. 5, pp. 336-340.
Patel, K.K., Butler, B., Wells, K.B., 2006. What Is
Necessary To Transform The Quality Of Mental
Health Care. Health Affairs, vol. 25, no. 3, pp. 681-693.
Piatetsky-Shapiro, G., Tamayo, P., 2003. Microarray Data
Mining: Facing the Challenges. SIGKDD
Explorations, vol. 5, no. 2, pp. 1-6.
Sidhu, A.S., Dillon, T.S., Sidhu, B.S., Setiawan, H., 2004.
A Unified Representation of Protein Structure
Databases. Biotechnological Approaches for
Sustainable Development, pp. 396-408.
Stevens, R., Baker, P., Bechhofer, S., Ng, G., Jacoby, A.,
Paton, N.W., Goble, C.A., Brass, A., 2002. TAMBIS:
Transparent Access to Multiple Bioinformatics
Information Sources. Bioinformatics, vol. 16, no. 2,
pp.184-86.
Tan, H., Dillon, T.S., Hadzic, F., Chang, E., Feng, L.,
2005. MB3-Miner: mining eMBedded sub-TREEs
using Tree Model Guided candidate generation.
Workshop on Mining Complex Data, USA.
Tan, H., Dillon, T.S., Hadzic, F., Chang, E., 2006a.
SEQUEST: mining frequent subsequences using DMA
Strips. Data Mining and Information Engineering,
Czech Republic.
Tan, H., Dillon, T.S., Hadzic, F., Feng, L, Chang, E.,
2006b. IMB3-Miner: Mining Induced/Embedded
subtrees by constraining the level of embedding.
Pacific-Asia Conf. on Knowledge Discovery and Data
Mining, Singapore.
Tan, H., Dillon, T.S., Hadzic, F., Chang, E., 2006c. Razor:
mining distance constrained embedded subtrees. IEEE
ICDM’06 Wkshp on Ontology Mining and Knowledge
Discovery from Semistructured documents, China.
Zaki, M.J., Wang, J.T.L., Toivonen, H.T.T., 2003. Data
Mining in Bioinformatics: report on BIOKDD’03’.
SIGKDD Explorations, vol. 5, no. 2, pp. 198-200.
HEALTHINF 2008 - International Conference on Health Informatics
160