A NEW GENERATION OF DIGITAL LIBRARY TO SUPPORT
DRUG DISCOVERY RESEARCH
Edy S. Liongosari, Anatole V. Gershman, Mitu Singh
Accenture Technology Labs, 161 N. Clark Street, Chicago, Illinois 60601, USA
Keywords: Data Integration, Visualization, Link Analysis, Drug Discovery.
Abstract: The recent explosion of publicly available biomedical information gave drug discovery researchers
unprecedented access to a wide variety of online repositories, but the sheer volume of the available data
diminishes its utility. This is compounded by the fact that these repositories suffer from a silo effect: data
from one cannot be easily linked to data in another. This is true for both publicly available sources and
internal sources such as project reports. The ability to explore all aspects of biological data and to link data
across sources is beneficial, as it allows researchers to discover new knowledge and to identify new
collaboration opportunities by exploiting links. This paper presents an approach to solving this problem and
an application that allows researchers to browse and analyze disparate bio-medical repositories as one
semantically integrated knowledge space.
1 INTRODUCTION
In the last two years, we have interviewed eighteen
drug discovery scientists from several
pharmaceutical companies and research institutes in
Europe and North America to better understand their
research tasks and information needs. These
scientists are responsible for identifying new
chemical compounds that have therapeutic purposes.
They spend between 20 and 90 percent of their time
reading scientific articles that might be pertinent to
their projects. In some cases, they scan over 900
abstracts and read 200 articles in a month just to
keep up. This translates to 45 abstracts and 10 full
articles a day – a very time consuming activity.
Until very recently, their primary external source
of information was MEDLINE (Katcher, 1999),
which contains over twelve million bibliographic
citations and abstracts from articles published in
over 4,600 bio-medical journals. However, with the
recent advancement of computational molecular
biology fields such as genomics and proteomics,
these scientists find an increasing number of new,
more structured information sources indispensable
(Elmasri & Navathe, 1999; National Library of
Medicine, 2003). Some of these sources include
GenBank (gene sequence), KEGG (biological
pathways) and OMIM (genetic disorders).
Furthermore, as pharmaceutical companies introduce
their own corporate intranets, highly valuable
internal information such as project reports, lab
notes and screening results become available
enterprise-wide and may be relevant to these
scientists.
As a result, these scientists have to face dozens
of information sources each with its own intricacies,
access methods and nomenclature. Even a simple
question such as “Is there any internal expert on T4
Polynucleotide Kinase?” can pose significant
challenges as no single source may contain the
answer. The answer may have to be constructed by
examining the authors of various articles, reports
and patent documents that are indirectly related to
this particular kinase, such as through its proteins
and biological pathways. While doing this is
certainly possible, it could be very time-consuming
as it requires access to multiple heterogeneous
internal and external sources.
Existing systems such as GeneLynx (Lambrix &
Jakoniene, 2003), DiscoveryLink (Haas et al., 2001)
and SRS (Etzold, Ulyanov & Argos, 1996) view this
as a distributed database problem and approach it by
integrating the underlying schemas of the sources
into a global schema and by providing a query
language to go against the schema. Few of them deal
with the nomenclature issues and the fact that
scientists may pose questions at a different level of
abstraction than what is available in the underlying
sources (Geffner et al., 1999).
301
S. Liongosari E., V. Gershman A. and Singh M. (2004).
A NEW GENERATION OF DIGITAL LIBRARY TO SUPPORT DRUG DISCOVERY RESEARCH.
In Proceedings of the Sixth International Conference on Enterprise Information Systems, pages 301-306
Copyright
c
SciTePress
Developing a system that is appealing to the
scientists is further complicated by their lack of
computer skills making direct exposure to a formal
query language, for example, impractical. Form-
based user interfaces are not effective as different
users with diverse backgrounds fill out the forms
differently.
In this paper, we introduce our approach to
addressing these issues and the Knowledge
Discovery Tool, an application that embodies our
answers. Our approach is based on the creation of a
semantic index to information contained in the
underlying heterogeneous sources. The basis for this
index is the knowledge model that relates all major
concepts in the domain. Domain and application-
specific rules utilize this index to infer relationships
of potential value to researchers. The tool’s
visualization framework enables intelligent
browsing to support research and investigation tasks
by providing the ability to uncover indirect or
hidden linkages among pieces of information.
2 KNOWLEDGE MODELING
Central to our approach is a technique for modeling
the kind of knowledge a scientist needs to perform
his or her job (Brody et al., 1999). This model
contains a representation of bio-medical concepts:
entities and the relationships between them. This
model is an ontology designed specifically for
scientists performing a predefined set of tasks.
The model shown in Figure 1, contains entities
and relationships that depict treatments for a disease,
how those treatments are related to a set of drugs,
the chemical compounds that comprise those drugs,
how the compounds are related to various target
proteins, and so on. This knowledge model is a
representation of how the scientists think about the
information they need in order to perform their
tasks.
The model enhances the accessibility of
knowledge in three major ways. First, it creates a
layer that is independent of the location of the
underlying information. Second, the instantiated
model allows the user to search and browse the
entire body of knowledge as one homogeneous
space of related entities while maintaining links back
to the original sources.
Third, when used with classification hierarchies,
the model becomes a powerful abstraction
mechanism (Geffner et al., 1999). For example,
when a scientist inquires about the role of G-protein
in Central Nervous System (CNS) diseases, a
straight text search might not reveal anything as the
underlying sources may not contain the phrase
“Central Nervous System” or its synonyms.
However, these sources may contain references to
the relationship of G-protein and Parkinson’s or
Alzheimer’s diseases. To answer the above question,
the system must utilize a disease classification
hierarchy such as the one in Medical Subject
Headings (Lowe & Barnett, 1994) which classifies
Parkinson’s and Alzheimer’s as sub-diseases of
CNS. As a result, the system can exploit these
linkages and provide information about the
relationship between G-protein and higher level
diseases.
In addition to structural relationships, our
knowledge model contains dozens of domain-
specific functional rules such as “Any person who
has authored more than X documents on a protein is
an expert in that protein.” Such rules constrain and
guide the automatic inference process.
3 KNOWLEDGE DISCOVERY
TOOL
When the model is instantiated with existing data
sources as described in section 4, it becomes a giant
semantic index into the underlying sources and can
support a variety of applications. One can easily
build an SQL interface for example, to query the
index. However, given that our target users have
Figure 1: A knowledge model to support drug discovery scientists.
ICEIS 2004 - HUMAN-COMPUTER INTERACTION
302
little or no SQL or information retrieval skills, we
built a web-based intelligent browser called
Knowledge Discovery Tool (KDT) that uses our
semantic index and the model’s inference rules to
select and present the most likely relationships
among the data of interest to the user.
To demonstrate some of the KDT features,
consider the following interaction: A drug discovery
scientist who specializes in Oncology is researching
a potential link between Parkinson’s disease and
cancer. She needs to bring herself up-to-speed about
Parkinson’s disease in a short time. She needs to
identify experts in the area and any past work that
might be relevant.
She starts up KDT, selects the search type
“Disease” and types “Parkinson’s” into the search
box. KDT then displays all diseases whose names
include “Parkinson’s”. She clicks on “Parkinson’s
disease” and is brought to the Landscape View for
that disease (Figure 2). The view divides the screen
into 9 (3x3) panes. The center pane shows her
current focus (Parkinson’s Disease). Each outer pane
displays information related to the current focus as
filtered by the knowledge model. She sees recent
literature related to the disease in the top left pane, a
list of experts on the disease in the top center pane,
and a list of organizations that have published
articles or own patents related to the disease in the
top right pane. The center row contains a list of
related chemical compounds; the disease itself; and a
list of biological target classes used for treatments of
the disease.
The researcher can see linkages among the items
in the panes indicated by light grey lines that are
visible without cluttering the screen. The lines let
users visually maintain the connectivity among
entities, which seems to be intuitive to many users.
Studies have shown that exposing a large number of
relationships stimulates fresh thoughts and breaks
through creative blocks (Schneiderman, 2000).
The view facilitates fast browsing by minimizing
the need for mouse clicking. For example, if the user
hovers the mouse over an item, a tool-tip will pop up
displaying the item’s full title and attributes and it
will also highlight its links. Since each item
represents an index to the underlying source, the
user can view the source by double clicking on the
item.
Single clicking on an item will re-orient the
knowledge model, move the item to the center pane,
and refresh the contents of the outer panes
accordingly. Figure 3a shows how the view looks
(a) (b)
Figure 3: (a) Landscape view of gene “PRKN”, (b) A window showing a path that substantiates the link
between PRKN and UBE2L1
Figure 2: Landscape View of “Parkinson’s disease”.
A NEW GENERATION OF DIGITAL LIBRARY TO SUPPORT DRUG DISCOVERY RESEARCH
303
after the user has clicked on PRKN - the third gene
(locus) in the bottom middle pane of Figure 2.
Each outer pane of the browser presents the
results of a query run against the underlying
semantic index, filtered and prioritized according to
the rules in the knowledge model. The user can
easily customize the view in each pane by picking
from a large library of predefined views. For
example, the user can change the list of disease
expert shown in top center pane of Figure 3a to
show those people that have been published about
PRKN in the past 3 years and sort the list based on
their most recent publication date. A user who is an
oncologist can tailor a pane to display only genes
that relate to PRKN and cancer-related diseases.
KDT provides utilities to explore potential
linkages among entities that are depicted as grey
lines. For example, one of the items in the left center
pane in figure 3a, UBE2L1, is a gene that has been
associated with several forms of cancer including
Leukemia and Breast Cancer (Ardley et al., 2000).
The link between PRKN and UBE2L1 is drawn as a
dashed line indicating there are multiple degrees of
separation between them. By clicking on the right
menu over the line, the user can query the tool to
show how this link is derived. The tool found two
paths that substantiate the link. Figure 3b shows one
of them: a path with six items as follows. Gene
PRKN has been renamed to PARK2 and this gene
produces the Parkin protein. Parkin is linked to
another protein UbcR7 through two articles. The
bottom half of figure 3b shows one of the articles
that describes an interaction between Parkin and
UbcR7 in rat brain (Wang et al., 2001). UbcR7 in
turn is produced by UBE2L1. Deriving links in this
manner to expose hidden or indirect links that the
users might not have thought of is a powerful
capability (Schneiderman, 2000).
KDT also serves as a powerful collaboration
tool. By annotating items or links, users can
informally share their opinions and enrich the
existing contents. Furthermore, by continuously
monitoring common bookmarks and users’
navigation paths, the tool also can match users with
similar interests.
Because the index is continuously updated, the
researchers can use KDT to monitor new items and
links related to a topic of interest. We also
developed wizards similar to those used in setting up
printers in Microsoft Windows, to automate the
repetitive tasks that the users frequently do with
KDT.
4 INSTANTIATING THE MODEL
The process used to instantiate the model is very
similar to the one used for data warehousing
(Fayyad et al., 1996). The data from the selected
data sources is extracted and transformed into its
relational form. It is then cleansed of errors using a
semi-automated approach. Thesauri are created in
the process. Cleansed data is then integrated. This
process is described more in detail below.
4.1 Data Selection, Mapping and
Extraction
The 13 sources used in the current implementation
of KDT are: Enzyme, GeneOntology, NCBI
Genome, Interpro, KEGG, LocusLink, MeSH, NLM
Taxonomy, OMIM, MEDLINE, NCBI RefSeq,
Swissprot, and Unigene.
First, data fields in each source are mapped onto
the attributes of entities in our model. For example,
NLMTaxonomy’s TaxID, LocusLink’s OrganismID
and RefSeq’s OrganismID are all mapped to the
OrganismID in the model. We map all attributes of
all entities and relationships defined in our model.
As the above example indicates, most attributes in
the knowledge model are instantiated from multiple
sources. This, of course, generates conflicts and
they will be addressed in section 4.3.
Second, we create a representation of each
source in a local database. Once the information is in
the database, some of the fields have to be parsed
further. For example, the pathway description field
from KEGG may contain a piece of text like
“Glycolysis/Gluconeogenesis – Aquifex aeolicus”
which would be parsed into two subfields:
“Glycolysis/Gluconeogenesis” as the pathway’s title
and “Aquifex aeolicus” as the associated organism
name. This secondary parsing could also result in
building new hierarchies. For example, by parsing
the ExPASy’s EnzymeID, one can determine that
the protein family with EnzymeID of “EC.2.7.1.12”
is a child of EnzymeID “EC.2.7.1.” We use dozens
of rules to guide data extraction.
4.2 Schema Integration
This phase reconciles diverse schemas in the local
database with the schema as defined in the
knowledge model. Lenzerini (2002) describes this as
the local-as-view (LAV) process. It is done in two
steps. The first step handles the fact that a data
source could be mapped into multiple entities in the
knowledge model. This is accomplished by creating
multiple database views. The second step handles
ICEIS 2004 - HUMAN-COMPUTER INTERACTION
304
the fact that an attribute of an entity can be
assembled from multiple sources. Thus we need to
combine multiple database views from step one.
This is accomplished by writing SQL scripts to
insert the contents from the appropriate views to the
target table for each entity.
4.3 Instance Integration
Schema integration produces a large number of
redundant instances for each entity primarily due to
the fact that different sources use different
nomenclature or provide different set of attributes.
The objective of instance integration is to remove
conflicts and merge redundant data of an instance by
comparing one or more of its attributes.
We start instance integration by identifying
redundant records. We employ the vector space
model (Fasulo, 1999) to determine the similarity
among attributes through the use of two broad
classes of heuristics: ID-based and text-based. In ID-
based heuristics, we assume that there are one or
more IDs that uniquely identify an instance. While
this is the most straightforward in most cases, the
process is complicated by the fact that some of these
IDs may not be consistent.
The heuristics to solve ID consistencies were
manually developed by domain experts after visually
inspecting the origins of the inconsistencies. For
example, we found that the combination of
SwissprotID and OrganismID can be used to
uniquely identify a gene for our purpose. This class
of heuristics is applicable to gene, proteins,
pathways, protein families and genomes where the
use of IDs is quite pervasive.
The text-based heuristics are applicable to
organisms, phenotypes, organizations and people
where IDs are not readily available. We employ
many techniques and heuristics to resolve the name
similarity problem. Many of such techniques are
also used in WHIRL (Cohen, 2000). They range
from simple removal of punctuation (e.g.,
“Legionnaires' Disease” vs. “Legionnaires
Disease”), to comparing last name and first initial
(e.g., “A. aeolicus” vs. “Aquifex aeolicus”), to using
dictionaries and synonyms to match “zebrafish” to
“zebra danio”.
Our synonym tables are created in three ways.
First, there are several sources such as
NLMTaxonomy that contain synonym information
explicitly and we simply import them. Second, some
sources contain implicit synonym information that
we have to extract. For example, Swissprot’s protein
name may contain text like “Alzheimer's disease
amyloid A4 protein precursor (Fragment) (Protease
nexin-II) (PN-II) (APPI)“. From this piece of text we
extract “PN-II” and “APPI” as the synonyms of
“Protease nexin-II”. The third way is described in
conflict resolution below.
In conflict resolution, we tag each attribute with the
source from which the value was extracted. Each
source is associated with a confidence value between
0 and 10 by a domain expert. An attribute with a
higher confidence value can overwrite those with
lower confidence values. However, for the attributes
that signify the names of an object, we mark them as
potentially synonymous instead of overwriting them.
The merge step involves a domain expert to
confirm that the identified redundant records are
indeed redundant and that the suggested merged
record is correct. We developed a small application
called Thesaurus Builder to assist a domain expert in
this task. This application also allows the domain
expert to manually identify instances to merge.
While this manual step is time consuming, it is
important to have an expert to validate this step as
the quality of the entire integration result is highly
dependant on it. The decisions made by the domain
expert are captured so that they can be automatically
re-applied in the future.
4.4 Post-Integration
While published articles in MEDLINE form one of
the richest sources of bio-medical information to
date, automatically extracting semantic information
from them is hard (Jacquemin, 2001). Currently, we
use these articles to create weak unlabelled links
between the entities in the knowledge model through
the co-occurrences of terms in the articles. The more
articles that link the two entities, the stronger the
link is. The strength value is then used to rank and
sort the query results.
Another factor in determining the strength value
is the length of inferred relationships. For example, a
gene can be related to a disease through its proteins
and variants. As each intermediate step introduces
further uncertainty, we assigned lower strength to
the relationships inferred using longer paths.
5 OUTCOME AND BENEFIT
As shown in section 3, a KDT user is able to explore
thematically related information across multiple data
sources as if it were a single richly connected
knowledge space. The user does not have to be
concerned with the details of the underlying
databases, their formats, or their access methods.
Exploration of a potential link between two entities
as shown in Figure 3b that takes minutes with KDT
would require hours or perhaps days using the
A NEW GENERATION OF DIGITAL LIBRARY TO SUPPORT DRUG DISCOVERY RESEARCH
305
access methods provided by the 13 knowledge
sources we cover.
We developed the components used to
instantiate the knowledge model using Microsoft-
based technologies including Microsoft Windows
2000 Server and Microsoft SQL Database. KDT was
written entirely in Java as a Java applet. This eases
the deployment of the tool as it can run in most
Internet browsers that support Java.
From the thirteen sources we have selected, the
instantiated knowledge model contains over 1.1
million genes, 1.6 million proteins, 200,000
organisms, 12,000 pathways, 6,000 diseases, 12
million articles and over 4 billion relationships
across these biological entities. Prior to the post-
integration phase, the instantiated knowledge model
is about 70GB in size. Once the indices and pre-
calculated relationships are created, the size of the
database grows to 400GB.
While we have not conducted a formal
evaluation of KDT’s effectiveness as compared to
the current access methods, bio-medical researchers
who have tried KDT have given us very positive
feedback and expressed a strong desire to use the
tool in their daily activities. In one particularly
gratifying case, the senior leader of a heart failure
research project tried it during a ten-minute
demonstration and discovered a valuable
relationship between heart failure and leukemia of
which he had not been aware. In another case, a
post-doctoral biologist discovered a new link
between Lou Gehrig’s disease and genital diseases
suggesting that a drug for one disease can be
modified for the treatment of the other.
Currently, we are working with several
pharmaceutical companies, the University of
Colorado Health Sciences Center in Denver and the
Integrative Neuroscience Initiative on Alcoholism –
an initiative sponsored by the National Institute of
Alcohol Abuse and Alcoholism – to formally
evaluate the effectiveness of KDT. We have over 30
people who have signed up for our initial pilot. We
estimate the formal evaluation will be done in
Spring 2004.
REFERENCES
Ardley, H. C., Moynihan, T.P., Markham A.F. &
Robinson P.A., 2000. ‘Promoter analysis of the human
ubiquitin-conjugating enzyme gene family UBE1LI-4,
including UBE2L3 which encodes UbcH7’, Biochim
Biophys Acta, vol. 1491, no. 1-3, pp. 57-64.
Brody, A.B., Dempski, K.L., Kaplan, J.E., Kurth, S.W.,
Liongosari, E.S. & Swaminathan, K. S., 1999.
‘Integrating Disparate Knowledge Sources’, Proc.
Second Int. Conf. on Practical Application of
Knowledge Management, pp. 77-82.
Cohen, W.W., 2000. ‘Data integration using similarity
joins and a word-based information representation
language’, ACM Trans. Info. Systems, vol. 18, no. 3,
pp. 288-321.
Elmasri, R. & Navathe, S., 1999. ‘Genome Data
Management’, in Fundamentals of Database Systems,
Pearson Addison Wesley, 3rd edition, pp. 898-905.
Etzold, T., Ulyanov A. & Argos, P., 1996. ‘SRS:
information retrieval system for molecular biology
data banks’, Methods in Enzymology, vol. 226, pp.
114-128.
Fasulo, D.,
1999. Analysis on recent work on clustering
algorithms, Technical Report #01-03-02, Dept. of
Computer Science and Eng., U of Washington, Seattle.
Fayyad, U., Piatetsky-Shapiro, G. & Smyth, P., 1996.
‘The KDD Process for Extracting Useful Knowledge
from Volumes of Data’, Comm. ACM, vol. 39, no. 11
pp. 27-34.
Geffner, S., Agrawal, D., El Abbadi, A., & Smith, T.,
1999. ‘Browsing large digital library collections using
classification hierarchies’, Proc. Eighth Int. Conf. on
Info. and Knowledge Management, pp. 195-201.
Haas, L., Schwarz, P., Kodali, P., Kotlar, E., Rice, J. &
Swope, W., 2001. ‘DiscoveryLink: A system for
integrated access to life sciences data sources’, IBM
Systems Journal, vol. 40, no. 2, pp. 489-511.
Jacquemin, C., 2001. Spotting and discovering terms
through NLP, MIT Press, Cambridge, MA.
Katcher, B.S., 1999. MEDLINE: A Guide to Effective
Searching, Ashbury Press, San Francisco, CA.
Lambrix, P. & Jakoniene, V., 2003. ‘Towards transparent
access to multiple biological databanks’, Proc. First
Asia-Pacific Bioinformatics Conf., vol. 19, pp. 53-60.
Lenzerini, M., 2002. ‘Data Integration: A Theoretical
Perspective’, Proc. 21
st
ACM Symp. on Principles of
Database Systems, pp. 233 – 246.
Lowe, H. & Barnett, G., 1994. ‘Understanding and using
the medical subject headings (MESH) vocabulary to
perform literature searchers’, JAMA, vol. 271, pp.
1103-1108.
National Library of Medicine, 2003 (updated 7 Feb 2003).
Growth of GenBank. Retrieved 28 Jan 2004 from
http://www.ncbi.nlm.nih.gov/Genbank/
genbankstats.html
Schneiderman, B., 2000. ‘Creating Creativity: User
Interfaces for Supporting Innovation’, ACM Trans. On
Computer-Human Inter., vol. 7, no. 1, pp. 114-138.
Wang M., Suzuki, T., Kitadata, T., Asakawa, S.,
Minoshima, S., Shimizu, N., Tanaka, K., Mizuno, Y.
& Hattori, N., 2001. ‘Developmental changes in the
expression of parkin and UbcR7, a parkin-interacting
and ubiquitin-conjugating enzyme, in rat brain’, J.
Neurochemistry, vol. 77, no. 6, pp. 1561-1568.
ICEIS 2004 - HUMAN-COMPUTER INTERACTION
306