access methods provided by the 13 knowledge
sources we cover.
We developed the components used to
instantiate the knowledge model using Microsoft-
based technologies including Microsoft Windows
2000 Server and Microsoft SQL Database. KDT was
written entirely in Java as a Java applet. This eases
the deployment of the tool as it can run in most
Internet browsers that support Java.
From the thirteen sources we have selected, the
instantiated knowledge model contains over 1.1
million genes, 1.6 million proteins, 200,000
organisms, 12,000 pathways, 6,000 diseases, 12
million articles and over 4 billion relationships
across these biological entities. Prior to the post-
integration phase, the instantiated knowledge model
is about 70GB in size. Once the indices and pre-
calculated relationships are created, the size of the
database grows to 400GB.
While we have not conducted a formal
evaluation of KDT’s effectiveness as compared to
the current access methods, bio-medical researchers
who have tried KDT have given us very positive
feedback and expressed a strong desire to use the
tool in their daily activities. In one particularly
gratifying case, the senior leader of a heart failure
research project tried it during a ten-minute
demonstration and discovered a valuable
relationship between heart failure and leukemia of
which he had not been aware. In another case, a
post-doctoral biologist discovered a new link
between Lou Gehrig’s disease and genital diseases
suggesting that a drug for one disease can be
modified for the treatment of the other.
Currently, we are working with several
pharmaceutical companies, the University of
Colorado Health Sciences Center in Denver and the
Integrative Neuroscience Initiative on Alcoholism –
an initiative sponsored by the National Institute of
Alcohol Abuse and Alcoholism – to formally
evaluate the effectiveness of KDT. We have over 30
people who have signed up for our initial pilot. We
estimate the formal evaluation will be done in
Spring 2004.
REFERENCES
Ardley, H. C., Moynihan, T.P., Markham A.F. &
Robinson P.A., 2000. ‘Promoter analysis of the human
ubiquitin-conjugating enzyme gene family UBE1LI-4,
including UBE2L3 which encodes UbcH7’, Biochim
Biophys Acta, vol. 1491, no. 1-3, pp. 57-64.
Brody, A.B., Dempski, K.L., Kaplan, J.E., Kurth, S.W.,
Liongosari, E.S. & Swaminathan, K. S., 1999.
‘Integrating Disparate Knowledge Sources’, Proc.
Second Int. Conf. on Practical Application of
Knowledge Management, pp. 77-82.
Cohen, W.W., 2000. ‘Data integration using similarity
joins and a word-based information representation
language’, ACM Trans. Info. Systems, vol. 18, no. 3,
pp. 288-321.
Elmasri, R. & Navathe, S., 1999. ‘Genome Data
Management’, in Fundamentals of Database Systems,
Pearson Addison Wesley, 3rd edition, pp. 898-905.
Etzold, T., Ulyanov A. & Argos, P., 1996. ‘SRS:
information retrieval system for molecular biology
data banks’, Methods in Enzymology, vol. 226, pp.
114-128.
Fasulo, D.,
1999. Analysis on recent work on clustering
algorithms, Technical Report #01-03-02, Dept. of
Computer Science and Eng., U of Washington, Seattle.
Fayyad, U., Piatetsky-Shapiro, G. & Smyth, P., 1996.
‘The KDD Process for Extracting Useful Knowledge
from Volumes of Data’, Comm. ACM, vol. 39, no. 11
pp. 27-34.
Geffner, S., Agrawal, D., El Abbadi, A., & Smith, T.,
1999. ‘Browsing large digital library collections using
classification hierarchies’, Proc. Eighth Int. Conf. on
Info. and Knowledge Management, pp. 195-201.
Haas, L., Schwarz, P., Kodali, P., Kotlar, E., Rice, J. &
Swope, W., 2001. ‘DiscoveryLink: A system for
integrated access to life sciences data sources’, IBM
Systems Journal, vol. 40, no. 2, pp. 489-511.
Jacquemin, C., 2001. Spotting and discovering terms
through NLP, MIT Press, Cambridge, MA.
Katcher, B.S., 1999. MEDLINE: A Guide to Effective
Searching, Ashbury Press, San Francisco, CA.
Lambrix, P. & Jakoniene, V., 2003. ‘Towards transparent
access to multiple biological databanks’, Proc. First
Asia-Pacific Bioinformatics Conf., vol. 19, pp. 53-60.
Lenzerini, M., 2002. ‘Data Integration: A Theoretical
Perspective’, Proc. 21
st
ACM Symp. on Principles of
Database Systems, pp. 233 – 246.
Lowe, H. & Barnett, G., 1994. ‘Understanding and using
the medical subject headings (MESH) vocabulary to
perform literature searchers’, JAMA, vol. 271, pp.
1103-1108.
National Library of Medicine, 2003 (updated 7 Feb 2003).
Growth of GenBank. Retrieved 28 Jan 2004 from
http://www.ncbi.nlm.nih.gov/Genbank/
genbankstats.html
Schneiderman, B., 2000. ‘Creating Creativity: User
Interfaces for Supporting Innovation’, ACM Trans. On
Computer-Human Inter., vol. 7, no. 1, pp. 114-138.
Wang M., Suzuki, T., Kitadata, T., Asakawa, S.,
Minoshima, S., Shimizu, N., Tanaka, K., Mizuno, Y.
& Hattori, N., 2001. ‘Developmental changes in the
expression of parkin and UbcR7, a parkin-interacting
and ubiquitin-conjugating enzyme, in rat brain’, J.
Neurochemistry, vol. 77, no. 6, pp. 1561-1568.
ICEIS 2004 - HUMAN-COMPUTER INTERACTION
306