OPTIMIZED DATA MIGRATION WITHIN A MEDICAL GRID
Jared Christopherson and Chun-Hsi Huang
Department of Computer Science and Engineering, University of Connecticut, Storrs, CT 06269, U.S.A.
Keywords: Caching, Database Optimization, HIV Databases, Data Grid.
Abstract: This paper focuses on creating an intelligent, scalable system that vastly improves the speed and efficiency
of looking up medical data. The system automatically and meaningfully organizes the distributed medical
data to allow fastest access. Additionally, this research seeks to further improve on the concept of a
distributed database service by introducing caching across servers as a means to optimize data retrieval
time. Instead of looking to many individual sources, researchers would be able to access data from a single
source, which is optimized on a per-region basis to ensure the shortest access times.
1 INTRODUCTION
The focal point of this research is to connect
multiple databases across a grid in such a way that
they appear as a single data repository while
optimizing data flow on the back end. More
specifically, the goal in this case deals with
simplifying the process of acquiring data for health
researchers. For example, when a researcher needs
to find data on cancer statistics or HIV drug
resistance, that researcher needs to spend time
connecting individually to many different data
sources and then must manually compile the data.
The possible data sources include information from
hospitals, other researchers’ laboratories, and any
institution hosting related databases. This process is
clearly time-consuming and would benefit from a
centralized location to acquire all the research data.
This project will provide a web application for
researchers to search a collection of databases from
participating institutions and present the data as if
they came from a single, specific source. Our initial
research targets at the HIV sequence, resistance,
immunology and vaccine trials databases
(www.hiv.lanl.gov/content/index).
Back-end. The initial problem with linking
databases is that each source usually stores their data
in completely different fashions. Related technical
articles may be found in (Huang, et. al., 2005).
Though some of the basic data supplied will be the
same, table names and field names within each
database will almost all be different – there is no
published standard for how to store this type of data,
so each individual site decides their preferred
method. Getting cooperation from each hospital to
restructure their database to a standard format would
be impractical. To solve this problem, this project
introduces the concept of Master Templates, which
form the basis for the entire interface. Refer to
articles included in (Huang, et. al., 2008).
Master Templates
. Each linked database is given a
unique ID and then their login information is stored
in the administrative database. A Master Template
keeps each database’s ID as a reference and maps
the desired fields from that particular database into a
“virtual” field created by the administrator. For
example, across several databases, a field with data
about the researcher might be known as source, site,
source_data, etc. Additionally, these fields are likely
in tables with all different names, such as hiv_data,
gene_info, etc. This makes it very difficult for a grid
system to link the data unless it has the information
on how the fields should be linked. The
administrator can simply choose to call this field in
the master template “Source” and then regardless of
all the different table and field names, the
information will be correctly linked and displayed to
the user. Thus, the grid-level administrator is
responsible for appropriately linking the database
fields as each new database is added to the system,
but this is a one-time process that saves a huge
amount of time for researchers in the future. Another
benefit of Master Templates is that different
templates can be provided to focus on different
research subjects. The idea is that some medical
481
Christopherson J. and Huang C. (2010).
OPTIMIZED DATA MIGRATION WITHIN A MEDICAL GRID.
In Proceedings of the Third International Conference on Health Informatics, pages 481-484
DOI: 10.5220/0002759904810484
Copyright
c
SciTePress
databases will contain information that pertains to
more than one research area, so a Master Template
would help organize the data into the different
subjects. In this manner, multiple templates could
be set up for a single database source to pull
information pertaining to cancer, HIV, drug
interactions, etc.

Display Templates. This project also introduces the
Display Templates as a means to simplify data
retrieval. A Master Template would contain every
bit of information possible on a certain research
subject, which is undoubtedly more than necessary
for most users. Oftentimes research focuses just on
a specific area of interest rather than simply the
subject of “vaccine trials” as a whole. Thus, users
may choose Display Templates, which act as a
subset of Master Templates, to only retrieve needed
information. All of the extraneous information
would be ignored and users now have a convenient
way to get information from a wide variety of
sources pertaining only to his or her interests.
Interface. The front end provides a basic search
system that returns records from all the databases
matching the search parameters. Users have the
ability to choose a Display Template to only focus
on the specific data needed. The user has the option
of viewing the datasets individually; otherwise the
program will attempt to compile the data into a
single table. The interfaces use AJAX, which is a
Javascript technology that allows a web page to
update data without refreshing the entire page.
Thus, AJAX allows for individual database results to
be pulled in as soon as they are returned so that a
user doesn’t have to wait for all the results from the
slow servers before the page is loaded.
2 OPTIMIZATION
This research seeks to address the issue of data
optimization so that researchers have the fastest
access to data that they seek. Inter-region accesses to
high-resolution image data such as MRI or CT-Scan
images via a medical grid could incur a prolonged
response time. The process would be much slower
than if the data were stored on a more local server
with sufficient bandwidth. In order to solve this
problem, the web application tracks statistics on
usage patterns and decides where to move the data
so that it is best optimized for each grid-level data
user. This project investigates the following options:
Moving to a Central Server - This option is
impractical because it would require maintaining a
single ultra-powerful system with prohibitive
bandwidth costs. Additionally, this doesn’t solve the
problem of data access from different regions,
because people in regions outside of the location of
this main computer would still be at a disadvantage.
Moving Records around as they are Accessed - This
option could run into legal issues with the actual
deletion and moving of data between systems, since
it could put the original owner of the data at a
disadvantage in terms of access to the data.
Complete Caching - Clearly the perfect situation
would be complete caching, where every institution
has a complete cache of every other system.
Unfortunately, this would be impractical because it
assumes that every institution has the available space
and bandwidth to host all the records.
Caching based on Usage - The best realistic solution
would be to monitor usage patterns of each system
and cache only the most highly requested databases.
This solution provides the best and most realistic
compromise to the problem.
What to Cache? Every time a query is performed,
the system converts that user’s IP address into a
region ID to store statistics on searches from that
region. The program keeps track of the region ID as
well as the result count for each database that returns
results from the query. In this manner, the system
can build a list of the most highly accessed databases
for each region. It is important to point out that the
system is only concerned with databases that are
outside of the region from which the search is being
performed, since searches within the same region
should already be relatively fast and are therefore
considered already “optimized.”
When the caching script is run, the program
looks at each region individually and considers the
result count for each database outside of that region.
The script then converts the result count to a
percentage of the overall relevant results, and inserts
into a Database Caching Queue in order of highest
result percentage. Thus, when the process is
complete, every region has an ordered list of the
most heavily accessed databases outside of that
region.
Where to Cache? Since the question of what data
needs to be cached has been answered, the
remainder of the problem is simply an examination
of where to cache the data. Several constraints on
how to decide where to cache the data exist. The
program must abide by these constraints while
attempting to place data on a server that would result
in the fastest response time for a particular set of
users that need it. Here are the basic constraints:
HEALTHINF 2010 - International Conference on Health Informatics
482
allow_cache – Not every hospital/institution will
want to volunteer to use their server for data
caching, so this binary setting causes the system to
take this server out of the caching determination.
supersite – This is a binary setting that gives special
preference to a particular server that may be
regarded as a more important institution or research
facility. A supersite has more data cached locally
and thus will have the fastest possible access.
bandwidth – This is a score of bandwidth available
for a particular server.
cache_size – This setting tells the system how much
space to allow for database and file caching.
All these settings are maintained by the grid-level
administrator at the request of each institution. The
variables are ordered in terms of importance for
caching consideration, thus allow_cache and
supersite trump any initial consideration of
bandwidth or cache size on the other servers. The
reason that cache_size comes after bandwidth in
terms of importance is because otherwise there could
be a server with a large amount of space but terrible
bandwidth, and if data is placed on this system,
nothing is being optimized in terms of speed.
Caching Script Conclusion. The script runs at an
automatic interval determined by the system
administrator. Alternatively, it may also be run
manually. The caching process continues for each
region with the program assigning data to servers
with progressively lower bandwidth and cache_size
scores until all the server space from that region is
exhausted. Thus, when the process is complete, each
region should have as many local copies of the most
frequently requested databases as possible, and users
will witness a very significant improvement in their
data retrieval speeds, especially in retrieving locally
cached high resolution image files. Finally, to avoid
problems with data consistency, all the cached
copies will be read-only so that they will always
reflect the exact data on the original server.
3 EXPERIMENTAL PROCEDURE
We initially ran a simulation locally and assigned
realistic values for data transfer speed. These values
reflected the slow transfer rate from/to servers that
were more distantly away. The values assumed the
originating site was within the US region and ranged
from transfer speeds of 450KB/s to a local server in
the US to 25KB/s to a server in the Japan region.
With the values in place, the script generated
random requests to each of the 9 different HIV
databases. The requests and transfer speeds from
each database are shown in Figure 1 as follows.
Database
ID
Region
Transfer
Speed from
US
Number of
Results
External
Region Result
Percentage
DB1 US 400KB/s 1229 0.00%
DB2 US 450KB/s 1105 0.00%
DB3 JAPAN 25KB/s 856 19.10%
DB4 FRANCE 75KB/s 473 10.55%
DB5 SPAIN 85KB/s 764 17.03%
DB6 US 380KB/s 1439 0.00%
DB7 SPAIN 90KB/s 620 13.82%
DB8 JAPAN 30KB/s 1023 22.81%
DB9 JAPAN 20KB/s 749 16.70%
Figure 1: Pre-caching transfer speeds and simulated result
counts.
Figure 2 illustrates the average transfer speed
overall, average transfer speed to servers outside the
US region, and transfer speed to the single most
heavily trafficked database outside the US:
Average Transfer
Speed
Time to Transfer
3Mb File
Overall System
172.77KB/s 17.36 sec
External Regions 54.17KB/s 55.38 sec
Most Heavily Trafficked
External DB (DB8)
30KB/s 100 sec
Figure 2: Speed averages and time indications for
transferring large files.
4 RESULTS AND CONCLUSIONS
The caching script ran as previously described,
determined the most heavily used databases in order
of usage, and cached them on the servers identified
by their IDs below. Note that for the purpose of
simulation, DB2 was ranked highest based on
bandwidth and has the ability to cache one database,
DB1, Db6 could cache 3, 2 databases, respectively.
Database to be Cached Caching Location
DB8 DB2
DB3 DB1
DB5 DB1
DB9 DB1
DB7 DB6
DB4 DB6
Figure 3: Database cache queue results.
OPTIMIZED DATA MIGRATION WITHIN A MEDICAL GRID
483
The post-caching transfer speeds are shown below
with the same result pattern as before.
Database
ID
Region
Transfer
Speed
from US
Number
of
Results
External
Region
Result
Percentage
DB1 US 400KB/s 1229 0.00%
DB2 US 450KB/s 1105 0.00%
DB3
CACHED
(DB1) 400KB/s 856 0.00%
DB4
CACHED
(DB6) 380KB/s 473 0.00%
DB5
CACHED
(DB1) 400KB/s 764 0.00%
DB6 US 380KB/s 1439 0.00%
DB7
CACHED
(DB6) 380KB/s 620 0.00%
DB8
CACHED
(DB2) 450KB/s 1023 0.00%
DB9
CACHED
(DB1) 400KB/s 749 0.00%
Figure 4: Post-caching transfer speeds and simulated result
counts.
Since there was enough space on the local databases
to cache the external databases, the external region
result percentages have all changed to 0. Examining
the new system speeds based on the same metrics as
before gives the following:
Average
Transfer
Speed
Time to
Transfer
3MbFile
Comparison
to Pre-
Cache Time
Overall System
404.44KB/s 7.42 sec 2.34x faster
External Regions
401.67KB/s 7.47 sec 7.41x faster
MostHeavily
Trafficked
External DB (DB8)
450KB/s 6.67 sec 15.00x faster
Figure 5: Post-caching speed averages and time
indications for transferring large files.
Comparison of the results shows that the caching
process in this simulation has greatly increased data
access times. On average, transfer speed is over
twice as fast and results from databases outside of
the US are accessible over seven times faster. The
most remarkable difference is evident with the most
heavily trafficked database outside of the US. Since
this database received the most, clearly it is of
significant research importance at the time of
running the caching script. Afterward the caching is
completed, researchers in this simulation would be
able to pull data from this database fifteen times
faster than before caching.
5 FUTURE WORK
Additional work to the current system spans several
areas. The current system has been designed to
connect only to MySQL databases. However,
institutions participating in a medical grid VO
(Virtual Organization) may use a wide variety of
database technologies to maintain their local
databases. To account for this, the ongoing
development incorporates a Database Abstraction
Layer (DAL) added to the Master Template system
to allow it to map to any kind of supported
databases, such as MySQL, MS SQL, Access, etc.
The caching system may also be improved in
that it caches an entire database any time interval the
caching script is set to. While the grid-level
administrator may manually notify the system to run
the caching script, there is currently no way to tell
when significant changes have been made to the
source database. Hence, users might not have access
to all of the newest data until the end of the time
interval. With a script that keeps track of major
changes, all caches could be updated automatically
when changes are made, avoiding the bandwidth
waste while re-caching unchanged databases.
Finally, as the system compiles data from many
different sources, it is likely that a user could run
into overlapping data of the same record sets. The
system may be designed to automatically clear up
redundant records as they are presented.
REFERENCES
Huang, C.-H., Konagaya, A., Lanza, V. and Sloot, P.
2008. Biomedical Computations on the Grid. IEEE
Transactions on Information Technology and
Biomedicine, Vol 12, No 2, 2008, pp. 133-137.
Huang, C.-H., Lanza, V., Rajasekaran, S. And Dubitzky,
W. 2005. HealthGrid – Bridging Life Science and
Information Technology. Journal of Clinical
Monitoring and Computing, Springer, Vol 19, No 4-5,
pp. 259-262.
HEALTHINF 2010 - International Conference on Health Informatics
484