databases will contain information that pertains to
more than one research area, so a Master Template
would help organize the data into the different
subjects. In this manner, multiple templates could
be set up for a single database source to pull
information pertaining to cancer, HIV, drug
interactions, etc.
Display Templates. This project also introduces the
Display Templates as a means to simplify data
retrieval. A Master Template would contain every
bit of information possible on a certain research
subject, which is undoubtedly more than necessary
for most users. Oftentimes research focuses just on
a specific area of interest rather than simply the
subject of “vaccine trials” as a whole. Thus, users
may choose Display Templates, which act as a
subset of Master Templates, to only retrieve needed
information. All of the extraneous information
would be ignored and users now have a convenient
way to get information from a wide variety of
sources pertaining only to his or her interests.
Interface. The front end provides a basic search
system that returns records from all the databases
matching the search parameters. Users have the
ability to choose a Display Template to only focus
on the specific data needed. The user has the option
of viewing the datasets individually; otherwise the
program will attempt to compile the data into a
single table. The interfaces use AJAX, which is a
Javascript technology that allows a web page to
update data without refreshing the entire page.
Thus, AJAX allows for individual database results to
be pulled in as soon as they are returned so that a
user doesn’t have to wait for all the results from the
slow servers before the page is loaded.
2 OPTIMIZATION
This research seeks to address the issue of data
optimization so that researchers have the fastest
access to data that they seek. Inter-region accesses to
high-resolution image data such as MRI or CT-Scan
images via a medical grid could incur a prolonged
response time. The process would be much slower
than if the data were stored on a more local server
with sufficient bandwidth. In order to solve this
problem, the web application tracks statistics on
usage patterns and decides where to move the data
so that it is best optimized for each grid-level data
user. This project investigates the following options:
Moving to a Central Server - This option is
impractical because it would require maintaining a
single ultra-powerful system with prohibitive
bandwidth costs. Additionally, this doesn’t solve the
problem of data access from different regions,
because people in regions outside of the location of
this main computer would still be at a disadvantage.
Moving Records around as they are Accessed - This
option could run into legal issues with the actual
deletion and moving of data between systems, since
it could put the original owner of the data at a
disadvantage in terms of access to the data.
Complete Caching - Clearly the perfect situation
would be complete caching, where every institution
has a complete cache of every other system.
Unfortunately, this would be impractical because it
assumes that every institution has the available space
and bandwidth to host all the records.
Caching based on Usage - The best realistic solution
would be to monitor usage patterns of each system
and cache only the most highly requested databases.
This solution provides the best and most realistic
compromise to the problem.
What to Cache? Every time a query is performed,
the system converts that user’s IP address into a
region ID to store statistics on searches from that
region. The program keeps track of the region ID as
well as the result count for each database that returns
results from the query. In this manner, the system
can build a list of the most highly accessed databases
for each region. It is important to point out that the
system is only concerned with databases that are
outside of the region from which the search is being
performed, since searches within the same region
should already be relatively fast and are therefore
considered already “optimized.”
When the caching script is run, the program
looks at each region individually and considers the
result count for each database outside of that region.
The script then converts the result count to a
percentage of the overall relevant results, and inserts
into a Database Caching Queue in order of highest
result percentage. Thus, when the process is
complete, every region has an ordered list of the
most heavily accessed databases outside of that
region.
Where to Cache? Since the question of what data
needs to be cached has been answered, the
remainder of the problem is simply an examination
of where to cache the data. Several constraints on
how to decide where to cache the data exist. The
program must abide by these constraints while
attempting to place data on a server that would result
in the fastest response time for a particular set of
users that need it. Here are the basic constraints:
HEALTHINF 2010 - International Conference on Health Informatics
482