
which data were updated through batches of
records that included codes to add, change, or
delete from the master flat file. Whenever coding
changes were required, the IT group, then called
"data management," had to be methodical enough
to make sure all occurrences of the offending
code where replaced with the new value.
Even though the old flat files were difficult to
quality assure, they were very easy for most users
to query. Reports generally represented little more
than sums across records of data with some
additional programming to convert cryptic codes
into their more legible legends. Conversely, the
OLTP model is perceived as a nightmare for most
users who must now write multi table joins across
seven or eight tables. Suddenly, the old flat file
legacy systems seemed much more desirable. To
compound matters, while information technology
groups spent their resources designing and
deploying elegant data models touting the ease of
use of SQL (i.e., Structured Query Language), the
user community was left unable to understand the
complexities of SQL when confronted with highly
normalized designs, and represented to their own
management that their data was no longer
accessible.
Clearly, there is a need for some compromise
between the elegance and efficiency of the OLTP
systems, and the needs of the end users to analyze
the data that cost so many dollars to capture and
ingest. The solution to this dilemma is the data
warehouse. Although it raises some brows with
apprehension, the data warehouse can be created
using the same RDBMS used to house the OLTP
database. In fact, a data warehouse is nothing more
than a database that has been optimized for
retrieval. The main concern of its architects is the
delivery of data in a consistent, easy to access
format. The emphasis of the design is speed of
retrieval.
3 FACTORS THAT MAKE A DATA
WAREHOUSE EFFICIENT
As mentioned earlier, a data warehouse must
provide good performance, manage large amounts
of data, and provide quick ways in which to load
large volumes of OLTP data. To support these
operations, the manufacturers of data
warehousing software have created the following
strategies:
Range, Hash, and Composite partitioning:
Partitioning involves separating objects into pieces
so that they can be managed more easily.
Although to the database the object is logically the
same, each piece is physically stored separately.
Range partitioning is useful when separating data
in ranges, such as by year, or by state. On the
other hand, when the data cannot be separated into
discrete meaningful groups, the data can be
separated using a hashing algorithm that ensures
that the data will be distributed evenly. The
combination of the two partitioning methodologies
is called composite partitioning and it uses range
partitioning first, while dividing the results using
hash partitioning to create sub-partitions (Scherer,
2000, p.145). With any of these techniques, the
database designer must analyze both the kinds of
data that will be stored in the warehouse, as well
as the ways in which the data will be used.
Partitioning data based on year may not be
effective if the organization typically analyzes data
aggregated by state, crossing over multiple years.
Transportable tablespaces: Tablespaces are
logical units that hold database objects. Each
tablespace may be composed of one or many
physical data files. Since data warehouses are
used to manage large numbers of data, it may be
necessary to move these tablespaces to supply
data to a different database or data mart, archive
data, or to share data with other databases.
Because the tablespaces are based on physical
files, there are certain limitations to this
capability. These limitations include the
requirement that tablespaces be transported only
between databases on the same platform, the
platform's block size must be the same, and they
must use the same character set (Scherer, 2000,
p.162).
The star schema: Data warehouses are
frequently designed using a dimensional data
model called the star schema. In this type of
design, there is a central table called a fact table
that is related to several look-up tables called
dimensions (Silverston, 1997). Fact tables tend to
be very large with millions of records and
gigabytes to terabytes of data and contain
quantitative data. On the other hand, dimension
tables tend to be much smaller and contain
descriptive data. Both must have a primary key,
which for fact tables are usually composites of the
foreign keys to the dimension tables. To access
tables in this formation, a star query is used, which
in turn uses what is referred to as a start join
(Scherer, 2000, p. 164). Most RDBMS (i.e.,
Relational Database Management System) allow
developers to use a mechanism called "cost-based
optimizer" to allow the computer to optimize the
query performance. This is done by regularly
compiling statistics on the data and then specifying
the "star" hint, which if possible, will cause the
database engine to position the largest table, or fact
ICEIS 2004 - DATABASES AND INFORMATION SYSTEMS INTEGRATION
630