that are fully main memory based. This structure is
designed for low cardinality of the domain
dictionaries, with an architecture that replaces code
rather than the original data values in tuples. The
main pitfall of this method is that the structure
cannot handle large databases because of full
dependency on main memory in combination with
the limitations of large memory spaces.
Investigation of main memory database systems is
well described in (Garcia-Molina and Salem 1992),
with a major focus on fidelity of main memory
content compared to conventional disk based
database systems. Memory resident database
systems (MMDB’s) store their data in main physical
memory, providing high data access speeds and
direct accessibility. As semiconductor memory
becomes less expensive, it is increasingly feasible to
store databases in main memory and for MMDB’s to
become a reality.
The unique graph-based data structure called
DBGraph may be used as a database representation
tool that fully exploits the direct access capability of
main memory. Additionally, the rapidly decreasing
cost of RAM makes main memory database systems
a cost effective solution to high performance data
management (Pucheral, Thevnin et al. 1990). This
overcomes the problem that disk-based database
systems have their performance limited by I/O.
A compressed 1-ary vertical representation is
used to represent high dimensional sparsely
populated data where database size grows linearly
(Hoque 2002). Queries can be processed on the
compressed form without decompression;
decompression is done only when the result is
necessary. Different kinds of problem, such as
access control and transaction management, may
apply to distributed and replicated data in distributed
database systems (DDBMS) (Alkhatib and Labban
1995). Oracle, a leading commercial DBMS,
defines a way to maintain consistent state for the
database using a distributed two phase commit
protocol. (Alkhatib and Labban 1995) address some
issues such as advantage, disadvantage, and system
failure in distributed database systems. Since
organizations tend to be geographically dispersed, a
DDMBS fits the organizational structure better than
traditional centralized DBMS. Advantages of
DDBMS include that failure of a server at one site
will not necessarily render the distributed database
system inaccessible. A general architecture for
archiving and retrieving real-time, scientific data is
described in (Lawrence and Kruger 2005). The
basis of the architecture is a data warehouse that
stores metadata on the raw data to allow for its
efficient retrieval. A transparent data distribution
system uses the data warehouse to dynamically
distribute the data across multiple machines.
A single dictionary based compression technique
to manage large scale databases is described by
Oracle corporation (Poess and Potapov 2003). The
authors also address an innovative table compression
technique that is very attractive for large relational
data warehouses. This technique is used to
compress and partition tables. The status of a table
can be changed from compressed to non-compressed
at any time by simply adding the keyword
COMPRESS to the table’s meta-data.
The LH*
RS
scheme defines a way of storing
available distributed data (Litwin, Moussa et al.
2004). This system includes distributed data
structures [SDDS] that are intended for computers
over fast networks, usually local networks. This
architecture is a promising way to store distributed
data and gaining in popularity.
A distributed storage system for structured data
called Bigtable is presented in (Chang, Dean et al.
2006). The system is used for managing data that is
designed to scale to very large size datasets
distributed across thousands of commodity servers.
Bigtable has successfully provided a flexible, high
performance solution for all of the Google products.
2.1 Existing HIBASE Compression
Technique
The basic HIBASE model, as described in
(Cockshott, Mcgregor et al. 1998), represents tables
as a set of columns (rather than as a set of rows as
used in a traditional relational database). This
structure is dictionary based, and designed for low
cardinality of domain values. The architecture
replaces code rather than the original data values, in
tuples. The main pitfall of this method is that it
cannot handle large databases because of its fully
main memory dependency. HIBASE uses single
block column vector; each attribute is associated
with a domain dictionary and a column vector. The
columns are organized as a linked list, each of which
points into the dictionary. Figure 1 shows a
HIBASE structure together with its domain
dictionary. There are 7 distinct lastnames
represented by identifiers numbered 0 to 6 which
can be represented by 3 bits; similarly suburb, state
and marital status are represented by 2, 1 and 1 bits
respectively. Hence, in compressed representation
7 bits are required to represent one tuple using the
HIBASE method. For the set of 8 tuples, 56 bits
would be required. In the uncompressed relation
ICSOFT 2008 - International Conference on Software and Data Technologies
38