table compressors. Unfortunately, from the point of
view of building compression tools, it is hard, if not
impossible, to automatically detect and take advan-
tage of such ad hoc structures in data. Nonethe-
less, this points out that, in each particular circum-
stance, optimum compression performance may re-
quire the ability to form arbitrary compositions of
diverse data transformations, both generic and data-
specific. Indeed, this ability is implicitly needed even
in implementing general purpose compressors such as
Gzip and Bzip2 that combine distinct techniques deal-
ing with different aspects of information redundancy.
Thus, a major challenge to compression research as
well as to practical use of compression on large scale
data is to provide a way to construct, reuse and in-
tegrate different data transformation and compression
techniques to suit particular data semantics. This is
the problem that the Vcodex platform addresses.
Central to Vcodex is the notion of data transforms
or software components to alter data in some invert-
ible ways. As it is beyond the scope of this paper,
we only mention that by taking this general approach
to data transformation Vcodex can also accommodate
techniques beyond compression such as encryption,
translation between character sets or just simple cod-
ing of binary data for portability. For maximum us-
ability, two main issues must be addressed:
• Defining a standard software interface for data
transforms: A standardized software interface
helps transform users as it eases implementation
of applications and ensures that application code
remains stable when algorithms change. Trans-
form developers also benefit as a standard inter-
face simplifies and encourages reuse of existing
techniques in implementing new ones.
• Defining a self-describing and portable standard
data format: As new data transforms may be
continually developed by independent parties, it
is important to have a common data format that
can accommodate arbitrary composition of trans-
forms. Such a format should allow receivers of
data to decode them without having to know how
they were encoded. In addition, encoded data
should be independent of OS and hardware plat-
forms so that they can be easily transported and
shared.
The rest of the paper gives an overview of the
Vcodex platform and how it addresses the above
software and data issues. Experiments based on
the well-known Canterbury Corpus (Bell and Powell,
2001) for testing data compressors shows that Vcodex
could far outperform conventional compressors such
as Gzip or Bzip2.
2 SOFTWARE ARCHITECTURE
Vcodex is written in the C language in a style com-
patible with C++. The platform is divided into two
main layers. The base layer is a software library
defining and providing a standard software interface
for data transforms. This layer assumes that data are
processed in segments small enough to fit entirely in
memory. The library can be used directly by applica-
tions written in C or C++ in the same way that other C
and C++ libraries are used. A command tool Vczip is
written on top of the library layer to enable file com-
pression without low-level programming. Large files
are handled by being broken into chunks with suitable
sizes for in-memory processing.
2.1 Library Design
Transformation handle and operations
Vcodex t – maintaining contexts and states
vcopen(), vcclose(), vcapply(), ...
Discipline structure Data transforms
Vcdisc t Vcmethod t
Parameters for data processing, Burrows-Wheeler, Huffman,
Event handler Delta compression, Table transform, etc.
Figure 1: A discipline and method library architecture for
Vcodex.
Figure 1 summarizes the design of the base
Vcodex library which was built in the style of the Dis-
cipline and Method library architecture (Vo, 2000).
The top part shows that a transformation handle of
type Vcodex t provides a holding place for contexts
used in transforming different data types as well as
states retained between data transformation calls. A
variety of functions can be performed on such han-
dles. The major ones are vcopen() and vcclose()
for handle opening and closing and vcapply() for
data encoding or decoding. A transformation han-
dle is parameterized by an optional discipline struc-
ture of type Vcdisc t and a required data transform
of type Vcmethod t. Each discipline structure is sup-
plied by the application and provides additional infor-
mation about the data to be processed. On the other
hand, a data transform is selected from a predefined
set and specifies the transformation technique. Sec-
tion 2.2 describes a few common data transforms such
as Burrows-Wheeler, Huffman, etc. Complex data
transformations can be composed from simpler ones
by passing existing handles into newly opened han-
dles for additional processing.
Figure 2 shows an example to construct and use a
transformation handle for delta compression by com-
posing two data transforms: Vcdelta and Vchuffman
(Section 2.2). Here, delta compression (Hunt et al.,
1998) is a general technique to compress a target
ICSOFT 2007 - International Conference on Software and Data Technologies
82