the information type. Matrix representation uses the
vectorial representation of its rows or columns so it
is useful to describe the storage using rows and co-
lumns.
The following describes some options of compres-
sion and representation of one-dimensional data sets
(vectors).
Run-length Encoding: Consecutive sequences of
data with the same value are stored as a pair
(count,value) in which value represents the value to
be represented and count represents the number of
occurrences of the value within the sequence.
There are variations to this type of representation
in which if the sequence of equal values are repea-
ted in different position of the vector, the value is sto-
red and additional to this, the beginning and the total
number of elements in each sequence are represented
(Elgohary et al., 2016).
Offset-list Encoding: For each distinct value within
the data set a new list is generated which contains the
indexes in which the aforementioned value appears.
In the case that there are two correlated set, a (x,y)
pair is generated and the index in which the data pair
appears is stored in the new list.
Figure 1 shows Run-Length encodig (RLE) and
Offset-list encoding (OLE) compression schemas.
Figure 1: Compression examples (Elgohary et al., 2017).
GZIP: Compression is based in the DEFLATE algo-
rithm
1
that consists in two parts: Lz77 and Huffman
coding. The Lz77 algorithm compress the data remo-
ving redundant parts and the Huffman coding codes
the result generated by Lz77 (Ouyang et al., 2010).
The classical compression methods, such as GZIP,
considerably overloads the CPU which minimizes the
performance gained by reducing the read/write ope-
rations, this fact makes them unfeasible options to be
implemented in databases (Chen et al., 2001).
Bit Level Compression: REDATAM software
2
uses
a distinct data compression schema that is based in 4-
bytes blocks. Each block stores one or more values
depending of the maximum size in bits required to
store the values (De Grande, 2016).
1
https://tools.ietf.org/html/rfc1952
2
http://www.redatam.org
This compression format represents the most vi-
able option when working with categorical data, be-
cause in most cases the information to be represen-
ted has a low number of different categories. This
method uses the total amount of available bits in each
block, so that a value can be contained in two diffe-
rent blocks of compressed data. Figure 2 shows the
above.
Figure 2: REDATAM compression.
3 COMPRESSION APPROACH
In this section we propose a new mechanism for com-
pressing categorical data, the compression method
proposal corresponds to a variation of the bit level
compression method described in Section 2. This
method doesn’t use all available bits because 32 may
not be a multiple of the number of bits needed to re-
present the categories.
The numerical information of categorical varia-
bles is represented, traditionally, as signed integer va-
lues of 32, 16 or 8 bits (4, 2, 1 bytes). This implies
that to store a numerical value it is necessary to use 32
bits (or its equivalent in 2 or 1 byte). We will consider
the case in which the information is represented as a
set of 4-bytes integer values.
Figure 3 represents the bit distribution of a integer
value composed of 4 bytes.
Figure 3: Representation of an integer value - 4 bytes.
There are variations to the representation showed
in Figure 3 due to the integer values can be represen-
ted in Little Endian or Big Endian format.
If the original variable has m observations, the size
in bytes needed to represent all the observations (wit-
hout compression) is: Total bytes = T b = m ∗ 4
If we consider that all 32 bits are not used to re-
present the values, there is a lot of wasted space.
In Figure 4, the gray area corresponds to space that
is not used. Out of a total of 4 ∗32 = 128 bits, only 16
are used, which represents 12.5% of the total storage
used.
KDIR 2018 - 10th International Conference on Knowledge Discovery and Information Retrieval
354