arise from the available large and rapidly increasing
amount of DNA sequences. Moreover, data
compression now has an even more important role in
reducing the costs of data transmission, since the
DNA files are typically shared and distributed over
the Internet in heterogeneous databases. Space-
efficient representation of the data reduces the load
on FTP service providers such as GenBank.
Consequently, file transmissions are done faster,
saving costs for clients who access those files
(Korodi and Tabus, 2005). Furthermore, modeling
and analyzing DNA sequences may lead into
significant results, leading into finding a good
relatedness measurement between sequences may
lead into effective alignment and phylogenetic tree
construction. In addition, the statistical significance
of DNA sequences show how sensitive the genome
is to random changes, such as crossover and
mutation, what the average composition of a
sequence is, and where the important composition
changes occur (Korodi and Tabus, 2005).
Additionally, DNA compression has been used to
distinguish between coding and non-coding regions
of a DNA sequence, to evaluate the “distance”
between DNA sequences and to quantify how much
two organisms are “close” in the evolutionary tree,
and in other biological applications. On the other
hand, the standard text compression tools, such as
compress, gzip and bzip2, cannot compress these
DNA sequences since the size of the files encoded
with these tools is larger than two bits per symbol.
Data compression is a process that reduces the
data size. One of the most crucial issues of
classification is the possibility that the compression
algorithm removes some parts of data which cannot
be recovered during the decompression. The
algorithms removing permanently some parts of data
are called lossy, while others are called lossless
(Deorowicz, 2003).
In this paper, we propose a Lossless
Compression Algorithm (LCA) which consists of
three phases will be discussed in detail in section 4.
We use PatternHunter as a part of the first phase to
find approximate repeats and complementary
palindromes. LCA proposes a new encoding
methodology for DNA Compression. The rest of this
paper is organized as follows: in section 2, we
survey different algorithms for DNA sequence
compression. In section 3, the difference between
BLAST and PatternHunter is explained, illustrating
why PatternHunter is used to detect approximate
repeats and complementary palindromes in the
proposed algorithm. In section 4, our proposed
algorithm is explained. Section 5 shows the
comparison of our results on a standard set of DNA
sequences with results published for the most recent
DNA compression algorithms. Finally, section 6
presents the conclusion and future work.
2 RELATED WORK
Several compression algorithms have been
developed, such as Biocompress (Grumbach and
Tahi, 1993), Biocompress-2 (Grumbach and Tahi,
1994), C-fact (Rivals, Delahaye, Dauchet and
Delgrange, 1996), GenCompress (Chen, Kwong and
Li, 1999; Chen, Kwong and Li, 2001), CTW+LZ
(Matsumuto, Sadakane, and Imai, 2000),
DNASequitur (Cherniavski and Lander, 2004),
DNAPack (Behzadi and Fessant,, 2004), and
LUT+LZ (Bao, Chen and Jing, 2005). The first two
developed algorithm for compressing DNA
sequences are Biocompress, and its second version
Biocompress-2. They are similar to the Lempel-Ziv
data compression method in the way they search the
previously processed part of the sequence for
repeats. Biocompress-2 performs compression by
the following steps: 1) Detecting exact repeats and
complementary palindromes located in the already
encoded sequence and 2) Encoding them by the
repeat length and the position of a previous repeat
occurrence. In case of no significant repetition is
found, Biocompress-2 utilizes order-2 arithmetic
coding. Another algorithm is the Cfact algorithm,
which searches for the longest exact matching repeat
using a suffix tree on the entire sequence. By
performing two passes, repetitions are encoded when
the gain is guaranteed; otherwise the two-bits-per
base (2-Bits) encoding is used. The GenCompress
algorithm is a one-pass algorithm based on
approximate matching, where two variants exist:
GenCompress-1 and GenCompress-2.
GenCompress-1 uses the Hamming distance (only
Replace) for the repeats while GenCompress-2 uses
the edition distance (deletion, insertion and Replace)
for the encoding of the repeats. CTW+LZ
algorithm is based on the context tree weighting
method. It combines a LZ-77 type method like
GenCompress and the CTW algorithm. Long
exact/approximate repeats are encoded by LZ77-
type algorithm, while short repeats are encoded by
CTW. Although good compression ratios are
obtained, execution time is too high to be used for
long sequences. DNASequitur is a grammar-based
compression algorithm for DNA sequences which
infers a context-free grammar to represent the input
data but fails to achieve better compression than
ICEIS 2008 - International Conference on Enterprise Information Systems
436