
 
arise from the available large and rapidly increasing 
amount of DNA sequences. Moreover, data 
compression now has an even more important role in 
reducing the costs of data transmission, since the 
DNA files are typically shared and distributed over 
the Internet in heterogeneous databases. Space-
efficient representation of the data reduces the load 
on FTP service providers such as GenBank. 
Consequently, file transmissions are done faster, 
saving costs for clients who access those files 
(Korodi and Tabus, 2005). Furthermore, modeling 
and analyzing DNA sequences may lead into 
significant results, leading into finding a good 
relatedness measurement between sequences may 
lead into effective alignment and phylogenetic tree 
construction.  In addition, the statistical significance 
of DNA sequences show how sensitive the genome 
is to random changes, such as crossover and 
mutation, what the average composition of a 
sequence is, and where the important composition 
changes occur (Korodi and Tabus, 2005). 
Additionally, DNA compression has been used to 
distinguish between coding and non-coding regions 
of a DNA sequence, to evaluate the “distance” 
between DNA sequences and to quantify how much 
two organisms are “close” in the evolutionary tree, 
and in other biological applications.  On the other 
hand, the standard text compression tools, such as 
compress, gzip and bzip2, cannot compress these 
DNA sequences since the size of the files encoded 
with these tools is larger than two bits per symbol.   
Data compression is a process that reduces the 
data size. One of the most crucial issues of 
classification is the possibility that the compression 
algorithm removes some parts of data which cannot 
be recovered during the decompression. The 
algorithms removing permanently some parts of data 
are called lossy, while others are called lossless 
(Deorowicz, 2003).  
 In this paper, we propose a Lossless 
Compression Algorithm (LCA) which consists of 
three phases will be discussed in detail in section 4. 
We use PatternHunter as a part of the first phase to 
find approximate repeats and complementary 
palindromes. LCA proposes a new encoding 
methodology for DNA Compression. The rest of this 
paper is organized as follows: in section 2, we 
survey different algorithms for DNA sequence 
compression. In section 3, the difference between 
BLAST and PatternHunter is explained, illustrating 
why PatternHunter is used to detect approximate 
repeats and complementary palindromes in the 
proposed algorithm. In section 4, our proposed 
algorithm is explained. Section 5 shows the 
comparison of our results on a standard set of DNA 
sequences with results published for the most recent 
DNA compression algorithms. Finally, section 6 
presents the conclusion and future work. 
2 RELATED WORK 
Several compression algorithms have been 
developed, such as Biocompress (Grumbach and 
Tahi, 1993), Biocompress-2 (Grumbach and Tahi, 
1994), C-fact (Rivals, Delahaye, Dauchet and 
Delgrange, 1996), GenCompress (Chen, Kwong and  
Li, 1999; Chen, Kwong and  Li, 2001), CTW+LZ 
(Matsumuto, Sadakane, and Imai, 2000), 
DNASequitur (Cherniavski and Lander, 2004), 
DNAPack (Behzadi and Fessant,, 2004), and 
LUT+LZ (Bao, Chen and Jing, 2005).  The first two 
developed algorithm for compressing DNA 
sequences are Biocompress, and its second version 
Biocompress-2.  They are similar to the Lempel-Ziv 
data compression method in the way they search the 
previously processed part of the sequence for 
repeats. Biocompress-2 performs compression by 
the following steps: 1) Detecting exact repeats and 
complementary palindromes located in the already 
encoded sequence and 2) Encoding them by the 
repeat length and the position of a previous repeat 
occurrence. In case of no significant repetition is 
found, Biocompress-2 utilizes order-2 arithmetic 
coding. Another algorithm is the Cfact algorithm, 
which searches for the longest exact matching repeat 
using a suffix tree on the entire sequence. By 
performing two passes, repetitions are encoded when 
the gain is guaranteed; otherwise the two-bits-per 
base (2-Bits) encoding is used.  The GenCompress 
algorithm is a one-pass algorithm based on 
approximate matching, where two variants exist: 
GenCompress-1 and GenCompress-2.  
GenCompress-1 uses the Hamming distance (only 
Replace) for the repeats while GenCompress-2 uses 
the edition distance (deletion, insertion and Replace) 
for the encoding of the repeats.   CTW+LZ 
algorithm is based on the context tree weighting 
method. It combines a LZ-77 type method like 
GenCompress and the CTW algorithm. Long 
exact/approximate repeats are encoded by LZ77-
type algorithm, while short repeats are encoded by 
CTW. Although good compression ratios are 
obtained, execution time is too high to be used for 
long sequences. DNASequitur is a grammar-based 
compression algorithm for DNA sequences which 
infers a context-free grammar to represent the input 
data but fails to achieve better compression than 
ICEIS 2008 - International Conference on Enterprise Information Systems
436