A LOSSLESS COMPRESSION ALGORITHM FOR DNA

SEQUENCES

Taysir H. A. Soliman

Faculty of Computer and Information, Assiut University, Egypt

Tarek F. Gharib, Alshaimaa Abo-Alian, Mohammed Alsharkawy

Faculty of Computer and Information Sciences, Ain Shams University, Egypt

Keywords: Lossless Compression Algorithm, encoding, approximate repeats, palindrome.

Abstract: Homology search is the seed for both genomics and proteomics research. However, the increase of the

amount of DNA sequences requires efficient computational algorithms for performing sequence comparison

and analysis. This is due to the fact that standard compression algorithms are not able to compress DNA

sequences because they do not consider special characteristics of DNA sequences (i.e. DNA sequences

contain several approximate repeats and complimentary palindromes are frequent in DNA). Recently, new

algorithms have been proposed to compress DNA sequences, often using detection of long approximate

repeats. The current work proposes a Lossless Compression Algorithm (LCA), providing a new encoding

method. LCA achieves a better compression ratio than that of existing DNA-oriented compression

algorithms, when compared to GenCompress and DNACompress, using nine different datasets.

1 INTRODUCTION

Molecular sequence databases (e.g., EMBL

(http://www.ebi.ac.uk/embl/), GenBank

(http://www.ncbi.nlm.nih.gov/Genbank/), Enterz

(http://www.ncbi.nlm.nih.gov/Entrez/), etc.)

currently collect terabytes of sequences of

nucleotides and amino-acids from biological

laboratories all over the world and are under

continuous expansion (Apostolico and Lonardi,

2000). These sequences play a vital role in

performing homology search to find motifs of

particular importance and to predict biomarkers.

DNA sequences contain only four bases {A, C, T,

G}. Thus, each base can be represented by two bits.

Some characteristics of DNA sequences show

that they are not random sequences (Behzadi and

Fessant, 2004). Two characteristic structures of

DNA sequences are known as approximate repeat

and complementary palindrome (Matsumuto,

Sadakane and Imai, 2000). Approximate repeat are

repeat with mutations i.e. a change in the DNA bases

"letters". Mutations are represented by edit

operations; Insert (I), Delete (D), Replace (R) and

the match or Copy represented by C (Tubingen and

Huson, 2005). Figure 1 shows the edit operations,

given two sequences: s

: “gaccgtcatt” and s

“gaccttcatt”.

(a) C C C C R C C C C C

g a c c g t c a t t

g a c c t t c a t t

Or C C C C D C I C C C C

(b) g a c c g t c a t t

g a c c t t c a t t

Figure 1: Examples of Séquence Mutations by Edit

Operations.

Complementary palindrome is a reversed repeat,

where A and C are respectively replaced by T and G

(Chen, Kwong and Li, 1999; Allison, Edgoose and

Dix, 1998). For example, consider the sequence

ACGCCT; its palindrome will be TCCGCA and its

complementary palindrome will be AGGCGT. The

effective handling of these features is the key to

successful DNA compression.

DNA sequences compression has recently

become a challenging computational problem.

Main advantages of compressing DNA sequences

435

H. A. Soliman T., F. Gharib T., Abo-Alian A. and Alsharkawy M. (2008).

A LOSSLESS COMPRESSION ALGORITHM FOR DNA SEQUENCES.

In Proceedings of the Tenth International Conference on Enterprise Information Systems - AIDSS, pages 435-441

DOI: 10.5220/0001683504350441

 SciTePress

arise from the available large and rapidly increasing

amount of DNA sequences. Moreover, data

compression now has an even more important role in

reducing the costs of data transmission, since the

DNA files are typically shared and distributed over

the Internet in heterogeneous databases. Space-

efficient representation of the data reduces the load

on FTP service providers such as GenBank.

Consequently, file transmissions are done faster,

saving costs for clients who access those files

(Korodi and Tabus, 2005). Furthermore, modeling

and analyzing DNA sequences may lead into

significant results, leading into finding a good

relatedness measurement between sequences may

lead into effective alignment and phylogenetic tree

construction. In addition, the statistical significance

of DNA sequences show how sensitive the genome

is to random changes, such as crossover and

mutation, what the average composition of a

sequence is, and where the important composition

changes occur (Korodi and Tabus, 2005).

Additionally, DNA compression has been used to

distinguish between coding and non-coding regions

of a DNA sequence, to evaluate the “distance”

between DNA sequences and to quantify how much

two organisms are “close” in the evolutionary tree,

and in other biological applications. On the other

hand, the standard text compression tools, such as

compress, gzip and bzip2, cannot compress these

DNA sequences since the size of the files encoded

with these tools is larger than two bits per symbol.

Data compression is a process that reduces the

data size. One of the most crucial issues of

classification is the possibility that the compression

algorithm removes some parts of data which cannot

be recovered during the decompression. The

algorithms removing permanently some parts of data

are called lossy, while others are called lossless

(Deorowicz, 2003).

In this paper, we propose a Lossless

Compression Algorithm (LCA) which consists of

three phases will be discussed in detail in section 4.

We use PatternHunter as a part of the first phase to

find approximate repeats and complementary

palindromes. LCA proposes a new encoding

methodology for DNA Compression. The rest of this

paper is organized as follows: in section 2, we

survey different algorithms for DNA sequence

compression. In section 3, the difference between

BLAST and PatternHunter is explained, illustrating

why PatternHunter is used to detect approximate

repeats and complementary palindromes in the

proposed algorithm. In section 4, our proposed

algorithm is explained. Section 5 shows the

comparison of our results on a standard set of DNA

sequences with results published for the most recent

DNA compression algorithms. Finally, section 6

presents the conclusion and future work.

2 RELATED WORK

Several compression algorithms have been

developed, such as Biocompress (Grumbach and

Tahi, 1993), Biocompress-2 (Grumbach and Tahi,

1994), C-fact (Rivals, Delahaye, Dauchet and

Delgrange, 1996), GenCompress (Chen, Kwong and

Li, 1999; Chen, Kwong and Li, 2001), CTW+LZ

(Matsumuto, Sadakane, and Imai, 2000),

DNASequitur (Cherniavski and Lander, 2004),

DNAPack (Behzadi and Fessant,, 2004), and

LUT+LZ (Bao, Chen and Jing, 2005). The first two

developed algorithm for compressing DNA

sequences are Biocompress, and its second version

Biocompress-2. They are similar to the Lempel-Ziv

data compression method in the way they search the

previously processed part of the sequence for

repeats. Biocompress-2 performs compression by

the following steps: 1) Detecting exact repeats and

complementary palindromes located in the already

encoded sequence and 2) Encoding them by the

repeat length and the position of a previous repeat

occurrence. In case of no significant repetition is

found, Biocompress-2 utilizes order-2 arithmetic

coding. Another algorithm is the Cfact algorithm,

which searches for the longest exact matching repeat

using a suffix tree on the entire sequence. By

performing two passes, repetitions are encoded when

the gain is guaranteed; otherwise the two-bits-per

base (2-Bits) encoding is used. The GenCompress

algorithm is a one-pass algorithm based on

approximate matching, where two variants exist:

GenCompress-1 and GenCompress-2.

GenCompress-1 uses the Hamming distance (only

Replace) for the repeats while GenCompress-2 uses

the edition distance (deletion, insertion and Replace)

for the encoding of the repeats. CTW+LZ

algorithm is based on the context tree weighting

method. It combines a LZ-77 type method like

GenCompress and the CTW algorithm. Long

exact/approximate repeats are encoded by LZ77-

type algorithm, while short repeats are encoded by

CTW. Although good compression ratios are

obtained, execution time is too high to be used for

long sequences. DNASequitur is a grammar-based

compression algorithm for DNA sequences which

infers a context-free grammar to represent the input

data but fails to achieve better compression than

ICEIS 2008 - International Conference on Enterprise Information Systems

436

other DNA Compressors. DNAPack uses Hamming

distance for computing the approximate repeats and

complementary palindromes, and either CTW or

order-2 arithmetic coding compression for the non-

repeat regions. Unlike the above algorithms,

DNAPack does not choose the repeats by a greedy

algorithm, but uses a dynamic programming

approach instead. LUT+LZ algorithm combines a

Lookup Table (LUT)-based pre-coding routine and

LZ77 compression routine. LUT maps the

combination of ATGC into 64 ASCII characters.

In addition, several DNA compression tools

exist, such as DNACompress (Chen, Li, Ma and

Tromp, 2002) and DNAC (Chang , 2004).

DNACompress is a DNA compression tool, which

employs the Lempel-Ziv compression scheme as

Biocompress-2 and GenCompress. It consists of two

phases: 1) Finding all approximate repeats including

complementary palindromes, using specific

software, PatternHunter, and 2) Encoding the

approximate repeats and the non-repeat regions. In

practice, the execution time of DNACompress is

much less than GenCompress. DNAC is another

DNA compression tool, working in four phases: 1)

Building a suffix tree to locate exact repeats, 2)

Extending approximate repeats, 3) Extracting the

optimal non-overlapping repeats from the

overlapping ones, and 4) Encoding all the repeats.

In the next section, a brief comparison between

PatternHunter and BLAST is performed.

3 PATTERNHUNTER V.S. BLAST

PatternHunter is a tool, which performs homology

search algorithm, using a novel seed model for

increased sensitivity and new hit-processing

techniques for significantly increased speed (Ma,

Tromp and Li, 2002).

PatternHunter finds all “approximate repeats”

and “complementary palindromes” in one DNA

sequence or between a pair of DNA sequences and

outputs all repeats ranked by score. Blast looks for

matches of k consecutive letters as seeds. Instead

PatternHunter uses non-consecutive k letters as

seeds. When comparing PatternHunter with BLAST

in terms of speed and memory usage, PatternHunter

performs better than BLAST, where PatternHunter

runs up to 20 times faster than BLAST. Yet, at the

same time, PatternHunter is more sensitive i.e. fewer

high-quality matches are missed by PatternHunter

than by BLAST (Ma, Tromp and Li, 2002).

For example, PatternHunter can search for

homologies between Arabidopsis Chromosome 2

(20 Mb) and Chromosome 4 (18 Mb) in under 15

minutes on a 700 MHz Pentium III with 1 GB of

main memory. BLAST cannot complete this task on

the same computer due to its high memory

requirements. PatternHunter was used by the Mouse

Genome Sequencing Consortium to compare human

and mouse genomes. The task was completed in 20

CPU-days; it was estimated that the same task would

take 20 CPU-years with BLAST

(http://www.bioinformaticssolutions.com/products/p

h/index.php).

4 LCA PHASES AND ANALYSIS

The proposed Lossless Compression Algorithm

(LCA) considers the special characteristics of DNA

sequences so it uses a new encoding method. LCA

consists of three main phases, as shown in Figure 2.

1) Finding approximate repeats and complementary

palindrome, 2) Removing overlapping between

repeats by choosing ones with high scores and

encoding repeats and palindromes after removing

overlapping, and 3) Encoding the non-repeat

regions.

4.1 Finding Approximate Repeats and

Complementary Palindromes

Because of sensitivity and efficiency of

PatternHunter, we use it to detect approximate

repeats and complementary palindromes in the input

DNA sequence in the first phase. Our algorithm uses

the standard parameters for PatternHunter. These

parameters were set and optimized based on the

number of bits needed to encode a mismatching base

as well as to encode a match. The output file of

PatternHunter contains information about each

repeat such as its score, start and end position, a set

of edit operations, and whether it is an approximate

repeat or a palindrome.

Figure 3 shows a sample of an output file of

PatternHunter using HUMVIR sequence as an input.

In the output file, the 'Sbjct' is the repeat segment

which will be encoded with respect to 'Query'.

‘Query’ and ‘sbjct’ are followed by their start

positions, sequence of bases, and end positions.

A LOSSLESS COMPRESSION ALGORITHM FOR DNA SEQUENCES

437

Figure 2: Phases of proposed DNA compression Algorithm.

Figure 3: A sample of an output file of PatternHunter.

The '|' means there is a Copy or Match operation.

The '-' when exists in query, it means there is

insertion operation but when exists in sbjct; it means

there is deletion operation. When an end position of

a query is less than its start position, it means that its

subject is a complementary palindrome. Otherwise,

it is an approximate repeat.

4.2 Encoding Repeat Regions

The second phase develops the new encoding

scheme which is the main contribution of this paper.

In this phase, we first remove overlapping between

repeats by choosing one with higher score. Then, we

encode each repeat, using the following encoding

scheme:

 The algorithm uses two bits to determine

whether the following block of bits is

approximate repeat, complementary

palindrome, edit operation, or non-repeat

segment.

o 01: an approximate repeat, followed

by start position of query and the

length.

o 10: a complementary palindrome,

followed by start position of

query and the length.

o 11: edit operations, followed by

number of edit operations, the

code of each operation.

o 00: non-repeat segment.

 When encoding numbers in case of position or

length, the algorithm searches for the

maximum position to get the fixed number of

bits to represent each position. Then, the

algorithm converts the number to its binary

representation.

 The algorithm uses two bits to encode each edit

operation:

o Replace (R): 00 followed by its

relative position and the code of

base to be replaced.

o Insertion (I): 01 followed by its

relative position and the code of

base to be inserted.

o Deletion (D): 10 followed by its

relative position.

4.3 Encoding Non-Repeat Regions

In the third phase, we extract non-repeat segments

from the DNA sequence and encode them using 2-

bit encoding method which replaces each base with

two bits as the following A (00), C (01), G (10), and

T (11). Figure 4 summarizes different steps of the

proposed algorithm.

ICEIS 2008 - International Conference on Enterprise Information Systems

438

5 EXPERIMENTAL RESULTS

AND DISCUSSION

We compare the results of our algorithm to the most

recent DNA compression algorithms. These

experiments are performed on a computer whose

CPU is Pentium IV 3GHz, memory is 512 MB and

OS is Windows XP SP2. Table 1 shows the

compression ratios (the number of bits per base) of

these algorithms on standard benchmark sequences.

These sequences are downloaded from NCBI

database in FASTA format. Table 2 shows a short

description about these sequences. Table 3 compares

running time of these algorithms. LCA seems to be

similar to DNACompress. However, LCA relies on

different encoding methods for repeat and non-

repeat regions. Our program achieves better

compression ratios than other programs except in

three cases, as shown in Table 1. Although we failed

to compress HUMDYSTROP efficiently because its

approximate repeats are infrequent, the execution

time of LCA is 1.33 seconds. HEHCMVCG and

VACCG contain approximate repeats with many

edit operations so we can not achieve the optimal

compression ratio for them.

The complexity of LCA is O( s (p + n) ) where s

is the number of approximate repeats and

complementary palindrome segments in the input

DNA sequence, p is the number of edit operations in

each segment, and n is the length of the input DNA

sequence.

6 CONCLUSIONS AND FUTURE

WORK

DNA compression computation is getting a lot of

attention. However, existing algorithms do not prove

good compression ratios because of the

characteristics of DNA sequences. In this work, we

proposed LCA for compressing DNA sequences,

relying on encoding methods, where its other phases

are similar to existing algorithms. LCA proved to

have better compression algorithms, when compared

to other two algorithms: GenCompress and

DNACompress. Nine different datasets have been

used: HUMGHCSA, HUMHPRTB,

HUMDYSTROP, HUMHDABCD, HEHCMVCG,

CHMPXX, CHNTXX, MPOMTCG, and VACCG.

LCA proves to have better compression ratios than

the other algorithms using all datasets, except for

HUMDYSTROP, HEHCMVCG and VACCG

sequences. In future work, we will use our

compression algorithm to predict coding and non-

coding regions in DNA sequences.

Figure 4: The pseudocode of the proposed algorithm.

REFERENCES

Allison, L., Edgoose, T., Dix, T. I. (1998) ‘Compression

of strings with approximate repeats’, In Intelligent

Systems in Mol. Biol., 8–16, Montreal.

1. Get a set of the longest

approximate repeats and

complementary palindromes Segments.

2. Remove overlapping between

segments by discarding whose score

is lower than its overlapped

segment.

3. Encoding segments:

For each s in Segments

Begin

IF s is palindrome THEN

Scode = “10”;

Else

Scode = “01”;

Scode = Scode +

BinaryCode(s.startpos) +

Binarycode(s.length);

For each op in s.editoperations

Begin

Scode = Scode + code(op) +

BinaryCode (op.pos);

IF op.type is not ‘delete’

THEN

Scode = Scode + code

(op.base);

End

4. Encoding Non-repeat regions :

For each block b is not in

Segments

Begin

Bcode = “”;

For each base in b

Begin

Switch (base)

Begin

Case “A”:

Bcode = Bcode + “00”;

Case “C”:

Bcode = Bcode + “01”;

Case “G”:

Bcode = Bcode + “10”;

Case “T”:

Bcode = Bcode + “11”;

End

A LOSSLESS COMPRESSION ALGORITHM FOR DNA SEQUENCES

439

Apostolico A. and Lonardi S. (2000) ‘Compression of

Biological Sequences by Greedy Offline Textual

Substitution’, In proc. Data Compression Conference,

IEEE Computer Society Press, 143-152.

Bao, S., Chen, S., and Jing, Z. (2005) ‘A DNA Sequence

Compression Algorithm Based on Look-up Table and

LZ77’, Signal Processing and Information

Technology, Proceedings of the Fifth IEEE

International Symposium, 23 - 28.

Behzadi, B. and Le Fessant, F. (2004) ‘DNA Compression

Challenge Revisited’, Lecture Notes in Computer

Science 3537, 190-200.

Chang C.-H. (2004) ‘DNAC: A Compression Algorithm

for DNA Sequences by Nonoverlapping Approximate

Repeats’, Master Thesis.

Cherniavski, N., Lander, R. (2004) ‘Grammar-based

Compression of DNA sequences’, in DIMACS

Working Group on The Burrows—Wheeler

Transform, Piscataway, NJ, USA.

Chen, X., Kwong, S., Li, M. (1999) ‘A compression

Algorithm for DNA sequences and its applications in

genome comparison’, The 10th workshop on Genome

Informatics, 51-61, Tokyo, Japan.

Chen, X., Kwong, S., Li, M. (2001) ‘A compression

Algorithm for DNA sequences’, IEEE Engineering in

Medicine and Biology Magazine, 20(4), 61-66.

Chen, X., Li, M., Ma, B. and Tromp, J. (2002)

‘DNACompress: fast and effective DNA sequence

compression’, Bioinformatics, 18, 1696-1698.

Deorowicz, S. (2003) ‘Universal lossless data compression

algorithms’, Philosophy Dissertation Thesis, Gliwice.

Grumbach S. and Tahi F. (1993) ‘Compression of DNA

Sequences’, In Data compression conference, IEEE

Computer Society Press, 340-350.

Grumbach S. and Tahi F. (1994) ‘A new Challenge for

compression algorithms: genetic sequences’, Journal

of Information Processing and Management, 30, 875-

866.

Korodi, G., Tabus, I. (2005) ‘An Efficient Normalized

Maximum Likelihood Algorithm for DNA Sequence

Compression’, ACM Transactions on Information

Systems, 23(1), 3–34.

Ma, B., Tromp, J., Li, M. (2002) ‘PatternHunter–faster

and more sensitive homology search’, Bioinformatics,

18, 440-445.

Matsumuto, T., Sadakane, K.,Imai, H. (2000) ‘Biological

sequence compression algorithms’, Genome Inform.

Ser. Workshop Genome Inform, 11, 43-52.

Rivals E., Delahaye J.-P., Dauchet M., Delgrange O.

(1996) ‘A Guaranteed Compression Scheme for

Repetitive DNA Sequences’, Data Compression

Conference, 453, Snowbird,

Tubingen, U., Huson, D. (2005) ‘Sequence comparison by

compression’, Alg. in Bioinformatics I, ZBIT, 18, 1-8.

http://www.bioinformaticssolutions.com/products/ph/inde

x.php

http://www.ebi.ac.uk/embl/

http://www.ncbi.nlm.nih.gov/Entrez/

http://www.ncbi.nlm.nih.gov/Genbank/

ICEIS 2008 - International Conference on Enterprise Information Systems

440

Table 1: Comparison of compression ratios for different algorithms (bits/base).

Table 2: Description about the datasets.

Table 3: Comparison of running times in seconds.

DNA Sequence GenCompress DNACompress LCA

HUMDYSTROP 6 2 1.33

HEHCMVCG 51 3.4 3

HUMHDABCD 11 2.5 2

Sequence Length (bases) GenCompress DNACompress LCA

HUMGHCSA 66495 1.0969 1.0272 1.0216

HUMHPRTB 56737 1.8466 1.8165 1.7911

HUMDYSTROP 38770 1.9231 1.9116 2.0113

HUMHDABCD 58864 1.8192 1.7951 1.7569

HEHCMVCG 229354 1.847 1.8492 1.9617

CHMPXX 121024 1.673 1.6716 1.613

CHNTXX 155844 1.6146 1.6127 1.6018

MPOMTCG 186609 1.9058 1.892 1.8849

VACCG 191737 1.7614 1.7580 1.7601

Description DNA Sequence

Human growth hormone and chorionic somatomammotropin genes HUMGHCSA

Human hypoxanthine phosphoribosyltransferase (HPRT) gene HUMHPRTB

Human dystrophin gene HUMDYSTROP

Human DNA sequence of contig comprising 3 cosmids (HDAB, HDAD, HDAC) HUMHDABCD

Human cytomegalovirus , a betaherpesvirus, represents the major infectious cause of birth defects HEHCMVCG

the complete chromosome III from yeast CHMPXX

Nectary tissues from flowers at Stage 6 of development from mature greenhouse grown Nicotiana

langsdorffii X Nicotiana sanderae (LxS8) plants

CHNTXX

Marchantia polymorpha mitochondrion MPOMTCG

Vaccinia virus Copenhagen VACCG

A LOSSLESS COMPRESSION ALGORITHM FOR DNA SEQUENCES

441