7 CONCLUSIONS
In this paper, we propose an algorithm to compress
quality score information stored in a FASTQ file. We
investigate the characteristics of a typical FASTQ file.
Based on our observations, We try to find a model
that gives the lowest entropy on quality score data.
We combine our powerful statistical model with arith-
metic coding to compress the quality score data as the
smallest. We compare its performance to text com-
pression utilities such as bzip2, gzip and ppmd and
existing compression algorithms for quality scores.
We show that our compression algorithm gives supe-
rior performance to that of these utilities and algo-
rithms.
Our study provides lossless compression method
for quality scores. Researchers claim that lossy com-
pression is suitable for quality scores because most
of the bioninformatics methods require nucleotide se-
quence with high quality scores. However their claim
is not always true for all conditions. For example,
lossy compression may cause wrong SNP detection
in low coverage parts of the genome.
While investigating FASTQ file for Illumina
Hiseq 2000, we observe that nucleotide sequences
that are produced from the same lane and swath have
similar quality score characteristics. This means reads
with locations close to each other have the same qual-
ity score structure. For future work, we will consider
this observation for quality score compression. Fur-
thermore, we will consider the compression of other
fields of FASTQ files. We will present a compression
system for FASTQ files.
REFERENCES
Bhola, V., Bopardikar, A., Narayanan, R., Leet, K., and
Ahm, T. (2011). No-reference compression of ge-
nomic data stored in fastq format. In Bioinformatics
and Biomedicine (BIBM), 2011 IEEE International
Conference on, pages 147 –150.
Bonfield, J. (2012). The fqzcomp com-
pression algorithm for fastq files.
http://sourceforge.net/p/fqzcomp/home/Home/.
Campos, A. (2000). Implementing ppmc with hash tables.
Christley, S., Lu, Y., Li, C., and Xie, X. (2009). Hu-
man genomes as email attachments. Bioinformatics,
25(2):274–275.
Cleary, J. and Teahan, W. (1995). Experiments on the zero
frequency problem. In Data Compression Conference,
1995. DCC ’95. Proceedings, page 480.
Cleary, J., Teahan, W., and Witten, I. (1995). Unbounded
length contexts for ppm. In Data Compression Con-
ference, 1995. DCC ’95. Proceedings, pages 52 –61.
Cleary, J. and Witten, I. (1984). Data compression using
adaptive coding and partial string matching. Commu-
nications, IEEE Transactions on, 32(4):396 – 402.
Daily, K., Rigor, P., Christley, S., Xie, X., and Baldi, P.
(2010). Data structures and compression algorithms
for high-throughput sequencing technologies. BMC
bioinformatics, 11(1):514+.
Deorowicz, S. and Grabowski, S. (2011). Compression of
dna sequence reads in fastq format. Bioinformatics,
27(6):860–862.
Drinic, M., Kirovski, D., and Potkonjak, M. (2003). Ppm
model cleaning. In Proceedings of the Conference on
Data Compression, DCC ’03, pages 163–, Washing-
ton, DC, USA. IEEE Computer Society.
Hsi-Yang Fritz, M., Leinonen, R., Cochrane, G., and Bir-
ney, E. (2011). Efficient storage of high throughput
sequencing data using reference-based compression.
Genome Research.
Kozanitis, C., Saunders, C., Kruglyak, S., Bafna, V., and
Varghese, G. (2010). Compressing genomic sequence
fragments using slimgene. In Proceedings of the 14th
Annual international conference on Research in Com-
putational Molecular Biology, RECOMB’10, pages
310–324, Berlin, Heidelberg. Springer-Verlag.
Moffat, A. (1990). Implementing the ppm data compres-
sion scheme. Communications, IEEE Transactions
on, 38(11):1917 –1921.
Pinho, A. J., Neves, J. R., and Ferreira, P. J. S. G. (2008).
Inverted-repeats-aware finite-context models for dna
coding. In Proceedings of the 16th European Confer-
ence on Signal Processing, EUSIPCO’08.
Pinho, A. J., Pratas, D., and Garcia, S. P. (2011). Green: a
tool for efficient compression of genome resequencing
data. Nucleic Acids Research.
Sakib, M. N., Tang, J., Zheng, W. J., and Huang, C.-T.
(2011). Improving transmission efficiency of large
sequence alignment/map (sam) files. PLoS ONE,
6(12):e28251.
Shkarin, D. (2002). Ppm: One step to practicality. In Pro-
ceedings of the Data Compression Conference, DCC
’02, pages 202–, Washington, DC, USA. IEEE Com-
puter Society.
Tembe, W., Lowey, J., and Suh, E. (2010). G-sqz: com-
pact encoding of genomic sequence and quality data.
Bioinformatics, 26(17):2192–2194.
Wan, R., Anh, V. N., and Asai, K. (2012). Transfor-
mations for the compression of fastq quality scores
of next-generation sequencing data. Bioinformatics,
28(5):628–635.
Wang, C. and Zhang, D. (2011). A novel compression tool
for efficient storage of genome resequencing data. Nu-
cleic Acids Research.
BIOINFORMATICS2013-InternationalConferenceonBioinformaticsModels,MethodsandAlgorithms
126