Alternative PPM Model for Quality Score Compression

Mete Akgün, Mahmut Şamil Sağıroğlu

Abstract

Next Generation Sequencing (NGS) platforms generate header data and quality information for each nucleotide sequence. These platforms may produce gigabyte-scale datasets. The storage of these datasets is one of the major bottlenecks of NGS technology. Information produced by NGS are stored in FASTQ format. In this paper, we propose an algorithm to compress quality score information stored in a FASTQ file. We try to find a model that gives the lowest entropy on quality score data. We combine our powerful statistical model with arithmetic coding to compress the quality score data the smallest. We compare its performance to text compression utilities such as bzip2, gzip and ppmd and existing compression algorithms for quality scores. We show that the performance of our compression algorithm is superior to that of both systems.

References

  1. Bhola, V., Bopardikar, A., Narayanan, R., Leet, K., and Ahm, T. (2011). No-reference compression of genomic data stored in fastq format. In Bioinformatics and Biomedicine (BIBM), 2011 IEEE International Conference on, pages 147 -150.
  2. Bonfield, J. (2012). The fqzcomp pression algorithm for fastq http://sourceforge.net/p/fqzcomp/home/Home/.
  3. Campos, A. (2000). Implementing ppmc with hash tables.
  4. Christley, S., Lu, Y., Li, C., and Xie, X. (2009). Human genomes as email attachments. Bioinformatics, 25(2):274-275.
  5. Cleary, J. and Teahan, W. (1995). Experiments on the zero frequency problem. In Data Compression Conference, 1995. DCC 7895. Proceedings, page 480.
  6. Cleary, J., Teahan, W., and Witten, I. (1995). Unbounded length contexts for ppm. In Data Compression Conference, 1995. DCC 7895. Proceedings, pages 52 -61.
  7. Cleary, J. and Witten, I. (1984). Data compression using adaptive coding and partial string matching. Communications, IEEE Transactions on, 32(4):396 - 402.
  8. Daily, K., Rigor, P., Christley, S., Xie, X., and Baldi, P. (2010). Data structures and compression algorithms for high-throughput sequencing technologies. BMC bioinformatics, 11(1):514+.
  9. Deorowicz, S. and Grabowski, S. (2011). Compression of dna sequence reads in fastq format. Bioinformatics, 27(6):860-862.
  10. Drinic, M., Kirovski, D., and Potkonjak, M. (2003). Ppm model cleaning. In Proceedings of the Conference on Data Compression, DCC 7803, pages 163-, Washington, DC, USA. IEEE Computer Society.
  11. Hsi-Yang Fritz, M., Leinonen, R., Cochrane, G., and Birney, E. (2011). Efficient storage of high throughput sequencing data using reference-based compression. Genome Research.
  12. Kozanitis, C., Saunders, C., Kruglyak, S., Bafna, V., and Varghese, G. (2010). Compressing genomic sequence fragments using slimgene. In Proceedings of the 14th Annual international conference on Research in Computational Molecular Biology, RECOMB'10, pages 310-324, Berlin, Heidelberg. Springer-Verlag.
  13. Moffat, A. (1990). Implementing the ppm data compression scheme. Communications, IEEE Transactions on, 38(11):1917 -1921.
  14. Pinho, A. J., Neves, J. R., and Ferreira, P. J. S. G. (2008). Inverted-repeats-aware finite-context models for dna coding. In Proceedings of the 16th European Conference on Signal Processing, EUSIPCO'08.
  15. Pinho, A. J., Pratas, D., and Garcia, S. P. (2011). Green: a tool for efficient compression of genome resequencing data. Nucleic Acids Research.
  16. Sakib, M. N., Tang, J., Zheng, W. J., and Huang, C.-T. (2011). Improving transmission efficiency of large sequence alignment/map (sam) files. PLoS ONE, 6(12):e28251.
  17. Shkarin, D. (2002). Ppm: One step to practicality. In Proceedings of the Data Compression Conference, DCC 7802, pages 202-, Washington, DC, USA. IEEE Computer Society.
  18. Tembe, W., Lowey, J., and Suh, E. (2010). G-sqz: compact encoding of genomic sequence and quality data. Bioinformatics, 26(17):2192-2194.
  19. Wan, R., Anh, V. N., and Asai, K. (2012). Transformations for the compression of fastq quality scores of next-generation sequencing data. Bioinformatics, 28(5):628-635.
  20. Wang, C. and Zhang, D. (2011). A novel compression tool for efficient storage of genome resequencing data. Nucleic Acids Research.
Download


Paper Citation


in Harvard Style

Akgün M. and Şamil Sağıroğlu M. (2013). Alternative PPM Model for Quality Score Compression . In Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms - Volume 1: BIOINFORMATICS, (BIOSTEC 2013) ISBN 978-989-8565-35-8, pages 122-126. DOI: 10.5220/0004221601220126


in Bibtex Style

@conference{bioinformatics13,
author={Mete Akgün and Mahmut Şamil Sağıroğlu},
title={Alternative PPM Model for Quality Score Compression},
booktitle={Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms - Volume 1: BIOINFORMATICS, (BIOSTEC 2013)},
year={2013},
pages={122-126},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0004221601220126},
isbn={978-989-8565-35-8},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms - Volume 1: BIOINFORMATICS, (BIOSTEC 2013)
TI - Alternative PPM Model for Quality Score Compression
SN - 978-989-8565-35-8
AU - Akgün M.
AU - Şamil Sağıroğlu M.
PY - 2013
SP - 122
EP - 126
DO - 10.5220/0004221601220126