COMPUTATION OF THE NORMALIZED COMPRESSION DISTANCE OF DNA SEQUENCES USING A MIXTURE OF FINITE-CONTEXT MODELS

Diogo Pratas, Armando J. Pinho, Sara P. Garcia

Abstract

A compression-based similarity measure assesses the similarity between two objects using the number of bits needed to describe one of them when a description of the other is available. For being effective, these measures have to rely on “normal” compression algorithms, roughly meaning that they have to be able to build an internal model of the data being compressed. Often, we find that good “normal” compression methods are slow and those that are fast do not provide acceptable results. In this paper, we propose a method for measuring the similarity of DNA sequences that balances these two goals. The method relies on a mixture of finite-context models and is compared with other methods, including XM, the state-of-the-art DNA compression technique. Moreover, we present a comprehensive study of the inter-chromosomal similarity of the human genome.

References

  1. Bennett, C. H., Gács, P., Vitányi, M. L. P. M. B., and Zurek, W. H. (1998). Information distance. IEEE Trans. on Information Theory, 44(4):1407-1423.
  2. Cao, M. D., Dix, T. I., Allison, L., and Mears, C. (2007). A simple statistical algorithm for biological sequence compression. In Proc. of DCC-2007, pages 43-52, Snowbird, Utah.
  3. Chaitin, G. J. (1966). On the length of programs for computing finite binary sequences. Journal of the ACM, 13:547-569.
  4. Cilibrasi, R. and Vitányi, P. M. B. (2005). Clustering by compression. IEEE Trans. on Information Theory, 51(4):1523-1545.
  5. Dix, T. I., Powell, D. R., Allison, L., Bernal, J., Jaeger, S., and Stern, L. (2007). Comparative analysis of long DNA sequences by per element information content using different contexts. BMC Bioinformatics, 8(Suppl. 2):S10.
  6. Gordon, G. (2003). Multi-dimensional linguistic complexity. Journal of Biomolecular Structure & Dynamics, 20(6):747-750.
  7. Kolmogorov, A. N. (1965). Three approaches to the quantitative definition of information. Problems of Information Transmission, 1(1):1-7.
  8. Lempel, A. and Ziv, J. (1976). On the complexity of finite sequences. IEEE Trans. on Information Theory, 22(1):75-81.
  9. Li, M., Chen, X., Li, X., Ma, B., and Vitányi, P. M. B. (2004). The similarity metric. IEEE Trans. on Information Theory, 50(12):3250-3264.
  10. Pinho, A. J., Pratas, D., and Ferreira, P. J. S. G. (2011a). Bacteria DNA sequence compression using a mixture of finite-context models. In Proc. of the IEEE Workshop on SSP, Nice.
  11. Pinho, A. J., Pratas, D., Ferreira, P. J. S. G., and Garcia, S. P. (2011b). Symbolic to numerical conversion of DNA sequences using finite-context models. In Proc. of EUSIPCO-2011, Barcelona.
  12. Pratas, D. and Pinho, A. J. (2011). Compressing the human genome using exclusively Markov models. In PACBB 2011, vol 93, pages 213-220.
  13. Solomonoff, R. J. (1964). A formal theory of inductive inference. Part I and II. Information and Control, 7(1 and 2):1-22 and 224-254.
  14. Zhao, G., Perepelov, A. V., Senchenkova, et al. (2007). Structural relation of the antigenic polysaccharides of E. coli o40, S. dysenteriae type 9, and E. coli k47. Carbohydrate Research, 342(9):1275-1279.
Download


Paper Citation


in Harvard Style

Pratas D., J. Pinho A. and P. Garcia S. (2012). COMPUTATION OF THE NORMALIZED COMPRESSION DISTANCE OF DNA SEQUENCES USING A MIXTURE OF FINITE-CONTEXT MODELS . In Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms - Volume 1: BIOINFORMATICS, (BIOSTEC 2012) ISBN 978-989-8425-90-4, pages 308-311. DOI: 10.5220/0003780203080311


in Bibtex Style

@conference{bioinformatics12,
author={Diogo Pratas and Armando J. Pinho and Sara P. Garcia},
title={COMPUTATION OF THE NORMALIZED COMPRESSION DISTANCE OF DNA SEQUENCES USING A MIXTURE OF FINITE-CONTEXT MODELS},
booktitle={Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms - Volume 1: BIOINFORMATICS, (BIOSTEC 2012)},
year={2012},
pages={308-311},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003780203080311},
isbn={978-989-8425-90-4},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms - Volume 1: BIOINFORMATICS, (BIOSTEC 2012)
TI - COMPUTATION OF THE NORMALIZED COMPRESSION DISTANCE OF DNA SEQUENCES USING A MIXTURE OF FINITE-CONTEXT MODELS
SN - 978-989-8425-90-4
AU - Pratas D.
AU - J. Pinho A.
AU - P. Garcia S.
PY - 2012
SP - 308
EP - 311
DO - 10.5220/0003780203080311