PROTEIN DOMAIN PHYLOGENIES - Information Theory and Evolutionary Dynamics

K. Hamacher


The ever-increasing wealth of whole-genome information prompts for phylogenies based on entire genomes. The quest for a good distance measure, however, poses a big challenge; e.g. because of large-scale evolutionary events such as genomic rearrangements or inversions. We introduce here an information theory driven measure that for the encoded protein domain composition of genomes as protein domains are key evolutionary entities. Thus the new method focuses on selective advantageous events. As evolving different protein domain compositions is more complex than single point mutations, the method makes longer evolutionary times accessible. Illustrating the new methodology we extract several phylogenetic trees for some 700 genomes, e.g. the separation of the three kingdoms of life, trees for mammals and bacillales, and a speculative result for plants (monocotyledons and dicotyledons). The method itself is shown to be robust against incomplete genome sampling. It has a consistent interpretation in both, information space at the sequence/information level and at the level of stochastic, evolutionary dynamics. In contrast to established protocols it becomes more accurate as more organisms are taken into account. Finally we show the equivalence to a (simplified) model of evolutionary dynamics of proteomes.


  1. Burstein, D., Ulitsky, I., Tuller, T., and Chor, B. (2005). Information theoretic approaches to whole genome phylogenies. In RECOMB, pages 283-295.
  2. Dunn, C. W., Hejnol, A., Matus, D. Q., Pang, K., Browne, W. E., Smith, S. A., Seaver, E., Rouse, G. W., Obst, M., Edgecombe, G. D., Sorensen, M. V., Haddock, S. H. D., Schmidt-Rhaesa, A., Okusu, A., Kristensen, R. M., Wheeler, W. C., Martindale, M. Q., and Giribet, G. (2008). Broad phylogenomic sampling improves resolution of the animal tree of life. Nature, 452:745- 749.
  3. Ekman, D., Björklund, A°. K., and Elofsson, A. (2007). Quantification of the elevated rate of domain rearrangements in metazoa. J. Mol. Biol., pages 1337- 1348.
  4. Endres, D. and Schindelin, J. (2003). A new metric for probability distributions. IEEE Trans Info Theo, 49:1858-1860.
  5. Felsenstein, J. (1989). PHYLIP - phylogeny inference package (version 3.2). Cladistics, 5:164-166.
  6. Fong, J. H., Geer, L. Y., Panchenko, A. R., and Bryant, S. H. (2007). Modeling the evolution of protein domain architectures using maximum parsomony. J. Mol. Biol., pages 307-315.
  7. Fukami-Kobayashi, K., Minezaki, Y., Tateno, Y., and Nishikawa, K. (2007). A Tree of Life Based on Protein Domain Organizations. Mol. Biol. Evol., 24(5):1181-1189.
  8. Gerstein, M. (1998). Patterns of protein-fold usage in eight microbial genomes: A comprehensive structural census. Proteins: Structure, Function, and Genetics, 33:518-534.
  9. Gough, J., Karplus, K., Hughey, R., and Chothia, C. (2001). Assignment of homology to genome sequences using a library of hidden markov models that represent all proteins of known structure. J Mol Biol, 313(4):903- 919.
  10. Grosse, I., Bernaola-Galvan, P., Carpena, P., RomainRoldan, R., and Oliver, J. e. (2002). Analysis of symbolic sequences using the jensen-shannon divergence. Phys Rev E, 65:041905.
  11. Hamacher, K. (2006). Adaptation in stochastic tunneling global optimization of complex potential energy landscapes. Europhys. Lett., 74(6):944-950.
  12. Hamacher, K. (2007a). Adaptive extremal optimization by detrended fluctuation analysis. J.Comp.Phys., 227(2):1500-1509.
  13. Hamacher, K. (2007b). Energy landscape paving as a perfect optimization approach under detrended fluctuation analysis. Physica A, 378(2):307-314.
  14. Hamacher, K. (2007c). Information theoretical measures to analyze trajectories in rational molecular design. J. Comp. Chem., 28(16):2576-2580.
  15. Hamacher, K., Hübsch, A., and McCammon, J. A. (2006). A minimal model for stabilization of biomolecules by hydrocarbon cross-linking. J. Chem. Phys., 124(16):164907.
  16. Huson, D. H. and Bryant, D. (2006). Application of phylogenetic networks in evolutionary studies. Mol. Biol. Evol., 23(2):254-267.
  17. Kolmogorov, A. N. (1965). Three approaches to the quantitative definition of information. Problems of Information and Transmission, 1(1):1-7.
  18. Li, M., Badger, J., Xin, C., Kwong, S., and Kearney, P. e. (2001). An information based sequence distance and its application to whole mitochondrial genome phylogeny. Bioinformatics, 17:149-154.
  19. Li, M., Chen, X., Li, X., Ma, B., and Vitanyi, P. (2004). The similarity metric. IEEE Trans Info Theo, 50:3250- 3264.
  20. Li, M. and Vitányi, P. (1997). An Introduction to Kolmogorov Complexity and Its Applications. Springer, New York.
  21. Lin, J. (1991). Divergence measures based on the shannon entropy. IEEE Trans. Inform. Theory, 37(1):145-151.
  22. Lund, O., Nielsen, M., Lundegaard, C., and Brunak, C. K. S. (2005). Immunological Bioinformatics. MIT Press, Cambridge.
  23. MacKay, D. (2004). Information Theory, Inference, and Learning Algorithms. Cambridge University Press, Cambridge, 2. edition.
  24. Makarenkov, V. (2001). T-REX: reconstructing and visualizing phylogenetic trees and reticulation networks. Bioinformatics, 17(7):664-668.
  25. Makarenkov, V. and Leclerc, B. (1999). An algorithm for the fitting of a phylogenetic tree according to a weighted least-squares criterion. J. Class., 16(1):3- 26.
  26. Mantaci, S., Restivo, A., and Sciortino, M. (2008). Distance measures for biological sequences: Some recent approaches. Int J Approx Reasoning, 47:109-124.
  27. Martin, W., Roettger, M., and Lockhart, P. J. (2007). A reality check for alignments and trees. Trends in Genetics, 23:478-480.
  28. Morrison, D. and Ellis, J. (1997). Effects of nucleotide sequence alignment on phylogeny estimation: a case study of 18S rDNAs of apicomplexa. Mol Biol Evol, 14(4):428-441.
  29. Otu, H. and Sayood, K. (2003). A new sequence distance measure for phylogenetic tree construction. Bioinformatics, 19:2122-2130.
  30. Panayiotis V. Benos, Alan S. Lapedes and Gray D. Stormo (2002). Probabilistic Code for DNA Recognition by Proteins of the EGR family. J. Mol. Biol., 323:701- 727.
  31. Philippe, H., Delsuc, F., Brinkmann, H., and Lartillot, N. (2005). Phylogenomics. Annual Review of Ecology, Evolution, and Systematics, 36(1):541-562.
  32. Rokas, A. (2008). GENOMICS: Lining Up to Avoid Bias. Science, 319(5862):416-417.
  33. Snel, B., Huynen, M. A., and Dutilh, B. E. (2005). Genome trees and the nature of genome evolution. Annu. Rev. Microbiol., 59(1):191-209.
  34. Solomonoff, R. J. (1964a). A formal theory of inductive inference. Information and Control, 7:1-22.
  35. Solomonoff, R. J. (1964b). A formal theory of inductive inference. Information and Control, 7:224-254.
  36. Soltis, P. S. and Soltis, D. E. (2003). Applying the bootstrap in phylogeny reconstruction. Statist. Sci., 18(2):256- 267.
  37. Tekaia, F., Lazcano, A., and Dujon, B. (1999). The Genomic Tree as Revealed from Whole Proteome Comparisons. Genome Res., 9(6):550-557.
  38. Wenzel, W. and Hamacher, K. (1999). A Stochastic tunneling approach for global minimization. Phys. Rev. Lett., 82(15):3003-3007.
  39. Wilson, D., Madera, M., Vogel, C., Chothia, C., and Gough, J. (2007). The superfamily database in 2007: families and functions. Nucleic Acids Res, 35(Database issue):308-313.
  40. Woese, C. R. (2000). Interpreting the universal phylogenetic tree. Proc. Nat. Acad. Sci., 97(15):8392-8396.
  41. Woese, C. R. (2002). On the evolution of cells. Proc. Nat. Acad. Sci., 99(13):8742-8747.
  42. Wolf, Y., Rogozin, I., Grishin, N., and Koonin, E. (2002). Genome trees and the tree of life. Trends in Genetics, 18(9):472-479.
  43. Woolley, S. M., Posada, D., and Crandall, K. A. (2008). A comparison of phylogenetic network methods using computer simulation. PLoS ONE, 3(4):e1913.
  44. Yang, S., Doolittle, R. F., and Bourne, P. E. (2005). Phylogeny determined by protein domain content. Proc. Nat. Acad. Sci., 102(2):373-378.
  45. Zhang, Y. and Skolnick, J. (2005). The protein structure prediction problem could be solved using the current PDB library. Proc. Nat. Acad. Sci., 102(4):1029- 1034.

Paper Citation

in Harvard Style

Hamacher K. (2010). PROTEIN DOMAIN PHYLOGENIES - Information Theory and Evolutionary Dynamics . In Proceedings of the First International Conference on Bioinformatics - Volume 1: BIOINFORMATICS, (BIOSTEC 2010) ISBN 978-989-674-019-1, pages 114-122. DOI: 10.5220/0002710101140122

in Bibtex Style

author={K. Hamacher},
title={PROTEIN DOMAIN PHYLOGENIES - Information Theory and Evolutionary Dynamics},
booktitle={Proceedings of the First International Conference on Bioinformatics - Volume 1: BIOINFORMATICS, (BIOSTEC 2010)},

in EndNote Style

JO - Proceedings of the First International Conference on Bioinformatics - Volume 1: BIOINFORMATICS, (BIOSTEC 2010)
TI - PROTEIN DOMAIN PHYLOGENIES - Information Theory and Evolutionary Dynamics
SN - 978-989-674-019-1
AU - Hamacher K.
PY - 2010
SP - 114
EP - 122
DO - 10.5220/0002710101140122