Authors:
Morihiro Hayashida
1
;
Hitoshi Koyano
2
and
Jose C. Nacher
3
Affiliations:
1
Department of Electrical Engineering and Computer Science, National Institute of Technology, Matsue College, Matsue, Shimane, Japan
;
2
School of Life Science and Technology, Tokyo Institute of Technology, Meguro-ku, Tokyo, Japan
;
3
Department of Information Science, Faculty of Science, Toho University, Funabashi, Chiba, Japan
Keyword(s):
Grammar-based Compression, Kolmogorov Complexity, Protein Domain Combination.
Abstract:
Revealing evolution of organisms is one of important biological research topics, and is also useful for understanding the origin of organisms. Hence, genomic sequences have been compared and aligned for finding conserved and functional regions. A protein can contain several domains, which are known as structural and functional units. In the previous work, a proteome, whole kinds of proteins in an organism, was regarded as a set of sequences of protein domains, and a grammar-based compression algorithm was developed for a proteome, where production rules in the grammar represented evolutionary processes, mutation and duplication. In this paper, we propose a similarity measure based on the grammar-based compression, and apply it to hierarchical clustering of seven organisms, Homo sapiens, Mus musculus, Drosophila melanogaster, Caenorhabditis elegans, Saccharomyces cerevisiae, Arabidopsis thaliana, and Escherichia coli. The results suggest that our similarity measure could classify the
organisms very well.
(More)