Figure 2: The Human genome inter-chromosomal similar-
ity heat map. On the right side is a bar that indicates the
strength of similarity (highest intensity at the top). The axes
of x and y represent chromosomes.
out: the sexual chromosomes (X-Y) have the larger
similarity among all chromosomes; looking into auto-
somes, the larger similarity is in chromosomes 18/21;
chromosome 12, 18 and X have the overall chromoso-
mal relation; there are relevant similarities is the fol-
lowing pairs: 3/4, 5/6, 11/12, 17/20 and 18/21.
4 CONCLUSIONS
We have developed a method for computing the nor-
malized compression distance based on a mixture
of finite context models. We have shown that this
method is, on average, better than the state-of-the-
art XM on large and not very similar sequences (the
human genome, for example). Moreover, the time
required to accomplish the task is much lower than
in the XM approach. Using the proposed method,
we have also studied the similarity between chro-
mosomes of the human genome, revealing several
pointed similarities among these chromosomes.
In the future, we intend to create a hybrid solu-
tion using the copy expert and the mixture of finite-
context models, since these two methods proved to be
of strong functionality and complementarity.
ACKNOWLEDGEMENTS
This work was supported in part by the grant
with the COMPETE reference FCOMP-01-0124-
FEDER-010099 (FCT reference PTDC/EIA-EIA/
103099/2008). Sara P. Garcia acknowledges funding
from the European Social Fund and the Portuguese
Ministry of Education.
REFERENCES
Bennett, C. H., G´acs, P., Vit´anyi, M. L. P. M. B., and Zurek,
W. H. (1998). Information distance. IEEE Trans. on
Information Theory, 44(4):1407–1423.
Cao, M. D., Dix, T. I., Allison, L., and Mears, C. (2007).
A simple statistical algorithm for biological sequence
compression. In Proc. of DCC-2007, pages 43–52,
Snowbird, Utah.
Chaitin, G. J. (1966). On the length of programs for com-
puting finite binary sequences. Journal of the ACM,
13:547–569.
Cilibrasi, R. and Vit´anyi, P. M. B. (2005). Clustering by
compression. IEEE Trans. on Information Theory,
51(4):1523–1545.
Dix, T. I., Powell, D. R., Allison, L., Bernal, J., Jaeger,
S., and Stern, L. (2007). Comparative analysis of
long DNA sequences by per element information con-
tent using different contexts. BMC Bioinformatics,
8(Suppl. 2):S10.
Gordon, G. (2003). Multi-dimensional linguistic complex-
ity. Journal of Biomolecular Structure & Dynamics,
20(6):747–750.
Kolmogorov, A. N. (1965). Three approaches to the quanti-
tative definition of information. Problems of Informa-
tion Transmission, 1(1):1–7.
Lempel, A. and Ziv, J. (1976). On the complexity of fi-
nite sequences. IEEE Trans. on Information Theory,
22(1):75–81.
Li, M., Chen, X., Li, X., Ma, B., and Vit´anyi, P. M. B.
(2004). The similarity metric. IEEE Trans. on Infor-
mation Theory, 50(12):3250–3264.
Pinho, A. J., Pratas, D., and Ferreira, P. J. S. G. (2011a).
Bacteria DNA sequence compression using a mixture
of finite-context models. In Proc. of the IEEE Work-
shop on SSP, Nice.
Pinho, A. J., Pratas, D., Ferreira, P. J. S. G., and Garcia,
S. P. (2011b). Symbolic to numerical conversion of
DNA sequences using finite-context models. In Proc.
of EUSIPCO-2011, Barcelona.
Pratas, D. and Pinho, A. J. (2011). Compressing the human
genome using exclusively Markov models. In PACBB
2011, vol 93, pages 213–220.
Solomonoff, R. J. (1964). A formal theory of inductive in-
ference. Part I and II. Information and Control, 7(1
and 2):1–22 and 224–254.
Zhao, G., Perepelov, A. V., Senchenkova, et al. (2007).
Structural relation of the antigenic polysaccharides of
E. coli o40, S. dysenteriae type 9, and E. coli k47.
Carbohydrate Research, 342(9):1275–1279.
COMPUTATION OF THE NORMALIZED COMPRESSION DISTANCE OF DNA SEQUENCES USING A MIXTURE
OF FINITE-CONTEXT MODELS
311