# The Distribution of Short Word Match Counts between Markovian Sequences

### Conrad J. Burden, Paul Leopardi, Sylvain Forêt

#### Abstract

The D2 statistic, which counts the number of word matches between two given sequences, has long been proposed as a measure of similarity for biological sequences. Much of the mathematically rigorous work carried out to date on the properties of the D2 statistic has been restricted to the case of ‘Bernoulli’ sequences composed of identically and independently distributed letters. Here the properties of the distribution of this statistic for the biologically more realistic case of Markovian sequences is studied. The approach is novel in that Markovian dependency is defined for sequences with periodic boundary conditions, and this enables exact analytic formulae for the mean and variance to be derived. The formulae are confirmed using numerical simulations, and asymptotic approximations to the full distribution are tested.

#### References

- Burden, C. J., Jing, J., and Wilson, S. R. (2012). Alignmentfree sequence comparison for biologically realistic sequences of moderate length. Statistical Applications in Genetics and Molecular Biology, 11(1):Article 3.
- Burden, C. J., Kantorovitz, M. R., and Wilson, S. R. (2008). Approximate word matches between two random sequences. Annals of Applied Probability, 18(1):1-21.
- Chor, B., Horn, D., Goldman, N., Levy, Y., and Massingham, T. (2009). Genomic DNA k-mer spectra: models and modalities. Genome Biology, 10:R108.
- Foreˆt, S., Kantorovitz, M. R., and Burden, C. J. (2006). Asymptotic behaviour and optimal word size for exact and approximate word matches between random sequences. BMC Bioinformatics, 7 Suppl 5:S21.
- Foreˆt, S., Wilson, S. R., and Burden, C. J. (2009a). Characterizing the D2 statistic: Word matches in biological sequences. Stat. Appl. Genet. Mo. B., 8(1):Article 43.
- Foreˆt, S., Wilson, S. R., and Burden, C. J. (2009b). Empirical distribution of k-word matches in biological sequences. Pattern Recogn., 42:539-548.
- G öke, J., Schulz, M., Lasserre, J., and Vingron, M. (2012). Estimation of pairwise sequence similarity of mammalian enhancers with word neighbourhood counts. Bioinformatics, 28(5):656-663.
- Jing, J., Wilson, S. R., and Burden, C. J. (2011). Weighted k-word matches: A sequence comparison tool for proteins. ANZIAM J., page To appear.
- Kantorovitz, M. R., Booth, H. S., Burden, C. J., and Wilson, S. R. (2006). Asymptotic behavior of k-word matches between two uniformly distributed sequences. J. Appl. Probab., 44:788-805.
- Kantorovitz, M. R., Robinson, G. E., and Sinha, S. (2007). A statistical method for alignment-free comparison of regulatory sequences. Bioinformatics, 23(13):i249- 55.
- Lippert, R. A., Huang, H., and Waterman, M. S. (2002). Distributional regimes for the number of k-word matches between two random sequences. Proc. Natl. Acad. Sci. USA, 99(22):13980-9.
- Liu, X., Wan, L., Li, J., Reinert, G., Waterman, M. S., and Sun, F. (2011). New powerful statistics for alignmentfree sequence comparison under a pattern transfer model. J. Theoret. Biol., 284:106-116.
- Reinert, G., Chew, D., Sun, F., and Waterman, M. S. (2009). Alignment-free sequence comparison (i): statistics and power. J. Comput. Biol., 16(12):1615-1634.
- Reinert, G., Schbath, S., and Waterman, M. (2005). Statistics on words with applications to biological sequences. In Lothaire, M., editor, Applied Combinatorics on Words, chapter 6. Cambridge University Press.
- Vinga, S. and Almeida, J. (2003). Alignment-free sequence comparison-a review. Bioinformatics, 19(4):513-23.

#### Paper Citation

#### in Harvard Style

J. Burden C., Leopardi P. and Forêt S. (2013). **The Distribution of Short Word Match Counts between Markovian Sequences** . In *Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms - Volume 1: BIOINFORMATICS, (BIOSTEC 2013)* ISBN 978-989-8565-35-8, pages 25-33. DOI: 10.5220/0004203700250033

#### in Bibtex Style

@conference{bioinformatics13,

author={Conrad J. Burden and Paul Leopardi and Sylvain Forêt},

title={The Distribution of Short Word Match Counts between Markovian Sequences},

booktitle={Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms - Volume 1: BIOINFORMATICS, (BIOSTEC 2013)},

year={2013},

pages={25-33},

publisher={SciTePress},

organization={INSTICC},

doi={10.5220/0004203700250033},

isbn={978-989-8565-35-8},

}

#### in EndNote Style

TY - CONF

JO - Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms - Volume 1: BIOINFORMATICS, (BIOSTEC 2013)

TI - The Distribution of Short Word Match Counts between Markovian Sequences

SN - 978-989-8565-35-8

AU - J. Burden C.

AU - Leopardi P.

AU - Forêt S.

PY - 2013

SP - 25

EP - 33

DO - 10.5220/0004203700250033