5 CONCLUSIONS
This paper introduces the concept of periodic bound-
ary conditions for Markovian sequences as an ele-
gant mathematical construct which avoids the incon-
venience of boundary effects in analytic calculations.
We have demonstrated that the mean and variance of
the D
2
word match statistic can be calculated analyt-
ically and readily computed to any desired accuracy
through formulae involving only traces of products
of matrices. Calculation of the mean and variance
is fast as powers of Hadamard products need only
be calculated once for a given Markovian model, and
only need to be calculated up to the point of conver-
gence. For biological applications such as measuring
sequence similarity or identifying regions of regula-
tory motifs, sequences lengths tend to be of at least
a few hundred letters. In these cases loss of infor-
mation about boundary effects is unlikely to be a se-
rious impediment. For instance, in previous studies
of a database of cis-regulatory modelled as a set of
i.i.d. sequences was successfully studied using the D
2
statistics simply by imposing PBCs on the sequences
prior to calculating the D
2
(For
ˆ
et et al., 2009a; Burden
et al., 2012).
The current work is a preliminary study designed
to illustrate the computational effectiveness of im-
posing periodic boundary conditions when calculat-
ing the D
2
statistic. In ongoing work we are test-
ing the agreement between the theoretical Markovian
distributions studied herein and empirical distribu-
tions from genomic DNA. In general, we find that the
empirical distribution tends to have heavier left and
right tails, suggesting the existence of a subset of k-
mers which are over- or under-represented within the
genomes studied.
Further work also needs to be done on extending
the results to more viable variants of the D
2
statistic.
It has been argued that a potential shortcoming of the
D
2
statistic is that the signal of sequence similarity
one is trying to detect maybe hidden by its variability
due to noise in each of the single sequences, and that
to overcome this problem one should instead calcu-
late a ‘centred’ version of D
2
in which word count
vectors are replaced with those centred about their
mean (Lippert et al., 2002; Reinert et al., 2009). There
also exist ‘standardised’ versions of D
2
(Liu et al.,
2011; G
¨
oke et al., 2012) designed to account for bi-
ases arising from the fact that some words are natu-
rally over-represented, and ‘weighted’ versions (Jing
et al., 2011) designed to account for higher substitu-
tion rates of chemically similar amino acids in protein
sequences. Extension of the mathematical formalisms
developed herein to these D
2
variants, as well as a
more compete study of the accuracy of approximat-
ing p-values with asymptotic distributions, will be the
subject of future work.
ACKNOWLEDGEMENTS
This work was funded in part by ARC Discovery
grants DP0987298 and DP120101422.
REFERENCES
Burden, C. J., Jing, J., and Wilson, S. R. (2012). Alignment-
free sequence comparison for biologically realistic se-
quences of moderate length. Statistical Applications
in Genetics and Molecular Biology, 11(1):Article 3.
Burden, C. J., Kantorovitz, M. R., and Wilson, S. R. (2008).
Approximate word matches between two random se-
quences. Annals of Applied Probability, 18(1):1–21.
Chor, B., Horn, D., Goldman, N., Levy, Y., and Massing-
ham, T. (2009). Genomic DNA k-mer spectra: models
and modalities. Genome Biology, 10:R108.
For
ˆ
et, S., Kantorovitz, M. R., and Burden, C. J. (2006).
Asymptotic behaviour and optimal word size for ex-
act and approximate word matches between random
sequences. BMC Bioinformatics, 7 Suppl 5:S21.
For
ˆ
et, S., Wilson, S. R., and Burden, C. J. (2009a). Charac-
terizing the D2 statistic: Word matches in biological
sequences. Stat. Appl. Genet. Mo. B., 8(1):Article 43.
For
ˆ
et, S., Wilson, S. R., and Burden, C. J. (2009b). Empir-
ical distribution of k-word matches in biological se-
quences. Pattern Recogn., 42:539–548.
G
¨
oke, J., Schulz, M., Lasserre, J., and Vingron, M. (2012).
Estimation of pairwise sequence similarity of mam-
malian enhancers with word neighbourhood counts.
Bioinformatics, 28(5):656–663.
Jing, J., Wilson, S. R., and Burden, C. J. (2011). Weighted
k-word matches: A sequence comparison tool for pro-
teins. ANZIAM J., page To appear.
Kantorovitz, M. R., Booth, H. S., Burden, C. J., and Wilson,
S. R. (2006). Asymptotic behavior of k-word matches
between two uniformly distributed sequences. J. Appl.
Probab., 44:788–805.
Kantorovitz, M. R., Robinson, G. E., and Sinha, S. (2007).
A statistical method for alignment-free comparison of
regulatory sequences. Bioinformatics, 23(13):i249–
55.
Lippert, R. A., Huang, H., and Waterman, M. S. (2002).
Distributional regimes for the number of k-word
matches between two random sequences. Proc. Natl.
Acad. Sci. USA, 99(22):13980–9.
Liu, X., Wan, L., Li, J., Reinert, G., Waterman, M. S., and
Sun, F. (2011). New powerful statistics for alignment-
free sequence comparison under a pattern transfer
model. J. Theoret. Biol., 284:106–116.
BIOINFORMATICS2013-InternationalConferenceonBioinformaticsModels,MethodsandAlgorithms
32