This paper introduces the concept of periodic bound-
ary conditions for Markovian sequences as an ele-
gant mathematical construct which avoids the incon-
venience of boundary effects in analytic calculations.
We have demonstrated that the mean and variance of
the D
word match statistic can be calculated analyt-
ically and readily computed to any desired accuracy
through formulae involving only traces of products
of matrices. Calculation of the mean and variance
is fast as powers of Hadamard products need only
be calculated once for a given Markovian model, and
only need to be calculated up to the point of conver-
gence. For biological applications such as measuring
sequence similarity or identifying regions of regula-
tory motifs, sequences lengths tend to be of at least
a few hundred letters. In these cases loss of infor-
mation about boundary effects is unlikely to be a se-
rious impediment. For instance, in previous studies
of a database of cis-regulatory modelled as a set of
i.i.d. sequences was successfully studied using the D
statistics simply by imposing PBCs on the sequences
prior to calculating the D
et et al., 2009a; Burden
et al., 2012).
The current work is a preliminary study designed
to illustrate the computational effectiveness of im-
posing periodic boundary conditions when calculat-
ing the D
statistic. In ongoing work we are test-
ing the agreement between the theoretical Markovian
distributions studied herein and empirical distribu-
tions from genomic DNA. In general, we find that the
empirical distribution tends to have heavier left and
right tails, suggesting the existence of a subset of k-
mers which are over- or under-represented within the
genomes studied.
Further work also needs to be done on extending
the results to more viable variants of the D
It has been argued that a potential shortcoming of the
statistic is that the signal of sequence similarity
one is trying to detect maybe hidden by its variability
due to noise in each of the single sequences, and that
to overcome this problem one should instead calcu-
late a ‘centred’ version of D
in which word count
vectors are replaced with those centred about their
mean (Lippert et al., 2002; Reinert et al., 2009). There
also exist ‘standardised’ versions of D
(Liu et al.,
2011; G
oke et al., 2012) designed to account for bi-
ases arising from the fact that some words are natu-
rally over-represented, and ‘weighted’ versions (Jing
et al., 2011) designed to account for higher substitu-
tion rates of chemically similar amino acids in protein
sequences. Extension of the mathematical formalisms
developed herein to these D
variants, as well as a
more compete study of the accuracy of approximat-
ing p-values with asymptotic distributions, will be the
subject of future work.
This work was funded in part by ARC Discovery
grants DP0987298 and DP120101422.
