with the diagonal elements set so that each row sums
to zero. This is a 5 parameter model, which, for sim-
plicity we will reduce to a 3-parameter model by as-
suming symmetry between the nucleotides A and T
and between the nucleotides C and G, that is, we set
π
A
= π
T
and π
C
= π
G
.
The stationary distribution of the matrix P
ij
de-
fined by Equation (3) for the HKY matrix with param-
eters α = 0.2, β = 0.5, π
A
= π
T
= 0.2 and π
C
= π
G
=
0.3 for an effective population size M = 30 is illus-
trated in Figure 2. Equation (1) implies that the argu-
ment of the stationary distribution φ(i
A
,i
C
,i
G
,i
T
;∞)
lies on a simplex which, for a 4-letter alphabet, can
be represented as a tetrahedron. In Figure 2 the distri-
bution is represented by a small sphere at each set of
integer coordinates, the volume of each sphere being
proportional to the probability mass function at that
coordinate.
Figure 2: Stationary distribution of allele frequencies for the
HKY model with parameters α = 0.2, β = 0.5, π
A
= π
T
=
0.2 and π
C
= π
G
= 0.3. The corners labelled A, C, G and
T correspond to the coordinates i = (1,0,0,0), (0,1,0, 0),
(0,0, 1,0) and (0,0, 0,1) respectively, and the volume of the
sphere at each coordinate point is proportional to the prob-
ability mass function.
The distribution is dominated by the corners of the
tetrahedron, indicating that the majority of genomic
sites are not SNPs. Edges of the tetrahedron corre-
spond to 2-allele SNPs, the interiors of the four faces
correspond to 3-allele SNPs and the interior volume
of the tetrahedron corresponds to 4-allele SNPs. As
is observed in real genomes, 3- and 4-allele SNPs are
extremely rare. For illustrative purposes the parame-
ter α has been chosen larger than what one might ex-
pect in nature by a couple of orders of magnitude to
ensure the SNP probabilities are visible in the figure.
The stationary distribution along the edges for each
of the six possible 2-allele Figure 3 SNPs is plotted in
Figure 3.
In the above example, the small effective popu-
lation size M = 30 was chosen to enable a numeri-
cally tractable solution. For more realistic values of M
the size of the matrix P
ij
grows asymptotically as M
4
.
However, this does not present a problem as is should
be feasible to develop a continuum Fokker-Planck
equation analogous to Equation (5) with a tactible so-
lution analogous to Equations (6) to (8) from which
to calculate a log-likelihood.
6 DISCUSSION AND
CONCLUSIONS
We have proposed an approach to estimating mutation
rates from observed allele frequencies across a popu-
lation at any set of independent genomic sites which
are believed not to be susceptible to the effects of nat-
ural selection. Our numerical estimates based on the
canonical textbook Wright-Fisher model of popula-
tion genetics suggest that reasonable estimates of mu-
tation rates can be obtained from as few as ∼ 10
4
such
sites.
The next step in this analysis is the technical prob-
lem of extending the continuum Fokker-Planck equa-
tion from the 2-letter genomic alphabet described in
Section 3 to an analogous equation defined over the
higher dimensional simplex relevant to the 4-letter
genomic alphabet described in Section 5, and solv-
ing to find the steady state solution. This should be
straightforward, at least for the Wright-Fisher model,
and will enable maximum likelihood estimates of evo-
lutionary rate matrices specified by parameterisations
such as the HKY model. Going beyond Wright-Fisher
to deal with species which are not diploid and mo-
noecious will presumably not present insurmountable
challenges provided the appropriate functions a(θ)
and b(θ) analogous to those occurring in Equation (5)
can be modelled.
An issue not addressed here at any level of rigour
is that of context-dependence. As mentioned in the
introduction, neutral mutation rates at a given site are
known to be subject to the nucleotide content of flank-
ing bases. We have assumed that the set of indepen-
dent sites used to estimate rates are simply chosen to
share the same context. However, this ignores the fact
that the flanking bases may themselves mutate, and
do so over timescales similar to the mutation rates we
seek to estimate. We end on a note of caution that
developing a non-local model which takes this into
account may prove to be a formidable mathematical
problem.
CanEvolutionaryRateMatricesbeEstimatedfromAlleleFrequencies?
187