reelin, which is an essential protein involved in the
development of the six-layer cortex of the human
brain. Fractal analysis was applied to the HAR1
nucleotide sequence and the homologous sequence
in the chimpanzee genome. Analysis shows that the
differences in fractal dimension can be used as a
marker of evolution. The 118-bp in HAR1 contains
18 point substitutions over an evolutionary span of 5
million years when comparing the human to the
chimpanzee. However, the same 118-bp region only
contains two point substitutions over a span of 300
million years when comparing the chicken to the
chimpanzee. The implications of evolution and
positive selection have been discussed in recent
literature (Pollard, et al, 2006b).
2 MATERIALS & METHODS
The nucleotide sequences were downloaded from
Genbank. The accession numbers of the mtDNA
database are listed in the Appendix. The studied
primates are human, Neanderthal, chimp (chimp and
pygmy chimp), gorilla (western and western
lowland), and orangutan (Bornean and Sumatran).
The ATCG sequence was converted to a
numerical sequence by assigning the atomic number,
the number of protons, to each of the nucleotides:
A(70), T(66), C(58), G(78). The assigned number is
roughly proportional to the nucleotide mass. This
assignment was consistent with the recently reported
mass fractal analysis of a ribosome sequence (Lee
2006). The A-T and C-G pairs in a double strand
DNA would have the same value of 136.
Fractal dimension analysis can be used in the
study of correlated randomness. Among the various
fractal dimension methods, the Higuchi fractal
method is well suited for studying signal fluctuation
(Higuchi, 1998). The signal from the sequence
represents a random spatial intensity series. The
spatial intensity (Int) random series with equal
intervals could be used to generate a difference
series (Int(j)-Int(i)) for different lags in the spatial
variable. The non-normalized apparent length of the
spatial series curve is simply L(k) = Σ absolute
(Int(j)-Int(i)) for all (j-i) pairs from 1 to k. The
number of terms in a k-series varies and
normalization must be used to get the series length.
If the Int(i) is a fractal function, then the log (L(k))
versus log (1/k) should be a straight line with the
slope equal to the fractal dimension. Higuchi
incorporated a calibration division step (divide by k)
such that the maximum theoretical value is
calibrated to the topological value of 2. The detailed
calculation is given in the literature (Higuchi, 1998).
When comparing the dimension of two fractal
forms, the popular method of taking the difference
of the two Higuchi fractal dimension values is valid
to within a constant regardless of the calibration
division step. The Higuchi fractal algorithm used in
this project was calibrated with the Weierstrass
function. This function has the form W(x) = Σ a
-nh
cos (2 π a
n
x) for all the n values 0, 1, 2, 3… The
fractal dimension of the Weierstrass function was
given by (2 - h) where h takes on an arbitrary value
between zero and one.
The Shannon entropy of a sequence can be used
to monitor the level of functional constraints acting
on the gene (Parkhomchuk, 2006). A sequence with
relatively low nucleotide variety would have a low
Shannon entropy (more constraint) in terms of the
set of 16 possible di-nucleotide pairs. A sequence’s
entropy can be computed as the sum of (p
i
) log(p
i
)
over all states i and the probability p
i
can be
obtained from the empirical histogram of the 16 di-
nucleotide-pairs. The maximum entropy is 4 binary
bits per pair for 16 possibilities (2
4
). The maximum
entropy is 2 bits per mono-nucleotide with 4
possibilities (2
2
).
3 RESULTS & DISCUSSION
For the 16S rRNA sequences, the C+G percent
correlates with fractal dimension, FD, with R-square
value of 0.88, N = 8, in Figure 1. Dropping human
and Neanderthal data would increase the R-square
value to ~ 0.91 because the data are in the middle as
small outliers.
y = 0.7981x + 1.6265
R
2
= 0.881
1.96
1.965
1.97
1.975
1.98
0.42 0.425 0.43 0.435 0.44
Figure 1: The C+G percent (x-axis) versus FD (y-axis) for
the studied 16S rRNA sequences.
The mono-nucleotide entropy correlates with di-
nucleotide entropy in the 16S rRNA sequence with
R-square value of ~ 0.88, N = 8 (Figure 2).
Dropping human and Neanderthal increases R-
BIOINFORMATICS 2010 - International Conference on Bioinformatics
258