SHANNON ENTROPY AND FRACTAL ANALYSIS FOR THE 16S

RIBOSOMAL RNA AND COX2 MT-DNA SEQUENCES IN

PRIMATES INCLUDING NEANDERTHAL

N. Gadura*, Todd Holden**, G. Tremberger, Jr**, E. Cheung**

P. Schneider*, D. Lieberman** and T. Cheung**

Biology* and Physics** Departments, CUNY Queensborogh Community College, Bayside, NY 11364 U.S.A.

Keywords: Mono-nucleotide entropy, Di-nucleotide entropy, Fractal dimension, Neanderthal mt-DNA COX2,

Neanderthal mt-DNA 16S rRNA.

Abstract: The primate mt-DNA 16S rRNA and COX2 sequences, including Neanderthal sequences, were studied

using nucleotide frequency, mono- and di-nucleotide entropy, and fractal dimension. The fractal dimension

was computed with the Higuchi method when a nucleotide sequence is expressed as a numerical sequence

where each nucleotide is assigned its proton number. The results shows that the C+G percent correlates

with the fractal dimension with R-square value of around 0.88 (N = 8) for both gene sequences. The Di- and

mono-nucleotide entropy is also well correlated with similar R-square values. For the COX2 gene, the

human and Neanderthal cluster at high entropy suggests that chimp, gorilla, and orangutan were subjected to

a higher selection pressure for this gene. The human COX2 has less entropy than the Neanderthal COX2

consistent with the presence of some selection pressure.

1 INTRODUCTION

Closely related species can be classified by a

collection of phylogenetic markers with numbers

determined from mathematical operations on DNA

sequence. In cases where such markers are

significantly different from what would be expected

for random mutations, we can get an idea of how

much natural selection has influenced the evolution

of a particular gene. Here, we analyze primate mt-

DNA 16S ribosomal RNA (rRNA) and COX2

sequences, including Neanderthal sequences, using

nucleotide frequency, mono- and di-nucleotide

entropy, and fractal dimension. For protein encoded

by the mt-DNA genome, COX2 have experienced

four amino acid substitutions on the modern human

mtDNA lineage (Green et al, 2008). COX2 encodes

a protein that maintains a proton gradient across the

mitochondrial inner membrane, which drives the

phosphorylation of ADP to ATP. The conclusion

that the evolution of human COX2 from Neanderthal

has been for minor adaptive advantages without

significant functional consequences for

mitochondrial function were put forward using the

fact that the substitutions are on non-functional site

in the crystal structure. The relationship of the

above mentioned phylogenetic markers would shed

light whether there is selection pressure

corresponding to some yet to be discovered

significant function. The 16S rRNA is an important

gene in classification and is included in this study.

The nucleotide base pair changes over a gene

sequence can be viewed as a fluctuation and,

consequently, can be investigated with standard

tools that include correlation and fractal dimension

analysis. For this study, the numerical sequence

representing the fluctuation of nucleotides in a gene

sequence was generated using the proton number of

each nucleotide. This numerical series can then be

processed further using numerical methods such as a

moving average, which is often used in stock market

time series analysis. The fractal dimension of such a

random series or random series derived from the

original atomic number based sequence can be

computed. A recent comparison of human and

chimpanzee genomes revealed that it is possible to

measure the acceleration rate of the accelerated

regions of the human genome (Pollard, et al, 2006a).

The most accelerated region, HAR1, was shown by

a gene expression experiment in the human embryo

to be transcription active and co-expressed with

257

Gadura N., Holden T., Tremberger Jr G., Cheung E., Schneider P., Lieberman D. and Cheung T. (2010).

SHANNON ENTROPY AND FRACTAL ANALYSIS FOR THE 16S RIBOSOMAL RNA AND COX2 MT-DNA SEQUENCES IN PRIMATES INCLUDING

NEANDERTHAL.

In Proceedings of the First International Conference on Bioinformatics, pages 257-260

DOI: 10.5220/0002753002570260

Copyright

c

SciTePress

reelin, which is an essential protein involved in the

development of the six-layer cortex of the human

brain. Fractal analysis was applied to the HAR1

nucleotide sequence and the homologous sequence

in the chimpanzee genome. Analysis shows that the

differences in fractal dimension can be used as a

marker of evolution. The 118-bp in HAR1 contains

18 point substitutions over an evolutionary span of 5

million years when comparing the human to the

chimpanzee. However, the same 118-bp region only

contains two point substitutions over a span of 300

million years when comparing the chicken to the

chimpanzee. The implications of evolution and

positive selection have been discussed in recent

literature (Pollard, et al, 2006b).

2 MATERIALS & METHODS

The nucleotide sequences were downloaded from

Genbank. The accession numbers of the mtDNA

database are listed in the Appendix. The studied

primates are human, Neanderthal, chimp (chimp and

pygmy chimp), gorilla (western and western

lowland), and orangutan (Bornean and Sumatran).

The ATCG sequence was converted to a

numerical sequence by assigning the atomic number,

the number of protons, to each of the nucleotides:

A(70), T(66), C(58), G(78). The assigned number is

roughly proportional to the nucleotide mass. This

assignment was consistent with the recently reported

mass fractal analysis of a ribosome sequence (Lee

2006). The A-T and C-G pairs in a double strand

DNA would have the same value of 136.

Fractal dimension analysis can be used in the

study of correlated randomness. Among the various

fractal dimension methods, the Higuchi fractal

method is well suited for studying signal fluctuation

(Higuchi, 1998). The signal from the sequence

represents a random spatial intensity series. The

spatial intensity (Int) random series with equal

intervals could be used to generate a difference

series (Int(j)-Int(i)) for different lags in the spatial

variable. The non-normalized apparent length of the

spatial series curve is simply L(k) = Σ absolute

(Int(j)-Int(i)) for all (j-i) pairs from 1 to k. The

number of terms in a k-series varies and

normalization must be used to get the series length.

If the Int(i) is a fractal function, then the log (L(k))

versus log (1/k) should be a straight line with the

slope equal to the fractal dimension. Higuchi

incorporated a calibration division step (divide by k)

such that the maximum theoretical value is

calibrated to the topological value of 2. The detailed

calculation is given in the literature (Higuchi, 1998).

When comparing the dimension of two fractal

forms, the popular method of taking the difference

of the two Higuchi fractal dimension values is valid

to within a constant regardless of the calibration

division step. The Higuchi fractal algorithm used in

this project was calibrated with the Weierstrass

function. This function has the form W(x) = Σ a

-nh

cos (2 π a

n

x) for all the n values 0, 1, 2, 3… The

fractal dimension of the Weierstrass function was

given by (2 - h) where h takes on an arbitrary value

between zero and one.

The Shannon entropy of a sequence can be used

to monitor the level of functional constraints acting

on the gene (Parkhomchuk, 2006). A sequence with

relatively low nucleotide variety would have a low

Shannon entropy (more constraint) in terms of the

set of 16 possible di-nucleotide pairs. A sequence’s

entropy can be computed as the sum of (p

i

) log(p

i

)

over all states i and the probability p

i

can be

obtained from the empirical histogram of the 16 di-

nucleotide-pairs. The maximum entropy is 4 binary

bits per pair for 16 possibilities (2

4

). The maximum

entropy is 2 bits per mono-nucleotide with 4

possibilities (2

2

).

3 RESULTS & DISCUSSION

For the 16S rRNA sequences, the C+G percent

correlates with fractal dimension, FD, with R-square

value of 0.88, N = 8, in Figure 1. Dropping human

and Neanderthal data would increase the R-square

value to ~ 0.91 because the data are in the middle as

small outliers.

y = 0.7981x + 1.6265

R

2

= 0.881

1.96

1.965

1.97

1.975

1.98

0.42 0.425 0.43 0.435 0.44

Figure 1: The C+G percent (x-axis) versus FD (y-axis) for

the studied 16S rRNA sequences.

The mono-nucleotide entropy correlates with di-

nucleotide entropy in the 16S rRNA sequence with

R-square value of ~ 0.88, N = 8 (Figure 2).

Dropping human and Neanderthal increases R-

BIOINFORMATICS 2010 - International Conference on Bioinformatics

258

square value of ~ 0.99 because they are in the

middle as moderate outliers

y = 3.3227x - 2.5987

R

2

= 0.8838

3.87

3.875

3.88

3.885

3.89

3.895

1.946 1.948 1.95 1.952 1.954

Figure 2: The mono-nucleotide entropy (x-axis) versus the

di-nucleotide entropy (y-axis) for the studied 16S rRNA

sequences.

Similar correlation of the C+G percent with FD is

observed for COX2 sequences R-square value of ~

0.8756, N = 8 (Figure 3). Dropping the human and

Neanderthal data would increase the R-square value

to ~ 0.93 because they are in the middle as moderate

outliers.

y = 1.3163x + 1.3876

R

2

= 0.8756

1.96

1.97

1.98

1.99

2

2.01

2.02

2.03

0.43 0.44 0.45 0.46 0.47 0.48

Figure 3: The C+G percent (x-axis) versus FD (y-axis) for

the studied COX2 sequences.

y = 2.5412x - 1.0639

R

2

= 0.8979

3.87

3.88

3.89

3.9

3.91

1.94 1.945 1.95 1.955

Figure 4: The mono-nucleotide entropy (x-axis) versus the

di-nucleotide entropy (y-axis) for the studied COX2

sequences.

Similar correlation of the single entropy with pair

entropy is observed for COX2 sequences R-sq ~

0.8979 (Figure 4). The interesting point is that if

human and Neanderthal are deleted then the

correlation drops to 0.7601 because they are the two

end points at large values (mono-nucleotide entropy

for Neanderthal ~ 1.953 bits, for human 1.951 bits).

Correlation would suggest similar selection

pressure mechanism. Conserved regions would

imply less random nucleotide fluctuation across

species and the fractal dimension would differ from

2 and the di-nucleotide Shannon entropy would be

smaller than 4 bits. Our previous results have

associated fractal dimension with functionality

(Tremberger, Jr., et al., 2009). The observation of

fractal dimension correlation to the C+G percent

(which is correlates to the A+T percent) shows that

human and Neanderthal occupy the mid range region

and thus would indicate a mild selection pressure on

the functionality of the 16S ribosomal RNA and

COX2 sequences as compared to the other primates.

Our previous results have associated entropy with

constrains (Holden et al, 2008). The high-value

positions of the COX2 sequences in the single-

entropy and pair entropy correlation suggests a

weaker constraint in human and Neanderthal as

compared to other primates. Di-nucleotide

frequency distributions confirm the closeness of

relations among the various primates. For the COX2

genes, lowered entropy is largely a manifestation of

high Cytosine (~30 %) and low Guanine (~15 %)

content. Di- and mono-nucleotide entropy is well

correlated. That human and Neanderthal cluster at

high entropy indicates that chimp, gorilla, and

orangutan were subjected to a higher selection

pressure for this gene. The human COX2 has less

entropy than the Neanderthal COX2 consistent with

the presence of some selection pressure.

Whether the mild selection pressure and the

weak constraint observed in the mtDNA 16S rRNA

and COX2 sequences have activated other selection

response such as those observed in the brain

function related HAR1 sequence is an interesting

consideration. The HAR1 hardly evolved in 300

million years from chicken to chimp and then

experienced accelerated selection pressure. The

fractal analysis shows that the HAR1 has a fractal of

2.02 while the chimp is at 1.97, the di-nucleotide

entropy is 3.86 bits for human and 3.64 bits for

chimp; and single entropy is 1.97bits for human,

1.86 bits for chimp. If the HAR1-like activation

exists such that the selection pressure would be mild

on the mt-DNA, it would suggests that the

Neanderthal would have similar HAR1-like response

as their 16S rRNA and COX2 sequences are very

similar to the human’s in terms of the above studied

parameters. The activation would be an indication

of cooperation in multi-cellular organism. In this

regard the CNV (copy number variants) strategy

SHANNON ENTROPY AND FRACTAL ANALYSIS FOR THE 16S RIBOSOMAL RNA AND COX2 MT-DNA

SEQUENCES IN PRIMATES INCLUDING NEANDERTHAL

259

should also be considered. It is possible that less

demand on the Cs and Gs with multiple copies could

be more advantageous than higher demand on Cs

and Gs bias with fewer copies for responding to

selection pressure. The correlation analysis method

can be a good supplemental tool to the relative

comparison method inherent in the BLAST method.

4 CONCLUSIONS

The nucleotide fluctuation of the mtDNA 16S rRNA

and COX2 gene sequences in primates including

Neanderthal were studied. The tools are fractal

dimension, Shannon mono- and di-nucleotide

entropy, and C+ G content. We found that C+G

content correlates with the fractal dimension. The

correlation of the mono- and di-nucleotide entropy

shows that the human COX2 gene has experienced

some selection pressure. Future studies include the

extension to other species.

ACKNOWLEDGEMENTS

The project was partially supported by several

CUNY PSC and Collaborative grants. N.G.

received partial support from CUNY Collaborative

CCIRG and Perkins Grant. E.C. thanks the

hospitality of QCC. We thank the research groups

for posting their gene sequence data in Genbank.

REFERENCES

Green, RE, 2008, Malaspinas AS, et al. “A Complete

Neandertal Mitochondrial Genome Sequence

Determined by High-Throughput Sequencing”, Cell,

134, 416-426.

Holden, T. 2008, G. Tremberger, Jr., P. Marchese, E.

Cheung, R. Subramaniam, R. Sullivan, P. Schneider,

A. Flamholz, D. Lieberman, & T. Cheung, “DNA

sequance based comparative studies of between non-

extremophile and extremophile organisms with

implications in exobiology”, SPIE Proceedings,

70970Q, (12 pages) invited.

Higuchi, T., 1998, "Approach to an irregular time series

on the basis of fractal theory", Physica D vol 31, 277-

283.

Lee, Chang-Yong, 2006 “Mass Fractal Dimension of the

Ribosome and Implication of its Dynamic

Characteristics”, Physical Review E 73, 042901.

Parkhomchuk, DV, 2006 ” Di-nucleotide Entropy as a

Measure of Genomic Sequence Functionality”,

arXiv:q-bio/0611059

Pollard, KS 2006a, Salama SR, Lambert N, Coppens S,

Pedersen JS, et al., “An RNA gene expressed during

cortical development evolved rapidly in humans”.

Nature 443, 167-172.

Pollard KS, 2006b, Salama SR, King B, Kern AD, Dreszer

T, et al. “Forces shaping the fastest evolving regions

in the human genome”, PLoS Genet 2(10): e168. DOI:

10. 1371/journal.pgen.0020168

Tremberger, Jr., George, E. Cheung; N. Gadura; T.

Holden; R. Subramaniam; R. Sullivan; P. Schneider;

A. Flamholz; D. Lieberman; T. Cheung, “Multi-fractal

property of perchlorate reductase gene sequences and

DNA photonics application to UV fluorescence

detection on Mars-like surface”, SPIE Proceedings

Vol. 7441, 74410G, (10 pages) invited, 2009

APPENDIX

The studied sequences were downloaded from the

Genbank mtDNA database. The accession numbers

are:

NC_001644 Pan paniscus mitochondrion, (pygmy

chimpanzee).

NC_001643 Pan troglodytes mitochondrion,

(chimpanzee).

NC_011120 Gorilla gorilla gorilla mitochondrion

(Western lowland gorilla).

NC_001645 Gorilla gorilla mitochondrion (Western

Gorilla).

AC_000021 Homo sapiens mitochondrion, (human).

NC_011137 Homo sapiens neanderthalensis

mitochondrion, (Neanderthal).

NC_001646 Pongo pygmaeus mitochondrion

(Bornean orang-utan).

NC_002083 Pongo abelii mitochondrion, (Sumatran

orangutan).

BIOINFORMATICS 2010 - International Conference on Bioinformatics

260