amino acid composition and are particularly useful
in comparing codon usage among genes, or sets of
genes that differ in their size and amino acid
composition. The formula for RCSU is given by:
RSCU
ij
X
ij
1
n
i
X
ij
j1
n
i
where X
ij
is the number of occurrences of the j
th
codon for the i
th
amino acid, and n
i
is the number
from one of six of alternative codons for the i
th
amino acid. Relative adaptive-ness of a codon, w
ij
, is
the frequency of use of that codon compared to the
frequency of the optimal codon for that amino acid,
and it is given by:
W
ij
RSCU
ij
RSCU
i max
X
ij
X
i max
where RSCU
imax
and X
imax
are the RSCU and X
values for that codon which is used most frequently
for the i
th
amino acid.
Codon Adaptation Index CAI measures the
relative adaptation of a gene of the codon usage of
highly expressed genes. CAI uses a reference set of
highly expressed genes from a species to assess the
relative merits of codon and identifies the role of
selective pressure in modeling the patterns of codon
usage (Sharp and Li, 1987). To calculate CAI, the
first step is to construct a reference table of relative
synonymous codon usage RSCU values from very
highly expressed genes of the organism in question.
The CAI values are calculated in relation to the
psbA gene of the same genome.
The psbA gene demonstrates atypical codon
usage and its codon bias is a remnant of the ancestral
bias degrading toward the compositional bias
(Morton and Levin, 1997). A CAI values close to
1.0 reflects strong bias in codon usage and
potentially high-expression level of the considered
gene (Sharp and Li, 1987).
The most commonly used characteristic is the
pattern of codon usage itself, defined in terms of
optimal codons. An optimal codon is any codon
whose frequency of usage is significantly higher
than its synonymous codons in putatively highly
expressed genes. Significance is estimated using a
two-way chi-squared contingency test, with a cut-off
at p<0.01. Codon usage was composed using chi-
square contingency test of the groups, and codons
whose frequency of usage were significantly higher
p-value < 0.01 in highly expressed genes than in
genes with low level of expression would be defined
as the optimal codons.
GC content is calculated as the fraction of
nucleotides in a sequence, that are guanine or
cytosine. The index GC3s is the frequency of G or C
nucleotides present at the third position of
synonymous codons i.e. excluding Met, Trp and
termination codons.
Hydrophobicity is measured in terms of gravy
score, while aromaticity denotes the frequency of
aromatic amino acids Phe, Tyr and Trp in the
translated sequences (Kyte and Doolittle, 1982).
To normalize and identify intra-genomic
variation with differing amino acid compositions,
relative synonymous codon usage RSCU was
analyzed for correspondence analysis COA for the
59 informative codons excluding Met, Trp, and the
three stop codons (Greenacre, 1984). This analysis
partitions the variation along 59 orthogonal axes,
with 41 degrees of freedom. The first axis is the one
that captures most of the variation in the codon
usage, with each subsequent axis explains a
diminishing amount of the variance. The
correspondence analysis also reflects the
corresponding distribution of synonymous codons.
RSCU values are close to 1.0 when all synonymous
codons are used equally without any bias. In
subsequent part of this work, the terms axis 1 RSCU
and axis 2 RSCU will be used to represent first-and
second-major axis of COA.
3 RESULTS
3.1 Detection of Codon Usage Patterns
As described in methods, the pattern of synonymous
codons usage across the codons in each genome was
investigated by the Nc-plot between ENc value and
GC3s value. The values range from 20 extremely
biased to 61 no bias (Wright, 1990), and the
respective plots are shown in Figure 1 for the basal
angiosperms, and Figure 2 for magnoliids. Nc-plots
of basal angiosperm chloroplast genomes follow a
trajectory path, i.e majority of points are on and just
below the Nc-plot.
Table 2 lists the Nc and GC3 values for all
species investigated and it can be seen that basal
angiosperms have very low GC3s values and their
Nc values range from about 38 to 61, the lowest
being 38.39 GC3s is 0.232 in case of the rps18 gene
of Amborella trichopoda.
Overall, the majority of genes follow a parabolic
line of trajectory indicating G+C mutational bias as
the predominant factor for variation in codon usage,
although some genes lie well below the expected
curve, hinting at additional factors responsible for
codon bias in basal angiosperms.
BIOINFORMATICS2015-InternationalConferenceonBioinformaticsModels,MethodsandAlgorithms
146