each ORF sequence, only those editing substitutions
are generated such that a significant homology with
some of the known proteins is reached, thus avoid-
ing an exponential growth of the sequences to ana-
lyze. Finally, in order to understand if the produced
amino acid sequences can be considered indicative of
gene activity, a further filtering step is carried out by
searching for the presence of possible transcripts in
DBEST (Boguski et al., 1993).
Figure 1 graphically illustrates the main steps of
our method and the associated supporting software
tools. Below we explain in detail each specific step
of our prediction approach.
2.1 Editing on the Start/Stop Codons
In order to extract novel ORF sequences from the
genome of a given organism, edited nucleotide triplets
corresponding to the start and stop of an amino acid
sequence have to be intercepted on the DNA se-
quence. Such triplets are called start codons and stop
codons, respectively. Exist one start codon, that is
atg, and three stop codons, that are tag, tga and taa.
Although ORF sequences can be easily searched for
in a genomic sequence by exploiting one of the exist-
ing software tools, such as for example ORF FINDER
and STARORF. These software do not take in account
of editing mechanism. Therefore, in plants, several
proteins are not found from the ORF sequences re-
turned in output by such tools.
To this aim, we start from the mtDNA of a spe-
cific plant, and predict that some editing substitu-
tions might have happened causing the generation of
some start/stop codons. Among all such possible new
codons, only those corresponding to significative po-
tential ORF sequences are taken into account. In par-
ticular, only ORF sequences corresponding to amino
acid sequences of length at least 100 are considered
to correspond to potential proteins. Thus, between a
start and a stop at least 300 nucleotides have to occur
for potential novel ORF sequences to be singled out.
Furthermore, the most frequent nucleotide substitu-
tion caused by editing is c → u at the RNA level, that
is, c → t if we refer to mtDNA. Thus we consider only
this kind of nucleotide substitution in our analysis.
The following example illustrates how new candi-
date ORF sequences can be generated from the origi-
nal nucleotide sequence, by simulating possible edit-
ing substitutions.
Example 1 In Figure 2 a portion of the rice mtDNA
is shown. In particular, in the considered sequence,
there are two stop codons, taa and tag, highlighted by
a widehat. Since no start codon occurs between the
two stops, no candidate ORF sequences would be ex-
tracted without editing simulation. On the contrary,
if we consider possible substitutions c → t leading to
the generation of new codons, then the start codon atg
resulting from the triplet acg in italic can be indeed in-
tercepted. Since between this start codon and the stop
tag there are 102 nucleotide triplets, the subsequence
highlighted in bold, worth considering as a candidate
ORF sequence, can be extracted this way.
The method starts by considering an input nu-
cleotide sequence (in the case we present in this pa-
per, this is the mtDNA of a plant). Such a nucleotide
sequence s
n
is then scanned in all its three possi-
ble reading frames (for both the forward and the re-
verse cases), by considering all the substitutions c → t
that can generate new start/stop codons (we call them
edited codons, while original codons are those al-
ready occurring in s
n
). Then, the nucleotide subse-
quences with minimum length 300 between a start
and a stop codons are extracted, by taking care that
only maximal subsequences are considered. And, in
fact, if several useful start codons occur before a same
stop codon, only the first start codon is considered for
the purpose of extracting the corresponding ORF se-
quence. All the other start codons are translated as the
corresponding amino acid Methionine (M) in the re-
sulting amino acid sequence. This avoids intercepting
all the possible subsequences. For what concerns the
stop codons, the first one after the chosen start c
START
is considered, if such a c
STOP
is an original codon. In
such a case, the so individuated subsequence is dis-
carded if its length is less than 300 bases, and we
look for another c
START
. If, instead, c
STOP
is an edited
stop, it is taken into account only if between c
START
and c
STOP
there are at least 300 nucleotides, otherwise
such an edited stop is discarded, and the next c
STOP
is
searched for, by using the same rule. We avoid this
way subdividing a potentially significative sequence
in several smaller meaningless subsequences.
Figure 3 summarizes the editing ORF simulation
method as described above.
2.2 Editing on the Amino Acid
Sequences
Let P
ORF
be the extracted amino acid sequences set:
a first question is to what extent possible RNA edit-
ings occurring in each sequence of P
ORF
may influ-
ence the prediction process (it is just worth recalling
that the only editing we are focusing on here is the
c → u one). Note that, if we simulate editing on the
sequences in P
ORF
, we should take into account all the
possible c → u editing configurations that might pos-
sibly occur, the number of which is 2
k
, where k is the
BIOINFORMATICS 2016 - 7th International Conference on Bioinformatics Models, Methods and Algorithms
184