Discovering New Proteins in Plant Mitochondria by

RNA Editing Simulation

Fabio Fassetti

, Claudia Giallombardo

, Ofelia Leone

, Luigi Palopoli

, Simona E. Rombo

2,∗

and Adolfo Saiardi

DIMES - Universit

a della Calabria, Rende (CS), Italy

Dipartimento di Matematica e Informatica, Universit

a degli Studi di Palermo, Palermo, Italy

LMCB, MRC, Cell Biology Unit & Department of Developmental Biology, University College, London, U.K.

Keywords:

Sequence Analysis, Editing Simulation, ORF Sequences, Plant mtDNA, Protein Prediction.

Abstract:

In plant mitochondria an essential mechanism for gene expression is RNA editing, often inﬂuencing the syn-

thesis of functional proteins. RNA editing alters the linearity of genetic information transfer. Indeed it causes

differences between RNAs and their coding DNA sequences that hinder both experimental and computational

research of genes. Therefore common software tools for gene search, successfully applied to ﬁnd canonical

genes, often fail in discovering genes encrypted in the genome of plants.

Here we propose a novel strategy useful to identify candidate coding sequences resulting from possible editing

substitutions. In particular, we consider c → u substitutions leading to the creation of new start and stop codons

in the mitochondrial DNA of a given input organism. We try to mimic the natural RNA editing mechanism,

in order to generate candidate Open Reading Frame sequences that could code for novel, uncharacterized

proteins. Results obtained analyzing the mtDNA of Oryza sativa are supportive of this approach, since we

identiﬁed thirteen Open Reading Frame sequences transcribed in Oryza, that do not correspond to already

known proteins. Five of the corresponding amino acid sequences present high homologies with proteins al-

ready discovered in other organisms, whereas, for the remaining ones, no such homology was detected.

1 INTRODUCTION

In mitochondria and chloroplasts of ﬂowering plants,

the linearity of genetic information is interrupted by

mechanisms that increase protein variability. Such

mechanisms can alter the RNA transcript so that their

ﬁnal primary nucleotide sequence results quite dif-

ferent from the corresponding DNA sequence. The

most common among these mechanisms is post-

transcriptional mRNA editing, consisting in enzy-

matic modiﬁcation of nitrogenous bases, almost ex-

clusively Cytidine to Uridine transformation (Take-

naka et al., 2008). Most RNA editing events are found

in the coding regions of mRNAs and usually at ﬁrst

and second position of codon, so that the deriving

amino acid is often different from that speciﬁed by

the corresponding unedited codon (Gray et al., 1992).

Editing can also create new start and stop codons

(Hoch et al., 1991), (Wintz and Hanson, 1991) and it

can occur in introns (Brennicke et al., 1999) and other

Corresponding author

non translated regions (Schuster et al., 1990). The

use of editing to generate aug start codons might rep-

resent another level of regulatory control of gene ex-

pression: introducing a translational start codon could

make an mRNA accessible for protein synthesis (Tak-

enaka et al., 2008).

Speciﬁcally, in plant mitochondria, RNA editing

is essential for gene expression. In many cases this

mechanism completes the genomic information and

is essential to the creation of a functional open read-

ing frame (Regina et al., 2002). Given the physiologi-

cal importance of RNA, identiﬁcation of sites of RNA

editing is essential for molecular, biochemical and

phylogenetic studies in plant mitochondria. Exper-

imental analysis, made comparing RNA transcripts

and genomic DNA sequences, is the more exhaus-

tive way, but it is also expensive and time consuming.

A collection of all sequences post-transcriptionally

modiﬁed by RNA editing from many organisms, re-

covered from primary databases and literature, is

available on the RNA editing database REDI (Pi-

cardi et al., 2007). Computational approaches have

182

Fassetti F., Giallombardo C., Leone O., Palopoli L., Rombo S. and Saiardi A.

Discovering New Proteins in Plant Mitochondria by RNA Editing Simulation.

DOI: 10.5220/0005664901820189

In Proceedings of the 9th Inter national Joint Conference on Biomedical Engineering Systems and Technologies (BIOINFORMATICS 2016), pages 182-189

ISBN: 978-989-758-170-0

been used to predict sites of RNA editing, based ei-

ther on statistical methods (Bundschuh, 2004) or on

evolutionary considerations. The latter ones are based

on the observation that often the ﬁnal effect of edit-

ing events is to make mitochondrial encoded pro-

teins more similar in sequence to their homologous

in other species (Gualberto et al., 1989). For in-

stance, PREPMT (Mower, 2005) and EDIPY (Picardi

and Quagliariello, 2005) are both systems exploiting

this tendency of RNA editing to “correct” codons that

specify unconserved amino acids. A more recent ap-

proach has been proposed in (Lenz and Knoop, 2013).

The simplest way to ﬁnd genes in a genome is to

scan the nucleotide sequence in all the three possible

reading frames, searching for DNA sequences that do

not contain any stop codon in a given reading frame.

The sequence comprised between a start and a stop

codon is an Open Reading Frame (we call them ORF

sequences in the rest of this paper) and it can be con-

sidered a potential protein encoding segments if its

length is at least 300 nucleotides. The alternative to

this “ab initio” gene discovery is the comparative gene

ﬁnding, based on sequence similarity. It consists in

comparing translated sequences with known proteins,

and homology criteria can allow for the identiﬁcation

of new proteins in the organism under analysis. The

number of known mitochondrial genes varies in dif-

ferent organisms from only 5 genes in Plasmodium

to nearly 100 genes in jakobid ﬂagellates, with the

average across eukaryotes being 40-50 genes (Burger

et al., 2003). Despite the difference in number, mito-

chondrial genes are involved in ﬁve basic processes:

invariantly in respiration and/or oxidative phosphory-

lation and translation, and occasionally also in tran-

scription, RNA maturation and protein import. How-

ever, because of the existence of mechanisms increas-

ing gene complexity in plant mitochondria, it is pos-

sible that a certain number of mitochondrial proteins

remains still unknown. Indeed, RNA editing mecha-

nism alters the linearity of genetic information trans-

fer, introducing differences between RNAs and their

coding DNA sequences that hinder both experimental

and computational research of genes. In fact, com-

mon software tools of gene search are helpful in ﬁnd-

ing canonical genes, but they fail in discovering genes

so encrypted in the genome. Accordingly, complete

sequencing of mtDNA of many organisms allowed the

identiﬁcation of canonical genes, but much of the in-

formational content of plant mitochondrial genomes

remains still undiscovered. Finding plant mitochon-

drial proteins and understanding how they integrate

into pathways, represent major challenges in cell bi-

ology.

In order to identify new proteins in plant mito-

chondria, we propose a method for ORF sequences

mining from genomes, based on editing simulation,

as illustrated in Section 2. Our approach aims at iden-

tifying ORFs that could potentially be coding regions

for proteins but that, due to RNA editing, cannot be

detected by classical ﬁnding techniques. The pre-

sented method is based on the observation that plant

mitochondria use editing mechanism on crucial sites,

for example to generate start codon aug from acg.

The main idea we pursue is that of simulating such an

editing process by exploiting a suitable metric to com-

pute the distance between sequences, in such a way to

directly take editing into accounts. We applied our

method on the mtDNA of Oryza sativa (rice), obtain-

ing encouraging preliminary results that are described

in Section 3. First, our method was able to single

out amino acid sequences corresponding to rice pro-

teins for which start codons editing is known to occur,

whereby validating our approach. Second, a number

of protein sequences were predicted, some of which

are homologous to proteins expressed in other organ-

ims, while some others are completely novel ones.

2 METHODS

The idea exploited in this work is that of trying to

automatically mimic those editing mechanisms pos-

sibly causing the presence of proteins that are not

imputable to ORF sequences obtained by traditional

methods (e.g. ORF FINDER

, STARORF

). This

is rather meaningful in plants, where mtDNA edit-

ing mechanisms can often involve nucleotide triplets

leading to start and stop codons. Our approach is

based on the simulation of such a process, in order to

generate novel potential proteins, not yet discovered

in a given input organism. The by far most frequent

nucleotide substitution caused by editing is c → u at

the RNA level, that is, c → t if we refer to mtDNA.

Thus we consider only this kind of nucleotide sub-

stitution in our analysis. Since RNA editing might

occur also on portions inside the simulated ORF se-

quences, we handle also a further editing simulation

step. In particular, when an amino acid sequence is

intercepted for a specif organism, a ﬁrst criterion to

understand its biological relevance is searching for

signiﬁcant homologies. Thus, we generate those edit-

ing substitutions on the ORF sequences in such a way

that possible new homologies with known proteins of

other organisms can be detected. To this aim, a suit-

able sequence distance measure is considered, and for

http://www.ncbi.nlm.nih.gov/projects/gorf/

http://web.mit.edu/star/orf/

Discovering New Proteins in Plant Mitochondria by RNA Editing Simulation

183

each ORF sequence, only those editing substitutions

are generated such that a signiﬁcant homology with

some of the known proteins is reached, thus avoid-

ing an exponential growth of the sequences to ana-

lyze. Finally, in order to understand if the produced

amino acid sequences can be considered indicative of

gene activity, a further ﬁltering step is carried out by

searching for the presence of possible transcripts in

DBEST (Boguski et al., 1993).

Figure 1 graphically illustrates the main steps of

our method and the associated supporting software

tools. Below we explain in detail each speciﬁc step

of our prediction approach.

2.1 Editing on the Start/Stop Codons

In order to extract novel ORF sequences from the

genome of a given organism, edited nucleotide triplets

corresponding to the start and stop of an amino acid

sequence have to be intercepted on the DNA se-

quence. Such triplets are called start codons and stop

codons, respectively. Exist one start codon, that is

atg, and three stop codons, that are tag, tga and taa.

Although ORF sequences can be easily searched for

in a genomic sequence by exploiting one of the exist-

ing software tools, such as for example ORF FINDER

and STARORF. These software do not take in account

of editing mechanism. Therefore, in plants, several

proteins are not found from the ORF sequences re-

turned in output by such tools.

To this aim, we start from the mtDNA of a spe-

ciﬁc plant, and predict that some editing substitu-

tions might have happened causing the generation of

some start/stop codons. Among all such possible new

codons, only those corresponding to signiﬁcative po-

tential ORF sequences are taken into account. In par-

ticular, only ORF sequences corresponding to amino

acid sequences of length at least 100 are considered

to correspond to potential proteins. Thus, between a

start and a stop at least 300 nucleotides have to occur

for potential novel ORF sequences to be singled out.

Furthermore, the most frequent nucleotide substitu-

tion caused by editing is c → u at the RNA level, that

is, c → t if we refer to mtDNA. Thus we consider only

this kind of nucleotide substitution in our analysis.

The following example illustrates how new candi-

date ORF sequences can be generated from the origi-

nal nucleotide sequence, by simulating possible edit-

ing substitutions.

Example 1 In Figure 2 a portion of the rice mtDNA

is shown. In particular, in the considered sequence,

there are two stop codons, taa and tag, highlighted by

a widehat. Since no start codon occurs between the

two stops, no candidate ORF sequences would be ex-

tracted without editing simulation. On the contrary,

if we consider possible substitutions c → t leading to

the generation of new codons, then the start codon atg

resulting from the triplet acg in italic can be indeed in-

tercepted. Since between this start codon and the stop

tag there are 102 nucleotide triplets, the subsequence

highlighted in bold, worth considering as a candidate

ORF sequence, can be extracted this way.

The method starts by considering an input nu-

cleotide sequence (in the case we present in this pa-

per, this is the mtDNA of a plant). Such a nucleotide

sequence s

is then scanned in all its three possi-

ble reading frames (for both the forward and the re-

verse cases), by considering all the substitutions c → t

that can generate new start/stop codons (we call them

edited codons, while original codons are those al-

ready occurring in s

). Then, the nucleotide subse-

quences with minimum length 300 between a start

and a stop codons are extracted, by taking care that

only maximal subsequences are considered. And, in

fact, if several useful start codons occur before a same

stop codon, only the ﬁrst start codon is considered for

the purpose of extracting the corresponding ORF se-

quence. All the other start codons are translated as the

corresponding amino acid Methionine (M) in the re-

sulting amino acid sequence. This avoids intercepting

all the possible subsequences. For what concerns the

stop codons, the ﬁrst one after the chosen start c

START

is considered, if such a c

STOP

is an original codon. In

such a case, the so individuated subsequence is dis-

carded if its length is less than 300 bases, and we

look for another c

START

. If, instead, c

STOP

is an edited

stop, it is taken into account only if between c

START

and c

STOP

there are at least 300 nucleotides, otherwise

such an edited stop is discarded, and the next c

STOP

searched for, by using the same rule. We avoid this

way subdividing a potentially signiﬁcative sequence

in several smaller meaningless subsequences.

Figure 3 summarizes the editing ORF simulation

method as described above.

2.2 Editing on the Amino Acid

Sequences

Let P

ORF

be the extracted amino acid sequences set:

a ﬁrst question is to what extent possible RNA edit-

ings occurring in each sequence of P

ORF

may inﬂu-

ence the prediction process (it is just worth recalling

that the only editing we are focusing on here is the

c → u one). Note that, if we simulate editing on the

sequences in P

ORF

, we should take into account all the

possible c → u editing conﬁgurations that might pos-

sibly occur, the number of which is 2

, where k is the

BIOINFORMATICS 2016 - 7th International Conference on Bioinformatics Models, Methods and Algorithms

184

Figure 1: The protein prediction method based on editing simulation.

atc gga tca tca tgc ata atc gaa caa agc tta tcc gca tgg

taa agt agt tta cca cac aag tcg aca aaa aag acg ttc ggc

ttt aga aat cat ttt ttt gct ccc tca tcc tcg gtt gtt cgt att tca ttt tct tca aag gca cat gca cta

ggt tac tta cgg aat ctc aaa gaa aga gtc gtc cag gag cac ttc gtt aga ttt gca tgt gtt aag cat ata

gct gaa gtt gcc tat gcg ctt caa cct gct ctt aca aga cga atc tct ttc tat acg caa ttt caa cta gag

tct act cct ttc tgg tct gaa atc tca gta gag acg ata aag att agg tgc ctt tct ttc tat agg gat agg

tgc ttc tct cta

tag aaa gaa agg aga tcc agt tta cca ttg aga gta gag aag ggg aag

Figure 2: Editing of the start codon acg → atg.

number of c occurrences in the ORF sequence under

consideration. However, for the purposes of our anal-

ysis, two or more such conﬁgurations are to be con-

sidered equivalent as long as they produce the same

amino acid. Note, by the way, that since more than

one c can occur with one single triplet, that triplet

can indeed induce different amino acids via editing

– this is the case, for instance, of the amino acid P

(Proline), that corresponds to four triplets including

ccc and from which, by editing, actually three amino

acids, namely L (Leucine), S (Serine) and F (Pheny-

lalanine), can be obtained. Therefore, a quantitative

analysis is useful here.

Thus, let a

be an amino acid containing a c such

that a substitution c → u leads to the generation of

an amino acid a

6= a

. We say that a

is an editable

amino acid. Analogously, we call editable c each c

that may cause the generation of a new amino acid

after a c → u substitution. We exploit the term editing

substitutions to refer to both c → u substitutions and

the corresponding a

→ a

substitutions, accordingly

to the case under analysis (nucleotide sequences or

amino acid sequences, respectively).

In the following we report an analysis performed

in order to evaluate the effect of editing occurrence

on the amino acid sequences. Figure 4 shows the dis-

tribution of the number of c, editable c and editable

amino acids for unit of length, with respect to all the

amino acid sequences generated from rice mtDNA us-

ing the technique illustrated in the previous section. A

Gaussian ﬁt has been performed for each distribution:

the abscissa corresponding to the peak of each curve

ﬁt has been found to agree with the corresponding cal-

culated average value. Moreover, the expected con-

ﬁdence intervals for normal distributions have been

observed: about 64%, 66%, 67% of the set are within

one standard deviation for fraction of c, editable c and

editable amino acids, respectively. Two standard de-

viations from the mean account for about 98%, 97%

and 95% of the set for each distribution, respectively.

Interestingly, looking at Figure 4, we observe that

the amino acid sequences are more sensible to edit-

ing substitutions than the original candidate ORF se-

quences from which they were obtained. Indeed, the

curve ﬁtting editable amino acids results to be trans-

lated along the x-axis approximatively by a factor 3

with respect to the curve corresponding to editable c.

We also observe that, in some cases, editing substi-

tutions involve more than the 40% of an amino acid

sequence, thus potentially causing also signiﬁcative

variations with respect to the amino acid sequence

that would have been obtained by translating the orig-

inal nucleotide sequence, without considering editing.

Unfortunately, in order to generate all the differ-

ent amino acid sequences that can be obtained by all

the possible combinations of c → u substitutions, we

should tackle the generation of many possible conﬁg-

urations, to be then searched for possible homologies

and/or transcribed sequences. In order to avoid such

a blow-up in the number of candidate ORF sequences

to analyze, we propose the following strategy.

Let s

be the amino acid sequence of a candidate

protein, obtained according to the procedure illus-

trated in Section 2.1. We ﬁrst try to individuate some

known proteins to which s

becomes homologous un-

dergoing a suitable editing. The idea is to consider

Discovering New Proteins in Plant Mitochondria by RNA Editing Simulation

185

Input: A nucleotide sequence s

;

Output: A set of amino acid sequences P

ORF

;

1. P

ORF

2. for each of the three possible reading frames fr of s

3. repeat

4. repeat

5. read a triplet t from fr;

6. until t is a start codon or by editing t a start codon is achieved;

7. set c

START

to t;

8. repeat

9. read a triplet t from fr;

10. until t is a stop codon or by editing t a stop codon is achieved;

11. set c

STOP

to t;

12. let n

be the number of nucleotides between c

START

and c

STOP

;

13. if c

STOP

is an edited stop codon

14. if n

< 300

15. skip c

STOP

and goto step 8;

16. end if

17. end if

18. if n

≥ 300

19. extract the nucleotide subsequences s

between c

START

and c

STOP

;

20. traduce s

in an amino acid sequence p

;

21. P

ORF

= P

ORF

∪ {p

};

22. end if

23. until the end of fr is reached;

24. end for

25. return P

ORF

;

Figure 3: The Editing ORF Simulation Module.

0.40.30.20.1

= 0.12

= 0.03

= 0.32

= 0.06

= 0.21

= 0.05

nr. editable C / length

nr. C / length

nr. editable Aa / nr. aa

Figure 4: Distribution of the number of c, editable c, and

editable amino acids for unit of length in P

ORF

a suitable metric to compute the distance between s

and each s

belonging to a set of known proteins, in

such a way to directly take editing into account. This

way only edited sequences that are homologous to

some already known proteins are generated from s

In more detail, given a candidate protein with amino

acid sequence s

and a known protein with amino acid

sequence s

, the distance between s

and s

is equal

to σ if there exists a set of editing substitutions trans-

forming s

into a sequence

, such that the distance

between

and s

is σ. We consider signiﬁcative the

homology between

and s

if σ is less than a ﬁxed

threshold σ

. In cases where no such an homologous

can be singled out, we keep the “original” s

(indi-

viduated by the Editing ORF Simulation Module) for

further analysis. Otherwise, we choose one among

those

scoring both the lowest σ and the smallest set

of editing substitutions.

We work by minimizing the Levenshtein distance

(Levenshtein, 1966) between sequences, modiﬁed to

take into account possible amino acid substitutions, as

shown in the following example.

Protein sequences in P

ORF

for the organisms O

(e.g., Oryza) are compared against known proteins

by simulating editing as explained above in order to

single out interesting homologies. Some of the se-

quences in P

ORF

can be found to be known O proteins,

in which case we discard them from further analy-

sis. Let P

edited

be the resulting amino acid sequences

set, where the original proteins of P

ORF

are possi-

bly substituted by the edited sequences correspond-

ing to minimum distance conﬁgurations. We can di-

vide P

edited

in two further subsets P

edited

and P

edited

includes amino acid sequences for which sig-

niﬁcant homologies have been found with respect to

some proteins belonging to other organisms, while

edited

contains the remaining ones.

Figure 5 illustrates the pseudocode for this step of

our approach.

Consider again the ORF sequence discussed in

Example 1. By applying the procedure explained

ftp://ftp.ncbi.nlm.nih.gov/blast/db/.

BIOINFORMATICS 2016 - 7th International Conference on Bioinformatics Models, Methods and Algorithms

186

Input: The set of amino acid sequences P

ORF

;

A set of known protein sequences P

known

;

A distance threshold σ

;

Output: A set of edited amino acid sequences P

edited

;

1. P

edited

, P

edited

2. for each amino acid sequence s

∈ P

ORF

3. s

, s

= ε; /* null string

4. σ

= d(s

, ε); /* initial distance is set to

the maximum possible value

5. for each protein sequence s

∈ P

known

6. ﬁnd the amino acid sequence

= ϕ(s

where ϕ is an operator transforming s

into

by applying a ﬁnite sequence of editing

substitutions to minimize d(

, s

);

7. if d(

, s

) < σ

8. s

;

9. s

= s

;

10. σ

= d(

, s

);

11. end if

12. end for

13. if s

does not belong to O and σ

≤ σ

14. add s

to P

edited

;

15. if s

does not belong to O and σ

> σ

16. add s

to P

edited

;

17. end for

18. return P

edited

= P

edited

∪ P

edited

;

Figure 5: The Computing Distances Module.

in this section, the corresponding amino acid se-

quence, which did not present any signiﬁcant ho-

mologous without editing, shows high similarity with

V5IJ74 IXORI, a putative atp synthase subunit of the

common tick Ixodes ricinus.

2.3 Final Predictions

The amino acid sequences in P

edited

are further ana-

lyzed by searching for the presence of possible tran-

scripts, since this can be considered indicative of gene

activity. In particular, the DBEST (Boguski et al.,

1993) is queried to this aim by each s

∈ P

edited

, in or-

der to detect signiﬁcant homologies with some known

expressed sequences. Eventually, our system returns

in output two sets of predicted proteins: P

and P

, re-

spectively containing amino acid sequences in P

edited

and in P

edited

for which trascripts have been found in

O (e.g., Oryza). As an example, the edited amino acid

sequence of the ORF discussed in Example 1 presents

EST in Zea mays but not in Oryza, thus it has been

discarded.

3 RESULTS

We applied our method on Oryza sativa (rice) mtDNA

with the aim of predicting possible new mitochondrial

proteins. The entire mitochondrial genome of rice has

been sequenced (Notsu et al., 2002); it was found to

be 490,520 bp long. To date, 81 genes have been iden-

tiﬁed, 53 of which coding for proteins. The automatic

simulation of editing on all the potential start and stop

codons of rice mtDNA leads to the generation of a to-

tal of 176 candidate ORF sequences, among which

138 are those involving edited start and stop codons.

In order to validate our approach, we at ﬁrst ver-

iﬁed if the two proteins that are known to be gener-

ated by RNA editing in Oryza sativa were actually

recognized by our system. We found both of them,

the NADH dehydrogenase subunit 1 and the NADH

dehydrogenase subunit 4.

Candidate ORF sequences involving edited start

and/or stop codons consist of 60 sequences with edit-

ing only on the start codon and 78 sequences with

editing only on the stop codons. The latter ones seem

to be less interesting for our analysis, since they repre-

sent subsequences of ORF sequences that can be gen-

erated also by other available ORF ﬁnder tools. In

this analysis, we focus only on the former 60 candi-

date ORF sequences. Among them, we found 32 se-

quences corresponding to proteins already described

in rice, 7 not known in Oryza but homologous to pro-

teins identiﬁed in other organisms, and 21 sequences

that have been not described before (see Figure 6).

The screening of the DBEST database (Boguski

et al., 1993) by TBLASTN (Altschul et al., 1997)

gave very interesting results: six candidate ORF se-

quences from forward DNA strand and seven from re-

verse strand (Table 1) showed positive matches, indi-

cating their transcription in the organism under study.

Because transcription of an open reading frame indi-

cates gene activity, we directed our further analysis on

these 13 transcribed ORFs. The ﬁrst column in Table

1 contains progressive numbers indicating the con-

sidered candidate ORF sequences, second and third

columns show the position in the nucleotide sequence

of the start and the stop codons of each sequence, re-

spectively. The last column shows organisms where

the corresponding transcribed ORF has been found.

Among these sequences, ﬁve (2 from forward and 3

from reverse strand) were homologous to proteins al-

ready known in other organisms, as reported in Table

2, but eight sequences have never been described until

now. The evidence of RNA transcription from these

sequences let us suppose that they may indeed repre-

sent new genes.

The second and third column in Table 2 show

the query coverage and percent identity of protein

BLAST results, respectively. Among the candidate

ORF showing homology with proteins already known

in other organisms, four are returned by our system

as hypothetical proteins. In particular, sequence 6

Discovering New Proteins in Plant Mitochondria by RNA Editing Simulation

187

Figure 6: Classiﬁcation of the discussed sequences.

in Table 2, is homologous to a protein described in

Zea mays (with NCBI accession number AAR91184),

a monocotyledon plant, and in Trichoplax adherens, a

Placozoa. Sequence 7 shows homology with a protein

described in Persephonella marina (Y P 002730925)

and many bacteria, sequence 10 is homologous to a

protein identiﬁed in Nicotiana tabacum (Y P 173435)

and other plants, while sequence 12 is homologous to

a protein described in Brassica napus (Y P 717160).

DBEST screening showed that all of them are ex-

pressed not only in Oryza sativa, but in several or-

ganisms. Functional studies can clarify the nature of

these proteins. Sequence 4 showed high similarity

with PG1 protein, a factor involved in transcription

regulation, in several plants and many bacteria. The

high similarity with the same protein in organisms,

even very distant from an evolutionary point of view,

strongly indicates that our candidate ORF sequence

of Oryza actually corresponds to the PG1 protein.

4 CONCLUSION

We proposed a method to predict novel candidate pro-

teins resulting from c → u editing substitutions in

plants mitochondrial DNA. The idea is to simulate the

natural RNA editing mechanism, in order to gener-

ate possible Open Reading Frame sequences coding

for some uncharacterized proteins. The approach al-

lowed us to identify interesting amino acid sequences

in Oryza which could represent proteins yet unknown.

As future work, ﬁrst of all we will test the method

on the mRNA of other plant mitochondria. Then, we

plan to investigate different strategies for the inner

editing of the candidate sequences, for example based

on the analysis of the context around the c → u substi-

tution (Mulligan et al., 2007). Furthermore, we think

to extend this in order to manage also next genera-

tion sequencing data, as already done in (Picardi and

Pesole, 2013). Finally we observe that, often, pro-

teins with low sequence homology have similar func-

tions and secondary/tertiary structures, whereby it ap-

pears sensible to comparatively look at such struc-

tures for the result assessment purposes, possibly by

suitable prediction techniques (see, e.g., (Palopoli

et al., 2009)).

Table 1: ORF sequences with transcription in rice.

SEQ. QUERY START STOP

EST

NR. COV. COD. COD.

1 124 354085 354460

O. sativa, T. dactiloydes

Z. mays, others

2 108 407800 408127

O. sativa, B. oldhamii

T. dactyloides, Zea,

T. aestivum, S. bicolor

3 99 467635 467935

O. sativa, S. bicolor,

Z. mays

4 111 283844 284180

O. sativa, Z. mays,

several bacteria

5 107 362648 362972

O. sativa, B. oldhamii,

Z. mays, Triticum,

S. bicolor, V. vinifera,

others

6 139 364454 364874

O. sativa, Z. mays,

B. oldhamii, Triticum,

S. bicolor, V. vinifera,

A. thaliana, others

7 200 463889 463286

O. sativa, Z. mays,

V. vinifera, T. aestivum,

others

8 127 232370 231986

O. sativa, B. oldhamii,

Z. mays, T. aestivum,

C. sinensis, others

9 108 449361 449034

O. sativa, Z. mays,

C. papaya, T. dactyloides,

R. communis, others

10 112 314493 314154

O. sativa, Z. mays,

B. oldhamii,

T. dactyloides,

Zea, S. bicolor, others

11 142 295218 294789

O. sativa, E.crassipes,

B. oldhamii,

L. tulipifera, others

12 114 201474 201129

O. sativa, Z. mays,

T. dactyloides,

B. oldhamii, others

13 100 105822 105519

O. sativa, T. aestivum,

Petunia, T. dactyloides,

B. oldhamii, others

ACKNOWLEDGEMENTS

PRIN Project 20122F87B2 “Approcci composizion-

ali per la caratterizzazione e il mining di dati omici”

(toF.F., C.G. and S.E.R.), ﬁnanced by the Italian Min-

istry of Education, Universities and Research.

BIOINFORMATICS 2016 - 7th International Conference on Bioinformatics Models, Methods and Algorithms

188

Table 2: ORF sequences with homology to existing proteins

(indicated by their name or NCBI accession number).

SEQ. QUERY IDENT. HOMOLOGUE HOMOLOGUE

NR. COV. ORGANISMS PROTEINS

4 89 49

Some plants,

PG1

many bacteria

6 46 98

Z. mays,

AAR91184

T. ashaerens

7 59 57

P. marina,

YP 002730925

Bacteria

10 95 89

N. tabacum,

YP 173435

B. vulgaris,

A. thaliana,

other

12 69 78 B. napus YP 717160

REFERENCES

Altschul, S. F. et al. (1997). Gapped BLAST and PSI-

BLAST: a new generation of protein database search

programs. Nucleic Acids Research, 25(17):3389–

3402.

Boguski, M. S., Lowe, T. M., and Tolstoshev, C. M. (1993).

dbEST–database for Expressed Sequence Tags. Nat

Genet., pages 332–333.

Brennicke, A., Marchfelder, A., and Binder, S. (1999).

RNA editing. FEMS Microbiol. Rev., 23:297–316.

Bundschuh, R. (2004). Computational prediction of rna

editing sites. Bioinformatics, 20(17):3214–3220.

Burger, G., Gray, M. W., and Lang, B. F. (2003). Mitochon-

drial genomes: anything goes. TRENDS in Genetics,

19(12):709–716.

Gray, M. W., Hanic-Joyce, P. J., and Covello, P. S. (1992).

Transcription, processing and editing in plant mito-

chondria. Annu. Rev. Plant Physiol. Plant Mol. Biol.,

43:145–175.

Gualberto, J. M., Lamattina, L., Bonnard, G., Weil, J. H.,

and Grienenberger, J. M. (1989). RNA editing in

wheat mitochondria results in the conservation of pro-

tein sequences. Nature, 341:660–662.

Hoch, B., Maier, R. M., Appel, K., Igloi, G. L., and Kossel,

H. (1991). Editing of a chloroplast mRNA by creation

of an initiation codon. Nature, 353:178–180.

Lenz, H. and Knoop, V. (2013). PREPACT 2.0: Predicting

C-to-U and U-to-C RNA editing in organelle genome

sequences with multiple references and curated RNA

editing annotation. Bioinform Biol Insights, 7:1–19.

Levenshtein, V. I. (1966). Binary codes capable of correct-

ing deletions, insertions, and reversals. Soviet Physics

Doklady, 10(8):707–710.

Mower, J. P. (2005). PREP-Mt: predictive RNA editor for

plant mitochondrial genes. BMC Bioinformatics, 6:96.

Mulligan, R., Chang, K. L., and Chou, C. C. (2007). Com-

putational analysis of rna editing sites in plant mito-

chondrial genomes reveals similar information con-

tent and a sporadic distribution of editing sites. Mol

Biol Evol, 24(9):1971–1981.

Notsu, Y. et al. (2002). The complete sequence of the rice

(oryza sativa l.) mitochondrial genome: frequent DNA

sequence acquisition and loss during the evolution of

ﬂowering plants. Mol Genet Genomics, 268(4):434–

445.

Palopoli, L., Rombo, S. E., Terracina, G., Tradigo, G., and

Veltri, P. (2009). Improving protein secondary struc-

ture predictions by prediction fusion. Information Fu-

sion, 10(3):217–232.

Picardi, E. and Pesole, G. (2013). REDItools: high-

throughput RNA editing detection made easy. Bioin-

formatics, 29(14):1813–1814.

Picardi, E. and Quagliariello, C. (2005). EdiPy: a re-

source to simulate the evolution of plant mitochon-

drial genes under the RNA editing. Comput. Biol.

Chem., 30(1):77–80.

Picardi, E., Regina, T. M. R., Brennicke, A., and

Quagliariello, C. (2007). Redidb:the rna editing

database. Nucleic Acids Research, 35:D173–D177.

Regina, T. M. R., Lopez, L., Picardi, E., and Quagliariello,

C. (2002). Striking differences in RNA editing re-

quirements to express the rps4 gene in magnolia and

sunﬂower mitochondria. Gene, 286:33–41.

Schuster, W., Unseld, M., Wissinger, B., and Brennicke,

A. (1990). Ribosomal protein S14 transcripts are

edited in Oenothera mitochondria. Nucleic Acids Res.,

18:229–233.

Takenaka, M., D., D. V., van der Merwe, J. A., Zehrmann,

A., and Brennicke, A. (2008). The process of RNA

editing in plant mitochondria. Mitochondrion, 8:35–

46.

Wintz, H. and Hanson, M. R. (1991). A termination codon

is created by RNA editing in the petunia mitochondrial

atp9 gene transcript. Curr Genet, 19:61–64.

Discovering New Proteins in Plant Mitochondria by RNA Editing Simulation

189