INFERRING MOBILE ELEMENTS IN S. CEREVISIAE STRAINS

Giulia Menconi

, Giovanni Battaglia

, Roberto Grossi

, Nadia Pisanti

and Roberto Marangoni

2,3

Istituto Nazionale di Alta Matematica, Citt

a Universitaria, 00185 Roma, Italia

Dipartimento di Informatica, Universit

a di Pisa, 56127 Pisa, Italia

CNR-Istituto di Bioﬁsica, 56124 Pisa, Italia

Keywords:

Residentome, Retrotransposons, Mobile DNA elements, L-grams.

Abstract:

We aim at ﬁnding all the mobile elements in a genome and understanding their dynamic behavior. Comparative

genomics of closely related organisms can provide the data for this kind of investigation. The comparison task

requires a huge amount of computational resources, which in our approach we alleviate by exploiting the

high similarity between homologous chromosomes of different strains of the same species. Our case study

is for RefSeq and two other strains of S. cerevisiæ. Our fast algorithm, called REGENDER, is driven by data

analysis. We found that almost all the chromosomes are composed by resident genome (more than 90% is

conserved). Most importantly, the inspection of the non-conserved regions revealed that these are putative

mobile elements, thus conﬁrming that our method is useful to quickly ﬁnd mobile elements. The software tool

REGENDER is available online at http://www.di.unipi.it/∼gbattag/regender.

1 INTRODUCTION

Mobile elements play a prominent role in eukaryotic

genomes. They represent around 15–20% of the Hu-

man genome (Lewin, 2007). They are very viva-

cious, since they are able to jump over the genome,

to duplicate in different positions, and to possibly ex-

press their own genes. They can also change cells’

phenotype, when jumping within a gene sequence

(or its regulatory elements), thus altering its expres-

sion. Some complex pathologies whose molecular

mechanisms and global inheritance are hard to ex-

plain by standard inheritance laws, have found to be

correlated to mobile elements translocations (see e.g.

(Conti et al., 2006)). This scenario suggests to look

at eukaryotic genomes as evolving ecosystems, where

the resident genome (intended as the immotile DNA)

and the different elements of the mobilome (the to-

tal of mobile elements, mainly transposons) act like

different species competing for the available bioche-

mical resources (Le Rouzic et al., 2007; Venner et al.,

2009). The importance of mobile elements strongly

supports the development of tools for quickly map-

ping all of them in a genome. The “classic” strategy

to face this problem consists in taking the known mo-

bile element sequences and ﬁnding their occurrences

in the genome. However, we cannot apply this strat-

egy when we do not know a priori all the different

kinds of mobile elements. Moreover, there are further

problems. Mobile elements are subject to mutation,

fragmentation, and fusion of consecutive elements in

certain cases: hence, many false negative outcomes

are possible. Also, unresolved sequences are a pri-

mary source of difﬁculty.

A more promising strategy is to compare genomes

of closely related organisms. The rationale is that

most of the chromosomal rearrangements observed

in such datasets are probably caused by mobile ele-

ments.

Recently, a dataset suitable for this purpose has

been made available: 39 different strains of S. cere-

visiæ have been sequenced, and the relative genomes

published without annotations (Liti et al., 2009). The

coverage of the dataset is relatively low (one-to-

fourfold), and the genomic sequences contain a frac-

tion of unresolved sequences. Unfortunately, per-

forming a genome-wide alignment is computationally

demanding.

Our proposed approach exploits the high sim-

ilarity between homologous sequences of different

strains of the same species, so as to perform a simple,

but powerful ad hoc alignment. Using L-grams we

detect the non-conserved regions by identifying and

masking the long conserved ones. In other words,

we identify the resident genome and then we ana-

lyze its complementary part to infer elements of the

131

Menconi G., Battaglia G., Grossi R., Pisanti N. and Marangoni R..

INFERRING MOBILE ELEMENTS IN S. CEREVISIAE STRAINS.

DOI: 10.5220/0003137001310136

In Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms (BIOINFORMATICS-2011), pages 131-136

ISBN: 978-989-8425-36-2

 2011 SCITEPRESS (Science and Technology Publications, Lda.)

mobilome. Our method is very efﬁcient, since the

resident genome localization requires just few sec-

onds for chromosomal sequences of millions of bases.

What remains for non-conserved regions is very small

when compared to the length of the chromosomes at

hand. Consequently, more sophisticated and expen-

sive methods can be focused to this restricted set of

candidates.

Our method builds a map for the mobilome of

S. cerevisiæ as a case study. For any given chro-

mosome in our dataset, we classify its segments

as either conserved or non-conserved regions, using

RefSeq@SGD (SGD, 2010) as the reference genome

for this yeast (since RefSeq is the only annotated

strain). We then inspect and mark all the non-

conserved regions trying to infer putative mobile el-

ements. Since S. cerevisiæ is probably the organism

where mobile elements are best characterized, it is an

optimal benchmark for validating our approach.

We apply the following classiﬁcation by refer-

ring to the available annotations of RefSeq: (i) com-

pleted transposons, including their bounding seg-

ments called Long Terminal Repeats (LTRs), in this

case indicated as Ty; (ii) pairs of LTRs without the

transposon in between; (iii) soloLTRs, where a sin-

gle LTR element is found in each of them. Then, the

map annotates all these occurrences of putative mo-

bile elements. Interestingly, the map can be seen as a

puzzle with some missing pieces (the non-conserved

regions). A global view at its conﬁguration allows

to locate a single candidate position for several non-

conserved regions with high conﬁdence, even if their

relocation involves unresolved symbols marked by Ns.

Note that we deploy the peculiar structure of the trans-

posons that can be evinced from RefSeq: they are

long between 5 000b–6000b (bases) and are delim-

ited by two LTRs of 200b–300b. We illustrate our

approach by involving the S. cerevisiæ strains Y55

and YPS128 from the given dataset (Liti et al., 2009),

whose chromosomes are compared against RefSeq’s.

What we present in this paper can be easily extended

to the rest of the strains in the above dataset. More-

over, given its speed, the method can be applied to

the analysis of similar strains of species with possibly

much longer genomes.

2 SYSTEM AND METHODS

Preliminary Data Analysis. Our approach is driven

by data analysis. We performed a preliminary

study to understand how to grasp the high similar-

ity in our dataset. Consider a chromosomes’ pair

(ChrN

, ChrN

), where A is RefSeq and B is either

Table 1: Statistics (%) for the L-grams (L = 32) satisfying

properties (a)–(c). The length of each chromosome N in

Y55 is also given.

N bases (a) (b) (c) N bases (a) (b) (c)

1 248 261 81.32 59.92 0.43 9 467 776 89.53 74.02 0.29

2 800 992 98.54 82.55 0.14 10 770 597 94.60 76.43 0.62

3 321 691 93.82 83.74 2.13 11 693 726 97.64 78.98 0.11

4 1 522 688 96.24 77.38 0.89 12 1 067 059 95.12 78.66 2.15

5 577 152 96.33 77.66 0.39 13 923 317 96.96 84.35 0.52

6 273 660 97.56 76.05 0.19 14 781 629 98.20 82.46 0.15

7 1 113 452 95.05 78.63 0.73 15 1 105 914 95.67 81.07 0.33

8 566 494 95.47 78.70 0.84 16 946 183 96.53 83.55 0.65

Y55 or YPS128; also, 1 ≤ N ≤ 16 since the yeast

genome consists of 16 chromosomes. Examine all

the possible (overlapping) L-grams of ChrN

as can-

didates, where an L-gram is a segment of L consecu-

tive bases.

Assuming that ChrN

contains m bases, there are

m − L + 1 L-grams, accounting for possible dupli-

cates. Call valid the L-grams that do not contain any

symbol N. The common L-grams are the valid L-grams

that occur exactly (i.e. fully conserved with no muta-

tion) both in ChrN

and ChrN

In our experiments, L = 32 resulted to be a good

choice, leading to the following empirical facts that

were observed for chromosomes N = 2, 3, . . . , 16,

with chromosome N = 1 (whose percentages are

shown inside parentheses below) being an outlier.

The reported percentages are absolute, as they are ob-

tained by dividing the number of wanted L-grams by

m − L + 1.

(a) The valid L-grams are numerous: they are in

the range 89.53%–98.54% in Y55 and from 88.84%

to 97.27% in YPS128 (81.32% in Y55 and 77.29% in

YPS128 for chromosome 1).

(b) The common L-grams are also numerous:

they are between 74.02%–84.35% in Y55 and be-

tween 71.71%–83.24% in YPS128 (59.92% in Y55

and 58.47% in YPS128 for chromosome 1).

genome are the vast majority: indeed, those occur-

ring two or more times are very few, between 0.11%–

2.15% in Y55 and 0.07%–1.93% in YPS128.

A summary reporting the above percentages for

the L-grams in the 16 chromosomes of Y55 is shown

in Table 1. The implication of (a)–(c) is that we can

localize the conserved regions using the common L-

grams, as discussed next.

Conserved Regions. Our algorithm for the rapid

detection of large highly-conserved segments, called

REGENDER (REsident GENome DEtectoR), is driven

by the above data analysis. It performs a two-phase

processing of all the possible chromosomes’ pairs

(ChrN

, ChrN

), where A is RefSeq, B is either Y55

BIOINFORMATICS 2011 - International Conference on Bioinformatics Models, Methods and Algorithms

132

or YPS128, and N ranges from 1 to 16. In the ﬁrst

phase, REGENDER ﬁnds the common L-grams be-

tween ChrN

and ChrN

. In the second phase, RE-

GENDER aggregates consecutive L-grams in a greedy

fashion using some user-deﬁned parameters that con-

trol when the next conserved region begins in both

ChrN

and ChrN

. We provide the details of the algo-

rithm in Section 3.

REGENDER is somewhat related to the anchor-

based algorithms (Ohlebusch and Abouelhoda, 2006)

that circumvent the quadratic costs. Such algorithms

share a common mechanism. First, they build a dic-

tionary to store the fragments or seeds that are com-

mon to both ChrN

and ChrN

. Second, they extend

the fragments/seeds into longer sequences called an-

chors using dynamic programming, except chaining

algorithms (Ohlebusch and Abouelhoda, 2006). The

sequence of anchors thus found are required to be co-

linear; namely, the anchors should occur in the same

relative order inside both ChrN

and ChrN

. Third,

these algorithms apply an expensive dynamic pro-

gramming scheme to the regions of ChrN

and ChrN

that are left uncovered by the anchors. Driven by our

data analysis, REGENDER can go simpler. First, the L-

grams of ChrN

are stored in a hash table, and those

of ChrN

are searched in the table during a scan of

ChrN

. The high similarity of ChrN

and ChrN

jus-

tiﬁes our choice of exact L-grams as fragments. Se-

cond, our dataset gives almost surprisingly a natural

set of anchors: contrarily to the anchor-based algo-

rithms, we do not need any dynamic programming

or chaining techniques to enforce the colinearity and

the non-overlapping property, since there is almost a

one-to-one mapping between the occurrences of the

L-grams (see Section 2). Actually, we take advantage

of the fact the L-grams overlap and, if they are not co-

linear, we get a hint for a possible translocation. As a

result, REGENDER performs just a scan of ChrN

and

ChrN

. One execution of REGENDER takes less than

a second on a standard PC with limited amount of

memory. This is a major requirement, since we need

to execute REGENDER for all pairs of corresponding

chromosomes of ChrN

and ChrN

. Third, we remark

that we do not need a complete alignment of ChrN

or ChrN

for the purposes of the analysis performed

in this paper. A high-quality alignment of the con-

served regions in ChrN

or ChrN

is unnecessary in

our case, as illustrated by the clear patterns emerg-

ing from Fig. 1. What we really care about is the de-

scription of the dynamics of the mobilome, identify-

ing and locating all the mobile elements in the input

sequences, together with the genomic rearrangements

they are involved into. A merit of our approach is that

of being able to select a small set of candidates for the

Figure 1: A plot of the common L-grams for Chr4 (1

095 000–1 155 000) of RefSeq (top sequence) and Y55 (bot-

tom sequence), where L = 32. Each line connects the start-

ing positions of a common L-gram. Thus, the empty trian-

gles or trapezoids represent non-conserved regions. Anno-

tated mobile elements are represented by the green rectan-

gles just below the top line; unresolved sequences are the

black rectangles just above the bottom line. High resolution

plots are available online.

latter investigation, as discussed next.

Non-conserved Regions. The outcomes of our ex-

periments with REGENDER are analyzed as follows.

Graphically, we represent the two homologous

chromosomes as two horizontal straight lines, and

place A in the top and B in the bottom, as in Fig. 1.

We mark the conservations with some color. The non-

conserved regions are then detectable as non-colored

trapezoids. The action of a transposable element T

that has changed position from region X of strain A to

region Y of strain B within two homologous chromo-

somes is then represented by two triangles (Fig. 1):

we detect a white downward triangle inside region X

(marking presence of T only in region X of strain A

and absence in strain B), and an upward white tri-

angle in Y (marking presence of T only in region Y

of strain B and absence in corresponding position on

strain A). Therefore, when strain A is the referring se-

quence RefSeq, we can infer that T probably moved

from X to Y inside strain B by projecting region X of

A onto the corresponding part in B.

A picture of possible situations is shown with

some detail in Fig. 2 and described in section 4.

We followed the above conceptual scheme to col-

lect statistics for all the chromosomal rearrangements

among the 16 chromosomes’ pairs from the selected

strains (B is Y55 or YPS128) with the same chromo-

some in A=RefSeq, thus classifying any resulting re-

arrangement. We refer the reader to Section 4 for

an aggregate view of all the chromosomal differences

found and their relationships with the mobilome. We

remark that we considered signiﬁcant events that in-

volve regions containing at least 200b, since very

short indels or mutations are not linked with mo-

INFERRING MOBILE ELEMENTS IN S. CEREVISIAE STRAINS

133

bilome nor with chromosomal rearrangements.

The proposed approach allowed us to obtain a fast

and efﬁcient localization of the resident genome, by

working on a standard computer. Our results clearly

show that the signiﬁcant chromosomal indels involve

almost exclusively the mobilome. Moreover, we show

that unresolved sequences take place almost always in

the correspondence of telomeres or mobile elements.

Our approach allows us to infer putative insertions

and deletions of transposons or LTR elements also in

the presence of unresolved sequences.

3 ALGORITHM AND

IMPLEMENTATION

As previously mentioned, we exploit the high simila-

rity between genomes of different strains by running a

massive computation involving all the possible chro-

mosomes pairs (ChrN

, ChrN

), where A is RefSeq, B

is either Y55 or YPS128 strain, and N = 1, . . . , 16. We

follow a two-phase approach for REGENDER, whose

inputs are two chromosomes ChrN

and ChrN

, the

length L of the grams, and two user-deﬁned parame-

ters δ

and δ

to be used in the second phase. First,

we ﬁnd all the common L-grams between ChrN

and

ChrN

. Second, we detect highly conserved regions

by aggregating consecutive L-grams. Finally, we in-

spect the non-conserved regions that are found by RE-

GENDER, so as to infer mobilome elements. Due to

space constraint we omit the implementation details.

Phase 1 of REGENDER: Common L-grams. We

aim at ﬁnding which L-grams of ChrN

occur inside

ChrN

, where an L-gram is any sequence of L con-

secutive bases. First, we construct a dictionary for all

the L-grams in ChrN

and, then, we search for the L-

grams of ChrN

inside the dictionary. This task can

be performed in expected linear time by employing a

rolling hash approach based on cyclic polynomial, as

described in (Cohen, 1997). Note that using a gen-

eral purpose hash function would be more expensive

by a multiplicative factor of L. Also, using a trie-

based dictionary instead of hashing would guarantee

a linear-time worst-case performance, but hashing is

faster in practice.

A detailed description of the rolling hashing is be-

yond the scope of the current paper. However, it can

be easily proved that the ﬁrst phase of the algorithm

REGENDER requires O(|ChrN

| + |ChrN

|) time on

average.

The output of the ﬁrst phase is a mapping M, as-

sociating each L-gram s

of ChrN

, with its occur-

rence list occs(s

) in ChrN

. If s

does not occur in

ChrN

, occs(s

) is empty. Although not optimal in

the worst case, our hash based approach turned out

to be effective on our datasets, yielding few colli-

sions, and allowing us to compare two entire chromo-

somes in few seconds. We implemented a prototype

in Java, using the fastutil Java collections library

to reduce as much as possible the memory usage (Vi-

gna, 2006). The experiments have been performed on

an Intel Core 2 Duo P8400 notebook, with 4GB of

RAM. The code is single-threaded, and the maximum

amount of RAM available for the ﬁrst phase has been

set to 200MB. The value of the parameter L has been

set to 32, and the load factor of the hash table is set to

α = 0.75.

Phase 2 of REGENDER: Conserved Regions. Du-

ring the second phase, the information about the L-

gram occurrences, stored in the mapping M computed

in the ﬁrst phase, is used to establish a correspondence

between segments of consecutive bases in ChrN

and

ChrN

, mapping a segment I

= ChrN

, r

] into a

corresponding segment I

= ChrN

, r

]. This infor-

mation is represented by the mapping M

, and it is

graphically shown with green lines in Fig. 1.

We perform a left-to-right scan of ChrN

and

ChrN

, according to the following greedy rule. Ini-

tially, I

and I

are empty. During the scan, the cur-

rent segments I

and I

are extended when the fol-

lowing conditions are met: (a) there exists a common

L-gram s, which occurs both to the right of I

and I

and no other L-gram with this property can be found

between I

and s, and I

and s; (b) letting d

be the

number of bases between I

and s, and d

be the num-

ber of bases between I

and s, it is |d

− d

| ≤ δ

and

≤ δ

(hence, d

≤ δ

+ δ

To compute the time complexity of the second

phase of REGENDER algorithm, we observe that the

sum of the sizes of the occurrence lists in M is upper

bounded by |ChrN

| − L + 1. In other words, the size

of the mapping M is O(|ChrN

|+ |ChrN

|), hence the

REGENDER algorithm requires O(|ChrN

|+ |ChrN

time on average.

Inspection of Non-conserved Regions. The contri-

bution of REGENDER is that of reducing a potentially

huge number of candidates to very few of them, so

that the direct inspection of the non-conserved regions

is doable. We perform this crucial analysis of the re-

gions that have not been mapped into segments by M

These are the potential candidates for being mobile

elements. Following, we discuss in details the per-

formed analysis.

BIOINFORMATICS 2011 - International Conference on Bioinformatics Models, Methods and Algorithms

134

Table 2: Running time in seconds for each chromosome of

the Y55 strain (columns). We run each tool with its default

parameters, specifying, whenever possible, the L, δ

, δ

ar-

guments (rows). Experiments performed on an Intel Core 2

Duo P8400 notebook, with 4GB of RAM, running Ubuntu

10.04.

Chrm Avid Lagan Lastz GSAligner Mauve Murasaki REGENDER

1 5.60 11.50 5.10 2.20 5.60 14.10 1.20

2 21.30 24.10 12.30 16.50 14.50 24.10 2.90

3 7.60 8.60 5.50 3 6 10.60 1.50

4 53.40 36.90 42.60 51.90 27.70 36.50 5.60

5 29.70 12.70 12.40 11.10 11.40 28.70 2.10

6 6.10 4.30 4.40 3.50 5.60 9.50 1.40

7 30.90 30.10 29.90 28.90 21.40 31.10 3.90

8 25.50 15.10 9 8 11.40 25 2.10

9 13.90 14 7.60 4.90 8.80 15.90 1.90

10 77.60 17.10 13.10 15.50 15.10 31 2.70

11 55.10 15.30 10.70 11.80 13.80 20.60 2.60

30.10 34.40 19 24.70 20.30 31.80 3.80

13 23.20 37.60 18 21.50 17.60 25.50 3.50

14 88.60 20.60 7.80 14.30 14.80 25.80 2.80

15 30.30 33.30 23.10 24.90 21.70 36.70 3.90

16 24.50 64.40 19 20.20 18.20 34.50 3.60

4 RESULTS AND DISCUSSION

REGENDER Performances. After the preliminary

data analysis described in Section 2 that led to the

choice of L = 32 for this dataset, we performed some

experiments on all the 16 chromosomes of the two se-

lected strains against RefSeq. REGENDER has proven

to be very fast (δ

= δ

= 100): for instance, it can

process the longest pair of chromosomes (Chr4, about

2Mb) in only 6 seconds.

In Table 2 REGENDER is compared with other ex-

isting tools, reporting the running time for each chro-

mosome of the Y55 strain. Although the REGENDER

prototype has been implemented in Python and Java,

and the code has not been ﬁne tuned, it is, on average,

from four to ten times faster than the other tools. This

is not surprising, since REGENDER does not perform

a high-quality alignment of the input chromosomes as

the other tools do (as explained in Section 2 this is not

necessary in identifying and locating the mobile ele-

ments in the input sequences).

Qualitative Analysis of REGENDER Results. A lo-

cal exploration of the results obtained by REGENDER

showed the following scenario: most of the chro-

mosomes appear to be conserved, therefore they are

graphically covered by a uniform color given by the

overlapping of parallel straight lines connecting iden-

tical L-grams. A set of examples is reported in Fig. 2,

where the top line always represents a region of a

chromosome of RefSeq, while the bottom line rep-

resents the same region in either Y55 or YPS128; here

the mobile elements annotated in RefSeq are repre-

sented by green rectangles placed just below the top

line, while unresolved sequences are represented by

black rectangles placed just above the bottom line.

Chr03-RefSeq

Chr03-Y55

CONS Ty5

pCONSu

pINSu Ty-c prox-mobil pINS Ty-c

pDEL

Chr04-YPS128

pDEL Ty1

b) c) d)

Chr11-RefSeq

Chr04-RefSeq

Chr04-Y55

Chr04-RefSeq

Chr04-Y55

Chr04-Y55Chr11-YPS128

Figure 2: Features detected after REGENDER results.

Conserved regions belong to this category (includ-

ing the mobile elements, when they are annotated on

the RefSeq) and appear to be identically maintained

in the screened strain. These regions are marked as

CONS on Fig. 2(a).

This uniform coverage can be interrupted when,

for example, the screened strain has a long run of un-

resolved bases. Such unresolved sequences are gra-

phically marked by black rectangles. When the lines

connecting their ﬂanking regions are all parallel, it is

likely that this fragment contains exactly the same se-

quence as RefSeq. In this case, we have an exam-

ple of putative conservation, marked as pCONSu, that

graphically appears as shown in Fig. 2(b).

Cases in which there is a sequence on RefSeq that

has no correspondent on the homologous region of the

screened strain are putative deletions. They can oc-

cur when a mobile element is annotated in RefSeq,

in which case they are marked as pDEL-Ty or pDEL-

LTR, if they occur for transposons or LTR, respec-

tively. They are, instead, marked as pDEL when this

putative deletion is not related to mobile elements

(Fig. 2(c),(d)).

Putative insertions are more difﬁcult to categorize,

as the screened strain where they take place are not

annotated. If the sequence is resolved, we employ

standard alignment tools to search it in the RefSeq,

trying to detect whether the fragment has actually

been moved rather than deleted. On the other hand,

when the sequence is unresolved, we can explore only

two features. First, whether or not the length of the

inserted sequence is compatible with either a trans-

poson (when the length of the inserted sequence is

≥ 4000b) or an LTR (when the length of the inserted

sequence is ≤ 500b). Second, whether these inser-

tions take place in a region where in the RefSeq a

mobile element is annotated at a distance less than

200b. For example, the event marked as ”pINSu

Ty-c prox-mobil” in Fig. 2(e) accounts for an inser-

INFERRING MOBILE ELEMENTS IN S. CEREVISIAE STRAINS

135

tion (in the Chromosome 11) in YPS128 strain with

respect to RefSeq, in an unresolved sequence. Since

such insertion takes place less than 200b away from

an LTR annotated in RefSeq, we consider this event

as ”proximal” to a mobile element. This is relevant,

since several observations in the literature suggest that

transposons prefer to migrate in zones where there are

LTR. Finally, Fig. 2(f) shows an event of ”pINS Ty-c”

since the inserted sequence length is compatible with

a transposon.

These cases represent a complete spectrum of the

situations we have found in our screening. The fol-

lowing subsections report an aggregation of the data

we collected for all these categories of events.

Conserved Regions and Mobile Elements. More

than 95% in Y55 and 93% in YPS128 are conserved re-

gions. Most of these are part of the resident genome,

but not all of them. The fraction of conserved trans-

posons or LTRs (i.e. mobilome) within conserved re-

gions contains two possible elements: the truly con-

served transposons (only 1) or LTRs (in a relative low

number), which are annotated onto the RefSeq and

exactly mapped on the screened strain and the puta-

tive conservations of annotated transposons or LTRs,

which are mapped onto unresolved sequences in the

screened strain: in this case, a direct attribution is

impossible. The pCONSu are always found in the

telomeres because the presence of long repeats is a

source of noise for the assembly phase. In all cases

but one, telomeres do not involve sequences related

with mobile elements. Concerning pCONSu that are

outside the telomeres, the number of unresolved se-

quences that are located in correspondence or in prox-

imity of mobile elements, is greater than 90% for Y55

and around 70% for YPS128. This supports the hy-

pothesis that unresolved regions are often located in

correspondence of a mobile element (annotated onto

RefSeq).

Deletions. Deletions concern almost only the mo-

bilome. In Y55 strain, for instance, there are four pu-

tative deleted regions on RefSeq that do not corre-

spond to annotated Ty or soloLTR (against more than

90 pDELs corresponding to mobilome annotations).

We found that the length of the two regions is com-

patible to that of a soloLTR. This evidence strongly

suggests that in genomes closely related, the only sig-

niﬁcant (i.e., for our work, those involving sequences

at least 200 long) chromosomal rearrangements are

due to the mobilome.

Insertions. The landscape for the putative inser-

tions, without (pINS) and with (pINSu) unresolved

regions is rich. We may label the inserted sequences

by proximity to annotated Ty or soloLTR in the inser-

tion site on RefSeq: from 40% to 50% of the cases,

it is a putative mobilome-proximal insertion. As for

deletions, we may also distinguish on the basis of in-

serted sequence length: Ty-c, LTR-c or in-between.

Also in this case, the large majority of events are con-

cerned with the mobilome. Even if this result is partly

derived from our classiﬁcation methods, it supports

the same conclusion anyway.

5 FUTURE DIRECTIONS

We proposed an approach aimed at a rapid and ef-

ﬁcient localization of the resident genome through

algorithm REGENDER. We shall generalize this ap-

proach to a multiple comparison extracting all the

chromosomal rearrangements on a dataset of 39

strains of the same specie S. cerevisiæ (Liti et al.,

2009). Our long term goal is to develop a model able

to describe the dynamics of the mobilome in these

strains.

ACKNOWLEDGEMENTS

We thank Emiliano Biscardi for performing tests

whose results are reported in Table 2.

REFERENCES

Cohen, J. D. (1997). Recursive hashing functions for n-

grams. ACM Trans. Inf. Syst., 15(3):291–320.

Conti, V., Aghaie, A., Cilli, M., and et al. (2006). crv4,

a mouse model for human ataxia associated with

kyphoscoliosis caused by an mrna splicing mutation

of the metabotropic glutamate receptor 1 (grm1). Int.

J. Molec. Med., 18:593–600.

Le Rouzic, A., Boutin, T. S., and Capy, P. (2007). Long

term evolution of transposable elements. PNAS,

104:19375–19380.

Lewin, B. (2007). Genes (IX ed.). Jones and Bartlett.

Liti, G., Carter, D. M., Moses, A. M., and et al. (2009). Pop-

ulation genomics of domestic and wild yeast. Nature,

458:337–341.

Ohlebusch, E. and Abouelhoda, M. (2006). Chaining al-

gorithms and applications in comparative genomics.

Handbook of Computational Molecular Biology.

SGD (2010). SGD project. Saccharomyces Genome

Database. http://www.yeastgenome.org.

Venner, S., Feschotte, C., and Biemont, C. (2009). Dynam-

ics of transposable elements: towards a community

ecology of the genome. Trends Genet., 25:317–323.

Vigna, S. (2006). fastutil: Fast and compact type-speciﬁc

collections for java. http://fastutil.dsi.unimi.it.

BIOINFORMATICS 2011 - International Conference on Bioinformatics Models, Methods and Algorithms

136