GENOME HALVING BY BLOCK INTERCHANGE

Antoine Thomas, A

ıda Ouangraoua and Jean-St

ephane Varr

LIFL, UMR 8022 CNRS, Universit

e Lille 1 INRIA Lille, Villeneuve d’Ascq, France

Keywords:

Genome duplication, Genome Halving, Block interchange.

Abstract:

We address the problem of ﬁnding the minimal number of block interchanges required to transform a dupli-

cated linear genome into a tandem duplicated linear genome. We provide a formula for the distance as well as

a polynomial time algorithm for the sorting problem.

1 INTRODUCTION

Genomic rearrangements are known to play a central

role in the evolutionary history of the species. Several

operations act on the genome, shaping the sequence

of genes. A number of rearrangement operations to

sort a genome into another, and evaluate the evolu-

tionary distance between genomes, have been studied:

reversals, transpositions, translocations, block inter-

changes, fusions, ﬁssions, and more recently Double-

Cut-and-Join (DCJ). In this paper, we focus on the

block interchange operation, that consists in exchang-

ing two intervals of a genome.

Block interchanges scenarios have been studied

for the ﬁrst time by Christie (Christie, 1996). He pro-

posed a O(n

) time algorithm for computing the min-

imum number of block interchanges for transforming

a linear chromosome with unique gene content into

another one. Lin et al. (Lin et al., 2005) proposed

later the best algorithm to date in O(γn) where γ is the

minimum number of block interchanges required for

the transformation. Yancopoulos et al. (Yancopoulos

et al., 2005) introduced the DCJ operation which con-

sist in cutting the genomes in two points and joining

the four resulting extremities in a different way. Inter-

estingly, they noticed that a block interchange can be

simulated by two consecutive DCJ operations.

Another very important feature in genome evolu-

tion is that genomes often undergo genome duplica-

tion events, both segmental and whole-genome dupli-

cations. For instance, a tandem-duplication event is

a segmental duplication that duplicates a genomic se-

quence and results in a segment made of two conse-

vutive occurrences of the genomic sequence, called a

tandem-duplicated segment, in the genome. Genome

duplication events are followed by other rearrange-

ments events which result in a scrambled genome.

The Genome Halving problem introduced by El-

Mabrouk et al. (El-Mabrouk et al., 1998) consists in

ﬁnding the sequence of rearrangement events that al-

low one to go back from the scrambled genome to the

original duplicated one.

Genome Halving has been studied under several

models: reversals (El-Mabrouk et al., 1998), translo-

cation/reversals (El-Mabrouk and Sankoff, 2003),

breakpoints (Tannier et al., 2008). Most of the re-

sults led to polynomial time algorithms. Particu-

larly, the Genome Halving by DCJ was studied in

(Warren and Sankoff, 2008; Mixtacki, 2008), and in

(Mixtacki, 2008) some useful data structures were

presented leading to a linear time algorithm for the

Genome Halving by DCJ. Following these results

on the Genome Halving by DCJ, a natural problem

to consider is the Genome Halving by block inter-

change. In this paper, we study the Genome Halving

by block interchange on a duplicated genomic seg-

ment resulting from a tandem-duplication event, fol-

lowed by block interchange events that have scram-

bled the gene content of the segment. This dupli-

cated genomic segment is represented as a linear chro-

mosome with duplicated gene content w.l.o.g, and

we search for a parsimonous scenario of block in-

terchange operations transforming the linear chromo-

some into a linear tandem-duplicated chromosome.

We answer yes to the question: Does there exist a par-

simonious sequence of block interchange operations,

such that, replacing each block interchange by two

consecutive DCJ operations yields a parsimonious se-

quence of DCJ operations ?. Based on the adequate

data structure to represent potential DCJ operations

and their overlapping relations, we derive a quadratic

time algorithm for the Genome Halving by block

interchanges. Very recently, Kov

c et al. (Kov

et al., 2010) addressed the problem of reincorporat-

Thomas A., Ouangraoua A. and Varré J..

GENOME HALVING BY BLOCK INTERCHANGE.

DOI: 10.5220/0003757200580065

In Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms (BIOINFORMATICS-2012), pages 58-65

ISBN: 978-989-8425-90-4

 2012 SCITEPRESS (Science and Technology Publications, Lda.)

ing the temporary circular chromosomes induced by

DCJs immediately after their creation considering the

Genome Halving. This problem is obviously related

to the problem addressed in the present paper, but the

aim and results are different. We are interested in lin-

ear genomes, not in multilinear ones, and we focus

on pure block interchange scenarios that can be sim-

ulated by particular types of DCJ operations called

excisions and integrations, whereas Kov

c et al. fo-

cused on general DCJ scenarios simulating reversals,

translocation, fusion, ﬁssions along with excisions,

integrations, and block interchanges.

Section 2 gives deﬁnitions. In Section 3, we ﬁrst

give a lower bound on the distance with helpful prop-

erties for the rest of the paper. In Section 4, we prove

the analytical formula for the distance. We conclude

in Section 5 with a quadratic time and space algorithm

to obtain a parsimonious scenario.

2 PRELIMINARIES:

DUPLICATED GENOMES,

REARRANGEMENT, GENOME

HALVING PROBLEMS

In this section we give the main deﬁnitions and nota-

tions used in the paper.

2.1 Duplicated Genomes

A genome is composed of genomic markers orga-

nized in linear or circular chromosomes. A linear

chromosome is represented by an ordered sequence

of unsigned integers, each standing for a marker, sur-

rounded by two abstract markers ◦ at each end in-

dicating the telomeres. A circular chromosome is

represented by a circularly ordered sequence of un-

signed integers representing markers. For example,

(1 2 3) (◦ 4 5 6 7 ◦) is a genome constituted

of one circular and one linear chromosome. Note

that all genomes are considered unsigned in this paper

w.l.o.g, because block interchange operations do not

modify the signs of markers.

Deﬁnition 2.1. A rearranged duplicated genome is a

genome in which each marker appears twice.

In a rearranged duplicated genome, two copies of

a same marker are called paralogs. We distinguish

paralogs by denoting one marker by x and its para-

log by x. By convention x = x. For example, the

following genome is a rearranged duplicated genome:

(◦ 1 1 3 2 4 5 6 6 7 3 8 2 4 5 9 8 7 9 ◦).

An adjacency in a genome is a pair of consecutive

markers. For example, the genome (◦ 1 2 ◦) (3 4 5)

has six adjacencies, (◦ 1), (1 2), (2 ◦), and

(3 4), (4 5), (5 3). The linear or circular order

of the markers in a chromosome naturally induces an

order on the adjacencies that we denote by <. For

example in the previous genome the order induced

on the adjacencies is: (◦ 1) < (1 2) < (2 ◦), and

(3 4) < (4 5) < (5 3) < (3 4).

A double-adjacency in a genome G is an adja-

cency (a b) such that (a b) is an adjacency of G as

well. Note that a genome always has an even number

of double-adjacencies. For example, the four double-

adjacencies in the following genome are indicated by

dots :

G = (◦ 1 1 3 2 · 4 · 5 6 6 7 3 8 2 · 4 · 5 9 8 7 9 ◦)

A consecutive sequence of double-adjacencies can

be rewritten as a single marker; this process is called

reduction. For example, genome G can be reduced by

rewriting 2 · 4 · 5 and 2 · 4 · 5 as 10 and 10, yielding

the following genome:

= (◦ 1 1 3 10 6 6 7 3 8 10 9 8 7 9 ◦)

Deﬁnition 2.2. A tandem-duplicated genome is a re-

arranged duplicated genome composed of a single

linear chromosome which can be reduced to a chro-

mosome of the form (◦ x x ◦).

In other words, a tandem-duplicated genome is

composed of a single linear chromosome where all

adjacencies, except the two containing the marker ◦

and the central adjacency, are double-adjacencies. For

example, the genome (◦ 1 · 2 · 3 · 4 1 · 2 · 3 · 4 ◦)

is a tandem-duplicated genome that can be reduced to

(◦ 5 5 ◦) by rewritting 1 · 2 · 3 · 4 and 1 · 2 · 3 · 4

as 5 and 5.

Deﬁnition 2.3. A perfectly duplicated genome is a

rearranged duplicated genome such that each adja-

cency is a double-adjacency.

For example, the genome (1 2 1 2) (◦ 3 4 ◦)

(◦ 3 4 ◦) is a perfectly duplicated genome com-

posed of one single circular chromosome and two lin-

ear chromosomes.

In other words, a tandem-duplicated genome is the

representation of a duplicated segment resulting from

a tandem-duplication of a genomic sequence, and a

perfectly duplicated genome represents the result of a

whole-genome duplication event that has duplicated

all chromosomes.

2.2 Rearrangements

A rearrangement operation on a given genome cuts a

set of adjacencies of the genome called breakpoints

GENOME HALVING BY BLOCK INTERCHANGE

and forms new adjacencies with the exposed extremi-

ties, while altering no other adjacency. In the sequel,

the adjacencies cut by a rearrangement operation are

indicated in the genome by the symbol

An interval in a genome is a set of markers that

appear consecutively in the genome. Given two dif-

ferent adjacencies (a b) and (c d) in a genome G

such that (a b) < (c d), [b ; c] denotes the interval of

G beginning with marker b and ending with marker c.

In this paper, we consider two types of rearrange-

ment operations called block interchange (BI) and

double-cut-and-join (DCJ).

A block interchange (BI) on a genome G is a re-

arrangement operation that acts on four adjacencies

in G, (a b) < (c d) ≤ (u v) < (x y) such that

the intervals [b ; c] and [v ; x] do not overlap, swap-

ping the intervals [b ; c] and [v ; x]. For example,

the following block interchange acting on adjacencies

(1 2) < (6 6) < (3 8) < (8 7) consists in swapping

the intervals [2, 6] and [8, 8].

(◦ 1 1

2 3 2 4 5 6

6 7 3

8 4 9 5 8

7 9 ◦)

↓

(◦ 1 1 8 4 9 5 8 6 7 3 2 3 2 4 5 6 7 9 ◦)

A double-cut-and-join (DCJ) operation on a

genome G cuts two different adjacencies in G and

glues pairs of the four exposed extremities to form two

new adjacencies. Here, we focus on two types of DCJ

operations called excision and integration.

An excision is a DCJ operation acting on a single

chromosome by extracting an interval from it, mak-

ing this interval a circular chromosome, and making

the remainder a single chromosome.For example, the

following excision extracts the circular chromosome

(2 3 4):

(◦ 1

2 3 4

5 6 ◦) → (2 3 4)(◦1 5 6 ◦)

An integration is the inverse of an excision; it is a

DCJ operation that acts on two chromosomes, one be-

ing a circular chromosome, to produce a single chro-

mosome. For example, the following operation is an

integration of the circular chromosome (2 3 4):

3 4)(◦1 5 6

◦) → (◦ 1 5 6 3 4 2 ◦)

We now give an obvious, but very useful, property

linking BI operations to DCJ operations.

Property 2.4. A single BI operation on a linear chro-

mosome is equivalent to two DCJ operations: an exci-

sion followed by an integration.

Proof. Let (◦ 1 U 2 V 3 ◦) be a genome, U and

V the two intervals that are to be swapped by a block

interchange operation, 1 2 and 3 the intervals constitut-

ing the rest of the genome (note that each of them may

be empty).

The ﬁrst DCJ operation is the excision that pro-

duces the adjacency (1 V ) by extracting and circular-

izing the interval [U ; 2]:

(◦ 1

U 2

V 3 ◦) → (◦ 1 V 3 ◦)(U 2 )

The second DCJ operation is the integration that

produces the adjacency (U 3) by reintegrating the cir-

cular chromosome (U 2) in the appropriate way:

(◦ 1 V

3 ◦)(U 2

) → (◦ 1 V 2 U 3 ◦).

A rearrangement scenario between two genomes

A and B is a sequence of rearrangement operations al-

lowing one to transform A into B.

Deﬁnition 2.5. A BI (resp. DCJ) scenario is a rear-

rangement scenario composed of BI (resp. DCJ) oper-

ations.

The length of a rearrangement scenario is the num-

ber of rearrangement operations composing the sce-

nario.

Deﬁnition 2.6. The BI (resp. DCJ) distance between

two genomes A and B, denoted by d

(A,B) (resp.

DCJ

(A,B)), is the minimal length of a BI (resp. DCJ)

scenario between A and B.

2.3 Genome Halving

We now state the genome halving problem considered

in this paper.

Deﬁnition 2.7. Given a rearranged duplicated

genome G composed of a single linear chromosome,

the BI halving problem consists in ﬁnding a tandem-

duplicated genome H such that the BI distance be-

tween G and H is minimal.

In order to solve the BI halving problem, we use

some results on the DCJ halving problem that were

stated in (Mixtacki, 2008) as a starting point. How-

ever, unlike the BI halving problem, the aim of the

DCJ halving problem is to ﬁnd a perfectly duplicated

genome instead of a tandem-duplicated genome.

Deﬁnition 2.8. Given a rearranged duplicated

genome G, the DCJ genome halving problem consists

in ﬁnding a perfectly duplicated genome H such that

the DCJ distance between G and H is minimal.

The BI and DCJ genome halving problems lead to

two deﬁnitions of halving distances: the BI halving

distance (resp. DCJ halving distance) of a rearranged

duplicated genome G is the minimum BI (resp.

DCJ) distance between G and any tandem-duplicated

genome (resp. any perfectly duplicated genome) ; we

denote it by d

(G) (resp. d

DCJ

(G)).

BIOINFORMATICS 2012 - International Conference on Bioinformatics Models, Methods and Algorithms

3 LOWERBOUND FOR THE BI

HALVING DISTANCE

In this section we give a lowerbound on the BI halv-

ing distance of a rearranged duplicated genome. We

use a data structure representing the genome called the

natural graph introduced in (Mixtacki, 2008).

Deﬁnition 3.1. (Mixtacki, 2008) The natural graph of

a rearranged duplicated genome G, denoted by NG(G),

is the graph whose vertices are the adjacencies of G,

and for any marker u there is one edge between (u v)

and (u w), and one edge between (x u) and (y u).

Note that the number of edges in the natural

graph of a genome G containing n distinct mark-

ers, each one present in two copies, is always 2n.

Moreover, since every vertex has degree one or two,

then the natural graph consists only of cycles and

paths. For example, the natural graph of genome

G = (◦ 1 2 1 4 3 4 3 2 ◦) is depicted in Fig.

◦ 1

2 1

2 ◦

1 2 1 4

3 4

3 2

4 3 4 3

Figure 1: The natural graph of genome G =

(◦ 1 2 1 4 3 4 3 2 ◦); it is composed of one

path and two cycles.

Deﬁnition 3.2. Given an integer k, a k−cycle (resp.

k−path) in the natural graph of a rearranged dupli-

cated genome is a cycle (resp. path) that contains k

edges. If k is even, the cycle (resp. path) is called

even, and odd otherwise.

Based on the natural graph, a formula for the DCJ

halving distance was given in (Mixtacki, 2008). Given

a rearranged duplicated genome G such that the num-

ber of even cycles and the number of odd paths in

NG(G) are respectively denoted by EC and OP, the DCJ

halving distance of G is:

DCJ

(G) = n − EC −





In the case of the BI halving distance, some pecu-

liar properties of the natural graph need to be stated,

allowing one to simplify the formula of the DCJ halv-

ing distance, and leading to a lowerbound on the BI

halving distance.

In the following properties, we assume that G is

a genome composed of a single linear chromosome

containing n distinct markers, each one present in two

copies in G.

Property 3.3. The natural graph NG(G) contains only

even cycles and paths:

1. All cycles in the natural graph NG(G) are even.

2. The natural graph NG(G) contains only one path,

and this path is even.

Proof. First, if (a x) is a vertex of the graph that be-

longs to a cycle C, then there exists an edge between

(a x) and a vertex (a y). These two adjacencies are

the only two containing a copy of the marker a at the

ﬁrst position. So, if we consider the set of all the ﬁrst

markers in all adjacencies contained in the cycle C,

then each marker in this set is present exactly twice.

Therefore, the cycle C is an even cycle.

Secondly, the graph contains exactly two vertices

(adjacencies) containing the marker ◦ which are both

necessarily ends of a path in NG(G). Thus there can be

only one path in the graph. Since the number of edges

in the graph is even and all cycles are even, then the

single path is also even.

We now give a lowerbound on the minimum

length of DCJ scenario transforming G into a tandem-

duplicated genome.

Lemma 3.4. Let d

DCJ

(G) be the minimum DCJ dis-

tance between G and any tandem-duplicated genome.

If NG(G) contains C cycles then a lowerbound on

DCJ

(G) is given by:

DCJ

(G) ≥ n −C −1

Proof. First, since all cycles of NG(G) are even and

NG(G) contains no odd path, then, from the DCJ halv-

ing distance formula, the DCJ halving distance of G is

DCJ

(G) = n −C.

Now, since any tandem-duplicated genome can be

transformed into a perfectly duplicated genome with

one DCJ, then d

DCJ

+ 1 ≥ d

DCJ

. Therefore, we have

DCJ

≥ d

DCJ

− 1 ≥ n −C − 1.

We are now ready to state a lowerbound on the BI

halving distance of a rearranged duplicated genome G.

Theorem 3.5. If NG(G) contains C cycles, then a

lowerbound on the BI halving distance is given by:

(G) ≥



n −C



GENOME HALVING BY BLOCK INTERCHANGE

Proof. We denote by `(S) the length of a rearrange-

ment scenario S. Let S

be a BI scenario transform-

ing G into a tandem-duplicated genome. From prop-

erty 2.4, we have that S

is equivalent to a DCJ sce-

nario S

DCJ

such that `(S

DCJ

) = 2 ∗ `(S

). Now, sup-

pose that `(S

) < b

n−C

c, then `(S

) ≤ b

n−C

c − 1 ≤

n−C−1

e − 1.

This implies `(S

DCJ

) ≤ 2d

n−C−1

e−2 ≤ n −C −2 <

n −C − 1. Thus, from Lemma 3.4 we have `(S

DCJ

) <

DCJ

which contradicts the fact that d

DCJ

is the minimal

number of DCJ operations required to transform G into

a tandem-duplicated genome.

In conclusion, we always have d

(G) ≥ b

n−C

4 FORMULA FOR THE BI

HALVING DISTANCE

In this section, we show that the BI halving distance

of a rearranged duplicated genome G with n distinct

markers such that NG(G) contains C cycles is exactly:

(G) =



n −C



In other words, we show that enforcing the con-

straint that successive couples of consecutive DCJ op-

erations have to be equivalent to BI operations does not

change the distance even though it obviously restricts

the DCJ that can be performed at each step of the sce-

nario.

In the following, G denotes a rearranged duplicated

genome G constisting in a single linear chromosome

with n distinct markers after the reduction process, and

such that NG(G) contains C cycles. We begin by recall-

ing some useful deﬁnitions and properties of the DCJ

operations that allow one to decrease the DCJ halving

distance by 1 in the resulting genome.

Deﬁnition 4.1. A DCJ operation on G producing

genome G

is sorting if it decreases the DCJ halving

distance by 1: d

DCJ

) = d

DCJ

(G) − 1 = n −C − 1.

Since the number of distinct markers in G

is n and

DCJ

) = n −C − 1, then NG(G

) contains C + 1 cy-

cles. In other words, a DCJ operation is sorting if it

increases the number of cycles in NG(G) by 1.

Given (u v) an adjacency of G that is not a double-

adjacency, we denote by DCJ(u v) the DCJ operation

that cuts adjacencies (u x) and (y v) to form adjacen-

cies (u v) and (y x), making (u v) a double-adjacency.

Property 4.2. Let (u v) be an adjacency of G that is

not a double-adjacency, DCJ(u v) is a sorting DCJ

operation.

Proof. DCJ(u v) increases the number of cycles in

NG(G) by 1, by creating a new cycle composed of ad-

jacencies (u v) and (

u v).

( ◦

2 1

1 3

◦ )

I(2 1) =]2 ; 1[

I(1 2) = [2 ; 1]

I(2 3) =]2 ; 3[

I(3 1) = [1 ; 3]

I(1 3) =]1 ; 3[

Figure 2: I (G) =



]2 ; 1[ ,[2 ; 1] , ]2 ; 3[ , [1 ; 3] , ]1 ; 3[



the set of intervals of G = (◦ 2 1 2 3 1 3 ◦) depicted as

boxes. The two boxes with thick lines represent two over-

lapping intervals of I (G) inducing a BI which exchanges 2

and 3.

Deﬁnition 4.3. Let (u v), (u x), and (y v) be adjacen-

cies of G. The interval of the adjacency (u v), denoted

by I(u v) is either:

• the interval [x ; y] if (u x) < (y v). In this case, we

denote it by ]u ; v[, or

• the interval [v ; u] if (y v) < (u x).

For example, the intervals of the adjacencies in

genome (◦ 2 1 2 3 1 3 ◦) are depicted in Fig

2. Note that, given an adjacency (u v) of G, if (u v) is

a double-adjacency then the interval I(u v) is empty,

otherwise DCJ(u v) is the excision operation that ex-

tracts the interval I(u v) to make it circular, thus pro-

ducing the adjacency (u v).

Two intervals I(a b) and I(x y) are said to be over-

lapping if their intersection is non-empty, and none

of the intervals is included in the other. It is easy to

see, following Property 2.4, that given two adjacencies

(a b) and (x y) of G such that I(a b) and I(x y)

are non-empty intervals, the successive application of

DCJ(a b) and DCJ(x y) is equivalent to a BI opera-

tion if and only if I(a b) and I(x y) are overlapping.

Note that in this case neither (a b), nor (x y) can be

double-adjacencies in G since their intervals are non-

empty. Figure 2 shows an example of two overlapping

intervals.

The following property states precisely in which

case the successive application of DCJ(a b) and

DCJ(x y) decreases the DCJ halving distance by 2,

meaning that both DCJ operations are sorting.

Property 4.4. Given two adjacencies (a b) and (x y)

of G, such that I(a b) and I(x y) are overlapping,

the successive application of DCJ(a b) and DCJ(x y)

decreases the DCJ halving distance by 2 if and only if

x 6= a and y 6= b.

Proof. If x 6= a and y 6= b, then the successive applica-

tion of DCJ(a b) and DCJ(x y) increases the number

of cycles in NG(G) by 2, by creating two new 2-cycles.

BIOINFORMATICS 2012 - International Conference on Bioinformatics Models, Methods and Algorithms

Otherwise, DCJ(a b) ﬁrst creates a new cycle that is

then destroyed by DCJ(x y).

We denote by I (G), the set of intervals of all the

adjacencies of G that do not contain marker ◦.

Remark 4.5. Note that, if G contains n distinct mark-

ers, then there are 2n − 1 adjacencies in G that do not

contain marker ◦, deﬁning 2n − 1 intervals in I (G).

Deﬁnition 4.6. Two intervals I(a b) and I(x y) of

I (G) are said to be compatible if they are overlapping

and x 6= a and y 6= b.

In the following, we prove the BI halving distance

formula by showing that if genome G contains more

than three distinct markers, n > 3, then there exist

two compatible intervals in I (G), and if n = 2 or n = 3

then d

(G) = 1 and 2 ≤ d

DCJ

(G) ≤ 3. This means

that there exists a BI halving scenario S such that all

BI operations in S, possibly excluding the last one, are

equivalent to two successive sorting DCJ operations.

From now on, until the end of the section, (a b)

is an adjacency of G that is not a double-adjacency,

A is a genome consisting in a linear chromosome L

and a circular chromosome C, obtained by applying the

sorting DCJ, DCJ(a b), on G.

If there exists an interval I(x y) in I (G) compatible

with I(a b), then applying DCJ(x y) on A consists in

the integration of the circular chromosome C into the

linear chromosome L such that the adjacency (x y)

is formed. Such an integration can only be performed

by cutting an adjacency (x u) in C and an adjacency

(v y) in L (or inversely) to produce adjacencies (x y)

and (v u). This means that there must be an adjacency

(x y) in either C or L such that x is in C and y in L or

inversely. Hence, we have the following property :

Property 4.7. C cannot be reintegrated into L by ap-

plying a sorting DCJ, DCJ(x y), on A if and only if

either:

(1) for any adjacency (x y) in C (resp. L), markers x

and y are in L (resp. C), or

(2) for any adjacency (x y) in C (resp. L), markers x

and y are also in C (resp. L).

Proof. If there exists no adjacency (x y) in A such that

x is in C and y in L or inversely, then A necessarily

satisﬁes either (1), or (2).

Deﬁnition 4.8. An interval I(a b) in I (G) is called

interval of type 1 (resp. interval of type 2) if DCJ(a b)

produces a genome A satisfying conﬁguration (1)

(resp. conﬁguration (2)) described in Property 4.7.

For example, in genome (◦ 2 1 1 3 2 3 ◦),

I(1 3) is of type 1 as DCJ(1 3) produces genome

(◦ 2 1 3 ◦) (1 3 2) ; I(2 3) is of type 2 as DCJ(2 3)

produces genome (◦ 2 3 2 3 ◦) (1 1).

Now we give the maximum numbers of intervals of

type 1 and type 2 that can be contained in genome G.

Lemma 4.9. The maximum number of intervals of type

1 in I (G) is 2.

Proof. First, note that there cannot be two intervals I

and J of I (G) such that I 6= J, and both I and J are of

type 1. Now, if I is an interval of type 1, there can be

at most two different adjacencies (x y) and (u v) such

that I(x y) = I(u v) = I. In this case G necessarily has

a chromosome of the form (. . . x v . . . u y . . .) or

(... u y . . . x v . . .). Therefore, there are at most

two intervals of type 1 in I (G).

Lemma 4.10. The maximum number of intervals of

type 2 in I (G) is n.

Proof. First, note that for two adjacencies (x y) and

(x z) in G that do not contain marker ◦, if (x y) is of

type 2 then (x z) cannot be of type 2. Now, there is

only one marker u such that (u ◦) is an adjacency of

G. Let (u v) be the adjacency of G having u as ﬁrst

marker, then at most half of the intervals in I (G) −

{I(u v)} can be of type 2. Therefore, there are at most

n intervals of type 2 in I (G).

Theorem 4.11. If NG(G) contains C cycles, then the

BI halving distance of G is given by:

(G) =



n −C



Proof. Since there are 2n − 1 intervals in I (G), and at

most n+2 are of type 1 or 2, then if G is a genome con-

taining more than three distinct markers n > 3, then

2n − 1 > n + 2 and there exist two compatible inter-

vals in I (G) inducing a BI operation that decreases the

DCJ distance by 2.

Next, we show that if n = 2 or n = 3, then d

(G) =

1 and 2 ≤ d

DCJ

(G) ≤ 3.

If n = 2, then the genome can be written, either

as (◦ a b b a ◦), in which case a BI can swap a

and b to produce a tandem-duplicated genome, or as

(◦ a a b b ◦), in which case a BI can swap a and a b to

produce a tandem-duplicated genome.

If n = 3, then the genome has two double-

adjacencies to be constructed, of the form (a b), (x y),

with (a b) and (x y) being two adjacencies already

present in the genome such that b = x or b = x and a

and y are distinct markers. One can rewrite (a b) and

(x y) as single markers since they will not be split-

ted, which makes a genome with 4 markers such that

at most 2 are misplaced. Then, a single BI can produce

a tandem-duplicated genome.

Now, it is easy to see to see that if n = 2 or n = 3,

then d

DCJ

(G) = n − C ≤ 3. Finally, if n = 2 or n = 3,

GENOME HALVING BY BLOCK INTERCHANGE

then d

DCJ

(G) ≥ 2, otherwise we would have d

DCJ

(G) =

1 which would imply, as G consists in a single linear

chromosome, d

(G) = 0. In conclusion, if n > 3 then

there exist two compatible intervals in I (G), otherwise

if n = 2 or n = 3, then d

(G) = 1 and 2 ≤ d

DCJ

(G) ≤ 3.

Therefore d

= b

DCJ

c = b

n−C

5 SORTING ALGORITHM

In Section 4, we showed that if a genome G contains

more than three distinct markers after reduction then

there exist two compatible intervals in I (G) inducing a

BI to perform. If G contains two or three distinct mark-

ers then the BI to perform can be trivially computed.

Thus the main concern of this section is to describe

an efﬁcient algorithm for ﬁnding compatible intervals

when n > 3.

As in Section 4, in the following, G denotes a

genome consisting of n distinct markers after reduc-

tion. It is easy to show that the set of intervals I (G)

can be built in O(n) time and space complexity.

We now show that ﬁnding 2 compatible intervals in

I (G) can be done in O(n) time and space complexity.

Property 5.1. If n > 3 , then all the smallest inter-

vals in I (G) that are not of type 2 admit compatible

intervals.

Proof. Let J be a smallest interval that is not of type 2

in I (G). As J is not of type 2, then J has compatible

intervals if J is not of type 1.

Let us suppose that J is of type 1, then for any ad-

jacency (a b) such that markers a and b are not in J,

a and b are in J, and then I(a b) is strictly included in

J and I(a b) can’t be of type 2. Such adjacency does

exist as there are n > 3 markers not included in J.

Therefore J cannot be a smallest interval that is not of

type 2.

We are now ready to give the algorithm for sort-

ing a duplicated genome G into a tandem-duplicated

genome with b

n−C

c BI operations.

Theorem 5.2. Algorithm 1 reconstruct a tandem-

duplicated genome with a BI scenario of length b

n−C

in O(n

) time and space complexity.

Proof. Building I (G) and ﬁnding two compatible in-

tervals can be done in O(n) time and space complexity.

It follows that the while loop in the algorithm can be

computed in O(n

) time and space complexity.

Finding and performing the last BI operation when

2 ≤ n ≤ 3 can be done in constant time and space com-

plexity.

Algorithm 1 : Reconstruction of a tandem-duplicated

genome.

1: while G contains more than 3 markers do

2: Construct I (G)

3: Pick a smallest interval I(a b) that is not of

type 2 in I (G)

4: Find an interval I(x y) in I (G) compatible with

I(a b)

5: Perform the BI equivalent to DCJ(a b) fol-

lowed by DCJ(x y)

6: Reduce G

7: end while

8: if G contains 2 or 3 markers then

9: Find the last BI operation and perform it

10: end if

Moreover, all BI operations, possibly excluding

the last one, are computed as pairs of sorting DCJ op-

erations, which ensures that the length of the scenario

is b

n−C

6 CONCLUSIONS

In this paper, we introduced the BI halving problem.

We used the DCJ model to simulate BI operations

and we showed that it is always possible to choose

two consecutive sorting DCJ operations such that they

are equivalent to a BI operation. This is an interest-

ing result as it shows that restricting the scope of al-

lowed DCJ operations under the constraint of perform-

ing only BI doesn’t affect our halving distance. We

thus provided a quadratic time and space algorithm to

obtain a most parsimonious scenario for the BI halving

problem. One direction for further studies of variants

of the BI halving problem is to consider multichro-

mosomal genomes. A further extension of these re-

sults will be a generalization to the guided BI halving

problem that consists in ﬁnding a tandem duplicated

genome that minimizes the BI distance to a given a du-

plicated genome and a given non duplicated genome.

BIOINFORMATICS 2012 - International Conference on Bioinformatics Models, Methods and Algorithms

REFERENCES

Christie, D. A. (1996). Sorting permutations by block-

interchanges. Inf. Process. Lett., 60(4):165–169.

El-Mabrouk, N., Nadeau, J. H., and Sankoff, D. (1998).

Genome halving. In Farach-Colton, M., editor, Pro-

ceedings of CPM’98, volume 1448 of Lecture Notes

in Computer Science, pages 235–250. Springer.

El-Mabrouk, N. and Sankoff, D. (2003). The reconstruction

of doubled genomes. SIAM J. Comput., 32(3):754–

792.

Kov

ac, J., Braga, M. D. V., and Stoye, J. (2010). The prob-

lem of chromosome reincorporation in DCJ sorting

and halving. In Tannier, E., editor, RECOMB-CG,

volume 6398 of Lecture Notes in Computer Science,

pages 13–24. Springer.

Lin, Y. C., Lu, C. L., Chang, H.-Y., and Tang, C. Y.

(2005). An efﬁcient algorithm for sorting by block-

interchanges and its application to the evolution of

vibrio species. Journal of Computational Biology,

12(1):102–112.

Mixtacki, J. (2008). Genome halving under DCJ revis-

ited. In Hu, X. and Wang, J., editors, Proceedings of

COCOON’08, volume 5092 of Lecture Notes in Com-

puter Science, pages 276–286. Springer.

Tannier, E., Zheng, C., and Sankoff, D. (2008). Multichro-

mosomal genome median and halving problems. In

Crandall, K. A. and Lagergren, J., editors, Proceed-

ings of WABI’08, volume 5251 of Lecture Notes in

Computer Science, pages 1–13. Springer.

Warren, R. and Sankoff, D. (2008). Genome halving with

double cut and join. In Brazma, A., Miyano, S.,

and Akutsu, T., editors, Proceedings of APBC’08,

volume 6 of Advances in Bioinformatics and Com-

putational Biology, pages 231–240. Imperial College

Press.

Yancopoulos, S., Attie, O., and Friedberg, R. (2005). Ef-

ﬁcient sorting of genomic permutations by transloca-

tion, inversion and block interchange. Bioinformatics,

21(16):3340–3346.

GENOME HALVING BY BLOCK INTERCHANGE