Generating Multiple Alignments of Genomes of the Same Species

Jannik Olbrich

, Thomas B

uchler

and Enno Ohlebusch

Institute of Theoretical Computer Science, Ulm University, Germany

Keywords:

Multiple Genome Alignment, Pangenome, Preﬁx-Free Parsing, Maximal Unique Matches, Sufﬁx Array.

Abstract:

In this paper, we tackle the problem of generating a multiple alignment of assembled genomes of individuals

of the same species. Of course, a (colinear) multiple alignment cannot capture structural variants such as

inversions or transpositions, but if these are relatively rare (as, for instance, in human or mouse genomes),

it makes sense to generate such a multiple alignment. In the following, it is assumed that each assembled

genome is composed of contigs. We will show that the combination of a well-known anchor-based method

with the technique of preﬁx-free parsing yields an approach that is able to generate multiple alignments on a

pangenomic scale, provided that large structural variants are rare. Furthermore, experiments with real world

data show that our software tool PANAMA (PANgenomic Anchor-based Multiple Alignment) signiﬁcantly

outperforms current state-of-the art programs.

1 INTRODUCTION

A single linear reference is commonly used to repre-

sent the human genome in genomic studies and diag-

nostics. However, there are a lot of differences be-

tween the genomes of individuals of the same species

and a single reference is unable to cover them. (The

1000 Genomes Project Consortium, 2015) produced a

database that includes the genomes of 2504 different

humans and the differences between them. Includ-

ing common variations in the reference gives a more

accurate representation of the genomes of a species.

We call such a representation the pangenome of the

population. It should be pointed out that throughout

this paper we use the term “pangenome” in a narrower

sense. In a broader sense, the pangenome deﬁnes

the entire genomic repertoire of a given phylogenetic

clade, which may range from species to phylum and

beyond. Note that (Tettelin et al., 2005) coined the

term pangenome two decades ago; they evaluated the

composition of six strains of Streptococcus agalac-

tiae.

A pangenome is often constructed from a refer-

ence sequence and a VCF-ﬁle containing the varia-

tions. Since novel long read sequencing technolo-

gies allow for de novo assembly of many individu-

als of a species or population (Porubsky et al., 2021),

https://orcid.org/0000-0003-3291-7342

https://orcid.org/0000-0002-9273-5439

https://orcid.org/0009-0008-3937-3652

high-quality assemblies are becoming widely avail-

able. In this paper, we tackle the problem of gen-

erating a multiple alignment of assembled genomes

of many individuals of the same species. Of course,

a (colinear) multiple alignment cannot capture struc-

tural variants such as inversions or transpositions, but

in many species these are relatively rare. (In future

work, we intend to combine our method with tech-

niques that detect such structural variants.) Up to now,

the computation of such a chromosome-scale multi-

ple alignment was not possible because no multiple

aligner was able to deal with such a volume of data

(cf. Section EXPERIMENTAL RESULTS).

ohl et al., 2002) presented MGA (Multiple

Genome Aligner), the ﬁrst software-tool that was able

to compute a multiple alignment of closely related

genomes. However, MGA was limited to viruses and

strains of bacteria (Chain et al., 2003). MGA and

many other software-tools for aligning multiple ge-

nomic DNA sequences use an anchor-based method

that is composed of three phases:

1. computation of fragments (segments in the

genomes that are identical or very similar),

2. computation of a highest-scoring global chain of

colinear non-overlapping fragments: these are the

anchors that form the backbone of the alignment,

3. alignment of the regions between the anchors (ei-

ther by applying the same method recursively or

by applying a different multiple sequence align-

ment program).

Olbrich, J., Büchler, T. and Ohlebusch, E.

Generating Multiple Alignments of Genomes of the Same Species.

DOI: 10.5220/0013165400003911

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 18th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2025) - Volume 1, pages 459-468

ISBN: 978-989-758-731-3; ISSN: 2184-4305

459

A recent tool that elaborates on this anchor-based

method is FMAlign2 (Zhang et al., 2024). How-

ever, FMAlign2 is still limited to datasets of a few

hundred million bases (cf. Section EXPERIMENTAL

RESULTS).

In this paper, we will show that it is possible

to generate a multiple alignment of a set of assem-

bled genomes (of the same species), where an assem-

bled genome is set of contigs. Our new method is

depicted in Figure 1. We combine MGA’s anchor-

based method with the technique of preﬁx-free pars-

ing (PFP), which was introduced by (Boucher et al.,

2019). This technique parses a DNA sequence (a

chromosome composed of contigs or a complete chro-

mosome) S into phrases, and two phrases have the

same identiﬁer (meta-symbol) if and only if they are

identical on the base-level. Thus, the parse P is the

sequence of identiﬁers that gives S if each identiﬁer

is replaced with its phrase. The main idea is to ﬁrst

compute anchors on the parse P instead of computing

anchors on S, see Figure 1. Given m DNA sequences

from the same chromosome of different individuals

(e.g. chromosome 19 of 1000 different humans) as in-

put, our method uses the following phases, which are

explained in subsequent sections:

1. Compute the parses of the m sequences

2. On the parses, compute the backbone of the over-

all multiple alignment as follows:

(a) Compute the generalized sufﬁx array

(b) Find multiMUMs (the fragments)

(d) Extend anchors on the base-level

3. Generate an alignment of the gaps

(a) While there remains a large gap, apply

Phases 2a-2c to the gap and add the frag-

ments in the resulting chain as anchors (here,

multiMUMs may be partial)

(b) Generate an alignment of the remaining (small)

gaps using FMAlign2 and MAFFT (Nakamura

et al., 2018)

In this way, we are able to generate multiple

alignments on a pangenomic scale. It should be

pointed out that our new method can only be success-

ful if the chromosomes of individuals from the same

species are very similar DNA sequences; in particular

structural chromosomal rearrangements must be rare.

Therefore, our software tool PANAMA is a special

purpose multiple sequence alignment program. On

For brevity, we will use the term ‘gap’ instead of ‘re-

gion between anchors’ from now on (it should not be con-

fused with a gap in an alignment).

Input sequences

= atcaagtcgtat

= caagtcgtatcgt

= atcaagtcgtgcgt

Parses

= ABCA

= BCAC

= ABCD

Dictionary

A = at

B = caagt

C = cgt

D = gcgt

Anchor

PFP multiMUMs

Figure 1: First, the parses of the input sequences will be

calculated. Then anchors are determined within the parses.

the other hand, there is an urgent need for such a pro-

gram (Liao et al., 2023). Experiments with real world

data show that our program PANAMA outperforms

current state-of-the art programs.

2 PREFIX-FREE PARSING

Preﬁx-free parsing is a technique invented by

(Boucher et al., 2019). In the simple version used

here, it uses a rolling hash (a hash function where

the input is hashed in a window that moves through

the input) to divide a string S into substrings, which

form the dictionary D. The name preﬁx-free parsing

is justiﬁed by the property that no sufﬁx of a string

from D is a proper preﬁx of a sufﬁx of any other

string from D (but this property does not play a role

in our context). In the following more detailed ex-

planation, we assume the reader to be familiar with

the Karp-Rabin-Algorithm, see e.g. (Cormen et al.,

1990). This algorithm uses a sliding window of ﬁxed

size w and a hash function KR. For every position i

in S, it computes the hash value KR(W

) of the sub-

string W

= S[i..i + w − 1]. Since the hash value of

i+1

can be computed in constant time from the pre-

vious hash value, the parsing algorithm takes only lin-

ear time. Given a ﬁxed number p (called modulus

the string W

is called a trigger string if and only if

KR(W

) mod p = 0. In a left-to-right scan of S, the

parsing algorithm breaks S into substrings so that each

substring ends with a trigger string (and contains no

other trigger string). This gives the dictionary D and

the parse P. In P, phrases (elements of D) are rep-

resented by their lexicographic rank. More precisely,

the phrases are ordered lexicographically and every

phrase is identiﬁed with its rank in the sorted dictio-

nary (i.e., the ranks serve as meta-symbols). Conse-

quently, the parse P is the sequence of numbers that

gives the string S if each number is replaced with its

phrase from D. In the example of Figure 1, the win-

In the Karp-Rabin-Algorithm, the modulus is a prime

number, but this is not required in our context.

BIOINFORMATICS 2025 - 16th International Conference on Bioinformatics Models, Methods and Algorithms

460

dow size is w = 1, ‘t’ is used as trigger string, and

upper case letters are used as meta-symbols.

3 PRELIMINARIES (FOR

COMPUTING multiMUMs)

Let S be a string of length n on an ordered alphabet Σ.

For 1 ≤ i ≤ n, S[i] denotes the character at position i

in S. For i ≤ j, S[i.. j] denotes the substring of S start-

ing with the character at position i and ending with

the character at position j. Furthermore, S

denotes

the i-th sufﬁx S[i..n] of S.

The sufﬁx array SA of the string S is an array

of integers in the range 1 to n specifying the lexico-

graphic ordering of the n sufﬁxes of S, that is, it sat-

isﬁes S

SA[1]

< S

SA[2]

< ··· < S

SA[n]

. The sufﬁx array

can be built in linear time; we refer to the overview

article of (Puglisi et al., 2007) for sufﬁx array con-

struction algorithms and to (Olbrich et al., 2024) for

newer developments.

Let S be a string of length n having the sentinel

character $ at the end (and nowhere else). We assume

that $ is smaller than any other character. The Bur-

rows and Wheeler transform (Burrows and Wheeler,

1994) converts S into the string BWT[1..n] deﬁned

by BWT[i] = S[SA[i] − 1] for all i with SA[i] ̸= 1 and

BWT[i] = $ otherwise.

The sufﬁx array SA is often enhanced with the

so-called LCP-array containing the lengths of longest

common preﬁxes between consecutive sufﬁxes in SA.

Formally, the LCP-array is an array so that LCP[1] =

−1 = LCP[n + 1] and LCP[i] = |lcp(S

SA[i−1]

SA[i]

for 2 ≤ i ≤ n, where lcp(u,v) denotes the longest com-

mon preﬁx between two strings u and v. Like the

sufﬁx array, the LCP-array can be computed in linear

time (Kasai et al., 2001). (Abouelhoda et al., 2004)

introduced the concept of lcp-intervals. An interval

[i.. j], where 1 ≤ i < j ≤ n, in the LCP-array is called

an lcp-interval of lcp-value ℓ (denoted by ℓ-[i.. j]) if

• LCP[i] < ℓ,

• LCP[k] ≥ ℓ for all k with i + 1 ≤ k ≤ j,

• LCP[k] = ℓ for at least one k with i + 1 ≤ k ≤ j,

• LCP[ j + 1] < ℓ.

(Abouelhoda et al., 2004) showed that there is a one-

to-one correspondence between the set of all lcp-

intervals and the set of all internal nodes of the suf-

ﬁx tree of S (we assume a basic knowledge of suf-

ﬁx trees). Consequently, there are at most n − 1 lcp-

intervals for a string of length n.

Let S

,. .. ,S

be strings of sizes n

,. .. ,n

respectively. We are interested in the lexicographic

order of all sufﬁxes

,. .. ,S

of these strings. Note that two sufﬁxes S

and S

with

j ̸= k may coincide, i.e., S

= S

is possible. (In this

case, it is natural to demand that the sufﬁx with the

smaller superscript shall appear before the sufﬁx with

the larger superscript.) Because the strings may share

identical sufﬁxes, we use m pairwise distinct char-

acters #

,. .. ,#

to tell the sufﬁxes apart. To be

precise, for each j with 1 ≤ j ≤ m, we obtain the

string S

of length n

+ 1 by appending the special

character #

to S

. This ensures that each sufﬁx can

uniquely be assigned to one of the m strings: if the

sufﬁx ends with #

, then it belongs to S

. If we assume

that #

< #

< ·· · < #

and that all other characters

in the alphabet Σ are larger than these symbols, then

the sufﬁxes of the strings S

,. .. ,S

are not

only pairwise distinct, but we also have S

< S

if and only if either S

< S

or S

= S

and j < k. In

the following, we tacitly assume that every string S

(1 ≤ j ≤ m) is terminated with the character #

, so its

size is n

+ 1.

The generalized sufﬁx array (GSA) of the strings

,. .. ,S

consists of two arrays of size n = m +

∑

j=1

, the document array DA and the array SA,

having the following properties:

• For every sufﬁx S

, there is an index i so that j =

DA[i] and k = SA[i].

• S

DA[i]

SA[i]

< S

DA[i+1]

SA[i+1]

for all i with 1 ≤ i ≤ n − 1.

In other words, the arrays DA and SA specify the lexi-

cographic order of all the sufﬁxes of the m strings. An

example can be found in (the two rightmost columns

of) Figure 2, which also shows the corresponding

LCP-array and BWT. We will call the combination

of the GSA with the LCP-array the enhanced GSA of

,. .. ,S

. (Louza et al., 2017) have shown that

the (enhanced) GSA can be constructed in O(n) time

with only one special character instead of the m spe-

cial characters #

,. .. ,#

(this is advantageous be-

cause it keeps the alphabet small). However, it is con-

ceptually easier to use m special characters instead of

one.

4 COMPUTATION OF multiMUMs

In order to determine anchors (on the parse-level

or on the base-level), we must ﬁrst compute frag-

ments and then a highest-scoring global chain of col-

inear non-overlapping fragments. MGA (H

ohl et al.,

2002) as well as FMAlign2 (Zhang et al., 2024) use

multiMEMs as fragments, but like parsnp (Treangen

Generating Multiple Alignments of Genomes of the Same Species

461

i LCP BWT S

DA[i]

SA[i]

DA SA

1 −1 A #

1 5

2 0 C #

2 5

3 0 D #

3 5

4 0 C A#

1 4

5 1 #

ABCA#

1 1

6 2 #

ABCD#

3 1

7 1 C AC#

2 3

8 0 A BCA#

1 2

9 3 #

BCAC#

2 1

10 2 A BCD#

3 2

11 0 A C#

2 4

12 1 B CA#

1 3

13 2 B CAC#

2 2

14 1 B CD#

3 3

15 0 C D#

3 4

16 −1

Figure 2: The GSA of the strings S

= ABCA#

, S

BCAC#

, and S

= ABCD#

(derived from the parses of Fig-

ure 1) consists of the two arrays DA and SA. The enhanced

GSA includes the corresponding LCP-array (note that the

BWT can easily be computed on the ﬂy).

et al., 2014) we use multiMUMs instead. This is

because the explanation of how multiMUMs can be

computed is much easier to understand and prelimi-

nary experiments showed that multiMUMs are equally

good. Roughly speaking, a multiMUM is a string

ω occurring exactly once in each of the sequences

,. .. ,S

with the property that ω cannot simulta-

neously be extended in all sequences (on either end)

without incurring a mismatch. The formal deﬁnition

reads as follows.

Deﬁnition 1. A multiple exact match in the se-

quences S

,. .. ,S

is an (m + 1)-tuple (ℓ, p

,. .. , p

)

with ℓ > 0 and 1 ≤ p

≤ n

− ℓ + 1 (1 ≤ k ≤ m)

so that S

..p

+ ℓ − 1] = S

..p

+ ℓ − 1] for all

i, j ∈ {1, . . . , m}. In words, the length ℓ substrings

of S

,. .. ,S

starting at the positions p

,. .. , p

coin-

cide. A multiple exact match is left-maximal if for at

least one pair (i, j) we have S

−1] ̸= S

−1] (for

k ∈ {1,. .. ,m}, we deﬁne S

−1] = #

if p

= 1, see

the deﬁnition of the BWT). It is right-maximal if for

at least one pair (i, j) we have S

+ ℓ] ̸= S

+ ℓ].

A multiple exact match is maximal if it is left-maximal

and right-maximal. A multiple maximal exact match

is also called multiMEM. A multiMEM (ℓ, p

,. .. , p

)

is a multiMUM (multiple maximal unique match) if for

all i with 1 ≤ i ≤ m the string S

..p

+ ℓ −1] occurs

exactly once in the sequence S

If there is a long multiMUM (ℓ, p

,. .. , p

)

in the sequences S

,. .. ,S

, then it is very

likely that the identical substrings S

..p

+ ℓ −

1],. .. ,S

..p

+ ℓ − 1] appear one below the other

in a multiple alignment of S

,. .. ,S

. In other words,

the multiMUM serves as a potential anchor. The next

lemma tells us how multiMUMs can efﬁciently be

computed.

Lemma 2. There is a one-to-one correspondence be-

tween the set of all multiMUMs and the set of all lcp-

intervals ℓ-[lb..rb] in the enhanced GSA of S

,. .. ,S

satisfying

(1) rb − lb + 1 = m.

(2) DA[i] ̸= DA[ j] for all pairs (i, j) with lb ≤ i < j ≤

rb.

(3) BWT[i] ̸= BWT[ j] for at least one pair (i, j) with

lb ≤ i < j ≤ rb.

Proof. Let ℓ-[lb..rb] be an lcp-interval satisfying the

three conditions. By conditions (1) and (2), we have

{DA[k] | 1 ≤ k ≤ m} = {1, . .. ,m}. That is, the m

sufﬁxes in [lb..rb] belong to m different strings. Let

,. .. ,i

be the permutation of the indices lb, lb +

1,. .. ,rb so that DA[i

] = k for 1 ≤ k ≤ m. Deﬁne p

SA[i

] for 1 ≤ k ≤ m. We claim that (ℓ, p

,. .. , p

) is

a multiMUM. By the deﬁnition of an lcp-interval, the

length ℓ string ω = S

..p

+ ℓ − 1] is a common

preﬁx of the sufﬁxes in [lb..rb]. It follows that ω oc-

curs exactly once in each of the strings S

,. .. ,S

and

that (ℓ, p

,. .. , p

) is a multiple exact match. By con-

dition (3), it is left-maximal. Since there is at least

one index q in [lb..rb] so that LCP[q] = ℓ (deﬁnition

of lcp-interval), it is also right-maximal. In summary,

(ℓ, p

,. .. , p

) is a multiMUM.

Conversely, let (ℓ, p

,. .. , p

) be a multiMUM. That

is, the string ω = S

..p

+ℓ−1] (1 ≤ k ≤ m) occurs

exactly once in each sequence S

,. .. ,S

. In combina-

tion with the right-maximality this implies that there

is an lcp-interval ℓ-[lb..rb] that contains exactly the

sufﬁxes S

,. .. ,S

. This lcp-interval satisﬁes condi-

tions (1) and (2). It also satisﬁes condition (3) because

(ℓ, p

,. .. , p

) is left-maximal.

According to the preceding lemma we can com-

pute multiMUMs as follows: Enumerate all lcp-

intervals and for each lcp-interval ℓ-[lb..rb] of size m

check whether the m indices in [lb..rb] satisfy condi-

tions (2) and (3). In the example of Figure 2, 2-[8..10]

is the only lcp-interval of size 3. This interval fulﬁlls

conditions (2) and (3) and hence it corresponds to the

multiMUM (2, 2, 1, 2), which is the anchor in Figure 1.

It is well known that all lcp-intervals can be enu-

merated in O(n) (Abouelhoda et al., 2004; Kasai

et al., 2001). The enumeration algorithm is shown

in Appendix 8. An alternative method for computing

multiMUMs can be found in (Ohlebusch and Kurtz,

2008). It works by (a) separately streaming each

string S

(2 ≤ j ≤ m) against the sufﬁx tree of S

and

BIOINFORMATICS 2025 - 16th International Conference on Bioinformatics Models, Methods and Algorithms

462

Step 1: multiMUM ﬁnding

III

Step 2: chaining (V and VI are removed)

III

Step 3: extending

III

Figure 3: Schematic constructing of our backbone. Step 1

runs on the parse, while Step 3 is done on the base-level.

Note that the situation of multiMUM V is highly unlikely

to occur (especially on the parse-level) and would likely

indicate a transposition. This case is included solely to ex-

haustively display the reasons why a multiMUM may not be

included in the chain. Note that the Steps 1–3 here corre-

spond to Phases 2b–2d (see Introduction).

(b) combining the pairwise exact matches to multiple

exact matches. In fact, the method described in (Ohle-

busch and Kurtz, 2008) computes rare multiMEMs,

but it yields multiMUMs if the rareness-thresholds are

all set to 1.

5 FRAGMENT-CHAINING AND

EXTENSION OF ANCHORS

To ﬁnd anchors, one must ﬁnd a highest-scoring

global chain of colinear non-overlapping fragments.

In our context, a fragment f is a multiMUM

(ℓ, p

,. .. , p

). We associate a weight with each frag-

ment, denoted by f .weight. If f is a multiMUM on

the base-level, then we set f .weight = ℓ. If f is a

multiMUM on the parse-level, then f .weight is the

number of bases that are obtained by replacing the

identiﬁers (meta-symbols) in S

..p

+ ℓ − 1] with

their phrases from the dictionary.

Roughly speaking, two fragments f and f

′

are col-

inear if their order is the same in all of the sequences.

In Step 1 of Figure 3, for example, the fragments I

and II are colinear, but II and V are not. Two frag-

ments overlap if their segments overlap in one of the

sequences (in Figure 3, the fragments I and VI are

overlapping, while I and II are non-overlapping).

We deﬁne a binary relation ≪ on the set of frag-

ments so that f ≪ f

′

if and only if f and f

′

are colin-

ear and non-overlapping.

Deﬁnition 3. Let f = (ℓ, p

,. .. , p

) and f

′

(ℓ

′

, p

′

,. .. , p

′

) be two fragments. We deﬁne f ≪ f

′

if and only if p

+ ℓ − 1 < p

′

for all k with 1 ≤ k ≤ m.

We then say that f precedes f

′

A chain C of colinear non-overlapping fragments

(‘chain’ for short) is a sequence of fragments

, f

,. .. , f

so that f

≪ f

i+1

for all i with 1 ≤ i < j.

The score of C is score(C) =

∑

i=1

.weight.

Given m weighted fragments, the global fragment-

chaining problem is to determine a chain of high-

est score (called optimal global chain in the follow-

ing) starting at the origin 0 and ending at termi-

nus t. (The origin 0 = (0, 0, . . . , 0) and the terminus

t = (0, n

+1, . .. ,n

+1) are artiﬁcial fragments with

weight 0. Note that 0 ≪ f ≪ t for every fragment f

with f ̸= 0 and f ̸= t.)

Let f

′

.score be deﬁned as the maximum score of

all chains starting at 0 and ending at f

′

. Then f

′

.score

can be expressed by the recurrence: 0.score = 0 and

′

.score = f

′

.weight + max{ f .score | f ≪ f

′

}

A dynamic programming algorithm based on this re-

currence takes O(mk

) time to compute an optimal

global chain, where k is the number of fragments.

We can use the fact that we expect an optimal

global chain to contain almost all of the fragments to

reduce the expected time complexity to O(k log k +

mk): We sort the fragments by increasing position in

e.g. the ﬁrst sequence and process them in this order.

Moreover, we maintain the already processed frag-

ments in an array sorted by score. For each fragment

′

we search for its predecessor of highest score as

follows: We consider the already processed fragments

(starting with the one with the highest score yet) and

pick the ﬁrst fragment which is actually a predecessor

of f

′

. For almost all fragments, the number of other

fragments checked will be one and the fragment will

be inserted at the end of the array. Of course, this

heuristic does not reduce the worst case time com-

plexity, but it works very well in our context.

The fragments (multiMUMs) in an optimal global

chain on the parse-level are the initial anchors of

the multiple alignment of the m sequences. Such

an anchor (a multiMUM of the parses) cannot be ex-

tended on the parse-level, but in most cases it can

be extended on the base-level. This is because the

phrases corresponding to two different meta-symbols

may share a common sufﬁx (so that a left extension

on the base-level may be possible) or preﬁx (ditto for

a right extension). If we replace each meta-symbol in

a multiMUM on the parse-level with its corresponding

phrase from the dictionary, then we obtain a multi-

ple exact match on the base-level. We simultaneously

extend it base-by-base to the left (and right, respec-

tively) in each of the m sequences until a mismatch

occurs. Since such a mismatch is most likely a SNP,

we try to further extend it by the same procedure.

This iterative extension ends when less than 10 bases

Generating Multiple Alignments of Genomes of the Same Species

463

match exactly (simultaneously in all of the sequences)

beyond one mismatch. This is illustrated in Step 3 of

Figure 3: The anchors I, II, and IV are extended on

the base-level (anchor I is extended twice to the left).

After this extension step, we have the ﬁnal anchors

(the backbone) of the overall multiple alignment.

6 GENERATING ALIGNMENTS

OF THE GAPS

After the computation of the anchors, we need to

ﬁll the gaps between them. (Note that this can triv-

ially be parallelized because the alignments of two

gaps are independent of each other.) For small gaps,

we can simply use another multiple-sequence aligner

(we use FMAlign2 (Zhang et al., 2024) and MAFFT

(Nakamura et al., 2018)). However, when there are

large gaps, e.g. because the sequences are incomplete,

those other programs take very long to run.

For this reason, we try to ﬁll the gaps using par-

tial multiMUMs. A partial multiMUM is a multiMUM

of a non-empty subset S of the sequences. For con-

venience, we denote it by (ℓ, p

,. .. , p

), where p

the position of the multiMUM in sequence i if this se-

quence is in S and p

= ⊥ otherwise (where ⊥ is a

special value indicating absence). We say the partial

multiMUM occurs in sequence i if and only if p

̸= ⊥,

and for all i ∈ {1, . .. ,m} with p

̸= ⊥ it is required

that the string S

..p

+ ℓ − 1] occurs exactly once in

In what follows, we describe our process of com-

pleting the alignment (Phase 3 in the Introduction) in

the six steps that are visualized in Figure 4. The Steps

1–5 correspond to Phase 3a and Step 6 equals Phase

3b.

Step 1: Find Partial multiMUMs. The ﬁrst step is

to ﬁnd partial multiMUMs. By deﬁnition, this can

be achieved using a trivial modiﬁcation (replacing

rb − lb + 1 = m with rb − lb + 1 ≤ m in Lemma 2)

of the algorithm for ﬁnding multiMUMs. If the gap

is sufﬁciently large, we work on the parse-level, oth-

erwise we work on the base-level. If a previous at-

tempt on the parse-level of this gap did not yield any

new anchors, we also switch to the base-level. For

Note that this deﬁnition does not require the converse

to be true (i.e. if the string S

..p

+ ℓ − 1] occurs exactly

once in sequence i, the partial multiMUM must not neces-

sarily occur in sequence i). This is because our algorithm

for chaining partial multiMUMs may discard the occurrence

of a partial multiMUM in a sequence in order to increase the

overall score of the chain (although this is unlikely).

Step 1: ﬁnd partial multiMUMs

III

Step 2: chaining

III

Step 3: minimap2

≈I

III

Step 4: splitting

≈I

III

Step 5: recurse into large gaps

≈I

III

IVV

Step 6: align small gaps externally

≈I

III

IVV

Figure 4: Schematic ﬁlling of a large gap. Note that it is

possible to ﬁnd multiMUMs that span all sequences (such

as III) which were not discovered when considering all se-

quences. This can happen when the same string also ap-

pears somewhere outside the large gap, or because we com-

pute multiMUMs on the base-level instead of on the parse-

level.

algorithmic reasons (see below), we require each par-

tial multiMUM to occur in more than half of the se-

quences.

Step 2: Chain Partial multiMUMs. As in the com-

putation of the backbone, we chain the fragments

(partial multiMUMs). However, the weight of a frag-

ment must now also incorporate the number of se-

quences the fragment occurs in. As new weight, we

use the length of the fragment multiplied by the num-

ber of sequences it occurs in (i.e. the total number

of bases it covers, as does FMAlign2 (Zhang et al.,

2024)). Unfortunately, chaining partial fragments ap-

pears to be fundamentally more difﬁcult than chaining

fragments occurring in all sequences. This is due to

the fact that the ≪-relation on partial fragments is not

transitive anymore.

To circumvent this issue, we use

a heuristic. Speciﬁcally, we select k sequences and

then chain only those partial multiMUMs occurring in

all of these k sequences. The selected sequences are

those with the most bases covered by the found partial

multiMUMs. We do this for all k ∈ {⌊

⌋ + 1, . .. ,m},

All (polynomial-time) chaining algorithms known to

us rely on this transitivity.

BIOINFORMATICS 2025 - 16th International Conference on Bioinformatics Models, Methods and Algorithms

464

III

Figure 5: Since we require that each anchor occurs in more

than half of the sequences, they can be linearly ordered

(I ≪ II ≪ III ≪ IV ≪ V). Consider the gap in the sec-

ond sequence between II and V (marked red). The anchors

III and IV are between II and V in the linear order, thus we

try to map them (the string they represent) to this gap us-

ing minimap2. Similarly, we would try to map II to the gap

between I and III in the third sequence (marked blue), I to

the gap before II in the ﬁrst sequence, and V to the third

sequence after IV.

and then choose the resulting chain with the most

bases covered. The restriction k >

is required for

every pair of partial multiMUMs to be either incom-

patible (i.e. overlapping or not colinear) or compara-

ble with ≪, and thus ensures that the anchors can be

totally ordered.

The partial multiMUMs in the resulting ‘sub-

chain’ may have had occurrences in the m − k other

sequences, which should of course not be discarded.

Thus, we add them to the chain with a simple vari-

ant of the well-known algorithm for the heaviest in-

creasing subsequence (HIS) (Jacobson and Vo, 1992).

Since the order of the partial multiMUMs in the sub-

chain is ﬁxed, this can be done separately for each

sequence.

Step 3: Extend Partial Fragments with Minimap2.

After computing such a sub-chain, we try to extend

each resulting partial fragment to the sequences it

does not occur in using the long read mapper min-

imap2 (Li, 2021). This is illustrated in Figure 5.

Recall that the fragments are linearly ordered be-

cause each occurs in more than half the sequences.

Let those fragments be f

≪ ·· · ≪ f

. We consider

each gap in each sequence separately. For such a gap,

we determine the two fragments f

and f

(1 ≤ i < j ≤

k) that bound it and try to map the fragments between

them ({ f

i+1

,. .. , f

j−1

}) to this gap using minimap2.

The matches reported by minimap2 are then chained

using the same HIS variant as above.

Now the fragments in the sub-chain are not

changed anymore and are considered anchors.

Steps 4–6: Complete Alignment. After adding the

new anchors, we again try to split the large gap into

smaller gaps, which are independent of each other.

If a resulting gap is sufﬁciently small, we use

FMAlign2 to align it, otherwise we recurse. It may

be the case that, after splitting, a gap still contains

some anchors. (This happens when anchors occur in

too few sequences and do not lead to a split.) These

anchors are kept, and we ensure that chains in lower

recursion levels are consistent with them.

If there is a sequence in a large gap that can-

not be aligned to the other sequences via (partial)

multiMUMs or minimap2, this gap cannot be split.

However, since we can assume that there is no match

anyway (because neither minimap2 nor we could ﬁnd

one), we remove this sequence from this gap before

recursing and ‘align’ it immediately after the gap.

This is only done when there are anchors exceeding

a given length that span the other sequences.

7 EXPERIMENTAL RESULTS

We implemented our algorithm in an experimental

tool, called PANAMA, in C++. The source code is

publicly available.

For the construction of enhanced

sufﬁx arrays we used the SDSL-lite library (Gog

et al., 2014).

We evaluated our implementation on two datasets,

namely data from the 1000 Genomes Project

(56

GiB) and data from the draft human pangenome ref-

erence

(21 GiB) (Liao et al., 2023). The generated

alignments are publicly available.

The ﬁrst dataset stems from (Boucher et al., 2021)

and contains 1000 human haplotypes of chromosome

19. The sequences within this set all have a length of

about 59 million base pairs. It can be viewed as a set

of complete telomere-to-telomere assemblies (there is

no missing data).

The second dataset contains the phased, diploid

assemblies of 47 genetically diverse humans. In or-

der to determine the ordering and orientation of the

contigs of the assemblies, we utilized the mapping

software gedmap (B

uchler et al., 2023) as a long read

mapper. For each phased assembly A, we split each

contig c

of A into 500bp long reads and mapped

these to the reference genome hs37d5 including the

variants from the 1000 genome project (The 1000

Genomes Project Consortium, 2015). The resulting

alignments were chained and the best chain was used

to map c

to the reference. In most cases, the best

chain distinctly revealed the position and orientation

of c

relative to the reference. When a contig c

could

not be placed, it was excluded from further process-

https://gitlab.com/qwerzuiop/panama

http://dolomit.cs.tu-dortmund.de/tudocomp/chr19.10

00.fa.xz archived at https://www.uni-ulm.de/in/theo/resear

ch/seqana/panama

https://s3-us-west-2.amazonaws.com/human-pange

nomics/index.html?prefix=working

https://www.uni-ulm.de/in/theo/research/seqana/pana

Generating Multiple Alignments of Genomes of the Same Species

465

Table 1: Wall clock time in minutes of the individual phases

for our two test sets and two PFP-moduli, and the maximum

RAM usage in GiB. The numbering of the phases refers to

the overview in the Introduction.

dataset

modulus

c19

100

c19

100

1: PFP 34 47 15 20

2: ﬁnd backbone 28 125 9 35

3: align gaps 22 23 40 40

total 98 195 64 95

RAM 64 69 51 111

ing. We selected all contigs of A that were aligned to

chromosome 1 (the largest human chromosome) and

sorted them by their position. If necessary, we re-

placed them with their reverse complement. The con-

catenation of these ordered contigs (separated by spe-

cial symbols) was used as the (partial) DNA sequence

of chromosome 1 of assembly A. In our experiment,

we included all sequences (obtained in this way) that

covered at least 80% of chromosome 1, which was

the case for 92 out of the 94 assemblies. It is thus not

surprising that the lengths of the sequences in this set

range from 203 to 250 million base pairs. This means

that we allowed up to 20% missing data per sequence,

which is an additional challenge when generating the

alignment.

For ﬁlling the small gaps (at most 10

bases), we

used FMAlign2 (Zhang et al., 2024) with HAlign3

(Tang et al., 2022) as backend, and fell back to

MAFFT (Nakamura et al., 2018) in case FMAlign2

produced no output.

We accepted a match from min-

imap2 when the reported MAPQ (‘mapping quality’)

value of the mapping was at least 30.

Since the

modulus used in PFP for determining trigger strings

inﬂuences the expected length of the phrases, we eval-

uated our program PANAMA with different moduli

(20 and 100). The window size for the KR-hashing

was always 10, because it seems to not signiﬁcantly

affect the properties of the parsing as long as it is suf-

ﬁciently large (Boucher et al., 2019). Note that a win-

dow size of 10 is also the default used in (Boucher

et al., 2019). The modulus for PFP in large gaps was

always 20.

All experiments were conducted on a Linux-6.5.0

machine with an AMD EPYC 7742 (64 cores, 128

threads) processor and 256GB of RAM. We com-

piled PANAMA with GCC 11.4.0 with optimiza-

FMAlign2 crashes on some inputs. This affects mostly

simple cases, e.g. where all sequences are preﬁxes of the

same string.

This corresponds to probability of the mapping being

wrong of at most 0.01%. Note that after accepting matches,

we still chain them.

tion ﬂags -O3 -funroll-loops -march=native

-DNDEBUG. The reported wall clock time was mea-

sured with the utility GNU Time. The reported RAM

usage was calculated by measuring the maximum res-

ident system RAM usage during execution and sub-

tracting the resident system RAM usage before/after

execution.

The results of our experiments can be

seen in Tables 1 and 2. By far the most memory hun-

gry phase is Phase 3, which invokes external aligners

in parallel (the external aligners are responsible for

most of the maximum RAM usage here). Therefore,

the memory usage of PANAMA could greatly be re-

duced by reducing the number of concurrent threads.

There is no data missing in the chromosome 19

dataset, thus the gaps in the backbone (i.e. after ex-

tension 2d) are small. Tables 1 and 2 show that

PANAMA can generate a high quality multiple align-

ment of 1000 haplotypes of chromosome 19 within

hours. (Of course, such a high quality alignment is

possible only if the input sequences are very similar,

which is the case here.)

In contrast, on the chromosome 1 dataset, the

number of bases per sequence varies greatly because

we had to use contigs instead of whole chromosomes.

This variance leads to large gaps in the backbone. In

particular, there is one large gap for which PANAMA

could not produce a reasonable alignment. This gap is

located around the centromere (Logsdon et al., 2024),

starting at around base 1.03 ·10

and ending at around

base 1.36 · 10

The centromere is known to be

particularly difﬁcult to align as it is highly repetitive

(Logsdon et al., 2024). Since PANAMA was unable

to generate a reasonable alignment of the centromeric

region, the coverage is ‘only’ 81%, see Table 2.

Note that the sufﬁx array construction algo-

rithms in the SDSL-lite library (DivSufSort and

qsufsort) are not the fastest ones that exist (see e.g.

(Olbrich et al., 2024) for a recent comparison) and

are not parallelized. Moreover, the library is gener-

ally focused on low memory usage instead of runtime.

Therefore, we conjecture that the performance (espe-

cially of Phase 2a ‘compute GSA’) of our implemen-

tation can be signiﬁcantly improved.

We also ran FMAlign2 (with HAlign3 as back-

end), HAlign3, and MAFFT v7.490 (all with 128

Since FMAlign2 and PANAMA execute other external

programs, measuring RAM usage with e.g. GNU Time is

not sufﬁcient.

Since the reference genome is not part of our dataset,

the stated positions are estimated as the number of bases

preceding the gap. The start position of the gap varies by

less than 10

bases, but the number of bases per sequence in

this gap ranges from 1.2 · 10

to 5.7 · 10

. The reported end

position is the start position plus the median of the number

of bases in the sequences in this gap.

BIOINFORMATICS 2025 - 16th International Conference on Bioinformatics Models, Methods and Algorithms

466

Table 2: Evaluation of the generated alignments: The number of anchors equals the number of multiMUMs in the chain

computed in Phase 2c. The number of extensions equals the number of left or right extension of anchors beyond a mismatching

column made in Phase 2d. This can also be seen as the number of potentially detected SNPs at this phase. Percent coverage is

the number of bases that are included in our alignment relative to the overall number of bases. Percent identity is the number

of 100% identity columns (i.e. columns without gaps in which all bases are identical) relative to the alignment length. The

centromeric region of chromosome 1 is included in all measurements but excluded from the computation of the identity.

dataset

modulus

c19

100

c19

100

# anchors 88788 305480 336206 640794

# extensions 335644 509986 661532 712606

coverage in %:

after 2c (chain) 19.01 49.30 23.28 49.27

after 2d (extend) 65.84 92.57 63.98 70.32

ﬁnal alignment 100 100 81.38 81.38

identity in %: 97.93 98.15 75.93 75.90

Table 3: Results of FMAlign2, HAlign3, and PANAMA

(with PFP modulus 100) on a few haplotypes of chromo-

some 19. The given time is the wall clock time.

(a) FMAlign2 running on small sets of chromosome 19 hap-

lotypes.

#haplotypes 2 3 4 5 6 7

time (min) 36 78 123 167 216 270

RAM (GiB) 19 27 35 43 51 59

(b) HAlign3 running on small sets of chromosome 19 hap-

lotypes.

#haplotypes 2 4 16 24 26

time (min) 0.6 0.7 1.0 1.6 2.4

RAM (GiB) 10.0 12.2 26.5 32.2 32.5

lotypes.

#haplotypes 2 4 16 24 26

time (min) 0.2 0.4 1.2 1.7 1.8

RAM (GiB) 0.4 0.5 0.8 1.0 1.2

threads and otherwise default options) on sets of chro-

mosome 19 haplotypes. The results are shown in

Table 3. FMAlign2 is able to compute the align-

ment of 3 to 6 haplotypes in the time PANAMA

requires for all 1000 (depending on the modulus

used), but is not suitable for the whole chromosome

19 dataset. On these datasets, HAlign3 is much

faster than FMAlign2 but still slower than PANAMA.

HAlign3 crashed on cases with more than 26 hap-

lotypes with a “NullPointerException”. MAFFT

already crashed on two chromosome 19 haplotypes

with a “TOO MANY SEGMENTS” error.

8 CONCLUSION AND FUTURE

WORK

The experiments showed that our tool PANAMA can

very efﬁciently generate a pangenomic-scale multiple

alignment of assembled genomes of the same species,

provided that structural variants seldomly occur. It

could be argued that in the presence of large-scale

chromosomal rearrangements, a (colinear) multiple

alignment of the chromosomes does not make sense

at all. In that case, a pangenome graph is a better rep-

resentation of the genomes (Baaijens et al., 2022).

We plan to extend our work in such a way that

it can deal with large-scale structural variants. Our

approach can be modiﬁed to address this very impor-

tant topic as follows: instead of computing a multiple

alignment, it should be possible to use the syntenic re-

gions as the backbone of a pangenome graph. In the

construction of the full pangenome graph, transposi-

tions and inversions can be detected. Moreover, pro-

grams for pangenome graph generation can be used

to deal with the remaining (relatively small) gaps be-

tween anchors.

REFERENCES

Abouelhoda, M., Kurtz, S., and Ohlebusch, E. (2004). Re-

placing sufﬁx trees with enhanced sufﬁx arrays. Jour-

nal of Discrete Algorithms, 2(1):53–86.

Baaijens, J. et al. (2022). Computational graph pange-

nomics: a tutorial on data structures and their appli-

cations. Natural Computing, 21(1):81–108.

Boucher, C. et al. (2019). Preﬁx-free parsing for building

big BWTs. Algorithms for Molecular Biology, 14(1).

Boucher, C. et al. (2021). PHONI: Streamed matching

Generating Multiple Alignments of Genomes of the Same Species

467

statistics with multi-genome references. In Data Com-

pression Conference (DCC), pages 193–202. IEEE.

uchler, T., Olbrich, J., and Ohlebusch, E. (2023). Efﬁ-

cient short read mapping to a pangenome that is rep-

resented by a graph of ED strings. Bioinformatics,

39(5):btad320.

Burrows, M. and Wheeler, D. (1994). A block-sorting loss-

less data compression algorithm. Research Report

124, Digital Systems Research Center.

Chain, P. et al. (2003). An applications-focused review of

comparative genomics tools: Capabilities, limitations

and future challenges. Brieﬁngs in Bioinformatics,

4(2):105–123.

Cormen, T., Leiserson, C., and Rivest, R. (1990). Introduc-

tion to Algorithms. MIT Press, Cambridge, MA.

Gog, S. et al. (2014). From theory to practice: Plug and

play with succinct data structures. In Experimental

Algorithms (SEA), pages 326–337. Springer.

ohl, M., Kurtz, S., and Ohlebusch, E. (2002). Efﬁ-

cient multiple genome alignment. Bioinformatics,

18:S312–S320.

Jacobson, G. and Vo, K. (1992). Heaviest increas-

ing/common subsequence problems. In Combinato-

rial Pattern Matching (CPM), pages 52–66. Springer.

Kasai, T. et al. (2001). Linear-time longest-common-preﬁx

computation in sufﬁx arrays and its applications. In

Combinatorial Pattern Matching (CPM), pages 181–

192. Springer.

Li, H. (2021). New strategies to improve minimap2 align-

ment accuracy. Bioinformatics, 37(23):4572–4574.

Liao, W. et al. (2023). A draft human pangenome reference.

Nature, 617(7960):312–324.

Logsdon, G. et al. (2024). The variation and evolution of

complete human centromeres. Nature, 629:136–145.

Louza, F., Gog, S., and Telles, P. (2017). Inducing enhanced

sufﬁx arrays for string collections. Theoretical Com-

puter Science, 678:22–39.

Nakamura, T. et al. (2018). Parallelization of MAFFT for

large-scale multiple sequence alignments. Bioinfor-

matics, 34(14):2490–2492.

Ohlebusch, E. and Kurtz, S. (2008). Space efﬁcient com-

putation of rare maximal exact matches between mul-

tiple sequences. Journal of Computational Biology,

15(4):357–377.

Olbrich, J., Ohlebusch, E., and B

uchler, T. (2024). Generic

non-recursive sufﬁx array construction. ACM Trans-

actions on Algorithms, 20(2):18.

Porubsky, D. et al. (2021). Fully phased human genome as-

sembly without parental data using single-cell strand

sequencing and long reads. Nature Biotechnology,

39(3):302–308.

Puglisi, S., Smyth, W., and Turpin, A. (2007). A taxonomy

of sufﬁx array construction algorithms. ACM Comput-

ing Surveys, 39(2):Article 4.

Tang, F. et al. (2022). HAlign 3: fast multiple alignment of

ultra-large numbers of similar DNA/RNA sequences.

Molecular Biology and Evolution, 39(8):msac166.

Tettelin, H. et al. (2005). Genome analysis of multiple

pathogenic isolates of streptococcus agalactiae: im-

plications for the microbial ”pan-genome”. Proceed-

ings of the National Academy of Sciences of the United

States of America, 102(39):13950–13955.

The 1000 Genomes Project Consortium (2015). A global

reference for human genetic variation. Nature,

526(7571):68–74.

Treangen, T. et al. (2014). The Harvest suite for rapid core-

genome alignment and visualization of thousands of

intraspeciﬁc microbial genomes. Genome Biology,

15(11):524.

Zhang, P. et al. (2024). FMAlign2: a novel fast multiple

nucleotide sequence alignment method for ultralong

datasets. Bioinformatics, 40(1):btae014.

APPENDIX

A Enumeration of LCP-Intervals

(Kasai et al., 2001) presented a linear time algorithm

to simulate the bottom-up traversal of a sufﬁx tree

with a sufﬁx array and its LCP-array (which, given

the sufﬁx array, can be constructed in linear time).

The following algorithm is a slight modiﬁcation of

their algorithm TraverseWithArray, cf. (Abouelhoda

et al., 2004). It computes all lcp-intervals of the

LCP-array with the help of a stack. The elements

on the stack are lcp-intervals represented by tuples

⟨lcp,lb,rb⟩, where lcp is the lcp-value of the inter-

val, lb is its left boundary, and rb is its right bound-

ary. In Algorithm 1, push (pushes an element onto the

stack) and pop (pops an element from the stack and

returns that element) are the usual stack operations,

while top provides a pointer to the topmost element

of the stack. Furthermore, ⊥ stands for an undeﬁned

value. We assume that array indexing starts at 1 and

that LCP[1] = −1 = LCP[n + 1].

Function Enumerate(LCP):

push(⟨0,1, ⊥⟩);

for k = 2 → n + 1 do

lb ← k − 1;

while LCP[k] < top().lcp do

top().rb ← k − 1;

interval ← pop();

report(interval);

lb ← interval.lb;

end

if LCP[k] > top().lcp then

push(⟨LCP[k], lb,⊥⟩);

end

Algorithm 1: Given the LCP-array of a string of length n,

this algorithm enumerates all lcp-intervals.

BIOINFORMATICS 2025 - 16th International Conference on Bioinformatics Models, Methods and Algorithms

468