ACCURATE LONG READ MAPPING

USING ENHANCED SUFFIX ARRAYS

Micha¨el Vyverman

, Joachim De Schrijver

, Wim Van Criekinge

, Peter Dawyndt

and Veerle Fack

Department of Applied Mathematics and Computer Science, Ghent University, Krijgslaan 281-S9, B-9000 Ghent, Belgium

Department of Molecular Biotechnology, Ghent University, Coupure Links 653, B-9000 Ghent, Belgium

Keywords:

Inexact string matching, DNA sequencing, Read mapping, Enhanced sufﬁx arrays.

Abstract:

With the rise of high throughput sequencing, new programs have been developed for dealing with the align-

ment of a huge amount of short read data to reference genomes. Recent developments in sequencing tech-

nology allow longer reads, but the mappers for short reads are not suited for reads of several hundreds of

base pairs. We propose an algorithm for mapping longer reads, which is based on chaining maximal exact

matches and uses heuristics and the Needleman-Wunsch algorithm to bridge the gaps. To compute maximal

exact matches we use a specialized index structure, called enhanced sufﬁx array. The proposed algorithm is

very accurate and can handle large reads with mutations and long insertions and deletions.

1 INTRODUCTION

With the rise of the second generation sequencing

technology the size of the sequencing data has in-

creased dramatically. New applications and new chal-

lenges have arisen, including the need for new al-

gorithms and data structures to handle the increased

amount of data and new error models. The efﬁcient

mapping of sequencing reads to a reference genome

is one of the most important challenges of the new

sequencing technology, especially when considering

the reassembly of genomes and its application in the

prestiguous

1000 Genomes Project

Typical features of the new sequencing technol-

ogy are the small read length and new types of se-

quencing errors. In recent years, many mapping pro-

grams have been developed for handling short read

lengths. But, while sequencing technology advances,

the read length of second generation sequencing tech-

nology is increasing and third generation sequencing

promises read lengths of more than 1kbp. Moreover,

most mapping algorithms for short reads focus on mu-

tations as being the main error, while the longer 454

reads contain mostly insertions and deletions.

In this paper we focus on mapping reads of sev-

eral hundreds of base pairs to a medium sized refer-

ence sequence, where the reads may contain a small

number of long insertions and deletions, as well as

mutations. The proposed method is based on chain-

ing together maximal exact matches (MEMs) using

heuristics for speeding up the search. In order to

obtain the MEMs between query and reference se-

quence, we store the reference sequence in a special-

ized data structure, the so-called enhanced sufﬁx ar-

ray (ESA) (Abouelhoda et al., 2004). The MEMs are

used as anchors in the alignment and are combined

or chained in a full local alignment using a combi-

nation of the Needleman-Wunsch algorithm (Needle-

man and Wunsch, 1970) for global alignment together

with some heuristics.

Related Work. Many fast and accurate short-read

mappers exist today.

SOAP

(Li et al., 2008),

Bowtie

(Langmead et al., 2009) and

BWA

(Li and

Durbin, 2009) all perform well when used for short

reads with few mismatches. The upper bounds for

the read length are 200bp for

BWA

and 1024bp for

Bowtie

and

SOAP

. Hoffmann et al. introduced a new

short sequence mapper

segemehl

and compared its

performance with

SOAP

BWA

and

Bowtie

(Hoff-

mann et al., 2009). Their results show that most map-

ping programs are not capable of forming a correct

alignment when 4 or more differences between query

and reference sequence are present. Recently a fast

and accurate long read mapper

BWA-SW

using the

Burrows-Wheeler transform was introduced (Li and

Durbin, 2010).

Also mapping programs based on maximal ex-

act matches have already been investigated. For ex-

102

Vyverman M., De Schrijver J., Van Criekinge W., Dawyndt P. and Fack V..

ACCURATE LONG READ MAPPING USING ENHANCED SUFFIX ARRAYS.

DOI: 10.5220/0003126201020107

In Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms (BIOINFORMATICS-2011), pages 102-107

ISBN: 978-989-8425-36-2

 2011 SCITEPRESS (Science and Technology Publications, Lda.)

ample,

AVID

(Bray et al., 2003) is a global align-

ment program that uses MEMs as anchors for a global

alignment.

AVID

further uses recursion and in a ﬁ-

nal step the Needleman-Wunsch algorithm is used in

order to ﬁll the gaps between the MEMs. MUM-

mer (Kurtz et al., 2004) is another alignment pro-

gram that uses a sufﬁx tree-based seed and extension

algorithm. MUMmer uses maximal unique matches

(MUMs) instead of MEMs.

2 PRELIMINARIES

By Σ = {A,C, G,T} we denote the alphabet of nu-

cleotides. Let S be a reference sequence over the al-

phabet Σ and let Q be a set of query sequences with

q ∈ Q. The inexact string matching problem is de-

ﬁned as ﬁnding an optimal local alignment between

the query sequences and the reference sequence S.

The queries can contain mismatches and gaps. The

differences can be the result of genetic differences, se-

quencing errors or other sources; however, the source

of the differences is not important for our purposes

and the gaps will be referred to as mutations, inser-

tions and deletions. The solution of the stated prob-

lem gives a starting position, followed by a list of the

positions of the insertions, deletions and mutations

along with their length if necessary. The edit distance

between S and a query q is deﬁned as the number of

‘errors’ (insertions, deletions or mutations) of the op-

timal local alignment found.

A maximal exact match (MEM) can intuitively be

seen as an exact match between substrings of two se-

quences that is no substring of another exact match-

ing substring between the two sequences. In this con-

text, a maximal exact match between the reference se-

quence S and the query sequence q is a triple (q

,ℓ,P),

where q

is a position in q corresponding to the start

of the match in q, ℓ is the length of the match and P

is a list of index positions p

in S that correspond to

starting positions of the match in S. When the list P

contains only one index position p we can also denote

the MEM as (q

,ℓ, p). The substrings correspond-

ing to [q

+ℓ],[p

, p

+ℓ],.. . , [p

|P|

, p

|P|

+ℓ] are ex-

act matches. The matches are maximal in the sense

that no exactly matching substring in S can be found

for the substrings corresponding to [q

− 1, ℓ] and

,ℓ + 1]. The deﬁnition of maximal exact matches

here is biased towards the query sequence. The rea-

son for this bias is that maximal exact matches will be

used to traverse the query sequence from left to right,

allowing the same substring in q to be encountered

more than once.

A sufﬁx tree (Gusﬁeld, 1997) for a string S (de-

noted by T ) is a tree-like data structure that indexes

S, by forming a one-to-one correspondence between

the sufﬁxes of the string and the leaves of the tree.

The main assets of the sufﬁx tree are its construction

cost, which is linear in the length of the string, and its

pattern matching cost, which is linear in the length of

the pattern.

Because many implementations and variants of

sufﬁx trees exist, it is important to note that T is a

sufﬁx tree with sufﬁx links. In our implementation we

use a variant of the sufﬁx tree, a so-called enhanced

sufﬁx array (ESA) (Abouelhoda et al., 2004), al-

though the idea behind the algorithm intuitively uses

a sufﬁx tree. We will use the general term index struc-

ture to refer to variants of both sufﬁx trees and sufﬁx

arrays.

3 ALGORITHM

3.1 Overview

The proposed algorithm ﬁrst builds an enhanced suf-

ﬁx array (ESA) for the reference sequence S and then

processes the query ﬁle one query at a time. Note that

the ESA data structure for S can be reused for all the

queries and in later projects. The ESA index is ﬁrst

used to calculate maximal exact matches (MEMs) be-

tween the reference sequence S and a query. After-

wards, the obtained MEMs are ﬁltered (using some

heuristics). After ﬁltering, the MEMs roughly cor-

responding to the best alignment position are joined

together. The chaining is done by traversing the list

of ﬁltered MEMs from left to right, relative to their

starting positions in the query sequence. The actual

joining process uses the Needleman-Wunsch global

alignment algorithm.

The main idea behind the algorithm is based on

ﬁrst pinpointing the exact position of the query in the

reference genome corresponding to the optimal align-

ment. Then the MEMs can be seen as a sort of an-

chors or seeds in the alignment. The chaining idea

is based on dynamic programming, where the matrix

rows correspond to the query, while the colums cor-

respond to the reference sequence and the MEMs are

diagonals in the matrix. The chaining algorithm tries

to combine diagonals that score well in order to ﬁnd

an optimal alginment.

3.2 Main Algorithm

Algorithm 1 requires as input the reference se-

quence S, a query q, the index structure T of S$, and

ACCURATE LONG READ MAPPING USING ENHANCED SUFFIX ARRAYS

103

the expected edit distance k. The expected edit dis-

tance k is an important parameter but can be an ap-

proximation. The algorithm will try to ﬁnd an optimal

alignment of q in a substring of S of size |q| + 2k.

The matching process starts by ﬁnding the list

of all MEMs between S and q (line 1). In the next

step (line 2), the list is ﬁltered so that only the most

promising MEMs remain. For every MEM (q

,ℓ,P)

remaining after the ﬁltering, it holds that P contains

only one element. In most cases, the ﬁrst MEM is

part of the best alignment, but even in the worst case,

the ﬁltering process guarantees that it is part of a sub-

optimal alignment that is located within k positions of

the optimal alignment.

During the main loop (lines 9-16), the list of ﬁl-

tered MEMs is traversed. For every current MEM,

corresponding to an interval in q and an interval in S,

the closest MEM to the right of the current match is

searched. More details on the deﬁnition of ‘closest’

MEM and the steps in the inner loop (lines 11-12)

are given in the next paragraphs. Once the closest

MEM is found, the gap between the two matches is

bridged (line 14), using a mix of heuristics and the

Needleman-Wunsch global alignment algorithm.

It is possible that, due to errors in the ﬁrst and/or

last characters of q, the alignment obtained during

the main loop does not cover the entire query. The

functions dynProgBegin() (line 5) and dynProgEnd()

(line 18) are used to ﬁnd those parts of the alignment.

Algorithm 1: Main Algorithm.

Require: S, T , q ∈ Q, expected edit distance k

Ensure: an optimal local alignment of q in S

1: mems ←mems(S,q,T )

2: mems ←ﬁlter(mems, q, k)

3: mem ← mems.get(1)

4: if mem.q

> 0 then

5: sol ← dynProgBegin(S,q,mem,k)

6: else

7: sol ← mem.P {starting pos. of alignment in S}

8: i ← 2

9: while (i ≤ |mems|) do

10: j ← i

11: while j ≤ |mems|∧ distance between mem and

mems.get(j) is descending do

12: j ← j + 1

13: newMem ← mems.get( j − 1)

14: sol ← sol+ ﬁndGap(S,q,mem,newMem)

15: mem ← newMem

16: i ← j

17: if mem.q

+ ℓ < |q| then

18: sol ← sol+ dynProgEnd(S,q,mem,k)

19: return sol

3.3 Calculating the MEMs

The maximal exact matches between S and q are

found using an adaptation of the matching statistics

algorithm described in (Gusﬁeld, 1997). The main

idea behind the algorithm consists of matching every

sufﬁx of q to the sufﬁx tree of S. After a mismatch

is found for one sufﬁx, a sufﬁx link is used in or-

der to speed up the search for the next sufﬁx. Using

some tricks, the algorithm runs in linear time com-

plexity O (|q|). The main difference between their al-

gorithm and our approach is that the reference text S

and query (or pattern) q are switched. Normally, it

would be more interesting to build the index structure

for the shortest sequence (i.e. q), but in our case build-

ing an index structure once for the reference sequence

is more interesting than building an index structure

for every query, since the sum of the lengths of the

queries is much larger than the length of S. The ben-

eﬁt of this approach is that the MEMs are sorted in

increasing starting position in q. Also, the list P, cor-

respondingto the positions of a match in S, is sorted in

increasing order. In order to suppress the output of the

algorithm, a minimal length depending on k and |q| is

induced upon the MEMs.

3.4 Filtering the MEMs

The list of MEMs and their lists P of positions in S

can be large. Filtering the list of MEMs using fast

heuristics has the advantage of speeding up the main

loop in Algorithm 1, but it also allows to ﬁnd the ﬁrst

MEM (with the smallest q

) that is contained in an

optimal local alignment.

First the list of MEMs is traversed in order to ﬁnd

a match with maximal length, which is assumed to be

in the optimal alignment and is referred to as the an-

chor of the alignment. Searching for a longest match

allows some error on the choice of parameter k. Next,

the list is traversed again and ﬁltered, keeping only

MEMs that ‘have potential’ and that contain a posi-

tion p ∈ P that is ‘close to’ the anchor.

A MEM (q

,ℓ,P) is said to ‘have potential’ if the

following condition holds:

(ℓ ≥

|q|

) ∨ (|P| = 1) ∨ (ℓ = ℓ

′

where ℓ

′

is the length of the anchor. This means that

the heuristic allows long matches, matches that are

unique in S and matches with the same length of the

anchor. The last of these conditionsis important when

dealing with short reads, highly repetitive regions or

a wrong estimate of k.

The condition that decides whether a potential

match is added to the ﬁltered list, traverses the list P

BIOINFORMATICS 2011 - International Conference on Bioinformatics Models, Methods and Algorithms

104

and searches the ﬁrst position p such that the rela-

tive position of the current match to the anchor is

located within a window [−k,+k] of the position it

would have if there were no errors or differences in

the alignment.

Experiments show that the combination of the

above heuristics ﬁlters almost always all MEMs not

corresponding to an optimal alignment related to the

longest MEM, while keeping most of the MEMs that

do form an optimal alignment around the longest

MEM. We believe that repeating the ﬁltering step,

but using a different reference MEM instead of the

longest match, allows to ﬁnd other alignments.

3.5 Finding the ‘Closest’ MEM

The distance between two MEMs (q

,ℓ, p) and

′

,ℓ, p

′

) is deﬁned as follows. First, distances in q

and S are determined as qDist := q

+ ℓ − q

′

and

refDist := p + ℓ − p

′

. If both distances are positive,

which is the most common situation, the distance be-

tween the two MEMs is the maximum of the two dis-

tances qDist and refDist. If both are negative, result-

ing in overlapping MEMs, the resulting distance is

negative then the distance is deﬁned as the sum of the

positive distance and the absolute value of the nega-

tive distance.

If at least one of the differences is negative (qDist

or refDist), then the shortest distance between the

MEMs is a single insertion or deletion (according to

the chosen score function). The case of two positive

differences is explained in the next section.

Using this deﬁnition and the ordering on the list

of MEMs from the ﬁrst step, one can see that the

distance between the two MEMs in general only de-

scends at ﬁrst, reaches a minimum and then starts in-

creasing. This means that during the main loop of the

algorithm every MEM is checked at most twice.

3.6 Closing the Gaps

Gaps and mismatches can occur between the begin-

ning of the query and the ﬁrst MEM in the list, be-

tween two MEMs in the list and between the last

MEM and the end of the query.

The ﬁrst and last gaps are bridged by using a vari-

ant of the Needleman-Wunsch global alignment algo-

rithm, usually referred to as semi-global alignment.

The ﬁrst variant allows gaps at the beginning of S that

are not penalized and the second variant allows for

free gaps at the end of S.

In order to ﬁnd the gaps between two MEMs, the

distance and the differences from the previous para-

graph are used. First, the algorithm checks if a sin-

gle insertion, deletion or mutation ﬁlls the gap. Oth-

erwise, a global alignment is performed between the

right bounds of the ﬁrst MEM and the left bounds of

the second MEM.

The Needleman-Wunsch algorithm was always

executed with a score +2 for matches, −1 for mis-

matches and an afﬁne gap penalty with an opening

gap penalty of −2 and an extension penalty of −1.

Other scores can be used, resulting in a different

alignment, especially at the ends of the query.

3.7 Implementation and Complexity

We implemented the above algorithm in Java. An en-

hanced sufﬁx array (ESA) was used instead of a stan-

dard sufﬁx tree because an ESA has a lower memory

footprint, has a modular design and maintains the full

functionality of the sufﬁx tree. The ESA (Abouelhoda

et al., 2004) contains four arrays: the sufﬁx array, the

longest common preﬁx (lcp) table, the child table and

the sufﬁx link table. The sufﬁx array was constructed

using the difference cover algorithm of Karkkainen

and Sanders (K¨arkk¨ainen and Sanders, 2003). The

lcp table was constructed using the algorithm of Ka-

sai et al. (Kasai et al., 2001). The child table was con-

structed using the algorithms described in (Abouel-

hoda et al., 2004) and ﬁnally the sufﬁx link table was

constructed using Maaß algorithm (Maaß, 2007).

The memory footprint of the algorithm mainly de-

pends on the size of the ESA and storage of the refer-

ence sequences and query sequences. The implemen-

tation of the ESA without memory optimization takes

about 20 bytes per character,which results in an index

structure of about 100MB for a sequence like E. coli.

Since the query sequences are in general much shorter

than the reference sequence and MEMs are not too

abundant, the memory needed for storing the MEMs

is low.

The theoretical time complexity of the algorithm

depends on the construction of the index structure, the

search for MEMs, and the ﬁltering and chaining of the

MEMs using the Needleman-Wunsch algorithm. For

one query q of length m and a reference sequence S

of size n the asymptotic time complexity can be cal-

culated as follows. Constructing the ESA can be done

in O (n) time (Abouelhoda et al., 2004), while calcu-

lating the MEMs can be done in O (m) time (Gusﬁeld,

1997). Finding the positions P for every MEM can

be done in constant time using an ESA. Filtering the

MEMs can be done in O (

∑

P), where the sum runs

over the number of found MEMs, which is bounded

by m. In theory, if the lists P are large (meaning a

lot of repeats or small maximal exact matches), this

ACCURATE LONG READ MAPPING USING ENHANCED SUFFIX ARRAYS

105

Table 1: Alignment of the simulated data using our algo-

rithm and

Bowtie

. The row ‘# of different alignments’ gives

the number of queries where our algorithm found an opti-

mal alignment that differed from the alignment suggested

by simulating the errors.

Our algorithm

Bowtie

time (sec.) 7.5 264

# of correct alignments 75381 24252

# of different alignments 1835

# of failed alignments 2150 55123

percentage correct 95% 31%

sum can become very large. However, in practice the

heuristics speed this up considerably. The main loop

of Algorithm 1 can be executed in O (m) time plus

the time used for the executions of the Needleman-

Wunsch algorithm. The worst case scenario is when

only one small MEM is found, which translates in a

time complexity of O ((m + 2k)

), which is the time

complexity of a Needleman-Wunsch alignment, when

the position of q in S is known. When many or long

MEMs are found and not ﬁltered out, the main loop is

very fast. Currently the bottleneck of the algorithm is

the calculation of the maximal exact matches.

4 RESULTS

4.1 Preliminary Results

We tested our algorithm on simulated BRCA1 data.

BRCA1 is a gene known to be involved in the devel-

opment of breast cancer when mutated (Friedenson,

2007) and is routineously screened in older women.

The diagnostic community is currently implementing

next-generation sequencing to carry out these screen-

ings, hence BRCA1 data was used to test and validate

the algorithm.

A reference sequence of 80kbp was used together

with a query ﬁle containing about 80000 queries rang-

ing in size from 8bp to 1021bp with an average length

of about 600bp. The total number of insertions, dele-

tions and mutations for one query was bounded by 4

errors of each type, while on average the total number

of errors was 2.5. The longest indels were 10bp long

and the average edit distance was 5.73.

The simulated queries were aligned with our al-

gorithm and

Bowtie

(Langmead et al., 2009), where

Bowtie

was executed with the default parameters.

The results of the alignments are shown in Table 1.

This includes the running time in seconds, as well as

the number of correct and failed alignments. Note that

in some cases our algorithm ﬁnds an alignment that is

different from what would be suggested by the sim-

ulation of the errors, but has the same edit distance;

these cases are also shown in the table.

The tests show that our algorithm is both faster

and more accurate than

Bowtie

. However, note that

Bowtie

was not optimized for performance, which

could explain its low speed. The large amount of

failed alignments for

Bowtie

can be explained by the

large length of the queries and the high edit distance

and/or large insertions and deletions.

The results show that our algorithm was able to

ﬁnd an optimal alignment for almost all the queries

(95% of them). The set of failed alignments con-

tains 129 queries with a larger edit distance and

2021 queries with a smaller edit distance (than the

alignment given by the simulation). Some of the

alignments with smaller edit distance are due to the

scoring in the Needleman-Wunsch algorithm. The

failed alignments are mostly caused by using wrong

MEMs in the alignment, which stem from too rigor-

ous ﬁltering of the MEMs or the greedily deﬁned dis-

tance between MEMs. However, even if part of the

alignment failed for a given query, the ‘true position’

of the query was found in all but 6 cases.

4.2 Current and Future Work

Currently we are testing our algorithm for datasets

with longer reads and more erroneous reads. We

also work on comparing its results with other long

read aligning programs, such as

BWA-SW

(Li and

Durbin, 2010),

MUMmer

(Kurtz et al., 2004) and

RazerS

(Weese et al., 2009).

In this version of the algorithm, only one possi-

ble best local alignment is returned. This is consistent

with the default option of many widely used align-

ment programs today. In future versions of the al-

gorithm, an option could be included allowing more

alignments to be returned. There can be some issues

with small reads and reference sequences containing a

lot of long repetitive regions, because a good ﬁrst an-

chor MEM cannot be found in those circumstances.

However, since the latter is not regularly screened by

sequencing, this present no big problems in practice.

When screening repeat regions, both the biology as

the informatics need some customization, which is

not the scope of this manuscript.

In spite of its linear memory usage, the index

structure used in the current implementation still has

a relatively large memory footprint (i.e. 20 bytes per

character). According to Abouelhoda et al. (Abouel-

hoda et al., 2004), the memory footprint can be scaled

down to 8 bytes per character. Moreover, the main

bottleneck in speed is ﬁnding the MEMs. Recently,

Khan et al. (Khan et al., 2009) used a sparse sufﬁx ar-

ray to compute MEMs. Also, many mapping progra-

BIOINFORMATICS 2011 - International Conference on Bioinformatics Models, Methods and Algorithms

106

ms use an FM-index instead of sufﬁx trees to improve

the memory requirements. In future versions, our al-

gorithm could implement a more memory-friendly in-

dex structure. A possibility to increase accuracy is

changing the anchors or seeds from MEMs to max-

imal unique matches as in

MUMmer

(Kurtz et al.,

2004) or using spaced or inexact matches.

5 CONCLUSIONS

We presented an algorithm for mapping large reads

to a reference genome which is fast and accurate in

ﬁnding an optimal local alignment. The algorithm is

easy to understand and has few parameters. Our ﬁrst

results show that the algorithm is able to map reads

with insertions, deletions and mutations and with a

length of several hundreds of base pairs successfully

to a reference sequence.

The algorithm is more accurate given longer

queries with lower error rates. The main algorithm

takes several parameters which can be tuned. These

are: the expected edit distance and the cost parameters

for the Needleman-Wunsch algorithm. The expected

edit distance can be approximated and allows some

deviation from the real edit distance. Better results

are obtained with an overestimate than with an un-

derestimate. The algorithm is devised in such a way

that an alignment is always found even if it is subop-

timal. Even if reads contain insertions, deletions or

mutations in the ﬁrst or last parts of the query, our

algorithm was able to ﬁnd an optimal alignment.

ACKNOWLEDGEMENTS

The research of MV is funded by a grant from the

N2N-MRP at Ghent University.

REFERENCES

Abouelhoda, M. I., Kurtz, S., and Ohlebusch, E. (2004). Re-

placing sufﬁx trees with enhanced sufﬁx arrays. Jour-

nal of Discrete Algorithms, 2:53–86.

Bray, N., Dubchak, I., and Patcher, L. (2003). AVID: a

global alignment program. Genome Research, 13:97–

102.

Friedenson, B. (2007). The BRCA1/2 pathway prevents

hematologic cancers in addition to breast and ovarian

cancers. BMC Cancer, 7:152.

Gusﬁeld, D. (1997). Algorithms on strings, trees, and se-

quences. Cambridge university press, 32 Avenue of

the Americas, New York, NY 10013-2473, USA, 11th

edition.

Hoffmann, S., Otto, C., Kurtz, S., Sharma, C., Khaitovich,

P., Vogel, J., Stadler, P., and Hackerm¨uller, J. (2009).

Fast mapping of short sequences with mismatches, in-

sertions and deletions using index structures. PLoS

Computational Biology, 9:e1000502.

K¨arkk¨ainen, J. and Sanders, P. (2003). Simple linear

work sufﬁx array construction. In Proceedings of

the 30th International Conference on Automata Lan-

guages and Programming, volume 2719 of Lecture

Notes in Computer Science, pages 943–955. Springer-

Verlag.

Kasai, T., Lee, G., Arimura, H., Arikawa, S., and Park, K.

(2001). Linear-time longest-common-preﬁx computa-

tion in sufﬁx arrays and its applications. In Proceed-

ings of the 12th Symposium on Combinatorial Pattern

Matching (CPM 01), volume 2089 of Lecture Notes in

Computer Science, pages 181–192. Springer-Verlag.

Khan, Z., Bloom, J., Kruglyak, L., and Singh, M. (2009). A

practical algorithm for ﬁnding maximal exact matches

in large sequence datasets using sparse sufﬁx arrays.

Bioinformatics, 13:1609–1616.

Kurtz, S., Phillippy, A., Delcher, A., Smoot, M., Shumway,

M., Antonescu, C., and Salzberg, S. (2004). Versa-

tile and open software for comparing large genomes.

Genome Biology, 5:R12.

Langmead, B., Trapnell, C., Pop, M., and Salzberg, S.

(2009). Ultrafast and memory-efﬁcient alignment of

short DNA sequences to the human genome. Genome

Biology, 10:R25.

Li, H. and Durbin, R. (2009). Fast and accurate short read

alignment with Burrows-Wheeler transform. Bioin-

formatics, 25:1754–1760.

Li, H. and Durbin, R. (2010). Fast and accurate long read

alignment with Burrows-Wheeler transform. Bioin-

formatics, 5:589–595.

Li, R., Li, Y., Kristiansen, K., and Wang, J. (2008). SOAP:

short oligonucleotide alignment program. Bioinfor-

matics, 24:713–714.

Maaß, M. (2007). Computing sufﬁx links for sufﬁx trees

and arrays. Information Processing Letters, 101:250–

254.

Needleman, S. B. and Wunsch, C. D. (1970). A gen-

eral method applicable to the search for similarities

in the amino acid sequence of two proteins. Journal

of Molecular Biology, 48(3):443–453.

Weese, D., Emde, A.-K., Rausch, T., D¨oring, A., and Rein-

ert, K. (2009). RazerS – fast read mapping with sen-

sitivity control. Genome Research, 19:1646–1654.

ACCURATE LONG READ MAPPING USING ENHANCED SUFFIX ARRAYS

107