ms use an FM-index instead of suffix trees to improve
the memory requirements. In future versions, our al-
gorithm could implement a more memory-friendly in-
dex structure. A possibility to increase accuracy is
changing the anchors or seeds from MEMs to max-
imal unique matches as in
MUMmer
(Kurtz et al.,
2004) or using spaced or inexact matches.
5 CONCLUSIONS
We presented an algorithm for mapping large reads
to a reference genome which is fast and accurate in
finding an optimal local alignment. The algorithm is
easy to understand and has few parameters. Our first
results show that the algorithm is able to map reads
with insertions, deletions and mutations and with a
length of several hundreds of base pairs successfully
to a reference sequence.
The algorithm is more accurate given longer
queries with lower error rates. The main algorithm
takes several parameters which can be tuned. These
are: the expected edit distance and the cost parameters
for the Needleman-Wunsch algorithm. The expected
edit distance can be approximated and allows some
deviation from the real edit distance. Better results
are obtained with an overestimate than with an un-
derestimate. The algorithm is devised in such a way
that an alignment is always found even if it is subop-
timal. Even if reads contain insertions, deletions or
mutations in the first or last parts of the query, our
algorithm was able to find an optimal alignment.
ACKNOWLEDGEMENTS
The research of MV is funded by a grant from the
N2N-MRP at Ghent University.
REFERENCES
Abouelhoda, M. I., Kurtz, S., and Ohlebusch, E. (2004). Re-
placing suffix trees with enhanced suffix arrays. Jour-
nal of Discrete Algorithms, 2:53–86.
Bray, N., Dubchak, I., and Patcher, L. (2003). AVID: a
global alignment program. Genome Research, 13:97–
102.
Friedenson, B. (2007). The BRCA1/2 pathway prevents
hematologic cancers in addition to breast and ovarian
cancers. BMC Cancer, 7:152.
Gusfield, D. (1997). Algorithms on strings, trees, and se-
quences. Cambridge university press, 32 Avenue of
the Americas, New York, NY 10013-2473, USA, 11th
edition.
Hoffmann, S., Otto, C., Kurtz, S., Sharma, C., Khaitovich,
P., Vogel, J., Stadler, P., and Hackerm¨uller, J. (2009).
Fast mapping of short sequences with mismatches, in-
sertions and deletions using index structures. PLoS
Computational Biology, 9:e1000502.
K¨arkk¨ainen, J. and Sanders, P. (2003). Simple linear
work suffix array construction. In Proceedings of
the 30th International Conference on Automata Lan-
guages and Programming, volume 2719 of Lecture
Notes in Computer Science, pages 943–955. Springer-
Verlag.
Kasai, T., Lee, G., Arimura, H., Arikawa, S., and Park, K.
(2001). Linear-time longest-common-prefix computa-
tion in suffix arrays and its applications. In Proceed-
ings of the 12th Symposium on Combinatorial Pattern
Matching (CPM 01), volume 2089 of Lecture Notes in
Computer Science, pages 181–192. Springer-Verlag.
Khan, Z., Bloom, J., Kruglyak, L., and Singh, M. (2009). A
practical algorithm for finding maximal exact matches
in large sequence datasets using sparse suffix arrays.
Bioinformatics, 13:1609–1616.
Kurtz, S., Phillippy, A., Delcher, A., Smoot, M., Shumway,
M., Antonescu, C., and Salzberg, S. (2004). Versa-
tile and open software for comparing large genomes.
Genome Biology, 5:R12.
Langmead, B., Trapnell, C., Pop, M., and Salzberg, S.
(2009). Ultrafast and memory-efficient alignment of
short DNA sequences to the human genome. Genome
Biology, 10:R25.
Li, H. and Durbin, R. (2009). Fast and accurate short read
alignment with Burrows-Wheeler transform. Bioin-
formatics, 25:1754–1760.
Li, H. and Durbin, R. (2010). Fast and accurate long read
alignment with Burrows-Wheeler transform. Bioin-
formatics, 5:589–595.
Li, R., Li, Y., Kristiansen, K., and Wang, J. (2008). SOAP:
short oligonucleotide alignment program. Bioinfor-
matics, 24:713–714.
Maaß, M. (2007). Computing suffix links for suffix trees
and arrays. Information Processing Letters, 101:250–
254.
Needleman, S. B. and Wunsch, C. D. (1970). A gen-
eral method applicable to the search for similarities
in the amino acid sequence of two proteins. Journal
of Molecular Biology, 48(3):443–453.
Weese, D., Emde, A.-K., Rausch, T., D¨oring, A., and Rein-
ert, K. (2009). RazerS – fast read mapping with sen-
sitivity control. Genome Research, 19:1646–1654.
ACCURATE LONG READ MAPPING USING ENHANCED SUFFIX ARRAYS
107