more difficult to map and the implementation could
still be improvedto obtain higher overall performance
and a lower memory footprint.
Within a candidate region, a gapped alignment is
first build using a chain of the seeds found within
this region. The greedy chaining algorithm is cur-
rently the major source of miss-alignments and could
be replaced by an optimal collinear chaining algo-
rithm (Abouelhoda, 2007). Other causes of misalign-
ments include failure to detect splice sites at the ends
of reads and failure to differentiate two consecutive
introns separated by an exon smaller than the mini-
mum seed length.
The run time could be further decreased by select-
ing good settings for parameters, such as minimum
seed length, but also by, for example, using a bit-
parallel dynamic programming implementation in the
extension stage. The memory footprint of the index
could further be reduced by bit-encoding the refer-
ence sequence.
In addition to algorithmic improvements, more
rigorous tests need to be performed on large and var-
ied data sets and experimental results need to be com-
pared to a larger set of spliced aligners, using different
parameter settings.
Finally, the current implementation of mesalina
still lacks some of the features other spliced align-
ers support, including specific algorithms for the de-
tection of micro-exons and alternative splicing, and
paired-end read mapping. We also acknowledge the
need for clear and intuitive command line options and
good portability of the tool.
ACKNOWLEDGEMENTS
The work of MV is supported by the Agency for In-
novation by Science and Technology of the Flemish
government [contract SB-101609]. All authors ac-
knowledge the support of Ghent University: MRP
Bioinformatics: from nucleotides to networks (N2N).
REFERENCES
Abouelhoda, M. (2007). A chaining algorithm for
mapping cDNA sequences to multiple genomic se-
quences. In SPIRE07, 14th international confer-
ence on String Processing and Information Retrieval.
Springer-Verlag.
Abouelhoda, M., Kurtz, S., and Ohlebusch, E. (2004). Re-
placing suffix trees with enhanced suffix arrays. Jour-
nal of Discrete Algorithms, 2:53–86.
Au, K., Jiang, H., Lin, L., Xing, Y., and Wong, W. (2010).
Detection of splice junctions from paired-end RNA-
Seq data by SpliceMap. Nucleic Acids Research,
38:4570–4578.
De Bona, F., Ossowski, S., Schneeberger, K., and R¨atsch, G.
(2008). Optimal spliced alignments of short sequence
reads. BMC Bioinformatics, 9:i170–i180.
Dobin, A., Davis, C., Schlesinger, F., Drenkow, J., Zaleski,
C., Jha, S., Batut, P., Chaisson, M., and Gingeras, T.
(2013). STAR: ultrafast universal RNA-Seq aligner.
Bioinformatics, 29:15–21.
Garber, M., Grabherr, M., Guttman, M., and Trapnell, C.
(2011). Computational methods for transcriptome an-
notation and quantification using RNA-Seq. Nature
methods, 8:469–477.
Hoffmann, S., Otto, C., Kurtz, S., Sharma, C., Khaitovich,
P., Vogel, J., Stadler, P., and Hackerm¨uller, J. (2009).
Fast mapping of short sequences with mismatches, in-
sertions and deletions using index structures. PLoS
Computational Biology, 9:e1000502.
Huang, S., Zhang, J., Li, R., Zhang, W., He, Z., Lam, T.,
Peng, Z., and Yiu, S. (2011). SOAPsplice: genome-
wide ab initio detection of splice junctions from RNA-
Seq data. Frontiers in genetics, 2.
Kim, D., Pertea, G., Trapnell, C., Pimentel, H., Kelley, R.,
and Salzberg, S. (2013). TopHat2: accurate alignment
of transcriptomes in the presence of insertions, dele-
tions and gene fusions. Genome Biology, 14:R36.
Li, W. (2012). RNASeqReadSimulator:
A Simple RNA-Seq Read Simulator.
http://alumni.cs.ucr.edu/ liw/rnaseqreadsimulator.html.
Liu, Y. and Schmidt, B. (2012). Long read alignment
based on maximal exact match seeds. Bioinformatics,
28:i318–i324.
Manber, U. and Myers, G. (1993). Suffix arrays: a new
method for on-line string searches. SIAM Journal on
Computing, 22:935–948.
Roberts, R., Carneiro, M., and Schatz, M. (2013). The
advantages of SMRT sequencing. Genome Biology,
14:405.
Trapnell, C., Pachter, L., and Salzberg, S. (2009). TopHat:
discovering splice junctions with RNA-Seq. Bioinfor-
matics, 25:1105–1111.
Vyverman, M., De Baets, B., Fack, V., and Dawyndt, P.
(2012). Prospects and limitations of full-text index
structures in genome analysis. Nucleic Acids Re-
search, 40:6993–7015.
Vyverman, M., De Baets, B., Fack, V., and Dawyndt, P.
(2013). essaMEM: finding maximal exact matches
using enhanced sparse suffix arrays. Bioinformatics,
29:802–804.
Wang, K., Singh, D., Zeng, Z., Coleman, S., Huang, Y.,
Savich, G., He, X., Mieczkowski, P., Grimm, S., and
Perou, C. (2010). MapSplice: accurate mapping of
RNA-Seq reads for splice junction discovery. Nucleic
Acids Research, 38:e178–e178.
Wu, T. and Nacu, S. (2010). Fast and SNP-tolerant detec-
tion of complex variants and splicing in short reads.
Bioinformatics, 26:873–881.
Wu, T. and Watanabe, C. (2005). GMAP: a genomic map-
ping and alignment program for mRNA and EST se-
quences. Bioinformatics, 21:1859–1875.
BIOINFORMATICS2014-InternationalConferenceonBioinformaticsModels,MethodsandAlgorithms
238