The local alignment problem (LAP) is the search for
a pair of subsequences, one from S
1
and the other
from S
2
, whose alignment achieves maximum score.
LAP is a well-posed optimization problem and
the
Smith-Waterman (SW) algorithm (
Smith, Waterman,
1981) solves LAP exactly, in O(|S
1
||S
2
|) time and
space. However, because of the shear length of
genome sequences and the ever-increasing number
of reference sequences available for comparisons,
the heuristic method called Basic Local Alignment
Sequence Tool (BLAST) (
Altschul, 1997), is
preferred by practitioners. BLAST solves LAP
approximately, by finding high scoring pairs (HSP)
instead of alignments. A HSP is the extension of a
word hit, which is the ungapped alignment of the
sequence with a short word whose score is greater
than or equal to a user defined threshold. Word hits
are extended with a local sequence alignment
method, such as Smith-Waterman, up until their
scores drop below another user-defined threshold.
Although BLAST returns the results much faster
than Smith-Waterman, it is still deemed too slow for
most post-genomic era processing demands.
Mapping assembly is an important post-genomic
instance of this demand. Post BLAST tools are
designed for rapidly aligning a sequence to an entire
genome, on a desktop computer. One such tool is
MUMmer (Delcher et. al., 2002). MUMmer’s speed
rests on efficient suffix tree representations of the
sequences. Suffix trees identify perfect matches,
which are more restrictive than ungapped
alignments, in linear time and space. MUMmer
aligns short reads to a genome using a specialized
routine called NUCmer. As in BLAST, NUCmer
approximate solutions are extensions of exact
matches produced with gapped or ungapped
alignment methods. In general both BLAST and
NUCmer, explore a subspace of the alignment space,
and therefore, often return a suboptimal alignment.
The tradeoff between speed and exactness is
highly sensitive when it comes to mapping
assembly. Different approximate alignments often
result in completely unrelated mapping (Li, Homer,
2010). The advantages of an exact algorithmic
solution became apparent soon after the introduction
of BLAST. Comparative studies (Shpaer et. al.,
1996) report a significantly lower number of false
positives and negatives in SW responses. Also,
several experiments have reportedly shown a
significant higher risk for BLAST to miss a
sequence alignment that is detectable by Smith-
Waterman. Approximate local alignment solutions
(Phillippy et. al., 2008) create a need for long and
exhaustive post-assembly processing (
Rahman,
2013). This motivates the exploration of accurate
mapping assembly algorithms that are time and
space efficient, as well.
This article examines some mathematical
principles behind the design of a parallel method
based on Smith-Waterman and Hirschberg’s longest
common subsequence algorithm (Hirschberg, 1975).
The idea of combining a sequence alignment method
and Hirschberg’s algorithm is not new.
Combinations of Hirschberg and Needleman-
Wunsch, a global alignment method that preceded
Smith-Waterman, have been reported without in-
depth discussions of the mathematics of their design.
The algorithms that resulted from this combination
are proved to save memory space to the cost of a
slight increase in computation time. The author is
not aware of specific reports on combinations of
Hirschberg and Smith-Waterman.
The parallel algorithm, whose principles are
discussed here, uses Hirschberg’s division phase to
partition the genome into string segments that are
distributed over a set of processors. Each processor
has a copy of the read strings. The optimal
alignments of a read and the genome segments are
computed in parallel, with Smith-Waterman. The
method compares each of the processor’s results and
returns the alignment with the maximum score.
Although this idea is similar to the one inspired by a
combination of Needleman-Wunsch and Hirschberg,
the actual design of the Smith-Waterman/Hirschberg
algorithm has at least two important differences. The
first difference is that, being a global alignment;
Needleman-Wunsch has to be recomputed at each
step of Hirschberg’s division phase. Otherwise, the
global alignment may not be retrievable from the
segments. As shown in Corollary 2, the division
phase of the Smith-Waterman/Hirschberg
combination does not require Smith-Waterman
computations, at least up to a certain depth in the
division tree. The second difference is in the conquer
phase. Unlike the Needleman-Wunsch/Hirschberg’s
conquer phase, which is just the concatenation of the
alignments that solve each sub problem; Smith-
Waterman/Hirschberg’s conquer phase does require
some extra processing. This is due to the fact that the
best local alignment might correspond to the
alignment of a read over two contiguous genome
segments. Most of the theory developed in this
article concerns the reconstruction of the local
alignment of a read with a genome from its
alignments with a pair of contiguous genome
segments.
The mathematical concepts and facts discussed
here come from observations made in the design of
BIOINFORMATICS2014-InternationalConferenceonBioinformaticsModels,MethodsandAlgorithms
222