2.1 The Overlap Phase
The overlap phase is based on the AMASS algorithm
(Kim, 1997). Kim used a multipattern matching al-
gorithm to efficiently find possible pairwise overlaps
between fragments.
PMSGA begins by randomly selecting probes of
constant lengths from fragments and their reverse
complements. Then, a faster multipattern matching
algorithm, called the BG algorithm (Salmela & al.,
2006), is used to find the occurrences of these probes
in the fragments.
The BG algorithm is based on the BNDM algo-
rithm (Navarro & Raffinot, 2000) for a single pattern.
The idea is to construct a generalized pattern that rep-
resents a group of patterns. For example, the group
of patterns, acgt, aacc, and gttt can be represented
by the generalized pattern: [a,g][a,c,t][c,g,t][c,t]. In
the overlap phase, PMSGA uses the BG algorithm to
find the generalized pattern representing overlapping
k-mers instead of single characters of the patterns. If
k = 2 in the example above, the corresponding gener-
alized pattern is given by [ac,aa,gt][cg,ac,tt][gt,cc,tt].
Each occurrence of the generalized pattern is a candi-
date for a real match. BNDM works as a filter and the
candidate matches are checked by an exact method. In
practice, the BG algorithm is very efficient (Salmela
& al., 2006).
To deal with small repeats, the probes with too
many occurrences are filtered out. However, if there
are enough probes with about equal number of occur-
rences, they are left in. This is done to preserve large
repeats, but to filter out small repeats that might cause
false overlaps between fragments.
The probe length (m) is obtained by solving m =
lnF/ln(1− ε), where F is the predefined average
probability of probe occurrence and ε is the average
error rate. In this work, the probe length varies be-
tween 10 and 30 base pairs, and F = 0.4 is used.
The overlap between a pair of fragments i and
j is computed as follows. The common probes of
these fragments are found using the BG algorithm.
Let probes a and b occur in i and j at positions
(pos
a,i
, pos
a, j
) and (pos
b,i
, pos
b, j
). Now, a and b are
said to be a consistent pair, if
|(pos
b,i
− pos
a,i
) − (pos
b, j
− pos
a, j
)| < σ
for a small threshold valueσ. This threshold is needed
to deal with insertion and deletion errors in the frag-
ments. A set of common probe occurrences is a con-
sistent set, if each consecutive pair of probe occur-
rences in the set is a consistent pair.
For each consistent set S
i
that can be constructed
from the probe hits in the overlapping area, a score
is calculated as Score(S
i
) = N
d
W + N
m
, where N
d
is
the number of disjoint probe sets within the consistent
set, W is the distance from the first probe in the set to
the last one and N
m
is the total number of probes in
the set. The consistent set with the largest score is
chosen to represent the overlap, and the length of the
overlap is calculated by using the selected set.
The quality of the detected overlaps is checked as
follows. Let a k-mer represent a sequence of k con-
tigous base pairs. Given a fragment a, occurrences
of all possible k-mers in overlapping areas are deter-
mined. A vectorV
a
of length 4
k
is constructed, where
its elements represent the number of occurrences of a
k-mer. Similarly, a vector V
b
for fragment b is con-
structed.
The average error probabilities for the overlapping
area of fragments a and b, denoted by p
a
and p
b
, are
also calculated. To compute these probabilities, the
PHRED quality scores (Ewing & Green, 1998) for
each fragment are used.
The number of k-mers that occur only in one frag-
ment is calculated as follows
N
miss
=
4
k
∑
i=1
V
a
−V
b
i
(1)
The total number of k-mers, N
tot
, is given by N
tot
=
2(L − k), where L is the length of the overlap being
considered. The number of common k-mers is given
by N
hit
= N
tot
− N
miss
.
The probability of a sequencing error in a k-mer is
given by p
m
= 1−((1− p
a
)(1− p
b
))
k
. The probabil-
ity of observing N
miss
with given error probability, is
given by
P(X ≥ N
miss
) =
N
tot
∑
i=N
miss
N
tot
i
p
i
m
(1− p
m
)
N
tot
−i
(2)
A threshold ξ is set to discard all overlaps, where
P(X ≥ N
miss
) ≤ ξ (3)
Note that P(X ≥ N
miss
) is the incomplete beta func-
tion, I
p
m
(N
miss
,N
hit
+ 1), which can be efficiently ap-
proximated.
To validate overlaps, PMSGA counts the distinct
k-mers, instead of counting errors, as was done by Ke-
cecioglu and Myers (1995). Note that by increasing
the value of k, a greater emphasis is placed on the cor-
rect order of the matching substrings within the over-
lap at the expense of a smaller number of common
k-mers. Fragments that are entirely contained in other
fragments are removed and are used in the final con-
sensus phase.
2.2 The Layout Phase
The layout phase of PMSGA is based on string graphs
(Myers, 2005). The idea is to construct a bidirected
BIOINFORMATICS 2010 - International Conference on Bioinformatics
78