in BW T into the table C. It is easy to compute C as an
exclusive prefix sum of the last row of Occ. For any
single character at position i in a pattern W , the inter-
val of rows in the BWM starting with this character is
easily computed from C:
{k, l} = {C[W [i]],C[W [i] + 1]} (3)
From this initial interval, it is possible to extend the
search backward from the starting position using the
following recursive procedure:
E x a c tRe c u r (W, i , k , l )
i f i < 0 t h e n
r e t u r n
{
k, l
}
b ← W[i]
k ← C(b) + Occ(b, k − 1) + 1
l ← C(b) + Occ(b, l)
r e t u r n E xac t R e cur (W, i −1, k , l )
Listing 1: Backward Exact Match.
After the search is complete, the final BWT inter-
val is mapped back to the locations in reference using
the suffix array.
BWA (Li and Durbin, 2009) extends this algo-
rithm to allow a predetermined number of mismatches
z:
I n e xRe c u r (W, i , z , k , l )
i f z < 0 t h e n
r e t u r n ∅
i f i < 0 t h e n
r e t u r n
{
k, l
}
I ← ∅
I ← I
S
InexRecur(W, i − 1, z − 1, k, l)
f o r ea c h b ∈ Σ do
k ← C(b) + Occ(b, k − 1) + 1
l ← C(b) + Occ(b, l)
i f k ≤ l t h e n
I ← I
S
InexRecur(W, i, z − 1, k, l)
i f b = W[i] t h e n
I ← I
S
InexRecur(W, i − 1, z, k, l)
e l s e
I ← I
S
InexRecur(W, i − 1, z − 1, k, l)
r e t u r n I
Listing 2: Inexact Match.
Note that every step of the inexact match algo-
rithm consumes eight Occ values compared to two
Occ values required by the exact match. In theory,
all Occ values could be precomputed, but holding the
full Occ array for the human genome reference would
consume approximately 100GB of memory. To save
memory space, the FM-index over the DNA alphabet
is often stored using a cache-friendly approach intro-
duced in (Gog and Petri, 2014), that harkens back to
the bucket layout from the original FM-index paper
(Ferragina and Manzini, 2000). Values of Occ(∗, k)
for every k that is a multiple of 128 are stored in me-
mory followed by 128 characters of BWT in 2-bit en-
coding. Four Occ counters occupy 256 bits, as does
the BWT string. For Occ counters that are not at the
factor of 128 positions, the values must be calculated
on the fly. Furthermore, the suffix array is compres-
sed in a similar manner. Only values of SA[k] where k
is a multiple of 32 are stored in memory, while all the
values in between are recomputed using the Inverse
Suffix Array relationships:
Ψ
−1
(i) = C[BW T [i]] + Occ(BW T [i], i) (4)
SA[k] = SA[(Ψ
−1
)
j
(k)] + j (5)
It means that Equation 4 is applied over and over until
for a value of j the result comes out to be a multiple of
32, and the SA value could be constructed according
to Equation 5.
Even though the memory saving measures do not
change the asymptotic complexity of the match algo-
rithm, in reality they add hundreds of computations of
Occ to every search. Given that the search is perfor-
med multiple times for each and every read out of bil-
lions required for the alignment of a human genome,
Occ function performance becomes crucial.
3 SOLUTION
Our approach could be traced back to the algorithm
by (Vigna, 2008) that performs memory table look-
ups to count character occurrences in each byte of the
BWT string. We replace the memory lookups with
the half-byte register lookups, building on an idea first
proposed by Mula for the bit population count (Muła
et al., 2017). Note however that we do not attempt
to reduce the character counting problem to bit coun-
ting, and apply the half-byte technique directly to the
BWT string. Inputs and outputs for all eight counters,
that are required for one step of the inexact search, fit
into a single AVX512 register and are computed with
one vector pass.
The input BWT string is masked with zeros befo-
rehand for situations when the character at position
k is in the middle of the byte. The result of 256-
bit occurrence count should be corrected for the extra
127 − k A characters.
3.1 Lookup
Every byte in the BWT string is split into its higher
and lower half. Since each half byte value cannot
be greater than 15, the lookup values now fit into a
single vector register and could be retrieved via the
VPSHUFB instruction. The lookup returns all four
counters in a 2-bit format packed into a byte. Two bits
are sufficient as a half byte contains just two charac-
ters. Additionally, the OccLo result is pre-converted
BIOINFORMATICS 2019 - 10th International Conference on Bioinformatics Models, Methods and Algorithms
150