number of blocks obtained. Then, we show how to
modify the mapping function so that the resulting
block size distribution is almost uniform instead of
normal. Since the running time of the mapping
computation is O(1) per k-mer, partitioning can be
performed efficiently while data is being read from
the storage medium.
2 K-MER COUNTING BASICS
A high-level sketch of a k-mer counting algorithm
can be given as follows:
Input: Multiple files containing reads. (A “read” is a
string of characters A, C, G, T.) With the current
technology, the length of each read is about 100-200
characters. Each file contains several million reads.
Output: A histogram of distinct k-mers found in all
files of the input.
Algorithm:
1. Divide the input into N blocks.
2. For each block, in parallel, count the number of
times each k-mer is seen.
3. Merge the partial results obtained for different
blocks into one global dataset for presentation to
the user.
Step 1 is the most critical step in this algorithm. This
step must satisfy two requirements:
a. All copies of a k-mer in all the input files must
be mapped to the same block. This will ensure
that k-mer counts obtained at the end of Step 2
will be universally valid.
b. Block sizes must be approximately equal. This
will ensure that parallel threads will have equal
work.
As will be discussed in the next section the first
requirement is easily satisfied by simple mapping
schemes. However, none of the mapping methods in
the literature guarantees the second requirement.
3 EXISTING METHODS
In the literature, there are three basic methods for k-
mer partitioning.
1. Hash-based method,
2. Prefix-based method,
3. Minimizer-based method.
The hash-based method is used in DSK algorithm
(Rizk et al., 2013). In this method a hash
function
maps each k-mer S to a number in the
range
If P is the desired number of blocks,
S is sent to the block. Obviously, all
copies of a k-mer will fall into the same block.
However, the maximum block size depends on the
hash function. The paper contains no information
about the hash function used, and provides no data
about the block size distribution obtained. While
hash functions generally do a good job in
distributing data between different bins, their worst
case performances can be quite bad.
The prefix-based method is used in KMC1
(Deorowicz et al., 2013). This method partitions k-
mers according to a fixed-length prefix. The prefix
becomes the name of the block for a k-mer. The
choice of prefix length depends on the desired
number of blocks. If the length of prefix is p, the
number of blocks will be
This method divides
the theoretical k-mer space into equal sized sub-
spaces. However, the theoretical space is not
uniformly populated by actual k-mers. The size of a
block is sensitive to the probability distributions of
symbols in the prefix. Since nucleotides A and T are
more abundant than C and G, blocks containing A’s
and T’s in their names can be much bigger than
others.
The Minimizer-based method was originally
invented to save memory space when storing k-mers
(Roberts et al., 2004). Later, it is used in k-mer
counting to partition the input data in MSP (Li et al.,
2013; Li, 2015), in KMC2 (Deorowicz et al., 2015),
and in Gerbil (Erbert et al., 2017). To explain, define
a p-string as a substring of length p in a k-mer. A p-
string is a minimizer for a k-mer if no other p-string
in the k-mer is lexicographically smaller than it.
Example 1:
Read:
Consider the k-mers derived from this read. Assume
k = 10, then the k-mers are ,
, and
If p=3, then the minimizer is .
All the k-mers in this example share this minimizer.
Instead of mapping these k-mers separately, we can
map the segment of the read that contains them. In
this case, the unit of mapping is a segment of a read
rather than a k-mer. The minimizer itself becomes
the name of the block to which the segment is
mapped. In this scheme, the number of blocks is
.
In algorithms, p is chosen in the range 4-10.
In a recent paper, Erbert et al., (2017) set out to
create the fastest possible k-mer counter by
combining the “best ideas” in earlier papers. After
extensively testing different methods, they selected
the signatures method for partitioning the input data
Robust K-Mer Partitioning for Parallel Counting
147