Authors:
Eleonora Mian
;
Enrico Petrucci
;
Cinzia Pizzi
and
Matteo Comin
Affiliation:
Department of Information Engineering, University of Padova, Padova, 35131, Italy
Keyword(s):
k-Mers, Gapped q-Gram, Multiple Spaced Seeds, Efficient Hashing.
Abstract:
Alignment-Free analysis of sequences has enabled high-throughput processing of sequencing data in many bioinformatics pipelines. Hashing k-mers is a common function across many alignment-free applications and it is widely used for indexing, querying and rapid similarity search. Recently, spaced seeds, a special type of pattern that accounts for errors or mutations, are routinely used instead of k-mers. Spaced seeds allow to improve the sensitivity, with respect to k-mers, in many applications, however the hashing of spaced seeds increases substantially the computational time. Moreover, if multiple spaced seeds are used the accuracy can further increases at the cost of running time. In this paper we address the problem of efficient multiple spaced seed hashing. The proposed algorithms exploit the similarity of adjacent spaced seed hash values in an input sequence in order to efficiently compute the next hashes. We report the results on several tests which show that our methods signifi
cantly outperform the previously proposed algorithms, with a speedup that can reach 20x. We also apply these efficient spaced seeds hashing algorithms to an application in the field of metagenomic, the classification of reads performed by Clark-S (Ounit and Lonardi, 2016), and we shown that a significant speedup can be obtained, thus resolving the slowdown introduced by the use of multiple spaced seeds. Code available at: https://github.com/CominLab/MISSH.
(More)