sumption and the possibility to use widespread index-
ing algorithms such as the FM-Index it also comes
with reduced query speeds. The poor locality of suc-
cinct data structures, to which the FM-Index belongs,
introduces a less efficient use of the cache. This may
impact the execution time of YALFF making it about
1.3x times slower than Quartz. This problem can
be mitigated using some optimization tricks such as
using multiple cores or extending a match on a k-
mer to search the neighbors on the linear sequence
effectively reducing the number of accesses to the
database.
4 CONCLUSIONS AND FUTURE
WORK
This work has demonstrated the feasibility of com-
bining a reassembly procedure with a string indexing
algorithm to produce a very compact dictionary of k-
mers which works as drop-in replacements whenever
static k-mer hash tables are needed. Given a SNPs
database, like dbSNPs or Affymetrix SNPs, it is pos-
sible to find a set of k-mers that are uniquely associ-
ated to a SNP. This set of k-mers can be efficiently
compressed into a string dictionary and used for qual-
ity value compression. These k-mers dictionaries are
more informative than a single reference genome and
they show better performance in terms of compres-
sion ratio and accuracy of genotyping, while keeping
low memory requirements.
Future directions of research are the construction
of a dynamic FM-Index with the ability to add and
remove k-mers without recomputing the whole struc-
ture, another interesting problem is how to speed-up
the search queries reducing cache misses. In the field
of metagenomic read classification most methods are
based on k-mers indexes (Girotto et al., 2017; Mar-
chiori and Comin, 2017; Qian et al., 2018), and only
recently the FM-index has been applied (Bˇrinda et al.,
2017), similarly the discovery of SNPs without map-
ping based on FM-index has been proposed only re-
cently (Prezza et al., 2018). We believe that the use of
FM-index will be beneficial in other alignment-free
applications like pan-genomics.
REFERENCES
Benoit, G., Lemaitre, C., Lavenier, D., Drezen, E., Dayris,
T., Uricaru, R., and Rizk, G. (2015). Reference-free
compression of high throughput sequencing data with
a probabilistic de Bruijn graph. BMC Bioinformatics,
16:288.
Bˇrinda, K. (2016). Novel computational techniques for
mapping and classifying Next-Generation Sequencing
data. PhD thesis, Universit´e Paris-Est.
Bˇrinda, K., Salikhov, K., Pignotti, S., and Kucherov, G.
(2017). Prophyle: a phylogeny-based metagenomic
classifier using the burrows-wheeler transform. Poster
at HiTSeq 2017.
Bonfield, J. K. and Mahoney, M. V. (2013). Compression
of fastq and sam format sequencing data. Plos one.
Burrows, M. and Wheeler, D. J. (1994). A block-sorting
lossless data compression algorithm. Technical report,
Digital Equipment Corporation.
Chikhi, R., Limasset, A., and Medvedev, P. (2016). Com-
pacting de bruijn graphs from sequencing data quickly
and in low memory. Bioinformatics, 32(12):i201–
i208.
C´anovas, R., Moffat, A., and Turpin, A. (2014). Lossy com-
pression of quality scores in genomic data. Bioinfor-
matics, 30(15):2130–2136.
Comin, M., Leoni, A., and Schimd, M. (2014). Qcluster:
Extending alignment-free measures with quality val-
ues for reads clustering. In Brown, D. and Morgen-
stern, B., editors, Algorithms in Bioinformatics, pages
1–13, Berlin, Heidelberg. Springer Berlin Heidelberg.
Comin, M., Leoni, A., and Schimd, M. (2015). Clustering
of reads with alignment-free measures and quality val-
ues. Algorithms for Molecular Biology, 10(1):1–10.
Consortium, T. . G. P. (2012). An integrated map of ge-
netic variation from 1,092 human genomes. Nature,
491(7422):56–65.
Ewing, B., Hillier, L., Wendl, M. C., and Green, P. (1998).
Base-Calling of Automated Sequencer Traces Using-
Phred. I. Accuracy Assessment. Genome Research,
8(3):175–185.
Ferragina, P. and Manzini, G. (2000). Opportunistic Data
Structures with Applications. In Proceedings of the
41st Annual Symposium on Foundations of Computer
Science, FOCS ’00, pages 390–, Washington, DC,
USA. IEEE Computer Society.
Ferragina, P. and Manzini, G. (2005). Indexing Compressed
Text. J. ACM, 52(4):552–581.
Girotto, S., Comin, M., and Pizzi, C. (2017). Higher re-
call in metagenomic sequence classification exploiting
overlapping reads. BMC Genomics, 18(10):917.
Girotto, S., Comin, M., and Pizzi, C. (2018a). Efficient
computation of spaced seed hashing with block index-
ing. BMC Bioinformatics, 19(15):441.
Girotto, S., Comin, M., and Pizzi, C. (2018b). Fsh: fast
spaced seed hashing exploiting adjacent hashes. Al-
gorithms for Molecular Biology, 13(1):8.
Greenfield, D. L., Stegle, O., and Rrustemi, A. (2016).
GeneCodeq: quality score compression and improved
genotyping using a Bayesian framework. Bioinfor-
matics (Oxford, England), 32(20):3124–3132.
HG38 (2018). Human reference genome (hg38). http://
hgdownload.cse.ucsc.edu/goldenpath/hg38/bigZips/.
Illumina8bin (2011). Quality scores for next-generation se-
quencing, illumina inc. Technical report, Illumina Inc.