formation coming from reads in pairs. (Indeed, we
compress the FASTQ files in a paired-end dataset in-
dependently, as they were single-end.) Both the above
aspects could be analyzed as future work.
We believe the results presented in this paper can
motivate the development of new FASTQ compres-
sors that modify the bases and quality scores com-
ponents taking into account both information at the
same time to achieve better compression while keep-
ing most of the relevant information in the FASTQ
data. As future work we intend to investigate the er-
ror correction problem that needs to take into account
much more information (e.g. reverse-complement, or
paired-end information).
ACKNOWLEDGEMENTS
The authors would like to thank N. Prezza for valu-
able comments and suggestions and for providing part
of the code library, and E. Niccoli for preliminary ex-
perimental investigations on positional clustering and
compression in his bachelor’s thesis under the super-
vision of GR and VG.
Work partially supported by the project MIUR-
SIR CMACBioSeq (“Combinatorial Methods for
Analysis and Compression of Biological Sequences”)
grant n. RBSI146R5L and by the University of Pisa
under the “PRA – Progetti di Ricerca di Ateneo” (In-
stitutional Research Grants) - Project no. PRA 2020-
2021 26 “Metodi Informatici Integrati per la Biomed-
ica”.
REFERENCES
Abouelhoda, M. I., Kurtz, S., and Ohlebusch, E. (2004). Re-
placing suffix trees with enhanced suffix arrays. Jour-
nal of Discrete Algorithms, 2(1):53 – 86.
Adjeroh, D., Bell, T., and Mukherjee, A. (2008). The
Burrows-Wheeler Transform: Data Compression,
Suffix Arrays, and Pattern Matching. Springer.
Bauer, M., Cox, A., and Rosone, G. (2013). Lightweight
algorithms for constructing and inverting the BWT of
string collections. Theor. Comput. Sci., 483(0):134 –
148.
Belazzougui, D., Cunial, F., K
¨
arkk
¨
ainen, J., and M
¨
akinen,
V. (2020). Linear-time string indexing and analysis in
small space. ACM Trans. Algorithms, 16(2).
Benoit, G., Lemaitre, C., Lavenier, D., Drezen, E., Dayris,
T., Uricaru, R., and Rizk, G. (2015). Reference-free
compression of high throughput sequencing data with
a probabilistic de bruijn graph. BMC Bioinformatics,
16.
Bonfield, J. K. and Mahoney, M. V. (2013). Compression of
FASTQ and sam format sequencing data. PLOS ONE,
8(3).
Bonizzoni, P., Della Vedova, G., Pirola, Y., Previtali,
M., and Rizzi, R. (2019). Multithread Multistring
Burrows-Wheeler Transform and Longest Common
Prefix Array. Journal of computational biology,
26(9):948—961.
Bonomo, S., Mantaci, S., Restivo, A., Rosone, G., and
Sciortino, M. (2014). Sorting conjugates and suffixes
of words in a multiset. International Journal of Foun-
dations of Computer Science, 25(08):1161–1175.
Boucher, C., Cenzato, D., Lipt
´
ak, Z., Rossi, M., and
Sciortino, M. (2021). Computing the original ebwt
faster, simpler, and with less memory. In SPIRE, pages
129–142. Springer International Publishing.
Burrows, M. and Wheeler, D. (1994). A Block Sorting data
Compression Algorithm. Technical report, DIGITAL
System Research Center.
Chandak, S., Tatwawadi, K., Ochoa, I., Hernaez, M.,
and Weissman, T. (2018). SPRING: a next-
generation compressor for FASTQ data. Bioinformat-
ics, 35(15):2674–2676.
Cleary, J. and Witten, I. (1984). Data compression using
adaptive coding and partial string matching. IEEE
Transactions on Communications, 32(4):396–402.
Cox, A., Bauer, M., Jakobi, T., and Rosone, G.
(2012). Large-scale compression of genomic se-
quence databases with the Burrows-Wheeler trans-
form. Bioinformatics, 28(11):1415–1419.
Deorowicz, S. (2020). Fqsqueezer: k-mer-based compres-
sion of sequencing data. Scientific reports, 10(1):1–9.
DePristo, M. A. and et al. (2011). A framework for variation
discovery and genotyping using next-generation DNA
sequencing data. Nature genetics, 43(5):491–498.
Egidi, L., Louza, F. A., Manzini, G., and Telles, G. P.
(2019). External memory BWT and LCP computation
for sequence collections with applications. Algorithms
for Molecular Biology, 14(1):6:1–6:15.
Ferragina, P. and Manzini, G. (2000). Opportunistic data
structures with applications. In FOCS, pages 390–
398. IEEE Computer Society.
Gagie, T., Navarro, G., and Prezza, N. (2020). Fully Func-
tional Suffix Trees and Optimal Text Searching in
BWT-Runs Bounded Space. J. ACM, 67(1):2:1–2:54.
Greenfield, D. L., Stegle, O., and Rrustemi, A. (2016).
GeneCodeq: quality score compression and improved
genotyping using a Bayesian framework. Bioinfor-
matics, 32(20):3124–3132.
Guerrini, V., Louza, F., and Rosone, G. (2020). Metage-
nomic analysis through the extended Burrows-
Wheeler transform. BMC Bioinformatics, 21.
Hach, F., Numanagi
´
c, I., Alkan, C., and Sahinalp, S. C.
(2012). SCALCE: boosting sequence compression al-
gorithms using locally consistent encoding. Bioinfor-
matics, 28(23):3051–3057.
Hernaez, M., Pavlichin, D., Weissman, T., and Ochoa, I.
(2019). Genomic data compression. Annual Review
of Biomedical Data Science, 2(1):19–37.
Lossy Compressor Preserving Variant Calling through Extended BWT
47