
Sun, C., Harris, R. S., Chikhi, R., and Medvedev, P. (2018).
Allsome sequence bloom trees. Journal of Computa-
tional Biology, 25(5):467–479.
Sun, C. and Medvedev, P. (2019). Toward fast and accurate
snp genotyping from whole genome sequencing data
for bedside diagnostics. Bioinformatics, 35(3):415–
420.
Wood, D. E. and Salzberg, S. L. (2014). Kraken: ultra-
fast metagenomic sequence classification using exact
alignments. Genome biology, 15(3):1–12.
APPENDIX
Table 5: Number of k-mers for each dataset varying k ∈ {15, 17, 21, 31, 41}.
dataset #15-mers #21-mers #31-mers #41-mers
SRR001665 1 13,889,837 14,286,068 10,343,472 -
SRR001665 2 16,371,558 16,895,362 12,058,109 -
SRR061958 1 225,788,025 388,490,798 404,149,685 392,492,657
SRR061958 2 265,935,616 482,235,278 495,804,915 475,405,235
SRR062379 1 109,810,585 152,875,155 160,692,477 160,746,342
SRR062379 2 108,958,432 151,987,994 159,905,793 158,802,318
SRR10260779 1 84,250,397 113,667,728 123,624,245 127,090,699
SRR10260779 2 93,032,179 128,074,943 139,633,894 143,150,103
SRR11458718 1 89,998,269 126,431,861 137,995,280 143,397,012
SRR11458718 2 94,018,791 134,997,414 150,549,990 159,144,668
SRR13605073 1 43,488,336 54,085,000 55,764,573 54,682,553
SRR14005143 1 11,307,338 13,223,059 15,005,192 16,272,583
SRR14005143 2 23,691,810 28,456,533 31,850,681 33,872,511
SRR332538 1 10,624,064 11,404,027 11,382,816 10,666,430
SRR332538 2 18,741,106 25,674,930 28,880,136 27,477,871
SRR341725 1 132,442,790 188,913,254 185,618,107 176,391,089
SRR341725 2 136,484,353 196,035,961 192,133,588 181,970,438
SRR5853087 1 159,744,051 316,438,109 382,773,071 399,026,650
SRR957915 1 126,236,121 208,110,514 239,200,400 250,988,377
SRR957915 2 188,867,779 335,926,750 364,597,018 361,352,380
Table 6: k-Mers (with k = 21) file size after compression with MFCompress. Dataset SRR5853087 1 gave a compression
error. Many Matchtigs cannot be computed due to out-of-memory errors.
K=21
compression
UST USTAR USTAR2 Greedy Matchtigs Matchtigs
SRR001665 1 12,641,658 12,332,551 8,728,852 8,813,736 8,845,254
SRR001665 2 15,492,263 15,109,673 10,915,321 11,003,600 10,876,474
SRR061958 1 194,173,905 185,905,825 45,510,962 45,454,536
SRR061958 2 235,657,588 225,975,765 50,801,622 50,486,848
SRR062379 1 82,713,766 79,283,723 59,070,721 58,566,163
SRR062379 2 80,164,746 76,708,406 57,036,189 56,882,630
SRR10260779 1 64,644,700 61,724,139 43,373,952 43,311,649
SRR10260779 2 72,772,294 69,375,320 49,077,343 48,574,622
SRR11458718 1 64,694,925 61,236,404 42,840,309 42,645,409
SRR11458718 2 68,982,466 65,438,050 45,077,154 44,708,191
SRR13605073 1 25,833,347 24,546,244 20,149,898 20,144,454
SRR14005143 1 6,419,520 6,220,215 4,222,948 4,179,654 4,194,213
SRR14005143 2 13,117,896 12,655,430 9,056,375 8,932,076 8,980,170
SRR332538 1 5,737,778 5,599,034 4,393,161 4,393,504
SRR332538 2 14,410,775 13,528,977 9,930,431 9,712,821
SRR341725 1 80,436,678 78,193,253 60,766,288 61,160,751
SRR341725 2 84,250,689 81,877,574 63,879,557 64,009,811
SRR5853087 1
SRR957915 1 122,748,678 116,872,195 83,947,935 82,631,218
SRR957915 2 182,073,051 172,757,385 131,423,152 130,303,345
average 75,103,512 71,860,009 42,115,904 41,890,264
BIOINFORMATICS 2024 - 15th International Conference on Bioinformatics Models, Methods and Algorithms
378