
3 RESULTS AND DISCUSSION
In this section, we will present the results obtained
from the evaluation and analysis of SMT, in compar-
ison with the JELLYFISH and KANALIZE algorithms.
These comparisons were important to validate the ef-
ficacy and efficiency of SMT in the context of genomic
data processing and analysis. The comparative anal-
ysis allows corroborating the superior performance
of SMT in various scenarios and also highlighting its
distinctive features that contribute to its high perfor-
mance and flexibility in handling kmers. Throughout
this section, we will discuss the methodological as-
pects of the tests carried out, as well as the metrics
employed for the evaluation of the data structures and
algorithms under study. The insights derived from
this analysis provide a deep understanding of the po-
tential of SMT as a robust and effective data structure
for analyzing large genomic data sets.
To evaluate the algorithms, all .bed files from the
JASPAR 2022 (Castro-Mondragon et al., 2022) repos-
itory were selected, relevant to CHIP-SEQ data with
more than 10.000 sequences, totaling 131 distinct data
sets. The algorithms were executed with values of k
ranging between 5 and 30. The monitoring of exe-
cution times and RAM consumption was carried out
through the command /usr/bin/time -v, available
in practically all LINUX/UNIX operating systems. All
tests on the Linux Ubuntu 22.04.3 LTS operating sys-
tem, with equipment equipped with AMD EPYC 7B12
processors, 8 GB of RAM, and an internal clock of ap-
proximately 2250 GHz. For reproducibility purposes,
the following command line instructions were used:
1. SMT: smt -i <fasta file> -k <size of
kmer> -s 500
2. JELLYFISH: jellyfish count -m <size of
kmer> -s 500M
3. KANALYZE: kanalyze count -k <size of
kmer> -f fasta <fasta file>
Figure 1 displays the average performance of the
algorithms considering all datasets and all values of k.
In the upper left part of this figure, illustrating RAM
consumption, it is observed that the JELLYFISH algo-
rithm requires considerably more memory compared
to the other two. Specifically, it consumes about 1000
Mb, while KANALYZE consumes approximately 250
Mb and SMT uses about 50 Mb. In the upper right
graph, depicting time (in seconds), KANALYZE shows
the longest execution time, reaching about 8 seconds.
JELLYFISH, in turn, takes approximately 2 seconds,
while SMT shows the shortest time, close to 0.5 sec-
onds.
Continuing with Figure 1, observing the bottom-
left graph, we note that JELLYFISH has a median close
to 1400 Mb with outliers exceeding 3000 Mb. KAN-
ALYZE has its median around 300 Mb, with peaks
reaching almost 2000 Mb. Conversely, SMT main-
tains lower consumption, with a median near 150 Mb
and outliers close to 3000 Mb. In the bottom-right
graph, JELLYFISH shows low dispersion with its me-
dian surpassing 4 seconds. KANALYZE exhibits high
dispersion, with a median close to 7 seconds and out-
liers going beyond 55 seconds. SMT remains with the
best performance, with a median close to 0.5 seconds
and outliers nearing 18 seconds.
The analysis of the interquartile range (IIQ) for the
RAM consumption of the three algorithms reveals im-
portant insights about their dispersion. The JELLY-
FISH algorithm has the highest IIQ, with 2284 Mb,
indicating a significant variation in its RAM consump-
tion in the central half of the data. In contrast, KAN-
ALYZE and SMT have much lower IIQs, with 133 Mb
and 115 Mb, respectively. This suggests that while
JELLYFISH has a significant dispersion of RAM con-
sumption, KANALYZE and SMT show efficiency in
this aspect, at least in the central half of their distribu-
tions. These observations complement the analysis of
the execution time, where SMT stands out with a con-
siderably lower median time compared to the others.
Upon analyzing the interquartile range (IIQ) for
the execution time of the three algorithms, we can ex-
tract significant information regarding the variability
of each method. JELLYFISH displayed an IIQ of 4.19
seconds, indicating that the central half of its execu-
tion times varies around this range. KANALYZE has
a slightly higher IIQ, with 4.73 seconds, suggesting a
slightly larger variation in its central times compared
to JELLYFISH. SMT, on the other hand, exhibited an
IIQ of only 0.5 seconds, reflecting good consistency
in execution time.
Figure 2 illustrates the relationship between k-mer
size and two performance metrics: RAM consump-
tion and execution time for the algorithms JELLYFISH,
KANALYZE, and SMT. Regarding RAM consumption,
we observe that JELLYFISH shows an increasing con-
sumption relative to k-mer size, with a sharp increase
after 20 mers. Surprisingly, KANALYZE exhibits a
slight decrease in RAM consumption relative to k-mer
size. SMT, in turn, maintains an almost constant and
low consumption profile, irrespective of the k-mer
size.
Figure 3 displays the behavior of the algorithms
concerning RAM memory consumption, grouped by
k values. Generally, it’s observed that the SMT algo-
rithm tends to have a more compact distribution of
RAM usage, while JELLYFISH and KANALYZE exhibit
SMT: A High-Performance Approach for Counting Kmers
547