et al., 2020) as well as for SOPANG and SOPANG2
(A.Cisłak et al., 2018; A.Cisłak and S.Grabowski,
2020) that, moreover, only detect exact matches with-
out allowing gaps nor mismatches. We thus consid-
ered the tool abPOA (Y.Gao et al., 2021), a C library
tool to align a sequence to a directed acyclic graph
that also uses partial order alignment and supports
global alignment, and which is the state of the art as
of base-level exact alignments, but it cannot handle
sequences as long as 100,000b. The same holds for
the tool Astarix (P.Ivanov et al., 2020; P.Ivanov et al.,
2022b; P.Ivanov et al., 2022a).
We therefore compared our DSA with
GRAPHALIGNER (M.Rautiainen and T.Marschall,
2020; M.Rautiainen et al., 2019) and MINIGRAPH
(H.Li et al., 2020; H.Li, 2016; H.Li, 2018; H.Li,
2021) on solving STODS. Both of them are designed
to align strings on a more general graph structure
than D-strings and therefore the comparison we
show below should be viewed as a validation of
the performance of DSA in solving STODS, and
not as claiming that DSA is in general a better
tool than any of the other two. GRAPHALIGNER
uses a seed and extend method and the bitvector
alignment extension algorithm of (M.Rautiainen
et al., 2019). MINIGRAPH is a well maintained and
highly optimized software tool that uses minimizers
to find strong colinear chains as starting point to
build the alignment (H.Li et al., 2020; H.Li, 2016;
H.Li, 2018; H.Li, 2021). Therefore, out of DSA,
GRAPHALIGNER, and MINIGRAPH, our DSA is the
only one which does base-level exact alignments.
D-strings generation. We randomly generated a
D-string of width W =100,000b by first generating a
random string of length W on {A,C,G,T }, and then
inserting
2
therein deg (input parameter given as a
percentage of W ) degenerate non-solid positions as
follows: for each such position we pick at random
a value for its size between 1 and S (another input
parameter), and a randomly chosen length between
1 and L (input parameter again). We generated
D-strings using width W = 100,000b in all tests,
degeneracy frequencies values deg = 1%,10%, max-
imum variance values S = 2,5, and maximum variant
lengths L=1,4. As a consequence the width of tested
D-strings will always be W = 100,000b, while its to-
tal size N will depend from input parameters deg,S,L.
Pattern generation. From the obtained synthetic D-
string
ˆ
T , we extracted a ground truth exact pattern P
0
of size W (that is, a string P ∈
ˆ
T that thus matches
ˆ
T with distance 0), and we (possibly) modified P
0
2
The insertion was done forcing the width to remain W .
into the actual input query P with different possible
divergences using real .vcf files
3
. The divergences we
tested were to insert (i) no divergence at all, (ii) 0,1%
SNPs, (iii) 1% SNPs, (iv) 0,1% INDELs; (percent-
ages are on W ). Hence, the size m = |P| of the query
string will be in Θ(W ) and so will the distance d be-
tween P and
ˆ
T .
We have run experiments for all values deg,S,L
mentioned above in D-string generation paragraph,
resulting in D-strings of size ranging from N =
101,000 (for deg = 1, S = 2, L = 1) to N = 160,327
(for deg = 10,S = 5,L = 4). All tests were ran on
a laptop (single threaded) Intel® Core™ i7-11800H
× 16 with 16.0 GiB RAM. Space and time was mea-
sured using /usr/bin/time -f"%S\t%M" to extract
system time (seconds) and maximum resident set size
(kbytes). Time was reported 0 when < 0.001s. In
all tests we used alignment scores a = 0, x = 1,o =
2,e = 1. For space reasons, we only report results for
two parameters’ sets that sample the comparative re-
sults. Table 1 shows results with a D-string of size
N = 106,147 generated with deg = 1% of degener-
ate positions with up to S = 5 variants of length up
to L = 4 (little frequency of highly degenerated posi-
tions). Table 2 shows results with a D-string of size
N = 110, 000 generated with deg = 10% of degener-
ate positions with up to S = 2 variants of length up to
L = 1 (high frequency of little degenerated positions).
We report time and memory peak, as well as
the number of detected events on optimal alignment:
number = for matches, X for mismatches, I for inser-
tions, and D for deletions. With no pattern divergence,
then for a correct alignment it must be I = D = X =0
and 100,000 matches. When the pattern divergence is
only SNPs, then it must be I = D = 0, and X should
be approximately equal to the number of SNPs (as
by chance the divergence may not change the DNA
base). For the experiments involving INDELs diver-
gence in the pattern, we also report the number G of
gaps that are opened: with W = 100,000b and 0.1%
INDELs, in a correct alignment it must be G = 100.
For accuracy evaluation, for each experiment the last
line shows the ground truth.
In all experiments, and for all tools, time was be-
low 0.2 seconds. For both data sets (Tables 1, 2), in
the first experiment (no pattern divergence) for an ex-
act match, DSA always finds (also in those not shown
here) the exact solution with 100000 matches, 0 mis-
matches, and 0 gaps (like GRAPHALIGNER does) tak-
ing less memory than MINIGRAPH and much less
than GRAPHALIGNER. In the second and third exper-
3
The .vcf file format is the standard in bioinformatics
to encode variants such as SNPs (letter substitutions) and
INDELs.
BIOINFORMATICS 2023 - 14th International Conference on Bioinformatics Models, Methods and Algorithms
76