Backward Pattern Matching on Elastic Degenerate Strings

Petr Proch

azka, Ond

rej Cvacho, Lubo

s Kr

al and Jan Holub

Dep. of Theoretical Computer Science, Faculty of Information Technology, Czech Technical University in Prague,

Thakurova 2700/9, Prague 6, Czech Republic

Keywords:

Elastic Degenerate Strings, Genomic Sequences, Degenerate Pattern Matching, Pan-Genomics.

Abstract:

Recently, the concept of Elastic Degenerate Strings (EDS) was introduced as a way of representing a sequenced

population of the same species. Several on-line Elastic Degenerate String Matching (EDSM) algorithms were

presented so far. Some of them provide practical implementation. We propose a new on-line EDSM algorithm

BNDM-EDS. Our algorithm combines two traditional algorithms BNDM and the Shift-And that were adapted

to the speciﬁcs needed by Elastic Degenerate Strings. BNDM-EDS is running in O(Nmd

e) worst-case time.

This implies O(Nm) time for small patterns, where m is the length of the searched pattern, N is the size of

EDS, and w is the size of the computer word. The algorithm uses O(N + n) space, where n is the length of

EDS. BNDM-EDS requires a simple preprocessing step with time and space O(m). Experimental results on real

genomic data show superiority of BNDM-EDS over state-of-the-art algorithms.

1 INTRODUCTION

Decreasing costs of DNA sequencing enabled a

huge development of projects focused on sequenc-

ing whole populations of individuals of the same

species. The projects cataloguing human genetic

variations include the 1000 Genomes Projects (Con-

sortium, The 1000 Genomes Project, 2011) and the

UK10K project (Consortium, The UK10K, 2015).

Sequenced populations pose a challenge to fast pat-

tern matching algorithms on a set of similar strings.

A common approach to represent the sequenced

population is to store genetic variations of individuals

of the population with respect to the chosen ‘reference

sequence’. This approach is very space efﬁcient since

single human individuals differ only in 0.1% bases on

average. In return, the pattern matching algorithms

performed on this kind of data are confronted by a

few obstacles given by the structure of the data.

During the last two decades, many different for-

mats representing the sequenced population on dif-

ferent level of detail were developed. A consensus

sequence drawn from an entire population in multiple

sequence alignment (MSA) is one of the most concise

forms that the sequenced population can take. The

consensus sequence can be expressed as a degener-

ate string over a degenerate alphabet. IUPAC alpha-

bet (on Biochemical Nomenclature (CBN), 1971) is

used for genomic data. Another more recent format

is EDS (Elastic Degenerate String) (Iliopoulos et al.,

2017). EDS enables to express the variants of dif-

ferent lengths within one degenerate position in the

sequence which provides more accuracy since inser-

tions and deletions can be expressed precisely. Vari-

ation graph (Marschall, 2018; Church et al., 2015;

Dilthey et al., 2015) data structure provides similar

level of accuracy as EDS. It is a graph data struc-

ture where the backbone of the graph represents the

reference genome and other alternatives (variants)

across the population are stored as additional edges.

The 1000 Genomes Projects (Consortium, The 1000

Genomes Project, 2011) stores the sequenced popu-

lations in vcf format (Danecek et al., 2011). The

variant-call-format (vcf) provides even more infor-

mation since the stored variations are addressed to

single individuals of the population.

The idea of Pan-genome poses another challenge

for efﬁcient storage of the sequenced population be-

cause the Pan-genome represents a collection of ge-

nomic sequences that are analyzed together or used

as a reference (Consortium, 2016).

A lot of work has been done on off-line search-

ing in the sequenced populations, i.e. many in-

dex data structures solving this problem were pro-

posed (Proch

azka and Holub, 2014; Maciuca et al.,

2016; Na et al., 2016; Navarro and Pereira, 2016;

Sir

en, 2016). However, the authors of (Grossi et al.,

2017) clearly state the following scenarios where on-

line version is more appropriate: (i) efﬁcient on-line

solutions can be used in combination with partial

Procházka, P., Cvacho, O., Kr

cál, L. and Holub, J.

Backward Pattern Matching on Elastic Degenerate Strings.

DOI: 10.5220/0010243600500059

In Proceedings of the 14th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2021) - Volume 3: BIOINFORMATICS, pages 50-59

ISBN: 978-989-758-490-9

(block-oriented) indexes as practical trade-offs; (ii)

efﬁcient on-line solutions for exact pattern matching

can be applied for fast average-case approximate pat-

tern matching; (iii) on-line solutions can be useful

when one wants to search for a set of patterns in elas-

tic degenerate texts.

Elastic Degenerate Strings (EDS) provide an

alternative to represent consensus/pan-genomic se-

quences. Implementation called ElDeS (the ﬁrst on-

line search algoritm on EDS) was provided later satis-

fying the presented upper bound O(N +αγnm), where

α is the maximum number of strings in any elastic-

degenerate symbol of the text and γ is the maximum

number of elastic-degenerate symbols spanned by any

occurrence of the pattern in the text. Constants α and

γ represent the limits of the algorithm. ElDeS com-

bines KMP algorithm (Knuth et al., 1977) and search-

ing in preprocessed sufﬁx trees for potential occur-

rences spanning the degenerate positions in the text.

The sufﬁx trees overhead and their use in recursive

extension of the potential occurrences is the possible

reason of high search time.

Later, a practical implementation of a pattern

matching algorithm in Elastic Degenerate Strings was

given in (Grossi et al., 2017). The proposed EDSM

algorithm works in O(nm

+ N) time. Border ta-

bles (Crochemore et al., 2007) are used to reveal pat-

tern preﬁxes and a sufﬁx tree S T

of the pattern to

reveal the pattern factors. Potential occurrences are

stored in lists and extended over segment borders in

left-to-right fashion. KMP is used in elements longer

than the pattern length m. The authors also present a

bit-vector version of the algorithm where the potential

occurrences are stored within a bit register. The bit-

version of the algorithm EDSM-BV works in O(Nd

time, where w is a size of the computer word (basi-

cally, size of the computer registers, usually w = 64

bits for standard computers). The practical imple-

mentation of the bit-version dominates especially for

longer patterns where the original version of the al-

gorithm suffers from the O(m

) factor. Pissis and

Retha (Pissis and Retha, 2018) presented their on-line

algorithm for a set of patterns of total length M. Their

algorithm achieves O(Nd

e) time with preprocess-

ing time and space O(M).

Recently, Cisłak et al. (Cisłak et al., 2018) pre-

sented SOPanG algorithm solving Elastic Degenerate

String matching problem in the same upper bound

O(Nd

e), however, achieving an order of magni-

tude better time than (Grossi et al., 2017) in prac-

tical experiments. The algorithm performs standard

Shift-Or pattern matching on every element of a

segment. Resulting search registers (after processing

an element) are merged across all elements in each

EDS segment. The register then stores all preﬁxes of

the potential occurrences and the algorithm starts to

process next EDS segment.

Aoyama et al. (Aoyama et al., 2018) proposed an-

other EDSM on-line algorithm based on efﬁcient sum

set computation using Fast Fourier Transform (Coo-

ley and Tukey, 1965). The proposed algorithm

improved the time complexity to O(nm

√

mlog m +

N). Bernardini at al. (Bernardini et al., 2019) de-

signed non-combinatorial O(nm

1.381

+ N)-time algo-

rithm combining three basic ingredients: a string peri-

odicity argument, Fast Fourier Transform and fast ma-

trix multiplication. In addition, a variant of EDSM al-

gorithm allowing errors was presented in (Bernardini

et al., 2017). Unfortunately, we could not compare

BNDM-EDS with the last three algorithms mentioned in

this paragraph due to missing public implementation.

1.1 Our Contribution

We propose a novel on-line algorithm BNDM-EDS solv-

ing EDSM in O(Nmd

e) worst-case time. To the

knowledge of the authors, BNDM-EDS is the ﬁrst back-

ward pattern matching algorithm optimized for EDS.

The algorithm exploits standard BNDM algorithm for

single elements of EDS text. Shift-And (D

olki,

1964) algorithm is used for crossing the borders be-

tween two consecutive EDS segments to reveal the

occurrences starting in one EDS segment and contin-

uing in the following EDS segment. BNDM-EDS is de-

signed to be efﬁcient on real-world data. For small

patterns (m ≤ w), an alphabet size σ, and variability

v, the average time complexity is O(

N(1−v)log

) +

O(Nvm(1 −v)) + O(Nvβδ), where β = O(1) repre-

sents the average number of elements within a degen-

erate EDS segment and δ < m represents the average

length of an element within a degenerate EDS seg-

ment. The assumptions about β and δ are made based

on their values in real datasets. Moreover, BNDM-EDS

beats all its competitors in experiments performed on

real genomic data.

1.2 Roadmap

The rest of the paper is organized as follows. We de-

ﬁne basic notions in Section 2. In Section 3, we pro-

vide exact deﬁnition and description of BNDM-EDS al-

gorithm. Section 4 summarizes experimental results

performed on synthetic and real genomic data. We

give the conclusion and some ideas for future work in

Section 5.

Backward Pattern Matching on Elastic Degenerate Strings

2 BASIC NOTIONS

Let x = x

..x

be a string composed of single sym-

bols x

of a ﬁnite ordered alphabet Σ. The length

of the string x is n = |x|. The size of the alpha-

bet Σ is σ = |Σ| = O(1). The start position i and

the length j deﬁne a factor (or substring) denoted

by x

i, j

= x

..x

i+ j−1

. Factor with i = 0 is called pre-

ﬁx and a factor with i + j −1 = n is called sufﬁx of the

string x. The empty string (of length 0) is denoted by

ε.

Deﬁnition 1. Elastic Degenerate String (Grossi

et al., 2017)

An Elastic Degenerate String (EDS)

X =

X[0]

X[1]...

X[n − 1] of length n on an alphabet

Σ is a ﬁnite sequence of n degenerate symbols. Every

degenerate symbol

X[i], for all 0 ≤ i < n, is a non-

empty set of strings

X[i][ j], with 0 ≤ j < |

X[i]|, where

every

X[i][ j] is a deterministic string on an alphabet

Σ. The degenerate symbols are also called segments.

The strings contained in the set corresponding to a

segment are also called elements.

N is the total size of an EDS and is deﬁned as

N =

n−1

∑

i=0

X[i]|−1

∑

j=0

X[i][ j]|

Example 1. Elastic Degenerate String.

Suppose the following multiple sequence align-

ment (MSA):

G C A A C G G G T A - - A C T

| | | | | | | | | | | | | | |

G C A A C G G G T A T A A C T

| | | × | × | | | | | | | | |

G C A C C T G G - - - - A C T

The resulting Elastic Degenerate String (EDS)

has the following properties:

T =

{

GCA

}





{

}





{

}







TATA







{

ACT

}

N =

∑

i=0

T [i]|−1

∑

j=0

T [i][ j]|= 3+2+1 +2+2+6 +3 = 19

The length of

T corresponds to the number of seg-

ments which is n = 7. The size of

T corresponds to

the sum of the length of all elements which is N = 19.

A solid string Y is a ﬁnite sequence of solid sym-

bols. It means that every position i of the solid string

Y is represented just by one symbol of alphabet Σ.

A solid string Y matches (Grossi et al., 2017) Elas-

tic Degenerate String

X[0][1]...[n −1] of length n > 1

(denoted by Y ≈

X), if and only if string Y can be

decomposed into factors y

...y

n−1

, y

∈ Σ

∗

, such that:

1. there exists a string s ∈

X[0] such that a sufﬁx of s

is y

6= ε;

2. if n > 2, there exists s ∈

X[i], for all 1 ≤ i ≤ n −2,

such that s = y

;

3. there exists a string s ∈

X[n −1] such that a preﬁx

of s is y

n−1

6= ε.

A string Y has an occurrence in an Elastic Degen-

erate String

T ending at position j if there exists a

position i < j such that Y ≈

T [i]...T [ j], or if there ex-

ists an element s ∈

T [ j] such that Y occurs in s. The

string matching in Elastic Degenerate Strings is for-

mally deﬁned as Problem 1.

Problem 1. Elastic Degenerate String Matching

(EDSM) (Grossi et al., 2017) Given a string P of

length m and an Elastic Degenerate String

T of length

n and size N ≥ m, ﬁnd all positions j in

T where at

least one occurrence of P ends.

Example 2. Elastic Degenerate String Matching.

Consider pattern P = AAC of length m = 3. Next,

consider the following Elastic Degenerate String

T of

length n = 7 and size N = 19.

T =

{

GCA

}





{

}





{

}







TATA







{

ACT

}

The occurrences are marked with red color. The

ﬁrst occurrence of P starts at position 0 and it ends

at position 2. In terms of the ﬁrst occurrence the pat-

tern P can be (according to the previous description)

decomposed into the following three factors: p

= A,

= A and p

= C. The second occurrence of P starts

at position 5 and it ends at position 6. The second oc-

currence can be decomposed into the two following

factors: p

= A and p

= AC. Notice that the pat-

tern factor p

is a sufﬁx of two elements

T [5][0] and

T [5][1]. Despite of that, we account only one occur-

rence since by EDSM deﬁnition we accept only dis-

tinct positions in EDS .

3 ALGORITHM

The idea behind BNDM-EDS is simple: to exploit

shifting capabilities of backward pattern matching

algorithms, speciﬁcally “good preﬁx shift” heuris-

tic (Melichar et al., 2005). We chose BNDM (Navarro

and Rafﬁnot, 1998) algorithm because of its sim-

plicity and its bit-parallel approach. We adapted

BNDM for EDS matching problem in the follow-

ing way. The elements

T [i][ j] of each segment

T [i]

are searched independently using a standard variant

of BNDM. Intermediate results (i.e. the discovered

pattern preﬁxes occurring in the last m −1 bases of

each element

T [i][ j]) must be merged at the end of

BIOINFORMATICS 2021 - 12th International Conference on Bioinformatics Models, Methods and Algorithms

each segment

T [i]. The registers storing the inter-

mediate results are transformed to be suitable for the

Shift-And algorithm that is used to process the bor-

ders of the two following segments

T [i] and

T [i + 1].

The Shift-And algorithm initialized with the inter-

mediate results from the previous segment processes

up to the ﬁrst m bases of each element from the fol-

lowing segment

T [i + 1]. The point is to discover all

potential occurrences spanning two consecutive seg-

ments. The Shift-And algorithm was chosen since

BNDM (generally any backward algorithm) could be

forced to process O(

∏

m−1

k=1

T [i −k]|) different strings

when crossing the borders of the segments right-to-

left for pathological EDS (i.e. EDS containing many

consecutive segments with one-base-long elements).

Algorithm 1 describes the idea of BNDM-EDS in

more detail. The algorithm assumes pattern length

smaller than computer word size (m ≤w) for simplic-

ity. The PREPROCESS function performs preprocess-

ing of the pattern P of length m. The ﬁrst while cycle

traverses the pattern from right to left and stores the

corresponding byte values (with active bits set to 1) to

byte vector B of length σ that is necessary for BNDM

part of the algorithm. Byte vector R (of length m)

stores active preﬁxes corresponding to BNDM last

position. The second while cycle traverses the pattern

from left to right and stores the corresponding byte

values to byte vector S that is used for the Shift-And

part of the algorithm. The active bits correspond to

the positions of the indexed character in the pattern P

in both cases (vector B and vector S).

The SEARCH function iterates over individual

EDS segments and their elements (see line 15 and

line 17). At the beginning of each element e,

Shift-And algorithm is performed for the ﬁrst up to

m bases (see lines 18 – 23). Elements shorter than m

bases continue to line 45 where intermediate results

(found pattern preﬁxes) are merged to the register R

BNDM algorithm is performed on elements longer

than m bases starting from the position m. The search

stores the active bits corresponding to the discovered

pattern factors. The condition at line 31 deﬁnes that

either pattern preﬁx (subcondition at line 32) or an oc-

currence (else statement at line 34) was found. Every

discovered pattern preﬁx is also transformed to the

corresponding bit position and stored to D

(see line 33) that is later used for the Shift-And algo-

rithm performed at the end of each element e. BNDM

part of the algorithm is ﬁnished when j + m −1 ex-

ceeds the element boundary (see the while condition

at line 26). At last, the Shift-And algorithm starts at

the position j + m −last and the search register is ini-

tialized with the active preﬁxes discovered in the last

Algorithm 1: BNDM-EDS preprocessing and searching phase.

1: function PREPROCESS(P, m)

2: B[1..σ] ← 0; S[1..σ] ← 0; R [1..m] ← 0;

3: i ←m; F ← 1;

4: while i > 0 do  Building BNDM mask vector

5: B[P

i,1

] ← B[P

i,1

] | F;

6: R [i] ← F;

7: F ← F  1; i ←i −1;

8: i ←1; j ← 1;

9: while i ≤m do  Building Shift-And mask vector

10: S[P

i,1

] ← S [P

i,1

] | j;

11: i ←i + 1;

12: j ← j  1;

13: function SEARCH(

T , P, m)

14: R

← 0; R

← 0; D

← 0; count ← 0; F ← 10

m−1

15: for each segment

T [i] of

T do

16: R

← R

; R

← 0;

17: for each element e ∈

T [i] do

18: j ← 1; D ← R

;

19: while j < m & j < |e| do  Shift-And part for the ﬁrst

up to m bases

20: D ← ((D  1) | 1) & S[e

j,1

];

21: if D & F then

22: count ←count + 1;

23: j ← j +1;

24: if j < m then go to elementEnd;

25: j ← 1;

26: while j + m −1 < |e| do  BNDM part

27: D

← 0; last ← m; i ←m −1; D ← ∼0;

28: while i ≥0 & D 6= 0 do

29: D ← D & B [e

j+i,1

];

30: i ←i −1;

31: if (D & F) 6= 0 then

32: if i ≥0 then

33: last ←i + 1; D

← D

| R [last];

34: else

35: count ←count + 1;

36: D ← D  1;

37: j ← j +last;

38: j ← j +m −last; D ← D

;

39: while j < |e| & D 6= 0 do  Shift-And part for the last

up to m bases

40: D ← ((D  1) | 1) & S[e

j,1

];

41: if D & F then

42: count ←count + 1;

43: j ← j +1;

44: elementEnd:

45: R

← R

| D;

46: return count;

iteration of BNDM (see line 38).

Example 3. BNDM-EDS searching. Figure 1

demonstrates individual steps of BNDM-EDS algo-

rithm. Consider EDS fragment

T and pattern P =

AATAAATA. Each step depicts: (i) the pattern (green

color), (ii) the processed element of

T (blue under-

line), (iii) search register D together with active bits

Backward Pattern Matching on Elastic Degenerate Strings

T = ...CAATAAATAA



ATA



TA...

j + m − 1

last = 7

last = 4

D :

rev

lsb

msb

lsb

msb

(a) Step 1: BNDM.

T = ...CAATAAATAA



ATA



TA...

j + m − 1

last = 7, D

= h0000 0001i

last = 4, D

= h0000 1001i

rev

D :

lsb

msb

lsb

msb

(b) Step 2: BNDM.

T = ...CAATAAATAA



ATA



TA...

D :

lsb

msb

lsb

T = ...CAATAAATAA



ATA



TA...

D :

lsb

msb

lsb

msb

(d) Step 4: Shift-And (beginning of element).

T = ...CAATAAATAA



ATA



TA...

D :

lsb

msb

lsb

msb

(e) Step 5: Shift-And (beginning of element).

T = ...CAATAAATAA



ATA



TA...

D :

lsb

msb

lsb

msb

(f) Step 6: Shift-And (beginning of element).

Figure 1: BNDM-EDS: Searching P = AATAAATA in EDS.

(red color), and (iv) preprocessed byte vectors B and

S. Search register D is depicted for each step. In case

of BNDM, the direction is right to left. In case of the

Shift-And, the direction is left to right. The least

signiﬁcant bit (lsb) is on the top and the most signif-

icant bit (msb) is at the bottom. For the sake of clar-

ity, the example starts inside of the ﬁrst segment with

BNDM part of the algorithm (line 25 of Algorithm 1).

In the ﬁrst step, BNDM discovers the longest preﬁx

AATA which implies the shift by last = 4 bases. In the

second step, BNDM reports an occurrence (see the

diagonal of red bits in the search register D). There

are two other potential occurrences since two pattern

preﬁxes A and AATA are discovered. These preﬁxes

are also stored in register D

= h0000 1001i. The

value of D

comes the initial value of search register D in the

next step. In the third step, the Shift-And algo-

rithm processes the end of the element. In the fourth

step, the Shift-And processes the element ATA and

discovers the second occurrence. The initial value

of search register D is inherited from the previous

segment. In the ﬁfth step, element A is processed

by the Shift-And. Register D values from step 4

and 5 are merged and used as the initial value in the

next step. The beginning of the last depicted element

(TA...) is processed by the Shift-And in the last

step and the last occurrence is reported.

BNDM-EDS time and space complexity can be de-

duced from Algorithm 1. The preprocessing step

consists of two while cycles iterating over bases of

the searched pattern P. This implies worst-case time

O(m) for preprocessing phase. The needed space

comprises of vectors B, S and R . For patterns smaller

that the size of the computer word (m ≤ w), this im-

plies O(σ + m) space. This space can further be re-

duced to O(m) for small alphabets where σ ≤ m.

BNDM-EDS search function consists of three steps

performed on all EDS elements. For each element, the

ﬁrst up to m bases and the last up to m bases are pro-

cessed by the Shift-And. Assuming (based on values

of real datasets) constant number of elements within

one EDS segment, this implies processing up to 2 ×

BIOINFORMATICS 2021 - 12th International Conference on Bioinformatics Models, Methods and Algorithms

n ×m = O(nm) bases for small patterns (m ≤ w) and

O(nmd

e) for large patterns. These two Shift-And

steps are represented as while cycles at lines 19 and

39 of Algorithm 1. The third step is BNDM algorithm

that needs to process all EDS elements (all bases).

This implies processing O(Nm) bases for small pat-

terns (m ≤w) and O(Nmd

e) for large patterns in the

worst case. This step is represented as a while cycle

at line 26 of Algorithm 1. Finally, the total worst-case

time is O(Nm + nm) = O(Nm) for small patterns and

O(Nmd

e+nmd

e) = O(Nmd

e) for large patterns.

We need to deﬁne a few parameters to derive the

average time complexity for BNDM-EDS. Let v ∈ [0, 1]

be the variability of a genomic sequence of length N

bases. It implies that every d

e-th base is degener-

ate. Furthermore, let β = O(1) represents the average

number of elements within a degenerate EDS segment

and let δ < m represents the average length of an el-

ement within a degenerate EDS segment. The num-

ber of EDS segments can be deﬁned by the number

of bases (which approximately equals the EDS size

N) and the variability v: n ≈ 2Nv. The probability of

solid segments of length k bases is deﬁned as: p

v(1 −v)

. Thus, we can state the number of k-base-

long solid segments as:

= Nvp

= Nv

(1 −v)

BNDM part of the algorithm processes only solid seg-

ments (since δ < m) that are longer than m bases.

Shift-And part of the algorithm processes: (i) all

solid segments, (ii) all degenerate segments and their

elements.

BNDM average time complexity is claimed to

be the same as the BDM average time complex-

ity (Navarro and Rafﬁnot, 1998). Crochemore

and Rytter proved BDM average time complex-

ity O(

nlog

) in (Crochemore and Rytter, 1994).

BNDM part of the algorithm processes Nv

(1 −v)

solid segments for k ∈ [m + 1, ∞). This implies

∑

∞

k=m+1

k(1 −v)

processed bases.

∑

∞

k=m+1

k(1 −

can by upper-bounded by

∑

∞

k=1

k(1 −v)

1−v

which means that BNDM part needs to process N(1 −

v) bases. Finally, BNDM part of the algorithm works

in O(

N(1−v)log

) time.

Shift-And part of the algorithm also processes

(1 − v)

solid segments for k ∈ [1, ∞). In ev-

ery solid segment, O(m) bases are processed which

is Nv

∑

∞

k=1

(1 −v)

= Nvm(1 −v) processed bases

overall. Next, Shift-And processes Nv degener-

ate segments which is Nvβδ processed bases over-

all. Finally, Shift-And part of the algorithm

works in O(Nvm(1 −v)) + O(Nvβδ) time. We con-

clude that BNDM-EDS average time complexity is

N(1−v)log

) + O(Nvm(1 −v)) + O(Nvβδ).

4 EXPERIMENTS

We present experimental results comparing BNDM-EDS

with other baseline algorithms performed on EDS

data. Particularly, we compare our algorithm with

ElDeS (Iliopoulos et al., 2017), EDSM-BV (Grossi

et al., 2017) and SOPanG (Cisłak et al., 2018).

All tested algorithms are optimized for EDS data

and were implemented in C/C++ programming lan-

guage. BNDM-EDS implementation is available at

https://github.com/stringology-prague/eds search.

We carried out our tests on Intel



Core

i7-

4740 3.40 GHz, 24 GB RAM, Linux Gentoo kernel

4.19.57. We used compiler gcc version 8.3.0 with

compiler optimization -O3. The tested patterns were

chosen randomly from the input text and their length

m was ranging from 8 to 32. All experiments were

run in loop 100 times and we report the mean running

time in seconds. All reported times represent mea-

sured user time + sys time. The reported times al-

ways include any necessary pattern preprocessing and

exclude the time necessary to read the data in mem-

ory.

We performed our tests on two different data sets.

We include synthetic EDS data from (Grossi et al.,

2017) to achieve better comparability. The synthetic

data set includes 5 ﬁles with n ranging from 100000

to 1 600000. The percentage of degenerate positions

is 10%. The number of elements within one segment

is random and is bounded by 10. The length of each

element at the degenerate position is random and it

is bounded by 10 bases. Table 1 illustrates the re-

sults achieved by the algorithms performed on syn-

thetic data set. The ﬁles of the data set are character-

ized by the following parameters: EDS length n, EDS

size N and variability v. The measured times prove the

dominance of BNDM-EDS and SOPanG that is given by

their simplicity. BNDM-EDS achieved slightly better re-

sults than SOPanG and this difference is growing (up to

15%) with growing pattern length. This performance

gain is caused by BNDM shifting across the searched

text at increased rates for longer patterns. The mea-

sured results provide two observations: (i) all com-

pared algorithms depend on the size of EDS by factor

O(N); (ii) ElDeS and EDSM-BV require large portion

of time for their preprocessing phase. The second ob-

servation is analyzed in more detail in Figure 2. We

measured ratio of preprocessing time and total time

of single algorithms. The results are given in % of the

total time. Figure 2 shows that ElDeS and EDSM-BV

need signiﬁcantly larger portion of the preprocessing

time in comparison to BNDM-EDS and SOPanG. More-

over, the measured ratio of preprocessing time is not

decreasing with the growing ﬁle size for ElDeS.

Backward Pattern Matching on Elastic Degenerate Strings

Table 1: Search time in seconds of BNDM-EDS, ElDeS, EDSM-BV and SOPanG for synthetic data set.

File File params Pattern length BNDM-EDS SOPanG EDSM-BV ELDeS

100000

n = 100 000

N = 361 546

v = 0.095410

m = 8 0.00118260 0.00113000 0.02016180 1.80522000

m = 16 0.00100830 0.00121400 0.02511590 1.67540000

m = 24 0.00099149 0.00113000 0.02863710 1.70129000

m = 32 0.00098746 0.00113000 0.02898240 1.76652000

200000

n = 200 000

N = 725 343

v = 0.095325

m = 8 0.00237368 0.00225000 0.03969590 3.75465000

m = 16 0.00202329 0.00224500 0.04990100 3.94397000

m = 24 0.00198232 0.00225400 0.05714240 3.60240000

m = 32 0.00197887 0.00225400 0.05706520 3.81179000

400000

n = 400 000

N = 1 443 453

v = 0.095015

m = 8 0.00471225 0.00465500 0.07850180 8.18302000

m = 16 0.00402697 0.00457400 0.09871310 7.81520000

m = 24 0.00394175 0.00460800 0.11369500 7.90823000

m = 32 0.00393171 0.00461000 0.11334600 7.92150000

800000

n = 800 003

N = 2 887 787

v = 0.095186

m = 8 0.00944679 0.01000000 0.15683100 16.57840000

m = 16 0.00804698 0.00915000 0.19761300 16.06050000

m = 24 0.00788773 0.00916900 0.22609000 15.96310000

m = 32 0.00786342 0.00915300 0.22549400 16.76730000

1600000

n = 1 600 000

N = 5 769 390

v = 0.095174

m = 8 0.01884609 0.01831300 0.31360500 33.98690000

m = 16 0.01611825 0.01824100 0.39423000 35.46980000

m = 24 0.01581450 0.01827100 0.45437200 33.81830000

m = 32 0.01576465 0.01827700 0.45616800 33.92970000

0.0001

0.001

0.01

0.1

100000 200000 400000 800000 1600000

% preprocessing time (log10 scale)

Files

BNDM-EDS

SOPanG EDSM-BV ELDeS

Figure 2: Preprocessing time / Total time ratio for syn-

thetic data set.

0.001

0.002

0.003

0.004

0.005

0.006

0.007

0.008

0.009

0.01

0.011

0.01 0.03 0.05 0.10 0.15 0.20 0.25 0.30 0.40 0.50

Time [s]

Probability of degenerate segment

BNDM-EDS

SOPanG

Figure 3: BNDM-EDS and SOPanG comparison on ﬁles

with different variation.

The other data set is composed of real-world

human genomic data obtained from 1000 Genomes

Projects (Consortium, The 1000 Genomes Project,

2011) (Phase 3 of the project). We downloaded 24

vcf ﬁles representing 24 human chromosomes (same

population of 2 504 individuals for all chromosomes)

and transformed them into EDS. The percentage of

degenerate positions is approximately 3% on average

and the average length of one element (within a de-

generate position) is 2 bases. We selected representa-

tive chromosomes/ﬁles in terms of their size and vari-

abilty and we show the measured times achieved on

these ﬁles in Table 2. Furthermore, Figure 4 provides

the general comparison of individual algorithms for

all tested chromosomes/ﬁles. Due to large difference

of measured times, the vertical axis is in log

scale.

We can observe stronger BNDM-EDS dominance. This

is given by lower percentage of degenerate positions

which implies that Shift-And part of the algorithm

is not applied so often. This reduces the overhead of

swapping between two algorithms and enables better

utilization of BNDM shifting capability (“good preﬁx

shift” heuristic). Particularly, this can be seen in case

of chromosome Y where the variability among indi-

viduals is minimal and the time difference between

BNDM-EDS and SopANG increased. BNDM-EDS achieves

43% time improvement in comparison to SopANG for

chromosome Y and pattern length m = 8. However,

for longer patterns m = 32, BNDM-EDS achieves 81%

time improvement. Generally, the time achieved for

single chromosomes is given especially by their size

that corresponds to the N element in the time com-

BIOINFORMATICS 2021 - 12th International Conference on Bioinformatics Models, Methods and Algorithms

Table 2: Search time in seconds of BNDM-EDS, ElDeS, EDSM-BV and SOPanG for real data set.

File File params Pattern length BNDM-EDS SOPanG EDSM-BV ELDeS

Chrom1

n = 248 458 108

N = 255 872 195

v = 0.025814

m = 8 0.44913196 0.53431100 7.35913000 1695.14000000

m = 16 0.37187386 0.53086700 8.80502000 1590.15000000

m = 24 0.35176826 0.53060000 9.35136000 1590.61000000

m = 32 0.35678421 0.53099200 9.68736000 1688.14000000

Chrom7

n = 158 984 673

N = 164 380 710

v = 0.029508

m = 8 0.31989103 0.34962400 5.32563000 1070.06000000

m = 16 0.30178564 0.34705100 6.40293000 1077.33000000

m = 24 0.24784241 0.34708700 6.84947000 1035.08000000

m = 32 0.25140232 0.34661600 7.09477000 1068.89000000

Chrom22

n = 50 713 670

N = 52 004 687

v = 0.021605

m = 8 0.07712649 0.10663000 1.30624000 304.21900000

m = 16 0.06338260 0.10615300 1.47754000 298.53900000

m = 24 0.05970330 0.10614700 1.62679000 277.77800000

m = 32 0.05970250 0.10603900 1.63156000 315.12800000

ChromX

n = 155 619 941

N = 157 232 908

v = 0.009292

m = 8 0.22626265 0.30025500 1.96430000 963.23200000

m = 16 0.16053621 0.29899700 2.35361000 903.40400000

m = 24 0.14193202 0.29896900 2.48494000 928.85500000

m = 32 0.14133996 0.29900700 2.55960000 943.67600000

ChromY

n = 26 622 695

N = 26 629 512

v = 0.000245

m = 8 0.02775240 0.04873900 0.08626860 134.60100000

m = 16 0.01600782 0.04860800 0.10365000 142.23600000

m = 24 0.01156795 0.04859900 0.10476000 136.01800000

m = 32 0.00933492 0.04861900 0.10436000 134.60100000

(a) m = 8.

0.01

0.1

100

1000

10000

Chrom1

Chrom2

Chrom3

Chrom4

Chrom5

Chrom6

Chrom7

Chrom8

Chrom9

Chrom10

Chrom11

Chrom12

Chrom13

Chrom14

Chrom15

Chrom16

Chrom17

Chrom18

Chrom19

Chrom20

Chrom21

Chrom22

ChromX

ChromY

Time [s] (log10 scale)

Files

BNDM-EDS

SopANG EDSM-BV ELDeS

(b) m = 16.

0.01

0.1

100

1000

10000

Chrom1

Chrom2

Chrom3

Chrom4

Chrom5

Chrom6

Chrom7

Chrom8

Chrom9

Chrom10

Chrom11

Chrom12

Chrom13

Chrom14

Chrom15

Chrom16

Chrom17

Chrom18

Chrom19

Chrom20

Chrom21

Chrom22

ChromX

ChromY

Time [s] (log10 scale)

Files

BNDM-EDS

SopANG EDSM-BV ELDeS

0.01

0.1

100

1000

10000

Chrom1

Chrom2

Chrom3

Chrom4

Chrom5

Chrom6

Chrom7

Chrom8

Chrom9

Chrom10

Chrom11

Chrom12

Chrom13

Chrom14

Chrom15

Chrom16

Chrom17

Chrom18

Chrom19

Chrom20

Chrom21

Chrom22

ChromX

ChromY

Time [s] (log10 scale)

Files

BNDM-EDS

SopANG EDSM-BV ELDeS

(d) m = 32.

0.001

0.01

0.1

100

1000

10000

Chrom1

Chrom2

Chrom3

Chrom4

Chrom5

Chrom6

Chrom7

Chrom8

Chrom9

Chrom10

Chrom11

Chrom12

Chrom13

Chrom14

Chrom15

Chrom16

Chrom17

Chrom18

Chrom19

Chrom20

Chrom21

Chrom22

ChromX

ChromY

Time [s] (log10 scale)

Files

BNDM-EDS

SopANG EDSM IKP

Figure 4: Search time of BNDM-EDS, ElDeS, EDSM-BV and SOPanG for real data set.

Backward Pattern Matching on Elastic Degenerate Strings

plexity of single algorithms.

Since BNDM-EDS gain in comparison to SopANG

clearly depends on variability in the processed EDS,

we included another experiment comparing both al-

gorithms on synthetic data with different probability

of degenerate EDS segment. Figure 3 depicts the

achieved search time in seconds. The probability of

degenerate EDS segment in single ﬁles starts from

0.01 and continues up to 0.5. The size of the ﬁles

is 1 600000 bases and we searched for a randomly

chosen pattern of length m = 16. The results show

that starting from 0.05 probability of degenerate seg-

ment, the length of processed elements (of the solid

segments) is shorter than the pattern length m = 16.

This implies that BNDM part of BNDM-EDS is not uti-

lized for most of the processed elements.

5 CONCLUSION AND FUTURE

WORK

We proposed BNDM-EDS algorithm which is the ﬁrst

backward pattern matching algorithm for Elastic De-

generate Strings. Its worst-case time is bounded

by O(Nmd

e). Moreover, for small patterns (m ≤

w), BNDM-EDS achieves average time complexity

N(1−v)log

) + O(Nvm(1 −v)) + O(Nvβδ) which

is optimal. Our goal was to design a simple (easy-to-

implement) algorithm achieving fast search times on

real-world data. This expectation was conﬁrmed by

experiments on real-world data set. BNDM-EDS proves

its superiority especially on data with lower degener-

ate rate and for longer patterns.

We plan to extend BNDM-EDS for protein alpha-

bet in future work. Furthermore, we want to focus

on processing more general variants of Elastic De-

generate Strings, such as recursive Elastic Degenerate

Strings or colored Elastic Degenerate Strings (allow-

ing to map elements to individuals of the sequenced

population).

ACKNOWLEDGEMENTS

The research was partially supported by the Czech

Science Foundation as project No. 19-20759S.

REFERENCES

Aoyama, K., Nakashima, Y., I, T., Inenaga, S., Bannai, H.,

and Takeda, M. (2018). Faster Online Elastic Degen-

erate String Matching. In Navarro, G., Sankoff, D.,

and Zhu, B., editors, Annual Symposium on Combi-

natorial Pattern Matching (CPM 2018), volume 105

of Leibniz International Proceedings in Informatics

(LIPIcs), pages 9:1–9:10, Dagstuhl, Germany. Schloss

Dagstuhl–Leibniz-Zentrum fuer Informatik.

Bernardini, G., Gawrychowski, P., Pisanti, N., Pissis, S. P.,

and Rosone, G. (2019). Even faster elastic-degenerate

string matching via fast matrix multiplication. In 46th

International Colloquium on Automata, Languages,

and Programming, ICALP 2019, July 9-12, 2019, Pa-

tras, Greece., pages 21:1–21:15.

Bernardini, G., Pisanti, N., Pissis, S. P., and Rosone, G.

(2017). Pattern matching on elastic-degenerate text

with errors. In Fici, G., Sciortino, M., and Ven-

turini, R., editors, String Processing and Information

Retrieval, pages 74–90, Cham. Springer International

Publishing.

Church, D. M. et al. (2015). Extending reference assembly

models. Genome Biology, 16(13).

Cisłak, A., Grabowski, S., and Holub, J. (2018). Sopang:

online text searching over a pan-genome. Bioinfor-

matics, page bty506.

Consortium, T. C. P.-G. (2016). Computational pan-

genomics: status, promises and challenges. Brieﬁngs

in Bioinformatics, 19(1):118–135.

Consortium, The 1000 Genomes Project (2011). A map

of human genome variation from population-scale se-

quencing. Nature, 473:544 EP –. Corrigendum.

Consortium, The UK10K (2015). The uk10k project identi-

ﬁes rare variants in health and disease. Nature, 526:82

EP –.

Cooley, J. W. and Tukey, J. W. (1965). An algorithm for the

machine calculation of complex fourier series. Math-

ematics of Computation, 19(90):297–301.

Crochemore, M., Hancart, C., and Lecroq, T. (2007). Algo-

rithms on Strings. Cambridge University Press, New

York, NY, USA.

Crochemore, M. and Rytter, W. (1994). Text algorithms.

Oxford University Press.

Danecek, P., Auton, A., Abecasis, G., Albers, C. A., Banks,

E., DePristo, M. A., Handsaker, R. E., Lunter, G.,

Marth, G. T., Sherry, S. T., McVean, G., and Durbin,

R. (2011). The variant call format and vcftools. Bioin-

formatics, 27(15):2156–2158.

Dilthey, A. et al. (2015). Improved genome inference in

the mhc using a population reference graph. Nature

Genetics, 47:682.

olki, B. (1964). An algorithm for syntactical analy-

sis. Computational Linguistics, 3:29–46. Hungarian

Academy of Science, Budapest.

Grossi, R., Iliopoulos, C. S., Liu, C., Pisanti, N., Pissis,

S. P., Retha, A., Rosone, G., Vayani, F., and Versari, L.

(2017). On-line pattern matching on similar texts. In

28th Symposium on Combinatorial Pattern Matching,

CPM 2017, July 4-6, 2017, Warsaw, Poland, pages

9:1–9:14.

Iliopoulos, C. S., Kundu, R., and Pissis, S. P. (2017). Ef-

ﬁcient pattern matching in elastic-degenerate texts.

In Drewes, F., Mart

ın-Vide, C., and Truthe, B., ed-

itors, Language and Automata Theory and Applica-

BIOINFORMATICS 2021 - 12th International Conference on Bioinformatics Models, Methods and Algorithms

tions, pages 131–142, Cham. Springer International

Publishing.

Knuth, D. E., Morris, J. H., and Pratt, V. R. (1977). Fast

Pattern Matching in Strings. SIAM Journal on Com-

puting, 6(2):323–350.

Maciuca, S. et al. (2016). A natural encoding of genetic

variation in a burrows-wheeler transform to enable

mapping and genome inference. In Algorithms in

Bioinformatics, pages 222–233, Cham. Springer In-

ternational Publishing.

Marschall, T. (2018). Computational pan-genomics: status,

promises and challenges. Brieﬁngs in Bioinformatics,

19(1):118–135.

Melichar, B., Holub, J., and Polcar, J. (2005). Text Search-

ing Algorithms.

Na, J. C., Kim, H., Park, H., Lecroq, T., L

eonard, M.,

Mouchard, L., and Park, K. (2016). Fm-index of

alignment: A compressed index for similar strings.

Theoretical Computer Science, 638:159 – 170. Pattern

Matching, Text Data Structures and Compression.

Navarro, G. and Pereira, A. O. n. (2016). Faster compressed

sufﬁx trees for repetitive collections. J. Exp. Algorith-

mics, 21:1.8:1–1.8:38.

Navarro, G. and Rafﬁnot, M. (1998). A bit-parallel ap-

proach to sufﬁx automata: Fast extended string match-

ing. In Proceedings of the 9th Annual Symposium

on Combinatorial Pattern Matching, CPM ’98, pages

14–33, London, UK, UK. Springer-Verlag.

on Biochemical Nomenclature (CBN), I.-I. C. (1971). Ab-

breviations and symbols for nucleic acids, polynu-

cleotides and their constituents. Journal of Molecular

Biology, 55(3):299 – 310.

Pissis, S. P. and Retha, A. (2018). Dictionary Match-

ing in Elastic-Degenerate Texts with Applications in

Searching VCF Files On-line. In D’Angelo, G., ed-

itor, 17th International Symposium on Experimental

Algorithms (SEA 2018), volume 103 of Leibniz Inter-

national Proceedings in Informatics (LIPIcs), pages

16:1–16:14, Dagstuhl, Germany. Schloss Dagstuhl–

Leibniz-Zentrum fuer Informatik.

Proch

azka, P. and Holub, J. (2014). Compressing similar

biological sequences using fm-index. In Data Com-

pression Conference, DCC 2014, Snowbird, UT, USA,

26-28 March, 2014, pages 312–321.

Sir

en, J. (2016). Indexing variation graphs. CoRR,

abs/1604.06605.

Backward Pattern Matching on Elastic Degenerate Strings