The aim of this experimental study is to identify a
suitable and preferably fast multiple pattern matching
algorithm for several problem parameters such as a
given biological database, the size of the pattern set,
and the length of the patterns.
2 EXPERIMENTAL
METHODOLOGY
The experiments were executed locally on an Intel
Core 2 Duo CPU with a 3.00GHz clock speed and 2
Gb of memory, 64 KB L1 cache and 6 MB L2 cache.
The Ubuntu Linux operating system was used and
during the experiments only the typical background
processes ran. To decrease random variation, the time
results were averages of 100 runs. All algorithms
were implemented using the ANSI C programming
language and were compiled using the GCC 4.4.3
compiler with the “-O2” and “-funroll-loops” opti-
mization flags.
To compare the pattern matching algorithms, the
practical running time was used as a measure. Practi-
cal running time is the total time in seconds an algo-
rithm needs to find all occurrences of a pattern in an
input string including any preprocessing time and was
measured using the MPI Wtime function of the Mes-
sage Passing Interface since it has a better resolution
than the standar clock() function.
The data set was similar to the ones used in (Sheik
et al., 2005) and (Kalsi et al., 2008). It consisted
of the SWISS-PROT Amino Acid sequence database
with a size of n = 182.116.687 characters and an al-
phabet of size 20, the FASTA Amino Acid (FAA)
sequence of the A-thaliana genome with a size of
n = 11.273.437 characters and an alphabet of size 20
and the FASTA Nucleidic Acid (FNA) sequence of
the A-thaliana genome with a size of n = 118.100.062
characters and an alphabet of size 4.
3 EXPERIMENTAL RESULTS
In this section, the performance of the algorithms is
evaluated according to their running time for different
biological databases.
Figures 1 to 3 present the running time of the algo-
rithms including preprocessing for the SWISS-PROT
amino acid sequence database and for the FASTA
amino acid and nucleidic acid databases of the A-
thaliana genome respectively for a pattern length of
m = 8 and m = 32 and for 100 to 100.000 patterns.
As can generally be seen from the Figures, by vary-
ing different parameters such as the size of the pattern
set, the length of the patterns and the size of the al-
phabet can affect the performance of the algorithms
in different ways.
In the case of the SWISS-PROT database and for a
pattern length of m = 8, the SOG and BG algorithms
had the best performance when up to 10.000 patterns
were used while the SBOM algorithm was faster for
more than 10.000 patterns. When a pattern length of
m = 32 was used, the SOG and BG algorithms had
the fastest running time for up to 20.000 patterns,
while SBOM was faster for more than 20.000 pat-
terns. The HG and Wu-Manber algorithms had an av-
erage performance for either m = 8 or m = 32 while
Commentz-Walter was consistently the slowest algo-
rithm in terms of running time.
For the FASTA amino acid database, and for a
pattern length of m = 8, the SOG and BG algo-
rithms were faster when up to 10.000 patterns were
used while for more than 10.000 patterns, the Wu-
Manber algorithm had the best performance, followed
by SBOM. When a pattern length of m = 32 was
used, the SOG and BG algorithms had the fastest
running time for up to 30.000 patterns. For bigger
pattern sets, Wu-Manber was the fastest algorithm.
Commentz-Walter was the algorithm with the worst
performance when m = 8 was used while for m = 32,
the Commentz-Walter and the SBOM algorithms had
the worst performance.
In the case of the FASTA nucleidic acid database,
SBOM was the algorithm that worked consistently
faster for a pattern length of m = 8. When up to 2.000
patterns were used, Commentz-Walter was the slow-
est algorithm while for more than 2.000 patterns, HG
was the algorithm with the worst performance. For
a pattern length of m = 32 the Wu-Manber was the
fastest algorithm, especially when more than 20.000
patterns were used.
Specific performancecomments on the algorithms
follow. Commentz-Walter was the algorithm with the
fastest running time when used on the FASTA nucle-
idic acid database with a pattern length of m = 32,
especially when more than 10.000 to 50.000 pat-
terns were used. The algorithm had the worst per-
formance when used on the SWISS-PROT and the
FASTA amino acid databases and thus its use is not
recommended in general on large alphabet sizes such
as amino acid databases. Wu-Manber was the fastest
algorithm on the FASTA amino acid database when
more than 10.000 patterns were used and on the
FASTA nucleidic acid for a pattern length of m = 32
together with the Commentz-Walter algorithm. On
the SWISS-PROT database and for a pattern length
of m = 8, the algorithm had a good performance with
EXPERIMENTAL RESULTS ON MULTIPLE PATTERN MATCHING ALGORITHMS FOR BIOLOGICAL
SEQUENCES
275