2 RELATED WORK
The most well known algorithms for string matching
are those proposed in 1977 by R. Boyer and J.
Moore (BM) (Boyer, 1977) for single matching and
in 1975 by Aho and Corasick (AC) (Aho, 1975) for
multiple matching. The BM algorithm uses two
heuristics: bad characters and good suffix that
reduce the number of comparisons relatively to the
naïve algorithm. BM is not efficient in multiple
strings matching, because it has to perform iterative
search for each pattern. In (Horspool, 1980),
Horspool improved the BM algorithm by proposing
a simpler and more efficient implementation that
uses only the bad-character heuristic.
In contrast to BM, the AC algorithm is an
efficient multi-pattern matching algorithm. Based on
the finite-state automata constructed from the set of
patterns, the AC algorithm can search for all the
patterns in one pass. Flurry of works and
enhancements related to the AC algorithm have been
presented and are widely used in current information
and communication technology.
In 2002 Fisk and Varghese (Fisk, 2002) designed
the Set-wise Boyer-Moore-Horspool algorithm. It is
an adaptation of BM to concurrently match a set of
rules. This algorithm is shown to be faster than both
AC and BM for medium-size pattern sets. Their
experiments suggest triggering a different algorithm
depending on the number of rules: Boyer-Moore-
Horspool if there is only one rule; Set-wise Boyer-
Moore-Horspool if there are between 2 and 100
rules, and AC for more than 100 rules. C. J. Coit, S.
Staniford, and J. McAlerney proposed the AC_BM
algorithm (Coit, 2002), which is similar to the Set-
wise Boyer-Moore-Horspool algorithm.
Using the bad-character heuristic introduced in
the BM algorithm, S. Wu and U. Manber designed
in 1994 the WM multi-pattern matching algorithm
(Wu, 1994). WM uses two or three suffix characters
to generate shift table constructed by preprocessing
all patterns. The algorithm uses a hash table on two
characters prefix to index a group of patterns, used
when the shift is zero. Finally, naïve comparison is
applied to confirm if the pattern exist in the text.
WM deals efficiently with large pattern set size, but
its performance depends on the shortest pattern.
Therefore, the maximum shift is equal to the length
of the shortest pattern minus one.
G. Anagnostakis, E. P. Markatos, S. Antonatos,
and M. Polychronakis proposed the E2XB
algorithm. It is an exclusion-based pattern matching
algorithm (Anagnostakis, 2003) based on the fact
that mismatches are, by far, more common than
matches. This algorithm was designed for providing
quick negatives.
3 COMMON SUBSTRINGS
PROBLEM
This section reviews the main ideas and definitions
underlying the Common Substrings Problem (CSP)
and the string classification problem. CSP is a very
wide known problem in string set theory. Indeed, the
most asked question about a set of string is: what
substrings are common to a large number of strings?
This problem is related to the problem of finding
substrings that appear (occur) repeatedly in a large
text (Gusfield, 1997). In this case, the large text
represents the concatenation of all the strings in the
CSP problem, so the common substrings represent
the substrings that occur repeatedly in the
concatenated text with a distance condition. The
CSP can be used in file comparison, approximate
string matching biological application such as
similarity detection in DNA sequences.
3.1 Formal Definition
The common substring problem can be derived from
the k-common substring problem, which can be
defined as follows:
Let S = {s
1
, s
2
,
…, s
K
} be the set of K strings. For
2 ≤ k ≤ K, we have to find the length and the longest
common substring to k strings, at least. When k = K,
we have the longest common substring for all the
strings.
Example:
S = {athe, heat, athire, athis, wiathis}; K=5
Table 1: K-common substring solution.
k Length substrings
2 5 athis
3 4 athi
4 3 ath
5 2 at
The common substring is “at” (k = 5 = K).
3.2 Problem Solution
We can locate the length and position of the longest
common substrings either by using the generalized
suffix tree or by dynamic programming (Gusfield,
1997). The running time is, respectively, O(n) and
O(p), where n=Σ|si| and p=∏|si|. We can note that the
PIECEWISE CLASSIFICATION OF ATTACK PATTERNS FOR EFFICIENT NETWORK INTRUSION DETECTION
101