Note, that a large infrastructure is able generate up to
100,000 logs per second (Chuvakin et al., 2012).
Although multi-regex matching is an extensively
studied area, e.g. in networking domain, usable prac-
tical implementations and libraries for common pro-
gramming languages and general purpose processor ar-
chitectures are seriously lacking. In this paper we pro-
pose an alternative approach for multi-pattern match-
ing based on trie data structure that is very easy to im-
plement, scales well with respect to number of match-
ing patterns, and exhibits very respectable real-world
performance.
The rest of this paper is structured as follows. Sec-
tion 2 provides background for the problem at hand.
Section 3 presents our approach for practical multi-
pattern matching. Section 4 evaluates the proposed
approach. Section 5 discusses the results. Section 6
presents limited amount of related work, and Section 7
concludes the paper.
2 BACKGROUND
In computer science the term of multi-pattern match-
ing traditionally refers to the goal of finding all oc-
currences of the given set of keywords in the input
text, e.g. as addressed by Aho-Corasic algorithm (Aho
and Corasick, 1975). However, in the context of log
abstraction the goal is to exactly match a single pattern
from a set of patterns (message type) against the whole
input (log message) and return the captured matching
groups that correspond to the dynamic part of log mes-
sage. Naïve approach, still widely used in practice,
is to iterate the set of matching patterns until the first
match is found. Therefore, in the worst-case scenario
the whole set must be iterated in order to match the
last pattern. This also obviously applies to situation
when no matching patterns can be found. For hundreds
of matching patterns this quickly becomes a serious
limiting factor with respect to performance. With re-
gard to this fact it is important to note that our work
is focused on results applicable to general purpose
processor architectures based on (possibly distributed)
COTS multi-core CPUs, or virtualized hardware (i.e.
not massively parallel architectures).
A slightly modified generalization of the above-
mentioned problem is the problem of multi-regex
matching, i.e. finding all the patterns of the given
set of regexes that match the input text. Note, that find-
ing the exact boundaries of the matches, i.e. capturing
the matching groups, is an additional problem (Cox,
2010). Formally speaking, every regular expression
can be transformed into a corresponding finite automa-
ton that can be used to recognize the input. By using
Thompson’s algorithm an equivalent nondeterminis-
tic finite automaton (NFA) can be constructed which
can be in turn converted into a deterministic finite au-
tomaton (DFA) via power-set construction (Hopcroft
J.E. Motwani R.Ullman J.D., 2001). Similarly, given
a set of regular expressions an equivalent NFA, and
subsequently DFA, can be constructed using the same
method.
In general, NFAs are compact in the terms of stor-
age space and number of states, however, the pro-
cessing complexity is typically high since possibly
many concurrent active states must be considered for
each input character (all states in the worst-case). For
performance-critical applications NFAs are typically
transformed into DFAs since they exhibit constant pro-
cessing complexity for each input character. Yet, with
the growing set of regular expressions the space re-
quirements can grow exponentially – a problem known
as state explosion. Thus, for some large regex pattern
sets a practical implementation of DFA-matching can
be infeasible due to memory limits. This conundrum
still attracts a great deal of research in many domains,
e.g. genomics, biology, and network intrusion detec-
tion. For example, in the networking domain, two
large groups of approaches can be identified utilizing
either NFA parallelization (e.g. on FPGAs), or DFA
compression techniques (e.g. regex grouping, alterna-
tive representation, hybrid construction, and transition
compression) (Wang et al., 2014).
However, in the terms of practical multi-regex
matching implementations suitable for log abstraction
the situation is unsatisfactory to say the least. This can
be due to the fact that, apart from a relative complexity
and individual limitations of the optimization tech-
niques (NFA/DFA), with the combination of vague
implementation details provided by the researchers,
the amount of implementation work needed to build a
fast full-fledged matching library is likely to be great.
For example, pure DFA-based approaches are unable
to capture matching groups – additional NFA-based
matching is needed, increasing the complexity of the
matching engine (Cox, 2010).
To the best of our knowledge, Google’s RE2 (Cox,
2007), (Cox, 2010), is the only available matching
library (C++, Linux only) that supports multi-regex
matching. RE2 aims to eliminate a traditional back-
tracking algorithm and uses so-called lazy DFA in or-
der to limit the state explosion – the regex is simulated
via NFA, and DFA states are built only as needed,
caching them for later use. The implementation of
multi-regex matching returns a vector of matching pat-
tern ids, since it is possible that multiple regexes match
the input (a likely situation as we will discuss below).
Therefore, additional logic is needed to select the best
ICSOFT-EA 2016 - 11th International Conference on Software Engineering and Applications
320