Parallel Finite State Machines for Very Fast Distributable Regular

Expression Matching

Luis Quesada, Fernando Berzal and Francisco J. Cortijo

Department of Computer Science and Artiﬁcial Intelligence, CITIC, University of Granada, Granada 18071, Spain

Keywords:

Parallel Finite State Machine, PFSM, Regular Expressions, Automaton, Distributable.

Abstract:

Regular expressions provide a ﬂexible means for matching strings and they are often used in data-intensive

applications. They are formally equivalent to either deterministic ﬁnite automata (DFAs) or nondeterministic

ﬁnite automata (NFAs). Both DFAs and NFAs are affected by two problems known as amnesia and acalculia,

and DFAs are also affected by a problem known as insomnia. Existing techniques require an automata conver-

sion and compaction step that prevents the use of existing automaton databases and hinders the maintenance of

the resulting compact automata. In this paper, we propose Parallel Finite State Machines (PFSMs), which are

able to run any DFA- or NFA-like state machines without a previous conversion or compaction step. PFSMs

report, online, all the matches found within an input string and they solve the three aforementioned problems.

Parallel Finite State Machines require quadratic time and linear memory and they are distributable. Parallel

Finite State Machines make very fast distributed regular expression matching in data-intensive applications

feasible.

1 INTRODUCTION

A regular expression, commonly called regex, pro-

vides a ﬂexible means for matching strings (Aho,

1990) often used in data-intensive applications such

as lexical analysis in language processors (Levine

et al., 1992; Quesada et al., 2011), XML tokenization

(Hosoya and Pierce, 2007), bibliographic search (Aho

and Corasick, 1975), virus detection (Lee, 2007),

and other data mining applications (Han et al., 2006;

G´omez and Vaisman, 2008).

Regex matching relies on either deterministic ﬁ-

nite automata (DFAs) or nondeterministic ﬁnite au-

tomata (NFAs) (Hopcroft et al., 2007). Both DFAs

and NFAs suffer from amnesia and acalculia, and

DFAs also suffer from insomnia (Kumar et al., 2007).

Amnesia is the inability to consider the progress of

multiple partial matches. Acalculia is the inability to

ﬁnd subexpression occurrences. Insomnia is the in-

ability to unload from memory rarely used big por-

tions of automata when not needed.

Several techniques exist that solve those problems

by converting sets of regular expressions into com-

pact state machines (Pasetto et al., 2010; Becchi and

Cadambi, 2007; Kumar et al., 2006; Ficara et al.,

2008). The required conversion or compaction step

preventsusing existing automata that could be already

available in antivirus signature databases or complex

data ﬁlters. It also makes the modiﬁcation of the set of

regular expressions more difﬁcult, as the conversion

and compactionsteps have to be performedevery time

they are modiﬁed.

In this paper, we propose Parallel Finite State Ma-

chines (PFSMs). A Parallel Finite State Machine is

an automaton that can have multiple active states. It

can be used for efﬁciently ﬁnding all the matches of a

set of regular expressions in a given input string, and

it solves the amnesia and acalculia problems. They

also mitigate the effect of insomnia by reducing the

number of states in the resulting automaton.

Parallel Finite State Machines do not require con-

version and compaction steps and they can, therefore,

be run on existing DFAs or NFAs. They also allow the

efﬁcient addition or removal of regular expressions

from existing automata.

Parallel Finite State Machines can perform reg-

ular expression matching in quadratic order of efﬁ-

ciency and have linear memory space requirements,

apart from automata storage, in the worst case. More-

over, they allow the parallelization of the regular ex-

pression matching process with almost linear scala-

bility in the practice.

105

Quesada L., Berzal F. and J. Cortijo F..

Parallel Finite State Machines for Very Fast Distributable Regular Expression Matching.

DOI: 10.5220/0003949901050110

In Proceedings of the 7th International Conference on Software Paradigm Trends (ICSOFT-2012), pages 105-110

ISBN: 978-989-8565-19-8

 2012 SCITEPRESS (Science and Technology Publications, Lda.)

2 BACKGROUND

A DFA is a tuple (Σ, Q, q

, δ, F), where:

• Σ is the input alphabet (i.e. a ﬁnite, non-empty set

of symbols).

• Q is a ﬁnite, non-empty set of states.

• q

is an initial state, an element of Q.

• δ is the state-transition function: δ : Q× Σ → Q.

• F is the set of ﬁnal states, a subset of Q.

In a NFA, the state-transition function would be

δ : Q× Σ → P (Q). That is, δ returns a set of states.

NFA-like state machines can be converted into

DFA-like machines by expanding their states and re-

moving any nondeterminism (Sipser, 2005). Each

state in the DFA will correspond to a set of states in

the NFA. Although DFAs are faster than NFAs, there

can be up to O(2

) states in a DFA, being n the num-

ber of states in the NFA.

Amnesia can be inefﬁciently solved by using a

separate state for each combination of every poten-

tially simultaneous partial match (Kumar et al., 2007).

Somehow, this is analogous the approach of con-

verting NFAs into DFAs, but much more memory-

intensive and therefore, quite impractical.

Acalculia can be solved by repeating the whole

regex matching process for each input string index as

a starting position and stopping at each possible end-

ing position (Quesada et al., 2011). This solution is

time-intensive and does not allow for the progress of

partial matches to be simultaneously considered, thus

preventing the solution of amnesia.

When the number of states in an automaton in-

creases, as a result of converting a NFA into a DFA

or considering simultaneous partial matches, mem-

ory use skyrockets and insomnia has to be taken into

consideration. Insomnia can be solved by swapping

rarely used portions of automata out to the hard drive

when they have not been used for a while (Kumar

et al., 2007). However, that approach presents two

major drawbacks. When those portions are used back

again, there might be a swapping delay. Also, au-

tomata might be so huge that even swapping it to hard

drive could be impractical. The most common ap-

proach in practice consists of avoiding the expansion

of states (Kumar et al., 2006; Ficara et al., 2008; Bec-

chi and Cadambi, 2007), thus eliminating the major

cause of insomnia.

DotStar (Pasetto et al., 2010) solves amnesia and

acalculia by converting sets of regular expressions

into compact state machines pertaining to a subclass

of DFA with status bits associated to their states.

The technique proposed in (Becchi and Cadambi,

2007) reduces the memory requirement of DFAs by

merging non-equivalent states and labelling their in-

put and output transitions.

FAs (Kumar et al., 2006) reduce the memory

requirement of DFAs by assuming default transitions

for all the states. Default transitions differ from ep-

silon transitions in that, in case a state cannot ﬁnd

a transition for a speciﬁc input symbol, the default

transition will be followed and a transition with that

speciﬁc input symbol will be looked for in the target

state. Several default transitions could be followed for

processing a single symbol.

δFA (Ficara et al., 2008) is an extension of D

that determines some of the results of chaining de-

fault transitions and non-defaulttransitions and makes

them explicit in the automaton for optimization. It

also uses local memory to store the set of transitions

going out from a state that is the source of a default

transition. If the default transition of the target state

brings to the source state, the transition set is known

without having to follow the default transition back to

the source state.

These techniques require a conversion or com-

paction step that makes them unable to run on existing

uncompacted DFA- or NFA-like state machines that

could be already available in automaton databases.

Moreover, compaction forbids regular expressions to

be added or removed from the state machine directly.

Therefore, compaction hampers the maintainability of

the automata. It should be noted that the only way to

add or remove regular expressions from the state ma-

chine would involve converting and compacting the

whole new regular expression set, which is costly in

terms of time when the set of regular expressions is

complex.

Lamb (Quesada et al., 2011) is a lexical analyzer

that partially solves acalculia by greedily matching

every pattern starting at every position of the input

string. However, it does not ﬁnd all the possible

submatches, as greedily-matched regular expressions

ﬁnd the longest possible matching and discard any

shorter matchings. That is, even though Lamb ﬁnds

matchings starting at any position in the input string,

it does not ﬁnds matchings ending at different posi-

tions in the input string for the same starting posi-

tion. This implies that some matchings may indeed

be missed (i.e. Lamb still suffers from acalculia).

3 PARALLEL FINITE STATE

MACHINE

In order to solve the amnesia, the acalculia, and the

insomnia problems, we propose the use of Parallel Fi-

nite State Machines (PFSMs).

ICSOFT2012-7thInternationalConferenceonSoftwareParadigmTrends

106

3.1 Deﬁnition

A PFSM is a concurrent automaton in which several

states may be active at the same time and ﬁnal states

include associated labels. This feature enables match-

ing an input string at several starting positions with a

set of regular expressions in order to ﬁnd all the pos-

sible matches. A PFSM generalizes all existing com-

binations of DFA- and NFA-like state machines.

A PFSM is a tuple (Σ, Q, q

, δ, F, L, l), where:

• Σ is the input alphabet.

• Q is a ﬁnite, non-empty set of states.

• q

is an initial state, an element of Q.

• δ is a NFA-like state-transition function: δ : Q ×

Σ → P (Q).

• F is the set of ﬁnal states, a subset of Q.

• L is the set of labels that identify the different reg-

ular expressions.

• l is the state-label function: l : F → L.

A set of DFAs or NFAs representing different reg-

ular expressions can be put together by using an an-

cillary initial state with epsilon transitions to each of

the initial states of the different automata, and by as-

signing each ﬁnal state a label identifying that iden-

tiﬁes the regular expression it represents. The result-

ing automata, including the ancillary initial state and

the added epsilon transitions, conform the PFSM. It

should be noted that NFAs can be considered instead

of DFAs for some regular expressions in order to trade

off processing time for memory use.

For example, an automaton that considers the reg-

ular expressions “a*c”, “ac”, and “a(ca)*b” is shown

in Figure 1. When trying to match the input string

“aacacab”, that automaton returns all the matches

shown in Figure 2.

qinit

aq0

aq1

a*c

bq0 bq1 bq2

cq0

cq1

cq3

a(ca)*b

Figure 1: A PFSM that compiles the “a*c”, “ac”, and

“a(ca)*b” regular expressions.

As the whole automaton does not need to be con-

verted into a single DFA, it does not have to be ex-

panded with too many states, which cuts down its

INPUT: "aacacab"

"a*c" matches [0-2]: "aac"

"a*c" matches [1-2]: "ac"

"ac" matches [1-2]: "ac"

"a*c" matches [2-2]: "c"

"a*c" matches [3-4]: "ac"

"ac" matches [3-4]: "ac"

"a*c" matches [4-4]: "c"

"a(ca)*b" matches [1-6]: "acacab"

"a(ca)*b" matches [3-6]: "acab"

"a(ca)*b" matches [5-6]: "ab"

Figure 2: Results of matching the “aacacab” input string

using the PFSM in Figure 1.

memory requirements, thus mitigating the impact of

insomnia.

Also, as amnesia is solved by allowing several si-

multaneously active states that can represent parallel

partial matchings, it is unnecessary to expand the au-

tomaton with states representing different simultane-

ous partial matchings. This drastically reduces the

impact of insomnia.

Furthermore, as the automaton does not need to

be compacted, regular expressions can be efﬁciently

added or removed from a PFSM. This makes PFSM

maintenance easier.

3.2 Algorithm

A PFSM is initialized by activating the initial state

at the start of the input string. It is also reinitialized

every time an input symbol is going to be processed,

in order to consider simultaneous partial matchings

starting at different positions in the input string.

Whenever the automaton is initialized or reinitial-

ized, the initial state is activated and it is annotated

with a tag that represents the current input string po-

sition, which will be used when a match is completed

in order to know the starting position of the matched

substring. The tag will be propagated unmodiﬁed to

other states when applying transitions, and a state can

include several tags at any given time, which will cor-

respond to multiple simultaneous matches starting at

different input string positions. It should be noted

that, whenever the automaton is reinitialized, any ac-

tive states are kept active.

Using a PFSM, each input symbol is processed in

four steps:

1. The automaton is initialized or reinitialized by ac-

tivating the initial state and tagging it with the cur-

rent input string position.

2. All epsilon transitions from the active states (e.g.

the initial ancillary state) are iteratively applied

and the target states of those transitions are

tagged.

ParallelFiniteStateMachinesforVeryFastDistributableRegularExpressionMatching

107

3. The input symbol is consumed and, from all the

transitions from the active states, only those that

match the current symbol are traversed.

4. The traversed transitions are applied and the target

states of those transitions are tagged.

Figure 3 shows how the PFSM from Figure 1 con-

sumes the two symbols in the “ac” input string.

Whenever a transition is applied, in the second or

fourth step of the algorithm above, the target state is

checked. When it is ﬁnal, a matching has been found.

The matching starting position is given by the ﬁnal

state tag (or tags). The matching ending position is

the current input string position. The matched regular

expression is identiﬁed by the label of the ﬁnal state.

3.3 Efﬁciency Analysis

Given an alphabet with s symbols and r regular ex-

pressions, being m the maximum number of states

of an automaton representing a regular expression,

a PFSM will contain mr + 1 states. A maximum of

r matchings will be performed in an input string of

length n.

If regular expressions are implemented as DFAs,

the resulting PFSM automaton may contain up to mr

states and mrs transitions. At most, r states may be

active when considering a single symbol of the input

string. As matchings starting at different input string

positions are considered, O(nr) states may be active

in the PFSM at any given time. The PFSM processes

one symbol at a time, so the process is repeated for

the n symbols in the input string. Therefore, PFSMs’

efﬁciency is O(n

r) in time and O(nr+ mrs) in space.

If regular expressions are implemented as NFAs,

the resulting PFSM automaton may contain up to mr

states and m

rs transitions. At most, mr states may be

active when considering a single symbol of the input

string. As matchings starting at different input string

positions are considered, O(nmr) states may be active

in the PFSM at any given time. The PFSM processes

one symbol at a time, so the process is repeated for

the n symbols in the input string. Therefore, PFSMs’

efﬁciency is O(n

mr) in time and O(nmr + m

rs) in

space.

Our suggested implementation of PFSMs allows

trading off time for space by expressing some of the

regular expressions as NFAs instead of DFAs.

It should be noted that PFSMs, as described here,

can be implemented as Moore-like machines. A

Mealy-like implementation would yield no reduction

on the number of states and would decrease the efﬁ-

ciency of the PFSM implementation both in terms of

time and space.

3.4 Parallelization

The PFSM is a parallelizable automaton with an al-

most linear scalability.

3.4.1 Regular Expression Partitioning

PFSMs can be partitioned by distributing the set of

regular expressions among the different processors.

Each processor will ﬁnd the matches within a subset

of regular expressions, which reduces the automaton

size at a given processor. By following this approach,

the memory use at a given processor can be reduced

just to the space that is necessary to store a single reg-

ular expression. Although the idea of PFSM partition-

ing is trivial, the fact that PFSM does not compact the

regular expression set that allows this kind of paral-

lelization on the ﬂy, in contrast to existing techniques

that need to distribute the regular expression set and

compact each subset separately.

Furthermore, when some regular expressions are

expressed as NFAs in order to save memory, it is pos-

sible to distribute both NFA and DFA regular expres-

sions among processors in such a way that memory

use and processing time is balanced in each proces-

sor, as proposed in (Sun et al., 2010).

3.4.2 Data Partitioning

PFSMs can be partitioned by distributing the input

string among the different processor. Two different

approaches are proposed:

• Lazy Data Partitioning. The input string is scat-

tered among the different processors. Each pro-

cessor will ﬁnd the matches that start within its

segment of the input string and end either in or

after that segment. To achieve this, the automa-

ton reinitialization is disabled after surpassing the

ending position of the segment, but the process-

ing does not stop until the end of the input string

is reached or there are no active states. Each pro-

cessor may need to ask other processors for their

segment in order to ﬁnish its processing.

• Chained Data Partitioning. The input string is

scattered among the different processors. Each

processor will ﬁnd the matches that start and end

within its segment. The tagged active states at the

end of the processing are sent to the next proces-

sor, a different set of tagged active states are re-

ceived from the previous processor and the whole

cycle is repeated (without reinitializations) until

no more tagged active states are received from the

previous processor.

ICSOFT2012-7thInternationalConferenceonSoftwareParadigmTrends

108

qinit

aq0

aq1

a*c

bq0 bq1 bq2

cq0

cq1

cq3

a(ca)*b

(0)

(a) Initialization (input[0] is ’a’).

qinit

aq0

aq1

a*c

bq0 bq1 bq2

cq0

cq1

cq3

a(ca)*b

(0)

(b) Application of epsilon transitions.

qinit

aq0

aq1

a*c

bq0 bq1 bq2

cq0

cq1

cq3

a(ca)*b

(0)

qinit

aq0

aq1

a*c

bq0 bq1 bq2

cq0

cq1

cq3

a(ca)*b

(0)

(d) Application of transitions.

qinit

aq0

aq1

a*c

bq0 bq1 bq2

cq0

cq1

cq3

a(ca)*b

(0)

(1)

(e) Reinitialization (input[1] is ’c’).

qinit

aq0

aq1

a*c

bq0 bq1 bq2

cq0

cq1

cq3

a(ca)*b

(0,1)

(0)

(1)

(f) Application of epsilon transitions.

qinit

aq0

aq1

a*c

bq0 bq1 bq2

cq0

cq1

cq3

a(ca)*b

(0,1)

(0)

(1)

(g) Transitions matching ’c’.

qinit

aq0

aq1

a*c

bq0 bq1 bq2

cq0

cq1

cq3

a(ca)*b

(0,1)

(0)

atc 0 1

atc

(h) Application of transitions. Three matches ending at in-

dex 1 are found.

Figure 3: Example PFSM running for two full cycles. The ﬁnal state labels specify which regular expression that state

corresponds to. Active states and transitions are shown in bold. The numbers between parentheses are the valid starting

positions for the states.

ParallelFiniteStateMachinesforVeryFastDistributableRegularExpressionMatching

109

Data partitioning linearly reduces memory use

with respect to the number of processors. Assuming a

maximum match length of twice the segment length,

data partitioning also reduces processing time linearly

with respect to the number of processors.

4 CONCLUSIONS AND FUTURE

WORK

Regular expressions provide a ﬂexible means for

matching strings and they are often used in data-

intensive applications.

The implementation of regular expression match-

ing typically relies on deterministic ﬁnite automata

(DFAs) or nondeterministic ﬁnite automata (NFAs).

Both DFAs and NFAs are affected by amnesia and

acalculia, and DFAs also suffer from insomnia.

Techniques exist that solve those problems by

converting sets of regular expressions into compact

state machines. This approach, however, presents two

major drawbacks: it prevents the use of existing au-

tomata that could be already available in antivirus sig-

nature databases or complexdata ﬁlters; and it hinders

the maintenance of the resulting automata, since the

whole set has to be converted and compacted again

whenever a regular expression has to be added or re-

moved from the set.

We have proposed Parallel Finite State Machines

(PFSMs), which allow multiple active states and efﬁ-

ciently ﬁnd all the matches of regular expressions in

an input string, solving amnesia and acalculia. PF-

SMs also mitigate the effect of insomnia by reducing

the number of states in the resulting automaton.

As PFSMs do not require any conversion or com-

paction step, they are able to run on existing DFA-

or NFA-like machines, and they allow the addition or

removal of regular expressions with zero downtime,

making the maintenance of the automaton easier.

PFSMs can perform regular expression matching

in quadratic time and have linear memory space re-

quirements, apart from automata storage, in the worst

case. On top of that, PFSMs support three different

approaches for parallelization with almost linear scal-

ability in the practice.

Therefore, PFSMs make very fast distributed reg-

ular expression matching in data-intensive applica-

tions feasible.

We plan to apply PFSMs to improve the Lamb

lexical analyzer and cure it from its partial acalculia.

We also plan to apply PFSMs to data-intensivepattern

matching applications.

ACKNOWLEDGEMENTS

Work partially supported by research project

TIN2009-08296.

REFERENCES

Aho, A. V. (1990). Algorithms for ﬁnding patterns in

strings, volume A, pages 255–300. MIT Press, Cam-

bridge, MA, USA.

Aho, A. V. and Corasick, M. J. (1975). Efﬁcient string

matching: An aid to bibliographic search. Commu-

nications of the ACM, 18(6):333–340.

Becchi, M. and Cadambi, S. (2007). Memory-efﬁcient reg-

ular expression search using state merging. In Proc.

of the IEEE INFOCOM’2007, pages 1064–1072.

Ficara, D., Giordano, S., Procissi, G., Vitucci, F., Antichi,

G., and Pietro, A. D. (2008). An improved DFA for

fast regular expression matching. ACM SIGCOMM

Computer Communication Review, 38(5):31–40.

G´omez, L. I. and Vaisman, A. A. (2008). RE-SPaM: Us-

ing regular expressions for sequential pattern min-

ing in trajectory databases. In Proc. of the IEEE

ICDMW’2008, pages 395–398.

Han, J., Kamber, M., and Pei, J. (2006). Data Mining: Con-

cepts and Techniques. Morgan Kaufmann, 2nd edi-

tion.

Hopcroft, J. E., Motwani, R., and Ullman, J. D. (2007). In-

troduction to Automata Theory, Languages, and Com-

putation. Pearson Education, 3rd edition.

Hosoya, H. and Pierce, B. (2007). Regular expression

pattern matching for XML. In Proc. of the ACM

SIGPLAN-SIGACT POPL’2007, volume 36 of ACM

SIGPLAN Notices, pages 67–80.

Kumar, S., Chandrasekaran, B., Turner, J., and Varghese,

G. (2007). Curing regular expressiong matching algo-

rithms from insomnia, amnesia and acalculia. In Proc.

of the ACM/IEEE ANCS’2007, pages 155–164.

Kumar, S., Turner, J., and Williams, J. (2006). Advanced al-

gorithms for fast and scalable deep packet inspection.

In Proc. of the ACM/IEEE ANCS’2006, pages 81–92.

Lee, T.-H. (2007). Generalized Aho-Corasick algorithm for

signature based anti-virus applications. In Proc. of the

ICCCN’2007, pages 792–797.

Levine, J. R., Mason, T., and Brown, D. (1992). lex & yacc.

O’Reilly, 2nd edition.

Pasetto, D., Petrini, F., and Agarwal, V. (2010). Tools

for very fast regular expression matching. Computer,

43(3):50–58.

Quesada, L., Berzal, F., and Cortijo, F. J. (2011). Lamb: A

lexical analyzer with ambiguity support. In Proc. of

the ICSOFT’2011, volume 1, pages 297–300.

Sipser, M. (2005). Introduction to the Theory of Computa-

tion. Course Technology, 2nd edition.

Sun, Y., Liu, H., Valgenti, V. C., and Kim, M. S. (2010).

Hybrid regular expression matching for deep packet

inspection on multi-core architecture. In Proc. of the

ICCCN’2010, pages 1–7.

ICSOFT2012-7thInternationalConferenceonSoftwareParadigmTrends

110