2000) consider the existence of implicit relationships
between words, symbols, or characters that are close
together in strings. These models need intensive
corpus-based training and they produce results with
associated implicit probabilities. Even though they
can perform well in natural language processing, their
training requirement is impractical for programming
or data representation languages, especially when the
syntactic rules provide all the needed context infor-
mation to unequivocally identify tokens. Further-
more, the results are prone to interpretation errors that
would render the analysis unusable.
The semi-syntactic lexer proposed in (Shyu, 1986)
considers context information found in syntactic
rules, but is not able to capture syntactic ambiguities
for their further consideration.
3 LAMB
In contrast to the aforementioned techniques, Lamb is
able to recognize and capture lexical ambiguities.
Our proposed algorithm takes as input the string to
be scanned and a list of tokens associated to their cor-
responding regular expressions. It produces a lexical
analysis graph in which each token is connected to its
following and preceding tokens in the input sequence.
Our algorithm consists of two steps: the scanning
step, which recognizes all the possible tokens in the
input string; and the graph generation step, which
computes the sets of preceding and following tokens
for each token and builds the lexical analysis graph.
3.1 The Scanning Step
The algorithm in Figure 1 takes as input a string and
a list of matchers, and produces a list of found tokens
sorted by starting position. These tokens may overlap
in the input string.
Each matcher consists of a regular expression and
its corresponding match method, a priority value, and
a next value.
The match method performs a match given the in-
put string and a starting position in it, and returns the
matched string.
The priority value specifies the matcher priority.
The value −1 is reserved for ignored patterns, which
represent irrelevant text. The value 0 is reserved for
tokens that are not affected by priority restrictions.
Priority values 1 or higher represent token priorities,
being the lower the value, the higher the priority.
Whenever a token is found, no lower priority tokens
will be looked for within the matched text.
for i in 0..input.length()-1:
prio = -2
if search[i] == SEARCH:
anymatch = false
for each matcher m in matcherlist:
if (prio == -2 || prio >= m.prio ||
m.prio == 0) && (prio != -1 &&
next[j] < i):
match = m.match(input,i)
if match != null:
anymatch = true
prio = m.prio
end = i+match.length()-1
if search[end+1] == SKIP:
search[end+1] = SEARCH
if m.prio == -1: //ignored pattern
for k in t.start..t.end:
search[k] = NEVER
else: //not ignored pattern
t = new token(id=id,text=match,
type=m.type,start=i,end=end)
tokenlist.add(t)
id++
if !anymatch:
if search[i+1] == SKIP
search[i+1] = SEARCH
Figure 1: Pseudocode of the scanning step in our lexical
analysis algorithm.
The next value specifies the position before the next
string index a match will be tried to be performed at.
It defaults to −1.
The search array determines if an input string in-
dex has to be scanned, skipped, or never scanned (i.e.
if an ignore pattern that contains it was found). It de-
faults to SCAN for the position 0 and SKIP for the rest
of them.
The prio variable represents the last priority that
has been matched in the current input position. Its
value is −2 if no match was performed, −1 if an
ignored element match was performed, and a higher
value if any token of that priority has been found.
This step has a theoretical order of efficiency of
O(n
2
·l), being n the input string length and l the num-
ber of matchers in the lexer.
3.2 The Graph Generation Step
The algorithm in Figure 2 goes through the identified
token list in reverse order and efficiently computes the
sets of preceding and following tokens for every to-
ken.
The sets of preceding and following tokens of the
token x are defined in Equation 1, being a, b, c tokens
and x
start
and x
end
the starting and ending positions of
the token x in the input string.
ICSOFT 2011 - 6th International Conference on Software and Data Technologies
298