Instrumenting a Context-free Language Recognizer

Paulo Roberto Massa Cereda and João José Neto

Escola Politécnica, Departamento de Engenharia de Computação e Sistemas Digitais, Universidade de São Paulo,

Av. Prof. Luciano Gualberto, s/n, Travessa 3, 158, CEP: 05508-900, São Paulo, SP, Brasil

Keywords:

Context-free Language, Structured Pushdown Automaton, Instrumentation.

Abstract:

Instrumentation plays a crucial role when building language recognizers, as collected data provide basis for

achieving better performance and model improvements, thus offering a balance between time and space, as

demanded by practical applications. This paper presents a simple yet functional semiautomatic approach for

generating a instrumentation-aware context-free language recognizer, enhanced with hooks, from a grammar

written using the Wirth syntax notation. The entire process is aided by a set of command line tools, freely

available for download. We also introduce the concept of an instrumentation layer enclosing the underlying

recognizer, acting as observer for each computational step and collecting data for later use.

1 INTRODUCTION

Instrumentation is the capability of monitoring or

measuring performance of a device, as well as trac-

ing information during its life cycle (Wert et al.,

2015). Such metrics allow an accurate understand-

ing of the device’s inner workings and provide base

for improvements on the model (Paul and Vahren-

hold, 2013; Ball and Larus, 1994). In general, it

is advisable to combine different metrics in order to

obtain a more comprehensive representation of the

device’s collected data, in an attempt to reduce bias

(which might cause misjudgement of the model as a

whole) (Wert et al., 2015).

Language recognition devices are mechanisms ca-

pable of reading strings built from an set Σ of sym-

bols (also known as language alphabet) and decide

whether such strings are in the language they de-

scribe (Aho and Ullman, 1995). These devices play

an important role in several areas, including program-

ming languages; context-free language recognizers

are widely used to design parsers (syntactic analy-

sers), which work out the grammatical structure of

strings according to a set of rules. It is highly ad-

visable to have deterministic devices, although that is

not always possible (Sebesta, 2013).

Recognizers need to be reasonably efﬁcient, in

time and space, when analysing a string. Practical ap-

plications demand a balance between these two fac-

tors (Cooper and Torczon, 2011). Hence, understand-

ing the inner workings of such devices and particu-

lar features of the languages for which they are con-

structed is crucial to achieving better performance

and providing model improvements (Ball and Larus,

1994). Designing instrumentation-aware recognizers

allows performance monitoring and information trac-

ing, as well as gathering potential ﬁndings about the

languages themselves and their formation rules.

We present a simple yet functional semiauto-

matic approach for generating a instrumentation-

aware context-free language recognizer from a gram-

mar written using the Wirth syntax notation, as well

as querying the recognizer and collecting instrumen-

tation data based on a set of metrics. The entire pro-

cess is aided by a set of command line tools, freely

available for download.

This paper is organized as follows: Section 2 in-

troduces the basic concepts of a context-free language

recognizer, the Wirth syntax notation used to describe

programming languages, and a process to automate

the generation of a structured pushdown automaton

given a WSN grammar. Section 3 presents the instru-

mentation layer, the set of metrics and its operational

semantics. Conclusions are presented in Section 4.

2 BACKGROUND

In this paper, we will use a structured pushdown au-

tomaton as our recognizer for context-free languages.

We aim at generating a recognizer instance from a

language grammar written using the Wirth syntax no-

tation and then instrumenting it later. The generation

Cereda, P. and Neto, J.

Instrumenting a Context-free Language Recognizer.

DOI: 10.5220/0006212002030210

In Proceedings of the 19th International Conference on Enterprise Information Systems (ICEIS 2017) - Volume 2, pages 203-210

ISBN: 978-989-758-248-6

203

will be aided by a set of command line tools written

for this purpose. Before we proceed, let us formally

introduce the concepts.

The structured pushdown automaton (SPA) (José

Neto and Magalhães, 1981; José Neto, 1993) is a kind

of pushdown automaton composed of a set of mutu-

ally recursive ﬁnite automata, also known as subma-

chines. Unlike the traditional pushdown automaton,

the stack is only used to store references to return

states on each submachine call. Calls and returns con-

sist on transferring control from one submachine to

another; this special transition uses the input symbol

to make a decision on which transition should be ex-

ecuted (the symbol is then consumed in the next tran-

sition) (José Neto, 1993; José Neto, 1994).

A structured pushdown automaton M is deﬁned as

M = (Q,A,Σ,Γ,P,Z

,F), in which Q is the set of

states, A is the set of submachines, deﬁned as fol-

lows, Σ is the automaton alphabet, corresponding to

the non-empty set of input symbols, Γ is the set of

stack symbols, P is the transition relation, q

∈ Q

is the initial state (of the ﬁrst submachine), Z

is a

special symbol acting as an empty stack marker, and

F ⊆ Q is the set of accepting states (of the ﬁrst sub-

machine) (José Neto and Magalhães, 1981; José Neto,

1993).

A submachine a

∈ A is deﬁned as a traditional ﬁ-

nite automaton a

= (Q

,Σ

i,0

), in which Q

⊆

Q is the set of states of a

, Σ

⊆ Σ is the set of input

symbols of a

, q

i,0

is the entry state of a

, P

⊆ P is the

transition relation of a

, and F

⊆ F is the set of return

states of a

The transition relation P is deﬁned as P ⊆ Γ ×

Q × Σ × Γ × Q, in the form (γg,e,sα) → (γg

′

,α),

in which e,e

′

are the current and target states, respec-

tively, s is the consumed symbol, α is the remainder of

the input string, g is the current top of the stack, g

′

the new top of the stack, and γ is the remainder of the

stack. A conﬁguration is an element of Q × Σ

∗

× Γ

∗

and a relation between successive conﬁgurations ⊢ is

deﬁned as follows:

– Symbol consumption: (q,σw,uv) ⊢ (p,w,xv), with

p,q ∈ Q, u,x ∈ Γ, v ∈ Γ

∗

, σ ∈ Σ∪{ε}, w ∈ Σ

∗

, if σ

was consumed, x = u, e (γ, q, σα) → (γ, p,α) ∈ P.

– Submachine call: (q,w,uv) ⊢ (r,w, xv), with q,r ∈

Q, u ∈ Γ, v, x ∈ Γ

∗

, w ∈ Σ

∗

, x = pu, with a call to

the submachine R, initial state r, return in p, and

(γ,q,α) → (γp, r,α) ∈ P.

– Submachine return: (q,w, uv) ⊢ (p, w,v), with

p,q ∈ Q, u,x ∈ Γ, v ∈ Γ

∗

, w ∈ Σ

∗

, u = p, with sub-

machine return to p, and (γg,q,α) → (γ,g,α) ∈ P.

The language recognized by a structured push-

down automaton M is given by L(M) = {w ∈ Σ

∗

,w, Z

) ⊢

∗

( f,ε,Z

), f ∈ F}.

A submachine call can be graphically represented

by a transition with double lines as illustrated in Fig-

ure 1. Note that, from state q

of submachine a

execution is transferred to the submachine a

and the

address regarding the return state q

is inserted into

the top of the stack. In the example, the current state

becomes q

, which is the initial of the submachine

...

Figure 1: Example of call to the submachine a

It is important to note that, as a matter of

model organization, it is assumed that a

∈ A,

= (Q

,Σ

i,0

), a

= (Q

,Σ

j,0

), Q

∩

0 and P

∩P

0, i.e. sets of states and mappings

of submachines are disjoint.

Automata are devices that, based on a set of for-

mation rules of a language, can decide whether an in-

put string is a valid sentence, i.e. the input string is a

element of the set of all sentences in that language. In

the late 1970s, Wirth (Wirth, 1977) presented a met-

alanguage for describing programming languages, in

an attempt to provide a simpliﬁed notation as alterna-

tive to existing initiatives, specially the Backus-Naur

Form (BNF); such metalanguage became known as

Wirth syntax notation (WSN) and has the following

properties:

i) The notation shows a clear distinction between

metasymbols, terminal and nonterminal symbols.

Existing metasymbols are =, ., (, ), [, ], {, },

| and ". A nonterminal symbol is denoted by an

identiﬁer, i.e. one letter followed by zero or more

letters and digits (as an usual variable deﬁnition

in a programming language), while the terminal

symbol is expressed by a string enclosed in dou-

ble quotes.

ii) There is no restriction regarding the use of meta-

symbols as symbols of language being described.

For example, the metasymbol | differs from ter-

minal symbol "|".

iii) The notation avoids heavy use of recursion to ex-

press simple repetitions by having a construct to

express explicit iteration. Repetition is denoted

by curly brackets.

iv) There is no need to use an explicit symbol to

represent the empty string, such as hemptyi in

BNF or ε, because the notation already has con-

structs that address this situation. Optionality is

expressed by square brackets.

ICEIS 2017 - 19th International Conference on Enterprise Information Systems

204

According to Wirth (Wirth, 1977), the repeti-

tion is denoted by curly braces, i.e. { a } repre-

sents ε | a | aa | aaa | ... (Kleene star). Op-

tional elements are expressed through square brack-

ets, i.e. [ a ] represents a | ε. Parentheses are used

to represent grouping, i.e. ( a | b ) c represents

ac | bc. Terminal symbols are expressed enclosed

in double quotes; if the double quotes appear as literal

symbols, these are duplicated. Some alternative rep-

resentations express literal double quotes like "\""

instead of """"; Wirth’s original article prefers dupli-

cating double quotes.

The simplicity of the Wirth syntax notation al-

lows a trivial representation of the grammar ele-

ments as internal and external transitions of the SPA

(symbol consumption and submachine calls, respec-

tively) (José Neto, 1987; José Neto et al., 1999).

Given a grammar written in WSN, we can use the

SPA presented in Figure 2 in order to obtain a result-

ing SPA that recognizes sentences from the language

expressed in the provided grammar (José Neto, 1987;

José Neto et al., 1999; Cereda and José Neto, 2015).

Semantic actions associated with transitions are de-

scribed in Figure 3.

The resulting SPA is potentially nondeterminis-

tic; however, as each submachine is in itself a ﬁ-

nite automaton, the automaton could be translated to

an equivalent deterministic SPA using classic subset

construction algorithms (Cooper and Torczon, 2011;

Sebesta, 2013). Also, each submachine could be re-

duced to an equivalent automaton with a minimum

number of states through minimization (Hopcroft,

1971).

We willuse a command line tool named wsn2spa

in order to automate the SPA generation from a gram-

mar written in WSN; there are options for determin-

istic translation and state minimization as well. The

tool is written in Java and it is released under GPLv3

(the GNU General Public License 3.0). The default

output is a DOT (plain text graph description lan-

guage) ﬁle, but we are also interested in the secondary

format, a YAML (human-friendly data serialization

standard) ﬁle, which provides a textual, structural rep-

resentation of the resulting SPA. We will discuss the

usage later on, in the next section.

3 INSTRUMENTING A

RECOGNIZER

Consider the automation ﬂow presented in Figure 4.

From a grammar, written in WSN, representing arith-

Ofﬁcial repository: https://goo.gl/pULqpm

metic expressions (for simplicity purposes, we are

only considering addition and nested parentheses),

wsn2spa generates a SPA spec. The language is

clearly context-free; validsentences include a, a + a,

(a + a), a + (a + a), and so on. The graphical

representation of this speciﬁc SPA spec is presented

in Figure 5.

Note that the call to wsn2spa shown in Figure 4

included two optional ﬂags, -c and -m. As the output

indicates, the generated SPA had the submachine AE

translated to its equivalentminimized deterministic ﬁ-

nite automaton. The tool also generated a DOT ﬁle

representing the submachine AE (and each additional

operation applied to it); the ﬁle can be compiled with

the dot command (from GraphViz). If the SPA had

more submachines, the tool would generate a set of

DOT and YAML ﬁles representing each submachine.

Once we have the SPA spec (possibly comprised

of individual submachine specs), we can use another

helper tool, named spa2run

, in order to submit

string queries to the automaton and check whether

they are valid sentences in the language the SPA rec-

ognizes. The tool is also written in Java and it is re-

leased under GPLv3, just as wsn2spa. The input takes

a list of submachine specs written in YAML (being

the ﬁrst item in the list the main submachine); once

this list is provided, the tool generates an on-the-ﬂy

executable code and grants a shell session in order to

query the automaton. The user can abort the session

at any time by pressing a certain combination of keys

or using the reserved keyword :quit as input string.

Figure 6 shows spa2run in action, as it instantiates

the SPA spec generated in Figure 4 into a proper au-

tomaton and allows querying it.

Now that we have means of describing a context-

free language through WSN, generate its correspond-

ing recognizer (namely, a SPA) and query an automa-

ton instance to check whether an input string is a valid

sentence of that language, we are able to go further

and instrument the recognizer. But ﬁrst, let us for-

mally introduce the operational semantics of our in-

strumentation.

Let us deﬁne a set B = {b | b: P 7→ R} of instru-

mentation metrics, i.e. a set of functions that takes an

element of the transition relation and returns a real

value. This approach allows us to simultaneously ap-

ply several metrics to the very same recognition in-

stance.

In order to keep track of each metric, the instru-

mentation layer (presented in Figure 7) provides a list

V of real variables, |V| = |B|, such that each variable

∈ V is associated with a function b ∈ B. At ﬁrst,

∀v

∈ V, v

← 0. For each computational step in the

Ofﬁcial repository: https://goo.gl/MvCnQs

Instrumenting a Context-free Language Recognizer

205

Grammar

nonterminal

Expression

nonterminal,

terminal,

nonterminal,

terminal, ε

(

[

{

)

]

}

(

[

{

Expression Expression Expression

Figure 2: SPA that generates another SPA given a grammar written in WSN.

Semantic action 1

stack.empty(); current := 0; counter := 1

Semantic action 2

stack.push(pair(current, counter))

counter := counter + 1

Semantic action 3

stack.empty()