Data Parsing using Tier Grammars

Alexander Sakharov

and Timothy Sakharov

Verizon, Waltham, Massachusetts, U.S.A.

Northeastern University, Boston, Massachusetts, U.S.A.

Keywords:

Data Preprocessing, Unstructured Data, Data Languages, LL(1) Grammars, Predictive Parsing.

Abstract:

Parsing turns unstructured data into structured data suitable for knowledge discovery and querying. The com-

plexity of grammar notations and the difﬁculty of grammar debugging limit the availability of data parsers.

Tier grammars are deﬁned by simply dividing terminals into predeﬁned classes and then splitting elements of

some classes into multiple layered sub-groups. The set of predeﬁned terminal classes can be easily extended.

Tier grammars and their extensions are LL(1) grammars. Tier grammars are a tool for big data preprocessing.

1 INTRODUCTION

Knowledge discovery methods focus on structured

data such as databases, semi-structured data such as

XML, and natural language (NL) documents. Infor-

mation retrieval is also well researched for structured

data (SQL), XML (XQuery), and for NL (search en-

gines). In reality, plenty of data are unstructured, and

they are not precisely NL documents. Examples of

such unstructured data include log ﬁles, dump ﬁles,

documents combining NL, codes/abbreviations, ref-

erences, and numeric data. NL processing methods

cannot be efﬁciently used for these data because the

NL that they contain is usually short and mixed with

numeric and encoded values. These unstructured data

need to be preprocessed to become usable for knowl-

edge discovery or search.

Parsing turns unstructured data into structured

data and can also serve as an information extraction

utility. Hard-coded parsers are typically used for pro-

cessing unstructured data. Their implementation is

costly and error-prone. These parsers require software

updates with every change in data format. Due to

these implementation problems, documents combin-

ing NL with other kinds of data may even be treated

as NL, and the other data including numbers become

noise. The declarative programming of parsers us-

ing grammars partially solves these problems and is a

good ﬁt for data preprocessing.

The output of grammar-based parsers is an ab-

stract syntax tree (AST) (Aho et al., 2006) which

contains syntactic information extracted from the

source. ASTs can be represented as DOM trees or

can be converted to XML or JSON. The XML or

JSON generated from ASTs is structured data be-

cause the set of node tags and the schema are pre-

deﬁned. For the same reason, ASTs can be loaded

into relational database tables. Following parsing,

knowledge can be extracted from DOM, database ta-

bles, XML, or JSON, and queries can be executed.

ASTs may provide leverage for information extrac-

tion (Tari et al., 2012). Note that ﬁelded search

(

http://lucene.apache.org

) and XQuery and XPath

Full Text (

http://www.w3.org/TR/xpath-full-text

)

search can be applied to the transformed data origi-

nating from documents including NL fragments.

Context-free grammars (CFG) are an excellent

mechanism for specifying the syntax of and parsing

programming languages (Aho et al., 2006) but they

are rarely used for data parsing because of the com-

plexity of their notation. Few software developers

have experience with CFGs. Creating an unambigu-

ous CFG is a challenge even for experts. The power

of available grammar inference methods (Sakakibara,

1997) is not sufﬁcient to handle real-world problems.

Note also that the inferred grammars have to be an-

alyzed and their nonterminals have to be mapped

to meaningful constructs for further data processing,

which is a non-trivial task.

There exist ample differences between program-

ming languages and data languages. In contrast

to programming languages, data languages normally

have a limited variety of constructs. Data languages

mostly consist of aggregation constructs and refer-

ences. The former represent structures with named

ﬁelds or sets including maps, i.e. key-value pairs.

Data languages are less constraining and strict than

programming languages. Almost always, some por-

tions of data somehow diverge from any given stan-

dard. Therefore, grammars for deﬁning the syntax of

Sakharov, A. and Sakharov, T..

Data Parsing using Tier Grammars.

In Proceedings of the 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2015) - Volume 1: KDIR, pages 463-468

ISBN: 978-989-758-158-8

463

data should be inclusive in order to avoid undesirable

exceptions when processing these data. In contrast

to programming languages, data formats are plentiful

and evolve all the time. It is important, especially for

big data, to be able to easily modify data grammars

without the danger of compromising their properties.

It is also important to be able to parse data using an

incomplete grammar because the exact syntax of big

data may not be known. Moreover, big data are often

syntactically incoherent, and grammars apply to data

fragments only. Therefore, a family of tiny grammars

may be needed to specify the syntax of data.

An adequate notation for deﬁning the syntax of

data languages should be on par with regular expres-

sions in terms of simplicity and comprehensibility.

Unfortunately, regular expressions themselves are not

a good choice for deﬁning data languages because of

their limited expressiveness and because they do not

help build informative ASTs. The use of such nota-

tion should not require sophisticated tools for parser

generation, and parsing should be feasible in linear

time. We introduce a grammar notation that satisﬁes

the above criteria.

This notation has no nonterminals, no grammar

productions, and no formulas. A language is deﬁned

in this notation by simply dividing terminals into pre-

deﬁned classes. Each class has its role. There could

be multiple layered sub-groups within a class. Note

that the choice of terminal classes in this notation is

not motivated by theoretical considerations but rather

is drivenby the intent to cover more constructs used in

practice, while maintaining a clear meaning for every

terminal class.

Our notation deﬁnes a subset of LL(1) languages,

which makes predictive parsing possible (Aho et al.,

2006). These languages are unambiguous, and are de-

vised to be very inclusive. We give a simple character-

ization of strings belonging to these languages. This

notation is rich enough for specifying data formats

of various kinds of documents, including machine-

generated documents. Our notation facilitates the

deﬁnition of constructs representing data aggregates

and references. This notation is especially beneﬁcial

for big data tasks because it enables the quick and

easy speciﬁcation of multiple data formats as well as

the modiﬁcation and augmentation of these speciﬁ-

cations. We call this notation tier grammars because

their constructs stack according to the priorities of

layered terminals groups. Tier grammars can be eas-

ily combined and extended in a variety of ways with-

out compromising the LL(1) property.

2 DEFINITION OF TIER

LANGUAGES

Following the tradition for programming languages, it

is assumed that lexical analysis using regular expres-

sions is done before parsing. The output of lexical

analysis is a sequence of tokens whose names are ter-

minals for parsing. As usual, the longest lexeme is

selected in case of conﬂicts (Aho et al., 2006). If the

syntax is known for portions of the input, then regu-

lar expressions are also used to select these fragments

before parsing them.

Suppose the set of terminals T is a union of dis-

joint sets T

, T

. T

is the set of base

terminals. Terminals from T

and T

deﬁne bracketed

constructs. Terminals from T

are opening brackets,

and terminals from T

are closing ones. Terminals

from T

are called markers. These terminals are split

into disjoint groups by their priority. Their role is to

serve as delimiters that combine items to the left and

right of them in groups.

Terminals from T

are called postﬁxes, and act

as postﬁx operators. Terminals from T

are called

preﬁxes; these are unary preﬁx operators. Terminals

from T

are connectives that serve either as binary op-

erators in expressions or as separators, such as in the

comma-separated values format. Preﬁxes, postﬁxes,

and connectives are also split into disjoint groups by

their priority. They share the range of priorities but

only one kind of terminals is allowed for a given pri-

ority. Let q be the highest priority for markers and k

be the highest priority for postﬁxes, preﬁxes, and con-

nectives. We use i

to denote the number of distinct

markers, postﬁxes, preﬁxes, or connectives of priority

The tier language Λ(T) for T = {T

, T

} is deﬁned recursively by the following

rules. Understanding these rules does not require any

knowledge of CFGs, but tier languages can still be

expressed via CFGs. We give CFG productions along

with the rules in order to demonstrate how the rules

map to them. S will denote the start nonterminal of the

corresponding CFG. Symbol ε will denote the empty

string. T

, T

will denote the respective ter-

minals of priority i. Note that only one of T

, T

may be non-empty for any i.

1. If b ∈ T

= {b

, ..., b

}, then b ∈ Λ(T).

A → b

|...|b

2. If a ∈ Λ(T), r ∈ T

= {r

, ..., r

}, e ∈ T

, ..., e

}, then rae ∈ Λ(T).

B → FSH; F → r

|...|r

; H → e

|...|e

3. Let either c

, ..., c

∈ T

= {c

, ..., c

} (con-

nective), p ∈ T

= {p

, ..., p

} (preﬁx), or s ∈

= {s

, ..., s

} (postﬁx). If a

, ..., a

, a

n+1

∈ Λ(T),

KDIR 2015 - 7th International Conference on Knowledge Discovery and Information Retrieval

464

, ...a

, a

n+1

are deﬁned by rules 1, 2, or this rule for

terminals of higher priority, a

∈ Λ(T), a

is deﬁned

by rules 1, 2, or this rule for terminals of the same

or higher priority, then a

...a

n+1

∈ Λ(T),

∈ Λ(T), or a

s ∈ Λ(T).

C → A|B

postﬁx:

→ E

i+1

(for i = 1, ..., k − 1); E

→ CG

→ ε|s

|...|s

preﬁx:

→ E

i+1

|...|p

(for i = 1, ..., k − 1)

→ C|p

|...|p

connective:

→ E

i+1

(for i = 1, ..., k− 1); E

→ CL

→ ε|c

i+1

|...|c

i+1

(for i = 1, ..., k − 1)

→ ε|c

|...|c

4. If m

, ..., m

∈ T

= {m

, ..., m

, a

, ..., a

∈ Λ(T), then ε ∈ Λ(T), a

...a

∈ Λ(T),

...m

∈ Λ(T) provided that this string

follows the beginning of the input string, a terminal

from T

, or a marker of lower priority, and precedes

the end of the input string, a terminal from T

, or a

marker of lower priority.

→ Q

i+1

(for i = 1, ..., q− 1); Q

→ DR

→ ε|m

i+1

|...|m

i+1

(for i = 1, ..., q− 1)

→ ε|m

|...|m

D → ε|E

Now we only need to add one more production

to complete the deﬁnition of the corresponding CFG:

S → Q

. The above context-free productions have

to be slightly modiﬁed when some terminal sets are

empty. In case the sets of connectives, postﬁxes, and

preﬁxes are all empty: E

→ C. In case the set of

markers is empty: Q

→ D.

These terminal classes are suitable for various rep-

resentations of data aggregates and references: pre-

ﬁxes, postﬁxes, and connectives for named ﬁelds in

structures; brackets for structures, including recur-

sive ones; markers and connectives for sets, including

multi-dimensional arrays; connectives for key/value

pairs; preﬁxes and brackets for references. Rule ap-

plications deﬁne parse trees for tier languages. Ap-

plications of rule 1 constitute the terminal nodes of

these parse trees. Every application of all other rules

corresponds to a nonterminal node of the parse tree.

Parse trees for tier grammars are similar to the

ASTs of the underlying CFG. One difference is that

one node in a tier grammar parse tree combines all as-

sociated connectives or markers. Tier grammar parse

trees are very compact, which is especially important

for big data. This translates into compact XML or

database representations with simple schemas. XPath

is widely used for expressing queries and information

extraction wrappers (Dalvi et al., 2011) for HTML. It

can fulﬁll the same purposes for tier grammar parse

trees due to their simplicity.

3 EXAMPLES

Typical data dump formats such as CSV and other for-

mats for multidimensional arrays can be easily speci-

ﬁed as tier grammars. The same applies to the output

of many Unix commands and of many command-line

tools. Here are a couple of other simple examples of

data formats that can be parsed using tier grammars.

In these examples, \b denotes a space and \n denotes

a new line character.

1. BibTex format (please see its speciﬁcation at

http://www.bibtex.org

)

Base terminals: words and quoted strings

Brackets: { }

Preﬁx: words starting with @ (priority 4)

Connectives: # (priority 3) = (priority 2) , (priority 1)

2. Documents with numbered sections

Markers: .\b ?\b !\b .\ n ?\n !\n (priority 3); \n\n

(priority 2); section numbersdeﬁned as lexemes\n[0-

9]* (priority 1)

Machine-generated human-readable ﬁles are the

main source of examples of tier languages. The

output of Apache’s

ReflectionToStringBuilder

(

http://commons.apache.org

) is one example. Let us

look at some code fragments that generate log ﬁles.

The following code patterns demonstrate why log

ﬁles or some parts contained therein are usually tier

languages.

print(<opening bracket>);

loop: { ... print(<data>); ... }

print(<closing bracket>);

function f(...){ print(<opening bracket>);

... f(...); ... print(<closing bracket>);

return; }

loop: { ... case ...: print(<prefix>);

print(<data>); ... }

loop: { ... print(<data>); if ( ... )

print(<postfix>); ... }

loop: { ... print(<data>);

print(<connective>); print(<data>); ... }

loop: { if ( !first ) print(<connective>);

... print(<data>); ... }

loop: { loop: { loop: { ... print(<data>);

... } ... print(<high priority marker>);

... } ... print(<low priority marker>);

... }

4 ANALYSIS

Proposition 1. Tier grammars deﬁne LL(1) lan-

guages.

Data Parsing using Tier Grammars

465

The availability of matching LL(1) grammars

makes table-driven predictive parsing (Aho et al.,

2006) possible for tier languages. Predictive parsing

has a linear time complexity. The uniformity of tier

languages with respect to predictive parsing is an es-

sential beneﬁt becausemost questions about CFGs are

undecidable. Note that S ⇒

∗

N for every nonterminal

N from tier grammar parse trees. This is an indication

of the inclusiveness of tier grammars. Tier languages

are unambiguous.

LL(1) parsing does not require any parser genera-

tor tools. A parser can be implemented as a couple of

library functions. One of them builds a parsing table,

and the other parses the input. In the case of gigantic

documents, parsing can be implemented via callbacks

like it is done in the SAX API for XML in Java

(

http://www.saxproject.org

void parse(LexemeStream stream,

EventHandler handler);

where class EventHandler has callback methods for

terminals from T

, T

, as well as for preﬁxes, post-

ﬁxes, connectives, markers. The latter methods are

called when the corresponding nonterminal is popped

from the stack.

The set of tier languages is a proper subset of

LL(1) languages. It includes languages that are not

regular. For instance, the language {a

|n ≥ 0} is

one such example. Since tier languages are designed

to be as inclusive as possible, they do not even include

some restrictive regular languages. For instance, the

language deﬁned by regular expression (ab)

∗

and any

language with a ﬁnite set of distinct strings are not tier

languages. If a tier grammar does not have brackets,

then it deﬁnes a regular language.

Since the tier grammar notation does not involve

any kind of formulas, terminals can only serve as tags

giving a particular syntactic meaning to neighboring

items or to strings starting or ending with them. Pre-

ﬁxes give a syntactic meaning to the item to the right.

Postﬁxes do the same for the item to the left. A

connective glues together the two items adjacent to

it. Markers group items on the left and on the right.

Brackets deﬁne construct borders. Altogether, they

cover more important cases.

The following simple characterization of tier lan-

guages shows that every tier language includes a wide

variety of strings. This helps avoid parsing excep-

tions.

Proposition 2. A string belongs to a given tier lan-

guage if and only if the following conditions hold:

- brackets are balanced, i.e. the number of opening

brackets in the string is equal to the number of clos-

ing brackets, and the number of opening brackets is

greater or equal to the number of closing brackets in

any preﬁx substring

- every postﬁx follows a base token, closing bracket,

or another postﬁx of a higher priority

- every preﬁx precedes a base token, opening bracket,

or preﬁx of the same or higher priority

- every connective follows a base token, closing

bracket, or postﬁx of a higher priority and precedes

a base token, opening bracket, or preﬁx of a higher

priority

Corollary 1. If T

′

, T

′

, T

′

, T

′

, T

′

, T

′

are subsets

of T

, T

, respectively, s ∈ Λ({T

, T

}), and terminals from T

\ T

′

are balanced with terminals from T

\ T

′

in s, then

s ∈ Λ({T

∪ T

′

∪ T

′

∪ T

′

∪ T

′

∪ T

′

∪ T

′

, T

\ T

′

, T

′

, T

\ T

′

, T

\ T

′

, T

\ T

′

, T

\ T

′

}).

This corollary guaranteesthat parsing with incom-

plete syntax will work. The extension of syntax usu-

ally amounts to assigning other roles to some of the

base terminals. Another corollary of Proposition 2 is

that all strings belong to every tier language contain-

ing only base terminals and markers.

5 EXTENDING AND COMBINING

TIER GRAMMARS

If the expressiveness of tier grammars is not sufﬁ-

cient, they can be easily extended. One extension

is the addition of preﬁxes of arity more than one.

This extension is introduced by the following context-

free production: E

→ E

i+1

|pE

...E

where the num-

ber of E

is the arity of p. Another typical exten-

sion is a construct deﬁned by a terminal pair: E

→

i+1

. It may also be useful to add other

types of preﬁxes. These other preﬁxes share pri-

orities with markers, as opposed to connectives and

postﬁxes. They are introduced by the following pro-

duction: Q

→ Q

i+1

|...|p

. One more ex-

ample is a construct with three constituents, where

the third one is optional. This construct is deﬁned

by the following productions: E

→ p

i+1

→ ε|p

i+1

We specify a class of productions that can be

added to tier grammars to form extensions. All afore-

mentioned examples belong to this class. The four

following types of productions guarantee that any ex-

tension deﬁned by them is a LL(1) grammar. Let α

denote a string of terminals and/or nonterminals.

1. E

→ α

|...|α

where E

, E

i+1

, and S are the only nonterminals that

may occur in any α

, all α

start with a terminal or

i+1

, not more than one α

starts with E

i+1

, and ev-

KDIR 2015 - 7th International Conference on Knowledge Discovery and Information Retrieval

466

ery occurrence of S in α

should be preceded and fol-

lowed by terminals.

2. E

→ α

→ ε|α

|...|α

(or L

→ ε|α

|...|α

)

where α

starts with E

i+1

or a terminal, other α

start

with a terminal, E

, E

i+1

, and S are the only nontermi-

nals that can occur in α

, and every occurrence of S

and E

in any α

should be preceded and followed by

terminals.

3. Q

→ α

|...|α

where Q

, Q

i+1

, and S are the only nonterminals that

may occur in α

, all α

start with a terminal or Q

i+1

not more than one α

starts with Q

i+1

, every occur-

rence of S and Q

in any α

should be preceded and

followed by terminals, and every two consecutive oc-

currences of Q

i+1

in any α

should be separated by a

terminal.

4. Q

→ α

→ ε|α

|...|α

(or R

→ ε|α

|...|α

)

where α

starts with Q

i+1

or a terminal, other α

start

with a terminal, Q

, Q

i+1

, and S are the only nontermi-

nals that can occur in α

, every occurrence of S and Q

in any α

should be preceded and followed by termi-

nals, and every two consecutive occurrences of Q

i+1

in any α

should be separated by a terminal.

All terminals from α

are distinct from terminals

from T. No terminal may occur more than once in all

productions. As with basic tier grammars, one exten-

sion production deﬁnes a class of terminals. This pro-

duction can be used as a template for the introduction

of multiple instances of this production, each having

distinct terminals and a priority. The ﬁrst two types

of extension productions add new priorities to those

of postﬁxes, preﬁxes, and connectives. The last two

types add new priorities to the priorities of markers.

The priorities of the original tier grammar should be

shifted accordingly.

Proposition 3. Extended tier grammars deﬁne LL(1)

languages.

If the ﬂexibility of a single tier grammar is not

sufﬁcient, then multiple tier grammars can be com-

bined so that every source grammar applies only to

a relevant portion of a document. The advantage of

combining multiple tier grammars vs CFGs is that the

simplicity of the notation is not compromised. Note

that the terminals of combined grammars may inter-

sect. If the set of terminals of tier grammar Γ

does

not include e

, then Γ

can be combined with Γ by

modifying the B production of Γ to the following:

B → FSH|r

e+1

where S

is the start nonterminal of Γ

. If the set

of terminals of Γ

is disjoint with {T

, T

, ..., T

4i−1

}

for Γ, then Γ

can be combined with Γ by adding a

marker tier. Here is the R

production for this tier:

→ ε|m

The combined grammars deﬁne LL(1) languages.

6 RELATED WORK

Several alternatives to the notation of CFGs have been

developed (Ford, 2004; Aho et al., 2006; Berstel and

Boasson, 2002). With the exception of regular expres-

sions, none of these alternatives really simpliﬁed the

task of creating and debugging grammars. Stochas-

tic CFG parsers (Chappelier and Rajman, 1998) have

a prohibitive time complexity for data that may be

much bigger than programs.

Despite the remarkable research in the area of for-

mal grammars, its applications to data parsing are

few and far between (Underwood, 2012; McCann

and Chandra, 2000; Back, 2002; Fisher and Gruber,

2005; Xi and Walker, 2010; Powell et al., 2011). An

overview of data description languages can be found

in (Fisher et al., 2006). None of these data description

languages are on par with tier grammars in terms of

the simplicity of speciﬁcation for data formats.

Regular expressions have been used for infor-

mation extraction tasks (Appelt and Onyshkevych,

1998), particularly for entity recognition. Google

uses regular expressions and LL(1) grammars for en-

tity recognition in their Search Appliance. There exist

techniques for learning regular expressions and CFGs

utilized in entity recognition (Li et al., 2008; Viola

and Narasimhan, 2005). Grammars for the purpose of

entity recognition should be strict, unlike tier gram-

mars. Tier grammars capture the overall syntactic

structure.

Grammar inference methods are basically lim-

ited to regular languages and other simple languages

(Sakakibara, 1997). RoadRunner (Crescenzi and

Mecca, 2004) infers union-free regular grammars that

are used to extract information from large web sites.

A method of learning CFG productions that specify

the syntax of web server access logs is presented in

(Thakur et al., 2013). The log format considered in

this paper is a very simple regular language. It is not

clear if this inference method will work for more com-

plex languages.

7 CONCLUSION AND FUTURE

WORK

Grammars enable the declarative programming of

data preprocessors that extract syntactic information

from unstructured sources and generate structured

Data Parsing using Tier Grammars

467

data that, in turn, serve as input for knowledge dis-

covery and querying. Specifying a grammar by split-

ting terminals into meaningful disjoint subsets is one

of the easiest ways to describe syntax. It is even

simpler than regular expressions. The family of tier

grammars presented and investigated here has sufﬁ-

cient expressive power to describe the syntax of many

data languages. Tier grammars can be extended and

combined, and predictive parsing is possible for all of

them. Tier grammars have the qualities that are im-

portant for data parsing, particularly for parsing big

data. The idea behind tier grammars that leads to

LL(1) conditions is considering nonterminals as an

ordered set and limiting productions to the forms in

which forward references in the right-hand sides are

always to the next nonterminal and backward refer-

ences are bracketed by terminals.

Tier grammars can be embedded into LL(1) gram-

mars. This gives a mechanism for deﬁning multi-

ple variants of syntactically complex languages. The

LL(1) grammar part takes care of the syntactic difﬁ-

culties whereas the tier part enables easy syntax mod-

iﬁcations with the guarantee of predictive parsing.

Deﬁning stochastic tier grammars is easier than deﬁn-

ing stochastic CFGs. Probabilities are givenfor termi-

nal membership in classes/sub-groups rather than for

productions. Tier grammar inference from positive

examples can be formulated as a discrete optimiza-

tion problem. Further investigation of all these topics

is beyond the scope of this paper.

REFERENCES

Aho, A. V., Lam, M. S., Sethi, R., and Ullman, J. D. (2006).

Compilers: Principles, Techniques, and Tools (2nd

Edition). Addison-Wesley Longman Publishing Co.,

Inc., Boston, MA, USA.

Appelt, D. E. and Onyshkevych, B. (1998). The common

pattern speciﬁcation language. In Proceedings of a

Workshop on Held at Baltimore, Maryland: October

13-15, 1998, TIPSTER ’98, pages 23–30, Strouds-

burg, PA, USA. Association for Computational Lin-

guistics.

Back, G. (2002). Datascript - a speciﬁcation and script-

ing language for binary data. In In Generative Pro-

gramming and Component Engineering, pages 66–77.

Springer.

Berstel, J. and Boasson, L. (2002). Balanced grammars and

their languages. In Formal and Natural Computing

- Essays Dedicated to Grzegorz Rozenberg [on occa-

sion of his 60th birthday, March 14, 2002], pages 3–

25.

Chappelier, J.-C. and Rajman, M. (1998). A generalized cyk

algorithm for parsing stochastic cfg. In Proceedings

of Tabulation in Parsing and Deduction (TAPD’98),

pages 133–137, Paris, France.

Crescenzi, V. and Mecca, G. (2004). Automatic information

extraction from large websites. J. ACM, 51(5):731–

779.

Dalvi, N., Kumar, R., and Soliman, M. (2011). Automatic

wrappers for large scale web extraction. Proc. VLDB

Endow., 4(4):219–230.

Fisher, K. and Gruber, R. (2005). Pads: A domain-speciﬁc

language for processing ad hoc data. In Proceedings

of the 2005 ACM SIGPLAN Conference on Program-

ming Language Design and Implementation, PLDI

’05, pages 295–304, New York, NY, USA. ACM.

Fisher, K., Mandelbaum, Y., and Walker, D. (2006). The

next 700 data description languages. In Conference

Record of the 33rd ACM SIGPLAN-SIGACT Sym-

posium on Principles of Programming Languages,

POPL ’06, pages 2–15, New York, NY, USA. ACM.

Ford, B. (2004). Parsing expression grammars: A

recognition-based syntactic foundation. In Proceed-

ings of the 31st ACM SIGPLAN-SIGACT Symposium

on Principles of Programming Languages, POPL ’04,

pages 111–122, New York, NY, USA. ACM.

Li, Y., Krishnamurthy, R., Raghavan, S., Vaithyanathan, S.,

and Jagadish, H. (2008). Regular expression learning

for information extraction. In Proceedings of the Con-

ference on Empirical Methods in Natural Language

Processing, pages 21–30. Association for Computa-

tional Linguistics.

McCann, P. J. and Chandra, S. (2000). Packet types:

Abstract speciﬁcation of network protocol messages.

In Proceedings of the Conference on Applications,

Technologies, Architectures, and Protocols for Com-

puter Communication, SIGCOMM ’00, pages 321–

333, New York, NY, USA. ACM.

Powell, A., Beckerle, M., and Hanson, S. (2011). Data

format description language (dfdl). Technical report,

Open Grid Forum.

Sakakibara, Y. (1997). Recent advances of grammatical in-

ference. Theoretical Computer Science’, 185(1):15–

45.

Tari, L., Tu, P. H., Hakenberg, J., Chen, Y., Son, T. C., Gon-

zalez, G., and Baral, C. (2012). Parse tree database for

information extraction. IEEE Transactions on Knowl-

edge and Data Engineering, 24(1):86–99.

Thakur, R., Jain, S., and Chaudhari, N. S. (2013). User

behavior analysis using alignment based grammati-

cal inference from web server access log. Interna-

tional Journal of Future Computer and Communica-

tion, 2(6):543.

Underwood, W. (2012). Grammar-based speciﬁcation and

parsing of binary ﬁle formats. International Journal

of Digital Curation, 7(1):95–106.

Viola, P.and Narasimhan, M. (2005). Learning to extract in-

formation from semi-structured text using a discrimi-

native context free grammar. In Proceedings of the

28th annual international ACM SIGIR conference on

Research and development in information retrieval,

pages 330–337. ACM.

Xi, Q. and Walker, D. (2010). A context-free markup

language for semi-structured text. In Proceedings

of the 31st ACM SIGPLAN Conference on Program-

ming Language Design and Implementation, PLDI

’10, pages 221–232, New York, NY, USA. ACM.

KDIR 2015 - 7th International Conference on Knowledge Discovery and Information Retrieval

468