data that, in turn, serve as input for knowledge dis-
covery and querying. Specifying a grammar by split-
ting terminals into meaningful disjoint subsets is one
of the easiest ways to describe syntax. It is even
simpler than regular expressions. The family of tier
grammars presented and investigated here has suffi-
cient expressive power to describe the syntax of many
data languages. Tier grammars can be extended and
combined, and predictive parsing is possible for all of
them. Tier grammars have the qualities that are im-
portant for data parsing, particularly for parsing big
data. The idea behind tier grammars that leads to
LL(1) conditions is considering nonterminals as an
ordered set and limiting productions to the forms in
which forward references in the right-hand sides are
always to the next nonterminal and backward refer-
ences are bracketed by terminals.
Tier grammars can be embedded into LL(1) gram-
mars. This gives a mechanism for defining multi-
ple variants of syntactically complex languages. The
LL(1) grammar part takes care of the syntactic diffi-
culties whereas the tier part enables easy syntax mod-
ifications with the guarantee of predictive parsing.
Defining stochastic tier grammars is easier than defin-
ing stochastic CFGs. Probabilities are givenfor termi-
nal membership in classes/sub-groups rather than for
productions. Tier grammar inference from positive
examples can be formulated as a discrete optimiza-
tion problem. Further investigation of all these topics
is beyond the scope of this paper.
REFERENCES
Aho, A. V., Lam, M. S., Sethi, R., and Ullman, J. D. (2006).
Compilers: Principles, Techniques, and Tools (2nd
Edition). Addison-Wesley Longman Publishing Co.,
Inc., Boston, MA, USA.
Appelt, D. E. and Onyshkevych, B. (1998). The common
pattern specification language. In Proceedings of a
Workshop on Held at Baltimore, Maryland: October
13-15, 1998, TIPSTER ’98, pages 23–30, Strouds-
burg, PA, USA. Association for Computational Lin-
guistics.
Back, G. (2002). Datascript - a specification and script-
ing language for binary data. In In Generative Pro-
gramming and Component Engineering, pages 66–77.
Springer.
Berstel, J. and Boasson, L. (2002). Balanced grammars and
their languages. In Formal and Natural Computing
- Essays Dedicated to Grzegorz Rozenberg [on occa-
sion of his 60th birthday, March 14, 2002], pages 3–
25.
Chappelier, J.-C. and Rajman, M. (1998). A generalized cyk
algorithm for parsing stochastic cfg. In Proceedings
of Tabulation in Parsing and Deduction (TAPD’98),
pages 133–137, Paris, France.
Crescenzi, V. and Mecca, G. (2004). Automatic information
extraction from large websites. J. ACM, 51(5):731–
779.
Dalvi, N., Kumar, R., and Soliman, M. (2011). Automatic
wrappers for large scale web extraction. Proc. VLDB
Endow., 4(4):219–230.
Fisher, K. and Gruber, R. (2005). Pads: A domain-specific
language for processing ad hoc data. In Proceedings
of the 2005 ACM SIGPLAN Conference on Program-
ming Language Design and Implementation, PLDI
’05, pages 295–304, New York, NY, USA. ACM.
Fisher, K., Mandelbaum, Y., and Walker, D. (2006). The
next 700 data description languages. In Conference
Record of the 33rd ACM SIGPLAN-SIGACT Sym-
posium on Principles of Programming Languages,
POPL ’06, pages 2–15, New York, NY, USA. ACM.
Ford, B. (2004). Parsing expression grammars: A
recognition-based syntactic foundation. In Proceed-
ings of the 31st ACM SIGPLAN-SIGACT Symposium
on Principles of Programming Languages, POPL ’04,
pages 111–122, New York, NY, USA. ACM.
Li, Y., Krishnamurthy, R., Raghavan, S., Vaithyanathan, S.,
and Jagadish, H. (2008). Regular expression learning
for information extraction. In Proceedings of the Con-
ference on Empirical Methods in Natural Language
Processing, pages 21–30. Association for Computa-
tional Linguistics.
McCann, P. J. and Chandra, S. (2000). Packet types:
Abstract specification of network protocol messages.
In Proceedings of the Conference on Applications,
Technologies, Architectures, and Protocols for Com-
puter Communication, SIGCOMM ’00, pages 321–
333, New York, NY, USA. ACM.
Powell, A., Beckerle, M., and Hanson, S. (2011). Data
format description language (dfdl). Technical report,
Open Grid Forum.
Sakakibara, Y. (1997). Recent advances of grammatical in-
ference. Theoretical Computer Science’, 185(1):15–
45.
Tari, L., Tu, P. H., Hakenberg, J., Chen, Y., Son, T. C., Gon-
zalez, G., and Baral, C. (2012). Parse tree database for
information extraction. IEEE Transactions on Knowl-
edge and Data Engineering, 24(1):86–99.
Thakur, R., Jain, S., and Chaudhari, N. S. (2013). User
behavior analysis using alignment based grammati-
cal inference from web server access log. Interna-
tional Journal of Future Computer and Communica-
tion, 2(6):543.
Underwood, W. (2012). Grammar-based specification and
parsing of binary file formats. International Journal
of Digital Curation, 7(1):95–106.
Viola, P.and Narasimhan, M. (2005). Learning to extract in-
formation from semi-structured text using a discrimi-
native context free grammar. In Proceedings of the
28th annual international ACM SIGIR conference on
Research and development in information retrieval,
pages 330–337. ACM.
Xi, Q. and Walker, D. (2010). A context-free markup
language for semi-structured text. In Proceedings
of the 31st ACM SIGPLAN Conference on Program-
ming Language Design and Implementation, PLDI
’10, pages 221–232, New York, NY, USA. ACM.