Data Parsing using Tier Grammars

Alexander Sakharov, Timothy Sakharov

Abstract

Parsing turns unstructured data into structured data suitable for knowledge discovery and querying. The complexity of grammar notations and the difficulty of grammar debugging limit the availability of data parsers. Tier grammars are defined by simply dividing terminals into predefined classes and then splitting elements of some classes into multiple layered sub-groups. The set of predefined terminal classes can be easily extended. Tier grammars and their extensions are LL(1) grammars. Tier grammars are a tool for big data preprocessing.

References

  1. Aho, A. V., Lam, M. S., Sethi, R., and Ullman, J. D. (2006). Compilers: Principles, Techniques, and Tools (2nd Edition). Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA.
  2. Appelt, D. E. and Onyshkevych, B. (1998). The common pattern specification language. In Proceedings of a Workshop on Held at Baltimore, Maryland: October 13-15, 1998, TIPSTER 7898, pages 23-30, Stroudsburg, PA, USA. Association for Computational Linguistics.
  3. Back, G. (2002). Datascript - a specification and scripting language for binary data. In In Generative Programming and Component Engineering, pages 66-77. Springer.
  4. Berstel, J. and Boasson, L. (2002). Balanced grammars and their languages. In Formal and Natural Computing - Essays Dedicated to Grzegorz Rozenberg [on occasion of his 60th birthday, March 14, 2002], pages 3- 25.
  5. Chappelier, J.-C. and Rajman, M. (1998). A generalized cyk algorithm for parsing stochastic cfg. In Proceedings of Tabulation in Parsing and Deduction (TAPD'98), pages 133-137, Paris, France.
  6. Crescenzi, V. and Mecca, G. (2004). Automatic information extraction from large websites. J. ACM, 51(5):731- 779.
  7. Dalvi, N., Kumar, R., and Soliman, M. (2011). Automatic wrappers for large scale web extraction. Proc. VLDB Endow., 4(4):219-230.
  8. Fisher, K. and Gruber, R. (2005). Pads: A domain-specific language for processing ad hoc data. In Proceedings of the 2005 ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 7805, pages 295-304, New York, NY, USA. ACM.
  9. Fisher, K., Mandelbaum, Y., and Walker, D. (2006). The next 700 data description languages. In Conference Record of the 33rd ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL 7806, pages 2-15, New York, NY, USA. ACM.
  10. Ford, B. (2004). Parsing expression grammars: A recognition-based syntactic foundation. In Proceedings of the 31st ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL 7804, pages 111-122, New York, NY, USA. ACM.
  11. Li, Y., Krishnamurthy, R., Raghavan, S., Vaithyanathan, S., and Jagadish, H. (2008). Regular expression learning for information extraction. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 21-30. Association for Computational Linguistics.
  12. McCann, P. J. and Chandra, S. (2000). Packet types: Abstract specification of network protocol messages. In Proceedings of the Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication, SIGCOMM 7800, pages 321- 333, New York, NY, USA. ACM.
  13. Powell, A., Beckerle, M., and Hanson, S. (2011). Data format description language (dfdl). Technical report, Open Grid Forum.
  14. Sakakibara, Y. (1997). Recent advances of grammatical inference. Theoretical Computer Science', 185(1):15- 45.
  15. Tari, L., Tu, P. H., Hakenberg, J., Chen, Y., Son, T. C., Gonzalez, G., and Baral, C. (2012). Parse tree database for information extraction. IEEE Transactions on Knowledge and Data Engineering, 24(1):86-99.
  16. Thakur, R., Jain, S., and Chaudhari, N. S. (2013). User behavior analysis using alignment based grammatical inference from web server access log. International Journal of Future Computer and Communication, 2(6):543.
  17. Underwood, W. (2012). Grammar-based specification and parsing of binary file formats. International Journal of Digital Curation, 7(1):95-106.
  18. Viola, P. and Narasimhan, M. (2005). Learning to extract information from semi-structured text using a discriminative context free grammar. In Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pages 330-337. ACM.
  19. Xi, Q. and Walker, D. (2010). A context-free markup language for semi-structured text. In Proceedings of the 31st ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 7810, pages 221-232, New York, NY, USA. ACM.
Download


Paper Citation


in Harvard Style

Sakharov A. and Sakharov T. (2015). Data Parsing using Tier Grammars . In Proceedings of the 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: KDIR, (IC3K 2015) ISBN 978-989-758-158-8, pages 463-468. DOI: 10.5220/0005632804630468


in Bibtex Style

@conference{kdir15,
author={Alexander Sakharov and Timothy Sakharov},
title={Data Parsing using Tier Grammars},
booktitle={Proceedings of the 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: KDIR, (IC3K 2015)},
year={2015},
pages={463-468},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005632804630468},
isbn={978-989-758-158-8},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: KDIR, (IC3K 2015)
TI - Data Parsing using Tier Grammars
SN - 978-989-758-158-8
AU - Sakharov A.
AU - Sakharov T.
PY - 2015
SP - 463
EP - 468
DO - 10.5220/0005632804630468