UNSUPERVISED INFERENCE OF DATA FORMATS IN HUMAN-READABLE NOTATION

Christopher Scaffidi

Abstract

One common approach to validating data such as email addresses and phone numbers is to check whether values conform to some desired data format. Unfortunately, users may need to learn a specialized notation such as regular expressions to specify the format, and even after learning the notation, specifying formats may take substantial time. To address these problems, this paper introduces Topei, a system that infers a format from an unlabeled collection of examples (which may contain errors). The generated format is presented as understandable English, so users can review and customize the format. In addition, the format can be used to automatically check data against the format and find outliers that do not match. Topei shows substantially higher precision and recall than an alternate algorithm (Lapis) on test data. Topei’s usefulness is demonstrated by integrating it with spreadsheet, database, and web services systems.

References

  1. Blackwell, B., 2001. SWYN: A Visual Representation for Regular Expressions. Your Wish is My Command: Programming by Example, pp. 245-270.
  2. Fisher, M., Rothermel, G., 2004. The EUSES Spreadsheet Corpus: A Shared Resource for Supporting Experimentation with Spreadsheet Dependability Mechanisms, Tech. Report 04-12-03, Univ. Nebraska-Lincoln.
  3. Hong, J., Wong, J., 2006. Marmite: End-User Programming for the Web, Proc. CHI'06 Conf. on Human Factors in Computing Systems, pp. 1541-1546.
  4. Lerman, K., Minton, S., 2000. Learning the Common Structure of Data, Proc. AAAI-2000, pp. 609-614.
  5. Lieberman, H., Nardi, B., Wright, D., 2001. Training Agents to Recognize Text by Example, Auton. Agents and Multi-Agent Systems, vol. 4, no. 1, pp. 79-92.
  6. Miller, R., Myers, B., 2001. Outlier Finding: Focusing User Attention on Possible Errors. Proc. 14th Annual Symp. on User Interface Software and Technology, pp. 81-90.
  7. Mitchell, T., 1997. Machine Learning, McGraw Hill.
  8. Nardi, B., Miller, J., Wright, D., 1998. Collaborative, Programmable Intelligent Agents. Comm. ACM, vol. 41, no. 3, pp. 96-104.
  9. Panko, R., 1998, What We Know about Spreadsheet Errors, J. End User Computing, vol. 10, no. 2, pp. 15-21.
  10. Pandit, M., Kalbag, S., 1997. The Selection Recognition Agent: Instant Access to Relevant Information and Operations. Proc. 2nd Intl. Conf. on Intelligent User Interfaces, pp. 47-52.
  11. Scaffidi, C., Myers, B., Shaw, M., 2007. The Topes Format Editor and Parser, Tech. Report CMU-ISRI-07- 104/CMU-HCII-07-100, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA.
  12. Stylos, J., Myers, B., Faulring, A., 2004. Citrine: Providing Intelligent Copy-and-Paste. Proc. 17th Annual Symp. on User Interface Software and Technology, pp. 185-188.
Download


Paper Citation


in Harvard Style

Scaffidi C. (2007). UNSUPERVISED INFERENCE OF DATA FORMATS IN HUMAN-READABLE NOTATION . In Proceedings of the Ninth International Conference on Enterprise Information Systems - Volume 5: ICEIS, ISBN 978-972-8865-92-4, pages 236-241. DOI: 10.5220/0002347902360241


in Bibtex Style

@conference{iceis07,
author={Christopher Scaffidi},
title={UNSUPERVISED INFERENCE OF DATA FORMATS IN HUMAN-READABLE NOTATION},
booktitle={Proceedings of the Ninth International Conference on Enterprise Information Systems - Volume 5: ICEIS,},
year={2007},
pages={236-241},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0002347902360241},
isbn={978-972-8865-92-4},
}


in EndNote Style

TY - CONF
JO - Proceedings of the Ninth International Conference on Enterprise Information Systems - Volume 5: ICEIS,
TI - UNSUPERVISED INFERENCE OF DATA FORMATS IN HUMAN-READABLE NOTATION
SN - 978-972-8865-92-4
AU - Scaffidi C.
PY - 2007
SP - 236
EP - 241
DO - 10.5220/0002347902360241