loading
Papers Papers/2022 Papers Papers/2022

Research.Publish.Connect.

Paper

Paper Unlock

Author: Christopher Scaffidi

Affiliation: Institute for Software Research, School of Computer Science, Carnegie Mellon University, United States

Keyword(s): Data integration, unsupervised learning, outlier finding, data formats, spreadsheets, databases, web services.

Related Ontology Subjects/Areas/Topics: Artificial Intelligence ; Enterprise Information Systems ; HCI on Enterprise Information Systems ; Human-Computer Interaction ; Intelligent User Interfaces

Abstract: One common approach to validating data such as email addresses and phone numbers is to check whether values conform to some desired data format. Unfortunately, users may need to learn a specialized notation such as regular expressions to specify the format, and even after learning the notation, specifying formats may take substantial time. To address these problems, this paper introduces Topei, a system that infers a format from an unlabeled collection of examples (which may contain errors). The generated format is presented as understandable English, so users can review and customize the format. In addition, the format can be used to automatically check data against the format and find outliers that do not match. Topei shows substantially higher precision and recall than an alternate algorithm (Lapis) on test data. Topei’s usefulness is demonstrated by integrating it with spreadsheet, database, and web services systems.

CC BY-NC-ND 4.0

Sign In Guest: Register as new SciTePress user now for free.

Sign In SciTePress user: please login.

PDF ImageMy Papers

You are not signed in, therefore limits apply to your IP address 54.173.43.215

In the current month:
Recent papers: 100 available of 100 total
2+ years older papers: 200 available of 200 total

Paper citation in several formats:
Scaffidi, C. (2007). UNSUPERVISED INFERENCE OF DATA FORMATS IN HUMAN-READABLE NOTATION. In Proceedings of the Ninth International Conference on Enterprise Information Systems - Volume 4: ICEIS; ISBN 978-972-8865-92-4; ISSN 2184-4992, SciTePress, pages 236-241. DOI: 10.5220/0002347902360241

@conference{iceis07,
author={Christopher Scaffidi.},
title={UNSUPERVISED INFERENCE OF DATA FORMATS IN HUMAN-READABLE NOTATION},
booktitle={Proceedings of the Ninth International Conference on Enterprise Information Systems - Volume 4: ICEIS},
year={2007},
pages={236-241},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0002347902360241},
isbn={978-972-8865-92-4},
issn={2184-4992},
}

TY - CONF

JO - Proceedings of the Ninth International Conference on Enterprise Information Systems - Volume 4: ICEIS
TI - UNSUPERVISED INFERENCE OF DATA FORMATS IN HUMAN-READABLE NOTATION
SN - 978-972-8865-92-4
IS - 2184-4992
AU - Scaffidi, C.
PY - 2007
SP - 236
EP - 241
DO - 10.5220/0002347902360241
PB - SciTePress