UNSUPERVISED INFERENCE OF DATA FORMATS

IN HUMAN-READABLE NOTATION

Christopher Scaffidi

Institute for Software Research, School of Computer Science, Carnegie Mellon University

5000 Forbes Ave, Pittsburgh, PA, 15217, United States

Keywords: Data integration, unsupervised learning, outlier finding, data formats, spreadsheets, databases, web services.

Abstract: One common approach to validating data such as email addresses and phone numbers is to check whether

values conform to some desired data format. Unfortunately, users may need to learn a specialized notation

such as regular expressions to specify the format, and even after learning the notation, specifying formats

may take substantial time. To address these problems, this paper introduces Topei, a system that infers a for-

mat from an unlabeled collection of examples (which may contain errors). The generated format is pre-

sented as understandable English, so users can review and customize the format. In addition, the format can

be used to automatically check data against the format and find outliers that do not match. Topei shows sub-

stantially higher precision and recall than an alternate algorithm (Lapis) on test data. Topei’s usefulness is

demonstrated by integrating it with spreadsheet, database, and web services systems.

1 INTRODUCTION

Information systems typically use semantics-free

character strings to store small semi-structured data,

including postal codes, person names, and URLs.

The usual approach to ensuring strings’ validity is to

write programs such as database triggers to check if

values match a desired format.

From a human-computer interaction standpoint,

two major challenges interfere with this approach.

First, users must learn specialized notation to

write a format. Regular expressions are probably the

most common notation for specifying simple for-

mats such as email addresses, but the notation is

hard for users to master (Blackwell, 2001). Even

experienced programmers recognize that regular

expressions are hard to learn. For example, the top

few results in a Google search for “regular expres-

sions” will return comments such as, “Do not worry

if the above example or the quick start make little

sense” and, “Sometimes you have a programming

problem and it seems like the best solution is to use

regular expressions; now you have two problems.”

Second, even after a user learns a notation, speci-

fying simple constraints can be time-consuming and

complex. For example, IPv4 addresses have four

integers, separated by periods, and each integer can

range from 0 through 255. Just the regular expres-

sion for an integer from 0 through 255 is lengthy:

25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?

The Tope system begins to address these chal-

lenges by presenting formats in understandable Eng-

lish, so users do not need to learn a specialized nota-

tion (Scaffidi, 2007). To date, the system’s main

weakness is its limited support for helping users to

get started: Although users do not need to learn spe-

cialized notation in order to specify a format, there is

still the cognitive work of examining data, breaking

it into parts, and representing the parts in the format.

This paper addresses this limitation by presenting

Topei, an algorithm that examines examples and

infers a format. The process involves three steps:

1. The user provides examples to Topei, which in-

fers a format.

2. The user reviews the format in the Tope sys-

tem’s editor tool. Although the format is inter-

nally represented in an XML notation, the user

never sees this, as the editor presents the format

to the user as a set of statements in English. Af-

ter any customization by the user, the editor

saves the format in a file.

3. The user validates strings by passing them

through the Tope system’s parser tool, which at-

tempts to parse each string using an augmented

236

Scafﬁdi C. (2007).

UNSUPERVISED INFERENCE OF DATA FORMATS IN HUMAN-READABLE NOTATION.

In Proceedings of the Ninth International Conference on Enterprise Information Systems - HCI, pages 236-241

DOI: 10.5220/0002347902360241

 SciTePress

context-free grammar (CFG) automatically de-

rived from the format.

In short, Topei provides automatic inference of

formats, saving users time and cognitive load, while

retaining the advantage of editability. This is an im-

portant advantage, since every inference algorithm

can make errors, so users should be able to under-

stand the inferred format and make corrections if

needed. In particular, the sample data may not ex-

emplify some aspects of the format, so users must

manually add these aspects.

An important aspect of data validation is outlier

finding, which involves identifying invalid data val-

ues that are mixed among a set of valid values. This

paper evaluates Topei as an outlier finder by compar-

ing it to the Lapis system (Miller & Myers, 2001),

revealing that Topei has higher precision and recall.

Section 2 reviews related work on inference and

data formats. Section 3 describes the integration of

Topei with spreadsheets, databases, and web

services. Section 4 discusses the inference algorithm

and evaluates its accuracy and limitations.

2 RELATED WORK

This paper’s contribution, Topei, has these traits:

• It infers a data format.

• It is unsupervised (uses unlabeled examples).

• Its formats permit soft constraints.

• Its formats are presentable in English.

• Its formats are useful for outlier finding.

Other algorithms share some traits but lack others:

Many machine learning algorithms train a recog-

nizer to notice certain features, enabling the recog-

nizer to identify outliers lacking those features

(Mitchell, 1997). However, such algorithms gener-

ally do not infer a human-editable data format.

Other algorithms generate formats in specialized

notation. These include (Blackwell, 2001), (Lerman

& Minton, 2000), (Lieberman et al., 2001), and

(Nardi et al., 1998), all of which use regular expres-

sions or CFGs. One last system, Lapis, has a special-

ized notation; it is like Topei in that it is intended to

be highly intuitive (Miller & Myers, 2001). Yet us-

ing its notation still takes practice, since expressing

simple constructs can be cumbersome. For instance,

the Lapis library defines a day (in a date) as:

@DayOfMonth is Number equal to

/[12][0-9]|3[01]|0?[1-9]/

ignoring nothing

In contrast, Topei infers formats that are pre-

sented to users as understandable English, and it

supports “soft” (non-mandatory) constraints on data.

Several tools recognize or manipulate some of

the same kinds of data as Topei (Hong & Wong,

2006) (Pandit & Kalbag, 1997) (Stylos et al., 2004).

However, these tools’ formats are hard-coded and

cannot be extended or customized by end users.

3 INTERFACES

Topei supports a graphical user interface (GUI), a

command-line interface (CLI), and an application

programming interface (API).

3.1 Spreadsheet Assertions

Spreadsheets are an important information reposi-

tory in organizations, but most spreadsheets contain

errors (Panko, 1998). Topei helps users validate

strings, which comprise nearly 40% of all spread-

sheet cells (Fisher, 2004). Users can create a format

from examples in one or more cells and attach the

format to those cells; if a cell’s contents do not

match the format, then the cell is highlighted as a

possible error that the user can either fix or ignore.

For instance, consider Figure 1, part of a sample

spreadsheet from Microsoft. The screenshot includes

the Topei “Patterns” toolbar (an Excel plug-in).

Figure 1: Sample Excel spreadsheet and Topei plug-in.

Though the first 38 rows each contain a person

first name in column B, some rows also have a middle

initial or name. Such outliers are semantically incon-

sistent with most cells, and the surface format reflects

this inconsistency. In particular, most outliers have an

extra uppercase letter, a space, and a period.

To find such errors, the user highlights column B

and clicks the toolbar’s “Create” button. Topei infers

a format, which the editor loads and displays (Figure

2). The user can add, edit, and remove format parts

and/or constraints, and constraints can be marked as

UNSUPERVISED INFERENCE OF DATA FORMATS IN HUMAN-READABLE NOTATION

237

“soft constraints” (i.e.: not “always” true). For more

details on the editor, refer to (Scaffidi, 2007).

Figure 2: Editor for reviewing / customizing formats.

In this example, the user might rename “MyPat-

tern” to “First Name” and “PART1” to “name,” then

broaden the length range from 3-8 characters to 2-10.

For more complex structured data, the editor

would be initialized with more parts, and perhaps

with constraints. For example, if the data were dates

DD/MM/YYYY form, then the editor would show

three parts (the day, the month, and the year, separated

by /), and each part could have constraints (e.g., re-

stricting the day to 1-31 and the month to 1-12).

After the user saves the format, the GUI uses the

parser to test each selected cell. Any cells that do not

match the format, including all constraints, are

marked with a red triangle; mousing over the cell

brings up an error message. The plug-in integrates

with Excel’s Reviewing functionality to display this

error feedback. Thus, the user can page through er-

rors using the existing “comments” browser within

Excel and can ignore messages if desired.

3.2 Database Integration

Topei supports a CLI, demonstrated here with the

sample

Person table provided by Microsoft.

First,

the user exports example data from the database.

The SQL Server command is:

osql -d AdventureWorks -E –h-1 -o

"c:\tmp.txt" -Q "SET NOCOUNT ON

select PostalCode from Person.Address"

Next, the user loads the examples into Topei:

topei.exe c:\tmp.txt

Topei infers a format that the editor displays for

review and customization. After the user saves the

format, it can be applied to data. A SQL Server

plug-in provides a new function,

checkPattern,

which loads the format, checks the string against the

format, and returns a score between 0 and 1. Here, 1

indicates that the string satisfies the format, and 0

indicates that it does not; if the string violates soft

constraints, then

checkPattern returns a score

between 0 and 1. (By passing in a non-negative

threshold, the user can make the function throw an

exception if the score is less than the threshold.)

For example, based on examples from the

Per-

son table, Topei infers that the varchar Postal-

Code has one five-digit part. If the user saves this

format to a file called “postal.xml,” then the col-

umn’s data can be checked using the following SQL:

select PostalCode from Person.Address

where dbo.checkPattern('postal.xml',

PostalCode, -1) < 1

Executing this query on the sample Microsoft

data returns 7276 rows (out of 19614 rows in the

table). Most erroneous values contain only four dig-

its, rather than the five that a US zip code must have.

Others are British postal codes; the user could allow

for these by creating a second format and doing a

query for rows that match neither format.

The SQL shown above can be executed in an in-

teractive query window or in a command-line batch

script. A trigger could use a similar query to prevent

invalid values from entering the table.

Tope data formats are portable. For example, the

user could create a format with the GUI, and then

use it to check data with the CLI or API.

3.3 Web Service Validation

Finally, Topei supports a C# API that can be called

from programs that manipulate text, XML, or other

data. This is useful when a programmer needs to

validate data coming from a web service.

For example, in the Google Base web applica-

tion, users can create records with arbitrary attribute-

value pairs. For instance, a

Job record might have a

contact field containing a phone number. Unfortu-

nately, Google Base does not validate these values,

so they are sometimes malformed. Consequently,

when a program retrieves jobs from the Google Base

XML feed, some attributes may have invalid values.

One line of code suffices to infer a format (

xt)

from an array of example values (

ex). A second line

creates a parser and validates a string (

str):

string xt=XtopeInference.MakeXtope(ex);

float sc=Parser.init(xt).check(str);

Like the checkPattern stored procedure, the

check function returns a score from 0 to 1 to indi-

cate how well the string matches the format.

ICEIS 2007 - International Conference on Enterprise Information Systems

238

4 TOPEI: FORMAT INFERENCE

This section describes the format model, the infer-

ence algorithm, and the evaluation of Topei.

4.1 Overview of Formats

The following discussion is a summary. For details

on formats and parsing, refer to (Scaffidi, 2007).

Each format contains one or more parts, each of

which has zero or more constraints.

The Pattern constraint specifies a part’s length

(or a range of valid lengths) and the character class

of its constituents. The character class can be

0 (dig-

its),

a (lowercase letters) or A (uppercase letters), or

it can be a composite character class. For example, a

mixture of uppercase letters and digits would be ex-

pressed using the composite class

A0.

Other constraints express additional semantics:

• Wrapped: neighboring separator

• Numeric: numeric inequality or equality

• Substring: starts or ends with a string

• Literal: in a certain set of valid strings

For example, a date in

DD/MM/YY format would

have three parts, each of which would have a Pattern

constraint with length 2 and character class

0. Each

part would be restricted to a numeric range with Nu-

meric constraints, and each slash would be associated

with the preceding part using a Wrapped constraint.

Each constraint can have an attribute in the range

0 through 100 to express a confidence that the con-

straint should be true. Constraints with confidence

100 (the default) must always be true; those with

confidence 0 represent statements that should never

be true. For example, a Numeric constraint with con-

fidence 0 can be used to specify, “the day never is >

31.” Constraints with confidence between 0 and 100

are soft. If a string violates soft constraints, then the

parser returns a score between 0 and 1.

Each constraint has a straightforward English

equivalent. This equivalence is the key to presenting

formats in human-readable form, as Figure 2 depicts.

A harder challenge is to examine example data and

infer a format in the notation. This is achieved with

Topei, as described below.

4.2 Inference Algorithm

Format inference has two phases. First, Topei identi-

fies the format’s parts and each part’s composite

character class. Second, Topei identifies constraints.

The discussion below uses six example email

addresses to demonstrate the algorithm:

apple@gmail.com

banana1@hotmail.com

carrot@company.com

DATE@UNIVERSITY.edu

eggplant@firm-name.com

fig.plant@mail.univ.edu

For each example, Topei replaces each letter and

digit with its character class, and then collapses runs of

identical classes. The collapsed string is called a “signa-

ture.” The email addresses yield five unique signatures:

a@a.a

a0@a.a

A@A.a

a@a-a.a

a.a@a.a.a

To finish this phase, Topei uses separators to iden-

tify and align parts, and it abstracts each part’s charac-

ter class to the least general composite character class

that covers the examples of that part. In the email ad-

dress examples, the first three signatures have three

parts and identical separators (one @ followed by a

period), so they are aligned and abstracted:

a @ a . a

a0 @ a . a aA0 @ aA . a

A @ A . a

This concludes the first phase of inference,

which has generated a primary format (

aA0@aA.a)

with three parts. Two email addresses have signatures

that are incompatible with this format, so Topei does

not include them in the primary format. Instead, Topei

uses them for secondary formats that can be retrieved

through the API. The user can also customize the pri-

mary format in the editor so that omitted examples are

covered; whether the user decides to do this depends

on the data’s semantics.

In the next phase, Topei generates constraints for

each format’s parts. In the example signature above,

the two punctuation marks (

@ and .) are separators,

each of which corresponds to a Wrapped constraint.

Next, for each part, Topei histograms the lengths

of examples. It selects the lengths that together cover

at least 95% of the examples, then expresses this set

of lengths as a range over a composite character

class (e.g.: part 1 of the email addresses has 4-7

aA0,

part 2 has 5-10

aA, and part 3 has 3 a). This yields a

Pattern constraint.

Then, Topei histograms the examples’ text and

searches for a set of 3 or fewer strings that together

cover at least 95% of the examples (e.g.:

edu and

com cover the email addresses’ part 3). If Topei

finds such a set, it creates a Literal constraint, re-

quiring that the part’s text must be in the set.

If no Literal constraint is created, then Topei tries

constraining the part to a numeric range. It converts

the part examples to numbers and histograms them. If

Topei finds a range of numeric values that covers at

UNSUPERVISED INFERENCE OF DATA FORMATS IN HUMAN-READABLE NOTATION

239

least 95% of all examples (including non-numeric

strings), then it constrains the part to that range (using

Numeric constraints).

If no

Literal or Numeric constraints are created,

then Topei tries “begins with” and/or “ends with”

constraints. Again using histograms, it tries to find

the longest substring that prefixes at least 95% of

examples; if such a prefix exists, and then Topei

creates a corresponding Substring constraint. Like-

wise, it creates a suffix Substring constraint if some

substring suffixes at least 95% of examples.

Topei sets the confidence to 100 for Pattern con-

straints but sets other constraints’ confidence to 60

(to help prevent over-fitting). When reviewing the

format, the user can tighten constraints to a confi-

dence of 100.

In a few details, this algorithm resembles exist-

ing algorithms. For example, Topei uses character

classes similar to those of (Lerman & Minton,

2000). However, in order to decide whether to gen-

eralize to a composite character class, Lerman &

Minton assume that string values are independently

sampled from a probability distribution. In practice,

if a value appears in one spreadsheet cell or database

row, then that value also tends to appear in nearby

cells or rows. Thus, real data violate Lerman &

Minton’s independence assumption. Therefore,

rather than presupposing the existence of a particular

probability distribution, Topei generalizes to the

least general composite character class and then uses

the editor to display the format so the user can re-

view and customize the format. (In addition, their

algorithm does not infer numeric constraints, and

their formats are not portable, as they are specialized

to matching strings mixed with HTML.)

4.3 Validation: Comparison to Lapis

Topei is evaluated by using it and Lapis 1.2 (Miller

& Myers, 2001) to locate outliers in test data, then

manually examining the data to find true outliers.

Topei shows higher precision and recall than Lapis.

Test data come from the EUSES spreadsheet cor-

pus (Fisher & Rothermel, 2004), which contains 4498

spreadsheets culled from the web. Two kinds of test

data are used—country names and American phone

numbers—in order to test Topei on unstructured and

structured data. A column of spreadsheet data is in-

cluded in the country test data if it has at least 20

cells, and at least 1 cell equals “Portugal,” and the

first cell contains the word “country.” A column is

included in the phone number test data if it has at least

20 cells with exactly 10 digits, and at least 2/3 of cells

have exactly 10 digits, and the first cell contains the

word “phone.” Visually examining the data reveals

that one phone column is a mixture of many formats;

discarding this column leaves 1124 countries in 7 col-

umns and 6288 phone numbers in 37 columns.

To locate outliers for each spreadsheet column,

Topei is used to infer a format, and the parser is used

to check each of the column’s cells against the for-

mat. If the parser returns a score less than 1 for a

cell, then it is flagged as an outlier.

To locate outliers for each column, Lapis maps

cells into a feature space, and then computes the dis-

tance of each cell from the centroid. Lapis finds the

maximum of these distances,

MAXDIST, and any cell

value that is at least

MAXDIST/2 from the centroid is

flagged as an outlier. There is one extra heuristic: if

more than half of the cells in the column would be

flagged as outliers, or if 10 or more would be

flagged, then Lapis flags none of them as outliers.

(Lapis comes pre-equipped with hand-coded country

and phone number formats, which are disabled for

this test, though the formats for numbers, words,

delimiters, and other layout remain enabled.)

To locate true outliers, each column is manually

examined. Countries are labeled as outliers if they con-

tain abbreviations, misspellings, or a different name

than the one usually used by English-speakers (e.g.:

Côte d’Ivoire is an outlier in one case; in fact, it is that

column’s only value spelled in French), though “Bra-

sil,” “US,” “USA,” “UK,” and various pre-re-

unification names for Germany are not labeled as out-

liers, as they are commonly used. Phone numbers are

labeled as outliers if their format differs from the col-

umn’s main format (e.g., insertion of extra spaces, or

using periods as separators when most cells use hy-

phens), or if their area codes or exchanges are not in the

list of valid values from www.nanpa.org. These criteria

identify 92 country names (8%) and 1669 phone num-

bers (26%) as true outliers.

In outlier finding, precision is the number of

cells correctly flagged as outliers by an algorithm,

divided by the total number of cells flagged as out-

liers. Recall is the number of cells correctly flagged

as outliers, divided by the number of true outliers.

As shown in Table 1, Topei demonstrates higher

precision and recall than Lapis.

Table 1: Precision and recall of Topei and Lapis.

Task Algorithm Precision (%) Recall (%)

Country Topei 56.5 94.6

Country Lapis 46.7 7.6

Phone Topei 97.7 99.8

Phone Lapis 44.0 2.4

The Lapis heuristics are intentionally biased to-

ward low recall, as the algorithm’s designers believed

that “highlighting a large number of outliers is un-

ICEIS 2007 - International Conference on Enterprise Information Systems

240

helpful to the user, since the user must examine each

one” (Miller & Myers, 2001). Recall can be raised by

lowering the outlier threshold from its default value

(½ of the maximal distance to the centroid) and elimi-

nating the heuristic that caps the number of outliers

per column. As Figure 3 shows, this raises Lapis’s

recall as high as 70% and 9% for the tasks, respec-

tively, with little loss of precision. Yet these scores

remain much lower than those of Topei.

In these tests, Topei generates few soft con-

straints, so the parser usually returns scores of 0 or

1. Thus, reducing Topei’s threshold from 1 only al-

ters its precision and recall by 2%.

Figure 3: Performance of modified Lapis with outlier

threshold expressed as a fraction of MAXDIST to centroid.

4.4 Limitations and Future Work

Topei performs better than Lapis at finding outliers

in spreadsheet cells. This is probably due to the al-

gorithms’ different inductive biases: the notation

used to express formats in Lapis is intended for de-

scribing regions of text in large documents. Unsur-

prisingly, Lapis performs better on unstructured data

(countries) than structured data (phone numbers). In

contrast, Topei is oriented toward single data values

(such as spreadsheet cells), particularly those with

separator-delimited parts.

Still, Topei makes mistakes. For the phone task,

most mistakes occur because Topei fails to notice

invalid area codes; in these cases, not enough exam-

ples are present to lead Topei to create soft numeric

range constraints. In the country task, most mistakes

occur because a valid country name contains two

words, and Topei currently does not infer word repe-

tition. Adding heuristics that are more sophisticated

might help to reduce these mistakes. Comparison

with other systems (besides Lapis) may inspire addi-

tional ideas for improvement.

Future work could make Topei more flexible by

adding support for non-English letters (such as let-

ters with accents). The editor and parser support

Unicode but would require interface changes.

Like Lapis, Topei is designed to have O(n) com-

putational complexity, where n is the number of ex-

amples. Preliminary tests indicate that the imple-

mentation does demonstrate O(n) performance,

though additional evaluation would be desirable.

The usability of Topei’s user interfaces has not

yet been evaluated. Such an evaluation may reveal

limitations to how well users understand Topei and

how successfully they apply formats to spreadsheets,

databases, web services, and other systems.

ACKNOWLEDGEMENTS

This work was funded in part by the National Sci-

ence Foundation (ITR-0325273) via the EUSES

Consortium and by the National Science Foundation

under Grant CCF-0438929. Any opinions, findings,

and conclusions or recommendations expressed in

this material are those of the author and do not nec-

essarily reflect the views of the sponsors.

REFERENCES

Blackwell, B., 2001. SWYN: A Visual Representation for

Regular Expressions. Your Wish is My Command:

Programming by Example, pp. 245-270.

Fisher, M., Rothermel, G., 2004. The EUSES Spreadsheet

Corpus: A Shared Resource for Supporting Experimen-

tation with Spreadsheet Dependability Mechanisms,

Tech. Report 04-12-03, Univ. Nebraska-Lincoln.

Hong, J., Wong, J., 2006. Marmite: End-User Program-

ming for the Web, Proc. CHI’06 Conf. on Human

Factors in Computing Systems, pp. 1541-1546.

Lerman, K., Minton, S., 2000. Learning the Common

Structure of Data, Proc. AAAI-2000, pp. 609-614.

Lieberman, H., Nardi, B., Wright, D., 2001. Training

Agents to Recognize Text by Example, Auton. Agents

and Multi-Agent Systems, vol. 4, no. 1, pp. 79-92.

Miller, R., Myers, B., 2001. Outlier Finding: Focusing User

Attention on Possible Errors. Proc. 14th Annual Symp.

on User Interface Software and Technology, pp. 81-90.

Mitchell, T., 1997. Machine Learning, McGraw Hill.

Nardi, B., Miller, J., Wright, D., 1998. Collaborative, Pro-

grammable Intelligent Agents. Comm. ACM, vol. 41,

no. 3, pp. 96-104.

Panko, R., 1998, What We Know about Spreadsheet Er-

rors, J. End User Computing, vol. 10, no. 2, pp. 15-21.

Pandit, M., Kalbag, S., 1997. The Selection Recognition

Agent: Instant Access to Relevant Information and

Operations. Proc. 2

Intl. Conf. on Intelligent User

Interfaces, pp. 47-52.

Scaffidi, C., Myers, B., Shaw, M., 2007. The Topes For-

mat Editor and Parser, Tech. Report CMU-ISRI-07-

104/CMU-HCII-07-100, School of Computer Science,

Carnegie Mellon University, Pittsburgh, PA.

Stylos, J., Myers, B., Faulring, A., 2004. Citrine: Providing

Intelligent Copy-and-Paste. Proc. 17

Annual Symp. on

User Interface Software and Technology, pp. 185-188.

UNSUPERVISED INFERENCE OF DATA FORMATS IN HUMAN-READABLE NOTATION

241