helpful to the user, since the user must examine each
one” (Miller & Myers, 2001). Recall can be raised by
lowering the outlier threshold from its default value
(½ of the maximal distance to the centroid) and elimi-
nating the heuristic that caps the number of outliers
per column. As Figure 3 shows, this raises Lapis’s
recall as high as 70% and 9% for the tasks, respec-
tively, with little loss of precision. Yet these scores
remain much lower than those of Topei.
In these tests, Topei generates few soft con-
straints, so the parser usually returns scores of 0 or
1. Thus, reducing Topei’s threshold from 1 only al-
ters its precision and recall by 2%.
Figure 3: Performance of modified Lapis with outlier
threshold expressed as a fraction of MAXDIST to centroid.
4.4 Limitations and Future Work
Topei performs better than Lapis at finding outliers
in spreadsheet cells. This is probably due to the al-
gorithms’ different inductive biases: the notation
used to express formats in Lapis is intended for de-
scribing regions of text in large documents. Unsur-
prisingly, Lapis performs better on unstructured data
(countries) than structured data (phone numbers). In
contrast, Topei is oriented toward single data values
(such as spreadsheet cells), particularly those with
separator-delimited parts.
Still, Topei makes mistakes. For the phone task,
most mistakes occur because Topei fails to notice
invalid area codes; in these cases, not enough exam-
ples are present to lead Topei to create soft numeric
range constraints. In the country task, most mistakes
occur because a valid country name contains two
words, and Topei currently does not infer word repe-
tition. Adding heuristics that are more sophisticated
might help to reduce these mistakes. Comparison
with other systems (besides Lapis) may inspire addi-
tional ideas for improvement.
Future work could make Topei more flexible by
adding support for non-English letters (such as let-
ters with accents). The editor and parser support
Unicode but would require interface changes.
Like Lapis, Topei is designed to have O(n) com-
putational complexity, where n is the number of ex-
amples. Preliminary tests indicate that the imple-
mentation does demonstrate O(n) performance,
though additional evaluation would be desirable.
The usability of Topei’s user interfaces has not
yet been evaluated. Such an evaluation may reveal
limitations to how well users understand Topei and
how successfully they apply formats to spreadsheets,
databases, web services, and other systems.
ACKNOWLEDGEMENTS
This work was funded in part by the National Sci-
ence Foundation (ITR-0325273) via the EUSES
Consortium and by the National Science Foundation
under Grant CCF-0438929. Any opinions, findings,
and conclusions or recommendations expressed in
this material are those of the author and do not nec-
essarily reflect the views of the sponsors.
REFERENCES
Blackwell, B., 2001. SWYN: A Visual Representation for
Regular Expressions. Your Wish is My Command:
Programming by Example, pp. 245-270.
Fisher, M., Rothermel, G., 2004. The EUSES Spreadsheet
Corpus: A Shared Resource for Supporting Experimen-
tation with Spreadsheet Dependability Mechanisms,
Tech. Report 04-12-03, Univ. Nebraska-Lincoln.
Hong, J., Wong, J., 2006. Marmite: End-User Program-
ming for the Web, Proc. CHI’06 Conf. on Human
Factors in Computing Systems, pp. 1541-1546.
Lerman, K., Minton, S., 2000. Learning the Common
Structure of Data, Proc. AAAI-2000, pp. 609-614.
Lieberman, H., Nardi, B., Wright, D., 2001. Training
Agents to Recognize Text by Example, Auton. Agents
and Multi-Agent Systems, vol. 4, no. 1, pp. 79-92.
Miller, R., Myers, B., 2001. Outlier Finding: Focusing User
Attention on Possible Errors. Proc. 14th Annual Symp.
on User Interface Software and Technology, pp. 81-90.
Mitchell, T., 1997. Machine Learning, McGraw Hill.
Nardi, B., Miller, J., Wright, D., 1998. Collaborative, Pro-
grammable Intelligent Agents. Comm. ACM, vol. 41,
no. 3, pp. 96-104.
Panko, R., 1998, What We Know about Spreadsheet Er-
rors, J. End User Computing, vol. 10, no. 2, pp. 15-21.
Pandit, M., Kalbag, S., 1997. The Selection Recognition
Agent: Instant Access to Relevant Information and
Operations. Proc. 2
nd
Intl. Conf. on Intelligent User
Interfaces, pp. 47-52.
Scaffidi, C., Myers, B., Shaw, M., 2007. The Topes For-
mat Editor and Parser, Tech. Report CMU-ISRI-07-
104/CMU-HCII-07-100, School of Computer Science,
Carnegie Mellon University, Pittsburgh, PA.
Stylos, J., Myers, B., Faulring, A., 2004. Citrine: Providing
Intelligent Copy-and-Paste. Proc. 17
th
Annual Symp. on
User Interface Software and Technology, pp. 185-188.
UNSUPERVISED INFERENCE OF DATA FORMATS IN HUMAN-READABLE NOTATION
241