Breiman, L. (2001). Random forests. Machine Learning,
45(1):5–32.
Breiman, L., Friedman, J., Olshen, R., and Stone, C. (1984).
Classification and Regression Trees. Wadsworth.
Chen, Z. and Cafarella, M. (2013). Automatic web spread-
sheet data extraction. In SSW’13, page 1. ACM.
Chen, Z. and Cafarella, M. (2014). Integrating spread-
sheet data via accurate and low-effort extraction. In
SIGKDD’14, pages 1126–1135. ACM.
Crestan, E. and Pantel, P. (2011). Web-scale table cen-
sus and classification. In WSDM’11, pages 545–554.
ACM.
Eberius, J., Braunschweig, K., Hentsch, M., Thiele, M., Ah-
madov, A., and Lehner, W. (2015). Building the dres-
den web table corpus: A classification approach. In
BDC’15. IEEE/ACM.
Eberius, J., Werner, C., Thiele, M., Braunschweig, K.,
Dannecker, L., and Lehner, W. (2013). Deexcelera-
tor: A framework for extracting relational data from
partially structured documents. In CIKM’13, pages
2477–2480. ACM.
Fisher, M. and Rothermel, G. (2005). The euses spreadsheet
corpus: a shared resource for supporting experimenta-
tion with spreadsheet dependability mechanisms. In
SIGSOFT’05, volume 30, pages 1–5. ACM.
Hermans, F. and Murphy-Hill, E. (2015). Enron’s spread-
sheets and related emails: A dataset and analysis. In
Proceedings of ICSE ’15. IEEE.
Liu, H. and Yu, L. (2005). Toward integrating feature
selection algorithms for classification and clustering.
IEEE Transactions on knowledge and data engineer-
ing, 17(4):491–502.
Platt, J. C. (1998). Fast training of support vector machines
using sequential minimal optimization. In Advances
in Kernel Methods - Support Vector Learning. MIT
Press.
Quinlan, J. R. (1993). C4.5: Programs for Machine Learn-
ing. Morgan Kaufmann Publishers Inc.
Vapnik, V. (1982). Estimation of Dependences Based
on Empirical Data. Springer Series in Statistics.
Springer-Verlag New York, Inc.
Wang, Y. and Hu, J. (2002). A machine learning based ap-
proach for table detection on the web. In WWW’02,
pages 242–250. ACM.
APPENDIX
Below we provide a table with the “top” 20 features
ordered by mean rank, which is calculated based on
the results from 5 feature selection methods. As men-
tioned in Section 5.1, a 10 fold cross validation was
performed for each one of these methods. The Cfs-
Subset (Cfs) and ConsistencySubstet (Con) output the
number of times a feature was selected in the final
best subset. The other three methods output the rank
of the feature.
Table 8: Top 20 features, considering the results from 5
feature selection methods: InfoGainAttribute (IG), Gain-
RatioAttribute (GR), ChiSquaredAttribute (Chi), CfsSubset
(Cfs), and ConsitencySubset (Con).
Nr Feature IG GR Chi Cfs Con Mean
1 IS BOLD? 3 2 4 10 10 37.04
2 NUM OF TOKENS# 2 17 2 10 10 35.9
3 H ALIGNMENT CENTER? 7 19 8 10 10 34.76
4 COL NUM# 4 43 3 10 10 32.56
5 ROW NUM# 1 52 1 10 10 31.53
6 HAS 4 NEIGHBORS? 10 32 13 10 7 30.59
7 INDENTATIONS# 30 4 14 10 1 23.39
8 IS STRING? 5 6 6 10 0 22.11
9 FORMULA VAL NUMERIC? 9 14 9 10 0 21.28
10 IS AGGRE FORMULA? 15 1 7 0 8 20.86
11 NUM OF CELLS# 24 7 11 10 0 20.65
12 REF VAL NUMERIC? 14 22 18 0 10 20.01
13 IS CAPITALIZED? 17 33 20 0 10 19
14 H ALIGNMENT DEFAULT? 12 41 16 0 10 18.89
15 IS NUMERIC? 8 26 10 0 6 18.66
16 LENGTH# 6 56 5 0 10 18.29
17 V ALIGNMENT BOTTOM? 27 40 30 0 10 17.35
18 LEADING SPACES# 50 8 33 10 0 17.31
19 CONTAINS PUNCTUATIONS? 28 42 31 0 10 17.08
20 IS ITALIC? 52 12 34 10 0 16.88
In order to combine these otherwise incompara-
ble results, we decided to use the geometric mean.
Features having good rank (from the attribute meth-
ods) and selected in many folds (from subset meth-
ods) should be listed higher in the table than others.
We would also like to favor those feature for which
the selection methods agree (the results do not vary
too much). For the cases where the variance is big,
we would like to penalize the extreme negative results
more than extreme positive results. To achieve this
for InfoGainAttribute (IG), GainRatioAttribute (GR),
and ChiSquaredAttribute (Chi) methods we invert the
rank. This means the feature that was ranked first
(rank
i
= 1) by the selection method will now take
the biggest number (the total number of features is
88). In general, we calculate the inverted rank as
(N + 1) − rank
i
, where N is the total number of fea-
tures and rank
i
is the rank for the i
th
feature in the list
of results from the specified feature selection method.
Once the ranks are inverted we proceed with the
geometric mean. Since the geometric mean does not
accept 0s, we add 1 to each one of the terms be-
fore calculation. Furthermore, we subtract 1 from
the result. Finally, we order the features descendingly
based on the calculated mean.
The information presented in Table 8 aims at high-
lighting promising features. Especially, the seven
first features are listed high, because they achieve rel-
atively good scores in all the considered methods.
However, more sophisticated techniques, such as the
ones discussed at (Liu and Yu, 2005), are required to
integrate results from different feature selection meth-
ods.
For completeness, in Table 9 we display the result-
ing F1 measures when using Random Forest with the
first 7, 15, and 20 features from Table 8. Addition-
A Machine Learning Approach for Layout Inference in Spreadsheets
87