enable collective matching that incorporates the table
and context into a single similarity score.
In summary, the context of a table is frequently
recognized as an important resource in table under-
standing. However, the relevance of specific context
paragraphs with respect to a table has received only
limited attention so far.
6 CONCLUSION
To exploit the rich information stored in billions
of Web tables, additional contextual information is
needed to understand their content and intention. In
the most general case the overall document containing
the Web table could be considered to support table un-
derstanding. However, since most of the context will
not be related to the Web table at all this introduces
to much noise. Therefore, we proposed a novel con-
textualization approach for Web tables based on text
tiling and similarity estimation to evaluate the rele-
vance of context information. We performed a de-
tailed analysis of state-of-the-art retrieval functions
such as TF-IDF, language models, Okapi BM25, and
LDA and applied them on the Web table in ques-
tion as well as the different context paragraphs. Our
evaluation showed that language models with Dirich-
let smoothing deliver excellent results with an MRR
score of almost 0.98. We finally studied different
ranking schemes that enable us to effectively identify
the most relevant context paragraphs for a given Web
table.
REFERENCES
Allan, J. (2002). Introduction to topic detection and track-
ing. In Topic Detection and Tracking, pages 1–16.
Kluwer Academic Publishers.
Beeferman, D., Berger, A., and Lafferty, J. (1999). Statisti-
cal models for text segmentation. Machine Learning
- Special Issue on Natural Language Learning, 34(1-
3):177–210.
Blei, D. M. and Lafferty, J. D. (2009). Topic models. Text
Mining: Classification, Clustering, and Applications,
10:71.
Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent
dirichlet allocation. The Journal of Machine Learning
Research, 3:993–1022.
Cafarella, M. J., Halevy, A. Y., and Khoussainova, N.
(2009). Data integration for the relational web. Pro-
ceedings of the VLDB Endowment, 2:1090–1101.
Eberius, J., Thiele, M., Braunschweig, K., and Lehner, W.
(2015). Top-k entity augmentation using consistent
set covering. In SSDBM’15, SSDBM ’15, pages 8:1–
8:12, New York, NY, USA. ACM.
Embley, D. W., Hurst, M., Lopresti, D., and Nagy, G.
(2006). Table-processing paradigms: a research sur-
vey. IJDAR’06, 8(2-3):66–86.
Hearst, M. A. (1997). Texttiling: Segmenting text into
multi-paragraph subtopic passages. Computational
Linguistics, 23(1):33–64.
Hurst, M. (2000). The Interpretation of Tables in Texts. PhD
thesis, University of Edinburgh.
Limaye, G., Sarawagi, S., and Chakrabarti, S. (2010). An-
notating and searching web tables using entities, types
and relationships. Proceedings of the VLDB Endow-
ment, 3:1338–1347.
Ling, X., Halevy, A. Y., Wu, F., and Yu, C. (2013). Synthe-
sizing union tables from the web. In IJCAI’13, pages
2677–2683.
Mulwad, V., Finin, T., and Joshi, A. (2011). Generating
linked data by inferring the semantics of tables. In
VLDS’11, pages 17–22.
Pimplikar, R. and Sarawagi, S. (2012). Answering table
queries on the web using column keywords. Proceed-
ings of the VLDB Endowment, 5(10):908–919.
Pinto, D., Branstein, M., Coleman, R., Croft, W. B., King,
M., Li, W., and Wei, X. (2002). Quasm: A system
for question answering using semi-structured data. In
JCDL’02, pages 46–55. ACM.
Pinto, D., McCallum, A., Wei, X., and Croft, W. B. (2003).
Table extraction using conditional random fields. In
SIGIR’03, pages 235–242. ACM.
Ponte, J. M. and Croft, W. B. (1998). A language modeling
approach to information retrieval. In SIGIR’98, pages
275–281. ACM.
Pyreddy, P. and Croft, W. B. (1997). Tintin: A system for
retrieval in text tables. In JCDL’97, pages 193–200.
ACM.
Robertson, S. E., Walker, S., Jones, S., Hancock-Beaulieu,
M. M., and Gatford, M. (1996). Okapi at trec-3. In
TREC’96, pages 109–126.
Salton, G. and Buckley, C. (1988). Term-weighting ap-
proaches in automatic text retrieval. Information Pro-
cessing and Management: an International Journal,
24(5):513–523.
Sarawagi, S. and Chakrabarti, S. (2014). Open-domain
quantity queries on web tables: Annotation, response,
and consensus models. In SIGKDD’14, pages 711–
720.
Wei, X. and Croft, W. B. (2006). Lda-based document mod-
els for ad-hoc retrieval. In SIGIR 2006, pages 178–
185. ACM.
Whissell, J. S. and Clarke, C. L. A. (2013). Effective
measures for inter-document similarity. In CIKM’13,
ACM, pages 1361–1370.
Yakout, M., Ganjam, K., Chakrabarti, K., and Chaudhuri, S.
(2012). Infogather: Entity augmentation and attribute
discovery by holistic matching with web tables. In
SIGMOD’12, pages 97–108. ACM.
Yin, X., Tan, W., and Liu, C. (2011). Facto: A fact lookup
engine based on web tables. In WWW’11, pages 507–
516.
Putting Web Tables into Context
165