Putting Web Tables into Context

Katrin Braunschweig, Maik Thiele, Elvis Koci, Wolfgang Lehner

2016

Abstract

Web tables are a valuable source of information used in many application areas. However, to exploit Web tables it is necessary to understand their content and intention which is impeded by their ambiguous semantics and inconsistencies. Therefore, additional context information, e.g. text in which the tables are embedded, is needed to support the table understanding process. In this paper, we propose a novel contextualization approach that 1) splits the table context in topically coherent paragraphs, 2) provides a similarity measure that is able to match each paragraph to the table in question and 3) ranks these paragraphs according to their relevance. Each step is accompanied by an experimental evaluation on real-world data showing that our approach is feasible and effectively identifies the most relevant context for a given Web table.

References

  1. Allan, J. (2002). Introduction to topic detection and tracking. In Topic Detection and Tracking, pages 1-16. Kluwer Academic Publishers.
  2. Beeferman, D., Berger, A., and Lafferty, J. (1999). Statistical models for text segmentation. Machine Learning - Special Issue on Natural Language Learning, 34(1- 3):177-210.
  3. Blei, D. M. and Lafferty, J. D. (2009). Topic models. Text Mining: Classification, Clustering, and Applications, 10:71.
  4. Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent dirichlet allocation. The Journal of Machine Learning Research, 3:993-1022.
  5. Cafarella, M. J., Halevy, A. Y., and Khoussainova, N. (2009). Data integration for the relational web. Proceedings of the VLDB Endowment, 2:1090-1101.
  6. Eberius, J., Thiele, M., Braunschweig, K., and Lehner, W. (2015). Top-k entity augmentation using consistent set covering. In SSDBM'15, SSDBM 7815, pages 8:1- 8:12, New York, NY, USA. ACM.
  7. Embley, D. W., Hurst, M., Lopresti, D., and Nagy, G. (2006). Table-processing paradigms: a research survey. IJDAR'06, 8(2-3):66-86.
  8. Hearst, M. A. (1997). TextTiling: segmenting text into multi-paragraph subtopic passages. Computational Linguistics, 23(1):33-64.
  9. Hurst, M. (2000). The Interpretation of Tables in Texts. PhD thesis, University of Edinburgh.
  10. Limaye, G., Sarawagi, S., and Chakrabarti, S. (2010). Annotating and searching web tables using entities, types and relationships. Proceedings of the VLDB Endowment, 3:1338-1347.
  11. Ling, X., Halevy, A. Y., Wu, F., and Yu, C. (2013). Synthesizing union tables from the web. In IJCAI'13, pages 2677-2683.
  12. Mulwad, V., Finin, T., and Joshi, A. (2011). Generating linked data by inferring the semantics of tables. In VLDS'11, pages 17-22.
  13. Pimplikar, R. and Sarawagi, S. (2012). Answering table queries on the web using column keywords. Proceedings of the VLDB Endowment, 5(10):908-919.
  14. Pinto, D., Branstein, M., Coleman, R., Croft, W. B., King, M., Li, W., and Wei, X. (2002). Quasm: A system for question answering using semi-structured data. In JCDL'02, pages 46-55. ACM.
  15. Pinto, D., McCallum, A., Wei, X., and Croft, W. B. (2003). Table extraction using conditional random fields. In SIGIR'03, pages 235-242. ACM.
  16. Ponte, J. M. and Croft, W. B. (1998). A language modeling approach to information retrieval. In SIGIR'98, pages 275-281. ACM.
  17. Pyreddy, P. and Croft, W. B. (1997). Tintin: A system for retrieval in text tables. In JCDL'97, pages 193-200. ACM.
  18. Robertson, S. E., Walker, S., Jones, S., Hancock-Beaulieu, M. M., and Gatford, M. (1996). Okapi at trec-3. In TREC'96, pages 109-126.
  19. Salton, G. and Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Information Processing and Management: an International Journal, 24(5):513-523.
  20. Sarawagi, S. and Chakrabarti, S. (2014). Open-domain quantity queries on web tables: Annotation, response, and consensus models. In SIGKDD'14, pages 711- 720.
  21. Wei, X. and Croft, W. B. (2006). Lda-based document models for ad-hoc retrieval. In SIGIR 2006, pages 178- 185. ACM.
  22. Whissell, J. S. and Clarke, C. L. A. (2013). Effective measures for inter-document similarity. In CIKM'13, ACM, pages 1361-1370.
  23. Yakout, M., Ganjam, K., Chakrabarti, K., and Chaudhuri, S. (2012). Infogather: Entity augmentation and attribute discovery by holistic matching with web tables. In SIGMOD'12, pages 97-108. ACM.
  24. Yin, X., Tan, W., and Liu, C. (2011). Facto: A fact lookup engine based on web tables. In WWW'11, pages 507- 516.
Download


Paper Citation


in Harvard Style

Braunschweig K., Thiele M., Koci E. and Lehner W. (2016). Putting Web Tables into Context . In Proceedings of the 8th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: KDIR, (IC3K 2016) ISBN 978-989-758-203-5, pages 158-165. DOI: 10.5220/0006034701580165


in Bibtex Style

@conference{kdir16,
author={Katrin Braunschweig and Maik Thiele and Elvis Koci and Wolfgang Lehner},
title={Putting Web Tables into Context},
booktitle={Proceedings of the 8th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: KDIR, (IC3K 2016)},
year={2016},
pages={158-165},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0006034701580165},
isbn={978-989-758-203-5},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 8th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: KDIR, (IC3K 2016)
TI - Putting Web Tables into Context
SN - 978-989-758-203-5
AU - Braunschweig K.
AU - Thiele M.
AU - Koci E.
AU - Lehner W.
PY - 2016
SP - 158
EP - 165
DO - 10.5220/0006034701580165