Subtopic Ranking based on Hierarchical Headings

Tomohiro Manabe, Keishi Tajima

2016

Abstract

We propose methods for generating diversified rankings of subtopics of keyword queries. Our methods are characterized by their awareness of hierarchical heading structure in documents. The structure consists of nested logical blocks with headings. Each heading concisely describes the topic of its corresponding block. Therefore, hierarchical headings in documents reflect the hierarchical topics referred to in the documents. Based on this idea, our methods score subtopic candidates based on matching between them and hierarchical headings in documents. They give higher scores to candidates matching hierarchical headings associated to more contents. To diversify the resulting rankings, every time our methods adopt a candidate with the best score, our methods exclude the blocks matching the candidate and re-score all remaining blocks and candidates. According to our evaluation result based on the NTCIR data set, our methods generated significantly better subtopic rankings than query completion results by major commercial search engines.

References

  1. Bah, A., Carterette, B., and Chandar, P. (2014). Udel @ NTCIR-11 IMine track. In NTCIR.
  2. Bouchoucha, A., Nie, J., and Liu, X. (2014). Université de montréal at the NTCIR-11 IMine task. In NTCIR.
  3. Carbonell, J. and Goldstein, J. (1998). The use of MMR, diversity-based reranking for reordering documents and producing summaries. In SIGIR, pages 335-336.
  4. Clarke, C. L. A., Craswell, N., and Voorhees, E. M. (2012). Overview of the TREC 2012 web track. In TREC.
  5. Das, S., Mitra, P., and Giles, C. L. (2012). Phrase pair classification for identifying subtopics. In ECIR, pages 489-493.
  6. Dias, G., Cleuziou, G., and Machado, D. (2011). Informative polythetic hierarchical ephemeral clustering. In WI, pages 104-111.
  7. Dou, Z., Hu, S., Luo, Y., Song, R., and Wen, J.-R. (2011). Finding dimensions for queries. In CIKM, pages 1311-1320.
  8. He, J., Hollink, V., and de Vries, A. (2012). Combining implicit and explicit topic representations for result diversification. InSIGIR, pages 851-860.
  9. Hu, Y., Qian, Y., Li, H., Jiang, D., Pei, J., and Zheng, Q. (2012). Mining query subtopics from search log data. In SIGIR, pages 305-314.
  10. Jiang, D. and Ng, W. (2013). Mining web search topics with diverse spatiotemporal patterns. In SIGIR, pages 881-884.
  11. Kim, S.-J. and Lee, J.-H. (2015). Subtopic mining using simple patterns and hierarchical structure of subtopic candidates from web documents. Inf. Process. Manage., 51(6):773-785.
  12. Liu, Y., Song, R., Zhang, M., Dou, Z., Yamamoto, T., Kato, M. P., Ohshima, H., and Zhou, K. (2014). Overview of the NTCIR-11 IMine task. In NTCIR.
  13. Luo, C., Li, X., Khodzhaev, A., Chen, F., Xu, K., Cao, Y., Liu, Y., Zhang, M., and Ma, S. (2014). THUSAM at NTCIR-11 IMine task. In NTCIR.
  14. Manabe, T. and Tajima, K. (2015). Extracting logical hierarchical structure of HTML documents based on headings. PVLDB, 8(12):1606-1617.
  15. Manning, C. D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S. J., and McClosky, D. (2014). The Stanford CoreNLP natural language processing toolkit. In ACL, pages 55-60.
  16. Moreno, J. G. and Dias, G. (2013). HULTECH at the NTCIR-10 INTENT-2 task: Discovering user intents through search results clustering. In NTCIR.
  17. Moreno, J. G. and Dias, G. (2014). HULTECH at the NTCIR-11 IMine task: Mining intents with continuous vector space models. In NTCIR.
  18. Oyama, S. and Tanaka, K. (2004). Query modification by discovering topics from web page structures. In APWeb, pages 553-564.
  19. Porter, M. F. (1997). Readings in information retrieval. chapter An Algorithm for Suffix Stripping, pages 313-316. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA.
  20. Robertson, S. E. and Walker, S. (1994). Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In SIGIR, pages 232-241.
  21. Sakai, T., Dou, Z., Yamamoto, T., Liu, Y., Zhang, M., and Song, R. (2013). Overview of the NTCIR-10 INTENT-2 task. In NTCIR.
  22. Song, R., Zhang, M., Sakai, T., Kato, M. P., Liu, Y., Sugimoto, M., Wang, Q., and Orii, N. (2011). Overview of the NTCIR-9 INTENT task. In NTCIR.
  23. Ullah, M. Z. and Aono, M. (2014). Query subtopic mining for search result diversification. In ICAICTA, pages 309-314.
  24. Ullah, M. Z., Aono, M., and Seddiqui, M. H. (2013). SEM12 at the NTCIR-10 INTENT-2 english subtopic mining subtask. In NTCIR.
  25. Wang, C., Danilevsky, M., Desai, N., Zhang, Y., Nguyen, P., Taula, T., and Han, J. (2013a). A phrase mining framework for recursive construction of a topical hierarchy. In KDD, pages 437-445.
  26. Wang, C.-J., Lin, Y.-W., Tsai, M.-F., and Chen, H.-H. (2013b). Mining subtopics from different aspects for diversifying search results. Inf. Retr., 16(4):452-483.
  27. Wang, J., Tang, G., Xia, Y., Zhou, Q., Zheng, T. F., Hu, Q., Na, S., and Huang, Y. (2013c). Understanding the query: THCIB and THUIS at NTCIR-10 intent task. In NTCIR.
  28. Wang, Q., Qian, Y., Song, R., Dou, Z., Zhang, F., Sakai, T., and Zheng, Q. (2013d). Mining subtopics from text fragments for a web query. Inf. Retr., 16(4):484-503.
  29. Xia, Y., Zhong, X., Tang, G., Wang, J., Zhou, Q., Zheng, T. F., Hu, Q., Na, S., and Huang, Y. (2013). Ranking search intents underlying a query. In NLDB, pages 266-271.
  30. Xue, Y., Chen, F., Damien, A., Luo, C., Li, X., Huo, S., Zhang, M., Liu, Y., and Ma, S. (2013). THUIR at NTCIR-10 INTENT-2 task. In NTCIR.
  31. Yamamoto, T., Kato, M. P., Ohshima, H., and Tanaka, K. (2014). KUIDL at the NTCIR-11 IMine task. In NTCIR.
  32. Yu, H. and Ren, F. (2014). TUTA1 at the NTCIR-11 IMine task. In NTCIR.
  33. Zeng, H.-J., He, Q.-C., Chen, Z., Ma, W.-Y., and Ma, J. (2004). Learning to cluster web search results. In SIGIR, pages 210-217.
  34. Zheng, W., Fang, H., Cheng, H., and Wang, X. (2012). Diversifying search results through pattern-based subtopic modeling. Int. J. Semant. Web Inf. Syst., 8(4):37-56.
  35. Zheng, W., Wang, X., Fang, H., and Cheng, H. (2011). An exploration of pattern-based subtopic modeling for search result diversification. InJCDL, pages 387-388.
Download


Paper Citation


in Harvard Style

Manabe T. and Tajima K. (2016). Subtopic Ranking based on Hierarchical Headings . In Proceedings of the 12th International Conference on Web Information Systems and Technologies - Volume 2: WEBIST, ISBN 978-989-758-186-1, pages 121-130. DOI: 10.5220/0005812401210130


in Bibtex Style

@conference{webist16,
author={Tomohiro Manabe and Keishi Tajima},
title={Subtopic Ranking based on Hierarchical Headings},
booktitle={Proceedings of the 12th International Conference on Web Information Systems and Technologies - Volume 2: WEBIST,},
year={2016},
pages={121-130},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005812401210130},
isbn={978-989-758-186-1},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 12th International Conference on Web Information Systems and Technologies - Volume 2: WEBIST,
TI - Subtopic Ranking based on Hierarchical Headings
SN - 978-989-758-186-1
AU - Manabe T.
AU - Tajima K.
PY - 2016
SP - 121
EP - 130
DO - 10.5220/0005812401210130