IMPROVING CONTENT-ORIENTED XML RETRIEVAL BY APPLYING STRUCTURAL PATTERNS

Philipp Dopichaj

2007

Abstract

XML is the perfect format for storing (mostly) textual documents in a knowledge management system; its flexibility enables users to store both highly structured data and free text in the same document. For knowledge management, it is important to be able to search the free-text parts effectively; users need to find the information that helps them solve their problem without having to wade through much information that is not relevant for their problem. Content-oriented XML retrieval addresses this challenge: In contrast to traditional information retrieval, documents are not considered atomic units, that is, elements such as sections or paragraphs can be returned. One implication of this is that results can overlap (for example a paragraph and the surrounding section). Although overlapping results are undesirable in the final retrieval result as presented to the user, they can help to improve the quality of the final result: We take advantage of overlaps by applying patterns to small sub trees of the retrieval result (result contexts); matching patterns adjust the retrieval status values of the involved node in order to promote the best results. We demonstrate on the INEX 2005 test collection that this post processing can lead to a significant improvement in retrieval quality.

References

  1. Arvola, P., Junkkari, M., and Kekäläinen, J. (2005). Generalized contextualization method for XML information retrieval. In Proc. CIKM 2005.
  2. Denoyer, L. and Gallinari, P. (2006). The Wikipedia XML Corpus. SIGIR Forum, 40(1).
  3. Dopichaj, P. (2006). The University of Kaiserslautern at INEX 2005. In Fuhr et al. (2006).
  4. Eger, B. (2005). Entwurf und Implementierung einer XMLVolltext-Suchmaschine. Master's thesis, University of Kaiserslautern.
  5. Fuhr, N., Lalmas, M., Malik, S., and Kazai, G., editors (2006). Proc. INEX 2005. Springer.
  6. Fuhr, N., Lalmas, M., Malik, S., and Szlávik, Z., editors (2005). Proc. INEX 2004. Springer.
  7. Järvelin, K. and Kekäläinen, J. (2002). Cumulated gainbased evaluation of IR techniques. ACM Transactions on Information Systems, 20(4):422-446.
  8. Kamps, J., de Rijke, M., and Sigurbjörnsson, B. (2005). The importance of length normalization for XML retrieval. Information Retrieval, 8(4):631-654.
  9. Kazai, G. and Lalmas, M. (2005). Notes on what to measure in INEX. In Proc. of the INEX 2005 Workshop on Element Retrieval Methodology.
  10. Kazai, G., Lalmas, M., and de Vries, A. P. (2004). The overlap problem in content-oriented XML retrieval evaluation. In Sanderson, M., Järvelin, K., Allan, J., and Bruza, P., editors, Proc. SIGIR 2004, pages 72-79. ACM.
  11. Kekäläinen, J., Junkkari, M., and Arvola, P. (2005). TRIX 2004 - struggling with the overlap. In Fuhr et al. (2005), pages 127-139.
  12. Lee, D. L., Chuang, H., and Seamons, K. (1997). Document ranking and the vector-space model. IEEE Software, 14(2):67-75.
  13. Malik, S., Kazai, G., Lalmas, M., and Fuhr, N. (2006). Overview of INEX 2005. In Fuhr et al. (2006).
  14. Michalewicz, Z. and Fogel, D. B. (2004). How to Solve It: Modern Heuristics, chapter 13, pages 367-388. Springer, 2nd edition.
  15. Ogilvie, P. and Callan, J. (2005). Hierarchical language models for XML component retrieval. In Fuhr et al. (2005), pages 224-237.
  16. Ramírez, G., Westerveld, T., and de Vries, A. P. (2006). Using small XML elements to support relevance. In Efthimiadis, E. N., Dumais, S. T., Hawking, D., and Järvelin, K., editors, Proc. SIGIR 2006. ACM.
  17. Salton, G., Allan, J., and Buckley, C. (1993). Approaches to passage retrieval in full text information systems. In Proc. SIGIR 1993, pages 49-58.
Download


Paper Citation


in Harvard Style

Dopichaj P. (2007). IMPROVING CONTENT-ORIENTED XML RETRIEVAL BY APPLYING STRUCTURAL PATTERNS . In Proceedings of the Ninth International Conference on Enterprise Information Systems - Volume 2: ICEIS, ISBN 978-972-8865-89-4, pages 5-13. DOI: 10.5220/0002364700050013


in Bibtex Style

@conference{iceis07,
author={Philipp Dopichaj},
title={IMPROVING CONTENT-ORIENTED XML RETRIEVAL BY APPLYING STRUCTURAL PATTERNS},
booktitle={Proceedings of the Ninth International Conference on Enterprise Information Systems - Volume 2: ICEIS,},
year={2007},
pages={5-13},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0002364700050013},
isbn={978-972-8865-89-4},
}


in EndNote Style

TY - CONF
JO - Proceedings of the Ninth International Conference on Enterprise Information Systems - Volume 2: ICEIS,
TI - IMPROVING CONTENT-ORIENTED XML RETRIEVAL BY APPLYING STRUCTURAL PATTERNS
SN - 978-972-8865-89-4
AU - Dopichaj P.
PY - 2007
SP - 5
EP - 13
DO - 10.5220/0002364700050013