FUZZY SEMANTIC MATCHING IN (SEMI-)STRUCTURED XML DOCUMENTS - Indexation of Noisy Documents

Arnaud Renard, Sylvie Calabretto, Béatrice Rumpler

Abstract

Nowadays, semantics is one of the greatest challenges in IR systems evolution, as well as when it comes to (semi-)structured IR systems which are considered here. Usually, this challenge needs an additional external semantic resource related to the documents collection. In order to compare concepts and from a wider point of view to work with semantic resources, it is necessary to have semantic similarity measures. Similarity measures assume that concepts related to the terms have been identified without ambiguity. Therefore, misspelled terms interfere in term to concept matching process. So, existing semantic aware (semi-)structured IR systems lay on basic concept identification but don’t care about terms spelling uncertainty. We choose to deal with this last aspect and we suggest a way to detect and correct misspelled terms through a fuzzy semantic weighting formula which can be integrated in an IR system. In order to evaluate expected gains, we have developed a prototype which first results on small datasets seem interesting.

References

  1. Allen, J. F., 1983. Maintaining knowledge about temporal intervals. Communications of the ACM, 26(11), 832- 843.
  2. Allen, J. F., 1991. Time and time again: The many ways to represent time. International Journal of Intelligent Systems, 6(4), 341-355.
  3. Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z., 2007. Dbpedia: A nucleus for a web of open data. In: LNCS (ed.): Proc. of 6th ISWC, Vol. 4825, Busan, Korea, 722-735.
  4. Bellia, Z., Vincent, N., Kirchner, S., Stamon, G., 2008. Assignation automatique de solutions à des classes de plaintes liées aux ambiances intérieures polluées. Proc. of 8th EGC, Sophia-Antipolis.
  5. Budanitsky, E., Hirst, G., 2001. Semantic distance in WordNet: An experimental, application-oriented evaluation of five measures. Proc. of 2nd NAACL Workshop on WordNet and other lexical resources.
  6. Formica, A., 2009. Concept similarity by evaluating information contents and feature vectors: a combined approach. Commununications of the ACM 52, 145- 149.
  7. Jiang, J.J., Conrath, D.W., 1997. Semantic similarity based on corpus statistics and lexical taxonomy. Proc. of International Conference on Research in Computational Linguistics.
  8. Kantor, P., Voorhees, E., 2000. The TREC-5 Confusion Track: Comparing Retrieval Methods for Scanned Text. Information Retrieval, 2(2/3), 165-176.
  9. Kobilarov, G., Scott, T., Raimond, Y., Oliver, S., Sizemore, C., Smethurst, M., Lee, R., 2009. Media Meets Semantic Web - How the BBC Uses DBpedia and Linked Data to Make Conections. Proc. of 6th ESWC Semantic Web in Use Track, Crete.
  10. Lin, D., 1998. An Information-Theoretic Definition of Similarity. Proc. of 15th ICML. Morgan Kaufmann Publishers Inc. 296-304.
  11. Mercier, A., Beigbeder, M., 2005. Application de la logique floue à un modèle de recherche d'information basé sur la proximité. Proc. of 12th LFA 2004, 231- 237.
  12. Navigli, R., 2009. Word sense disambiguation: A survey. ACM Computing Surveys (CSUR) 41, 1-69.
  13. Pedler, J., 2007. Computer Correction of Real-word spelling Errors in Dyslexic Text. Phd thesis. Birkbeck, London University, 239.
  14. Rada, R., Mili, H., Bicknell, E., Blettner, M., 1989. Development and application of a metric on semantic nets. IEEE Transactions on Systems, Man and Cybernetics 19, 17-30.
  15. Resnik, P., 1995. Using information content to evaluate semantic similarity in taxonomy. Proc. of 14th IJCAI, 448-453.
  16. Rosso, P., Ferretti, E., Jimenez, D., Vidal, V., 2004. Text categorization and information retrieval using wordnet senses. Proc. of 2nd GWC, Czech Republic, 299-304.
  17. Schenkel, R., Theobald, A., Weikum, G., 2005. Semantic Similarity Search on Semistructured Data with the XXL Search Engine. Information Retrieval 8, 521- 545.
  18. Suchanek F., Kasneci G., Weikum G., 2007. Yago - A Core of Semantic knowledge. 16th international World Wide Web conference
  19. Taha, K., Elmasri, R., 2008. CXLEngine: a comprehensive XML loosely structured search engine. Proc. of 11th EDBT workshop on Database technologies for handling XML information on the Web, Vol. 261. ACM, Nantes, France, 37-42.
  20. Tambellini, C., 2007. Un système de recherche d'information adapté aux données incertaines : adaptation du modèle de langue. Phd Thesis. Université Joseph Fourier, Grenoble, 182.
  21. Torjmen, M., Pinel-Sauvagnat, K., Boughanem, M., 2008. Towards a structure-based multimedia retrieval model. Proc. of 1st ACM MIR. ACM, Vancouver, British Columbia, Canada, 350-357.
  22. Van Zwol, R., Van Loosbroek, T., 2007. Effective Use of Semantic Structure in XML Retrieval. In: LNCS (ed.): Proc. of 29th ECIR, Vol. 4425, Rome, Italy, 621.
  23. Wu, Z., Palmer, M., 1994. Verbs semantics and lexical selection. Proc. of 32nd annual meeting of ACL. ACL, Las Cruces, New Mexico, 133-138.
  24. Zargayouna, H., Salotti, S., 2004. Mesure de similarité dans une ontologie pour l'indexation sémantique de documents XML. Proc. of IC 2004.
  25. Zargayouna, H., 2005. Indexation sémantique de documents XML. Phd thesis. Université Paris-Sud (Orsay), Paris, 227.
Download


Paper Citation


in Harvard Style

Renard A., Calabretto S. and Rumpler B. (2010). FUZZY SEMANTIC MATCHING IN (SEMI-)STRUCTURED XML DOCUMENTS - Indexation of Noisy Documents . In Proceedings of the 6th International Conference on Web Information Systems and Technology - Volume 1: WEBIST, ISBN 978-989-674-025-2, pages 253-260. DOI: 10.5220/0002807502530260


in Bibtex Style

@conference{webist10,
author={Arnaud Renard and Sylvie Calabretto and Béatrice Rumpler},
title={FUZZY SEMANTIC MATCHING IN (SEMI-)STRUCTURED XML DOCUMENTS - Indexation of Noisy Documents},
booktitle={Proceedings of the 6th International Conference on Web Information Systems and Technology - Volume 1: WEBIST,},
year={2010},
pages={253-260},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0002807502530260},
isbn={978-989-674-025-2},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 6th International Conference on Web Information Systems and Technology - Volume 1: WEBIST,
TI - FUZZY SEMANTIC MATCHING IN (SEMI-)STRUCTURED XML DOCUMENTS - Indexation of Noisy Documents
SN - 978-989-674-025-2
AU - Renard A.
AU - Calabretto S.
AU - Rumpler B.
PY - 2010
SP - 253
EP - 260
DO - 10.5220/0002807502530260