CONCEPTS EXTRACTION BASED ON HTML DOCUMENTS STRUCTURE

Rim Zarrad, Narjes Doggaz, Ezzeddine Zagrouba

Abstract

The traditional methods to acquire automatically the ontology concepts from a textual corpus often privilege the analysis of the text itself, whether they are based on a statistical or linguistic approach. In this paper, we extend these methods by considering the document structure which provides interesting information on the significances contained in the texts. Our approach focuses on the structure of the HTML documents in order to extract the most relevant concepts of a given field. The candidate terms are extracted and filtered by analyzing their occurrences in the titles and in the links belonging to the documents and by considering the used styles.

References

  1. Berners-Lee, T. Hendler, J., Lassila, O, 2001. The Semantic Web. Scientific American, pp. 28-37
  2. Bourrigault, D., Fabre, C., Frérot, C., Jacques, M. P., Ozdowska, S., 2005. Syntex, analyseur syntaxique de corpus. In: Actes des 12èmes journées sur le Traitement Automatique des Langues Naturelles, pp. 17-20, Dourdan, France
  3. Cotardière, P., Penot, J. P., 1999. Dictionnaire de l'Astronomie et de l'Espace. eds. Larousse-Bordas
  4. Hazman, M., El-Beltagy, S.R., Rafaa, A., 2009. Ontology Learning from Domain Specific Web Documents. International Journal of Metadata, Semantics and Ontologies, Vol 4, Number 1-2, pp. 24-33
  5. Joachims, T., 1997. A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. In: Proc. 14th International Conference on Machine Learning, pp. 143- 151, Morgan Kaufmann
  6. Karoui, L., Aufaure, M. and Bennacer, N., 2004. Ontology Discovery from Web Pages: Application to Tourism. In: Knowledge Discovery and Ontologies Workshop at ECML/PKDD
  7. Lopez, C., Prince, V., Roche, M., 2010. Automatic Titling of Electronic Documents with Noun Phrase Extraction. In: Proceedings of Soft Computing and Pattern Recognition, SOCPAR'10, pp. 168-171, Paris, France
  8. Morin, E., 1999. Using Lexico-Syntactic Patterns to Extract Semantic Relations between terms from Technical Corpus. In: Proceedings of the 5th International Congress on Terminology and Knowledge Engineering (TKE'99), pp 268-278
  9. Schmid, H., 1994. Probabilistic Part-of-Speech Tagging Using Decision Trees. In: Proceedings of the International Conference on New Methods in Language Processing
  10. Séguéla, P., 1999. Adaptation semi-automatique d'une base de marqueurs de relations sémantiques sur des corpus spécialisés. Revue Terminologies, Number 19, pp. 52-61
  11. Velardi, P., Fabriani, P. and Missikoff, M. , 2002. Using text processing techniques to automatically enrich a domain ontology, In Proceedings of the ACM Conference on Formal Ontologies and Information Systems, pp 270-284
Download


Paper Citation


in Harvard Style

Zarrad R., Doggaz N. and Zagrouba E. (2012). CONCEPTS EXTRACTION BASED ON HTML DOCUMENTS STRUCTURE . In Proceedings of the 4th International Conference on Agents and Artificial Intelligence - Volume 1: ICAART, ISBN 978-989-8425-95-9, pages 503-506. DOI: 10.5220/0003748305030506


in Bibtex Style

@conference{icaart12,
author={Rim Zarrad and Narjes Doggaz and Ezzeddine Zagrouba},
title={CONCEPTS EXTRACTION BASED ON HTML DOCUMENTS STRUCTURE},
booktitle={Proceedings of the 4th International Conference on Agents and Artificial Intelligence - Volume 1: ICAART,},
year={2012},
pages={503-506},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003748305030506},
isbn={978-989-8425-95-9},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 4th International Conference on Agents and Artificial Intelligence - Volume 1: ICAART,
TI - CONCEPTS EXTRACTION BASED ON HTML DOCUMENTS STRUCTURE
SN - 978-989-8425-95-9
AU - Zarrad R.
AU - Doggaz N.
AU - Zagrouba E.
PY - 2012
SP - 503
EP - 506
DO - 10.5220/0003748305030506