HOW STATISTICAL INFORMATION FROM THE WEB CAN HELP IDENTIFY NAMED ENTITIES

Mathieu Roche

Abstract

This paper presents a Natural Language Processing (NLP) approach to filter Named Entities (NE) from a list of collocation candidates. The NE are defined as the names of ’People’, ’Places’, ’Organizations’, ’Software’, ’Illnesses’, and so forth. The proposed method is based on statistical measures associated with Web resources to identify NE. Our method has three stages: (1) Building artificial prepositional collocations from Noun-Noun candidates; (2) Measuring the ”relevance” of the resulting prepositional collocations using statistical methods (Web Mining); (3) Selecting prepositional collocations. The evaluation of Noun-Noun collocations from French and English corpora confirmed the relevance of our system.

References

  1. Baluja, S., Mittal, V. O., and Sukthankar, R. (2000). Applying machine learning for high-performance namedentity extraction. Comput. Intelligence, 16(4):586- 596.
  2. Bourigault, D. and Jacquemin, C. (1999). Term extraction + term clustering: An integrated platform for computeraided terminology. In Proceedings of the European Chapter of the Association for Computational Linguistics, pages 15-22.
  3. Brill, E. (1994). Some advances in transformation-based part of speech tagging. In Proceedings of AAAI (Conference on Artificial Intelligence), volume 1, pages 722-727.
  4. Clas, A. (1994). Collocations et langues de spécialité. Meta, 39(4):576-580.
  5. Daille, B. (1996). Study and Implementation of Combined Techniques for Automatic Extraction of Terminology. In The Balancing Act: Combining Symbolic and Statistical Approaches to Language, MIT Press, pages 49-66.
  6. Daille, B., Fourour, N., and Morin, E. (2000). Catégorisation des noms propres : une étude en corpus. Cahiers de Grammaire, 25:115-129.
  7. Farkas, R., Szarvasand, G., and Ormandi, R. (2007). Improving a state-of-the-art named entity recognition system using the world wide web. In Proceedings of Industrial Conference on Data Mining, pages 163- 172.
  8. Fort, K., Ehrmann, M., and Nazarenko, A. (2009). Vers une méthodologie d'annotation des entités nommées en corpus. In Proceedings of TALN (Traitement Automatique du Langage Naturel).
  9. Heid, U. (1998). Towards a corpus-based dictionary of german noun-verb collocations. In Proceedings of the Euralex International Congress, pages 301-312.
  10. Jacquemin, C. (1997). Variation terminologique : Reconnaissance et acquisition automatiques de termes et de leurs variantes en corpus. In Mémoire d'Habilitation à Diriger des Recherches en informatique fondamentale, Université de Nantes.
  11. Melcuk, I., Arbatchewsky-Jumarie, N., Elnitsky, L., and Lessard, A. (1984, 1988, 1992, 1999). Dictionnaire explicatif et combinatoire du franc¸ais contemporain. Presses de l'Université de Montréal, 1,2,3,4.
  12. Paik, W., Liddy, E., Yu, E., and McKenna, M. (1994). Categorizing and standardizing proper nouns for efficient information retrieval. In Corpus Processing for Lexical Acquisition, MIT Press, chap. 4.
  13. Petrovic, S., Snajder, J., Dalbelo-Basic, B., and Kolar, M. (2006). Comparison of collocation extraction measures for document indexing. In Proceedings of ITI (Information technology interfaces conference), pages 451-456.
  14. Roche, M. and Kodratoff, Y. (2006). Pruning Terminology Extracted from a Specialized Corpus for CV Ontology Acquisition. In Proceedings of onToContent Workshop - OTM'06, Springer Verlag, LNCS, pages 1107-1116.
  15. Roche, M. and Kodratoff, Y. (2009). Text and web mining approaches in order to build specialized ontologies. Journal of Digital Information (JoDI), 10(4).
  16. Roche, M. and Prince, V. (2008). Managing the Acronym/Expansion Identification Process for TextMining Applications. International Journal of Software and Informatics, 2(2):163-179.
  17. Smadja, F., McKeown, K., and Hatzivassiloglou, V. (1996). Translating collocations for bilingual lexicons : A statistical approach. Comp. Linguistics, 22(1):1-38.
  18. Thanopoulos, A., Fakotakis, N., and Kokkianakis, G. (2002). Comparative evaluation of collocation extraction metrics. In Proceedings of LREC (International Conference on Language Resources and Evaluation), pages 620-625.
  19. Turney, P. (2001). Mining the Web for synonyms: PMI-IR versus LSA on TOEFL. In Proceedings of ECML/PKDD (European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases), pages 491-502.
Download


Paper Citation


in Harvard Style

Roche M. (2011). HOW STATISTICAL INFORMATION FROM THE WEB CAN HELP IDENTIFY NAMED ENTITIES . In Proceedings of the 7th International Conference on Web Information Systems and Technologies - Volume 1: WTM, (WEBIST 2011) ISBN 978-989-8425-51-5, pages 685-689. DOI: 10.5220/0003473906850689


in Bibtex Style

@conference{wtm11,
author={Mathieu Roche},
title={HOW STATISTICAL INFORMATION FROM THE WEB CAN HELP IDENTIFY NAMED ENTITIES},
booktitle={Proceedings of the 7th International Conference on Web Information Systems and Technologies - Volume 1: WTM, (WEBIST 2011)},
year={2011},
pages={685-689},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003473906850689},
isbn={978-989-8425-51-5},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 7th International Conference on Web Information Systems and Technologies - Volume 1: WTM, (WEBIST 2011)
TI - HOW STATISTICAL INFORMATION FROM THE WEB CAN HELP IDENTIFY NAMED ENTITIES
SN - 978-989-8425-51-5
AU - Roche M.
PY - 2011
SP - 685
EP - 689
DO - 10.5220/0003473906850689