FEATURES FOR NAMED ENTITY RECOGNITION IN CZECH LANGUAGE

Pavel Král

Abstract

This paper deals with Named Entity Recognition (NER). Our work focuses on the application for the Czech News Agency (ˇCTK).We propose and implement a Czech NER system that facilitates the data searching from the ˇCTK text news databases. The choice of the feature set is crucial for the NER task. The main contribution of this work is thus to propose and evaluate some different features for the named entity recognition and to create an “optimal” set of features. We use Conditional Random Fields (CRFs) as a classifier. Our system is tested on a Czech NER corpus with nine main named entity classes. We reached 58% of the F-measure with the best feature set which is sufficient for our target application.

References

  1. Abdul Hamid, A. and Darwish, K. (2010). Simplified feature set for arabic named entity recognition. In Proceedings of the 2010 Named Entities Workshop, pages 110-115. Association for Computational Linguistics.
  2. Curran, J. R. and Clark, S. (2003). Language independent ner using a maximum entropy tagger. In Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4, CONLL 7803, pages 164-167, Edmonton, Canada. Association for Computational Linguistics.
  3. Ekbal, A. and Bandyopadhyay, S. (2010). Named entity recognition using support vector machine: A language independent approach.
  4. Ekbal, A., Saha, S., and Garbe, C. S. (2010). Feature selection using multiobjective optimization for named entity recognition. In International Conference on Pattern Recognition, pages 1937-1940.
  5. Favre, B., Hakkani-T ür, D., and Shriberg, E. (2009). Syntactically-informed models for comma prediction. pages 4697-4700, Taipei, Taiwan.
  6. Georgiev, G., Nakov, P., Ganchev, K., and Osenova, P. (2009). Feature-rich named entity recognition for bulgarian using conditional random fields. aclweborg, pages 113-117.
  7. Gravier, G. (2005). The ester phase ii evaluation campaign for the rich transcription of french broadcast news. In European Conf. on Speech Communication and Technology.
  8. Grishman, R. and Sundheim, B. (1996). Message understanding conference-6: a brief history. In Proceedings of the 16th conference on Computational linguistics - Volume 1, COLING 7896, pages 466-471, Copenhagen, Denmark. Association for Computational Linguistics.
  9. Isozaki, H. and Kazawa, H. (2002). Efficient support vector classifiers for named entity recognition. In Proceedings of the 19th international conference on Computational linguistics - Volume 1, COLING 7802, pages 1-7, Taipei, Taiwan. Association for Computational Linguistics.
  10. Jan Hajic, e. a. (2005). Manual for morphological annotation, revision for the prague dependency treebank 2.0. Technical Report TR-2005-27, Ý FAL MFF UK, Praha, Czechia.
  11. Kozareva, Z., Ferrández, O., Montoyo, A., Mun˜oz, R., Suárez, A., and Gómez, J. (2007). Combining datadriven systems for improving named entity recognition. Data & Knowledge Engineering, 61:449-466.
  12. Kravalová, J., S? evc?íková, M., and Z?abokrtskÉ, Z. (2009). Czech Named Entity Corpus 1.0.
  13. Kravalová, J. and Z? abokrtskÉ, Z. (2009). Czech named entity corpus and svm-based recognizer. In Proceedings of the 2009 Named Entities Workshop: Shared Task on Transliteration, NEWS 7809, pages 194-201, Suntec, Singapore. Association for Computational Linguistics.
  14. Lafferty, J. D., McCallum, A., and Pereira, F. C. N. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference on Machine Learning, ICML 7801, pages 282-289, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc.
  15. McCallum, A. and Li, W. (2003). Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4, CONLL 7803, pages 188-191, Edmonton, Canada. Association for Computational Linguistics.
  16. Sang, T. K. and Erik, F. (2002). Introduction to the conll2002 shared task: language-independent named entity recognition. In Proceedings of the 19th international conference on Computational linguistics, pages 1-4, Taipei, Taiwan.
  17. Santos, D., Seco, N., Cardoso, N., and Vilela, R. (2006). Harem: An advanced ner evaluation contest for portuguese. In Odjik and Daniel Tapias (eds.), Proceedings of LREC 2006 (LREC'2006) (Genoa, pages 22- 28.
  18. Satoshi, S. and Hitoshi, I. (2000). Ir and ie evaluation project in japanese. In LREC.
  19. Zhou, G. and Su, J. (2002). Named entity recognition using an hmm-based chunk tagger. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL 7802, pages 473-480, Philadelphia, Pennsylvania. Association for Computational Linguistics.
Download


Paper Citation


in Harvard Style

Král P. (2011). FEATURES FOR NAMED ENTITY RECOGNITION IN CZECH LANGUAGE . In Proceedings of the International Conference on Knowledge Engineering and Ontology Development - Volume 1: KEOD, (IC3K 2011) ISBN 978-989-8425-80-5, pages 437-441. DOI: 10.5220/0003660104370441


in Bibtex Style

@conference{keod11,
author={Pavel Král},
title={FEATURES FOR NAMED ENTITY RECOGNITION IN CZECH LANGUAGE},
booktitle={Proceedings of the International Conference on Knowledge Engineering and Ontology Development - Volume 1: KEOD, (IC3K 2011)},
year={2011},
pages={437-441},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003660104370441},
isbn={978-989-8425-80-5},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Knowledge Engineering and Ontology Development - Volume 1: KEOD, (IC3K 2011)
TI - FEATURES FOR NAMED ENTITY RECOGNITION IN CZECH LANGUAGE
SN - 978-989-8425-80-5
AU - Král P.
PY - 2011
SP - 437
EP - 441
DO - 10.5220/0003660104370441