4 CONCLUSIONS AND FUTURE
WORK
This paper presents a text mining method for extract-
ing collocation candidates and identifying Named En-
tities from a list of candidates. The filtering method
uses only a statistical approach based on the Dice
measure and exploitation of the results of search en-
gines. The NE is ”stable”, which is why we built vari-
ations of the candidates and checked their popularity
using search engines. If the candidates we built turned
out to be irrelevant (i.e. low value of the statistical
measure), they were considered as NE.
In the next step of our work, we plan to enrich the
rules concerning variations, which is a precondition
to take into account the vast majority of possible lin-
guistic variations. Finally, we plan to combine this
approach, which is only based on statistical knowl-
edge, with lexical information, particularly the use of
uppercase letters when possible.
ACKNOWLEDGEMENTS
I thank Daphne Goodfellow who improved the read-
ability of this paper.
REFERENCES
Baluja, S., Mittal, V. O., and Sukthankar, R. (2000). Apply-
ing machine learning for high-performance named-
entity extraction. Comput. Intelligence, 16(4):586–
596.
Bourigault, D. and Jacquemin, C. (1999). Term extraction +
term clustering: An integrated platform for computer-
aided terminology. In Proceedings of the European
Chapter of the Association for Computational Lin-
guistics, pages 15–22.
Brill, E. (1994). Some advances in transformation-based
part of speech tagging. In Proceedings of AAAI (Con-
ference on Artificial Intelligence), volume 1, pages
722–727.
Clas, A. (1994). Collocations et langues de sp
´
ecialit
´
e. Meta,
39(4):576–580.
Daille, B. (1996). Study and implementation of combined
techniques for automatic extraction of terminology. In
The Balancing Act: Combining Symbolic and Statisti-
cal Approaches to Language, MIT Press, pages 49–66.
Daille, B., Fourour, N., and Morin, E. (2000).
Cat
´
egorisation des noms propres : une
´
etude en
corpus. Cahiers de Grammaire, 25:115–129.
Farkas, R., Szarvasand, G., and Ormandi, R. (2007). Im-
proving a state-of-the-art named entity recognition
system using the world wide web. In Proceedings
of Industrial Conference on Data Mining, pages 163–
172.
Fort, K., Ehrmann, M., and Nazarenko, A. (2009). Vers
une m
´
ethodologie d’annotation des entit
´
es nomm
´
ees
en corpus. In Proceedings of TALN (Traitement Au-
tomatique du Langage Naturel).
Heid, U. (1998). Towards a corpus-based dictionary of ger-
man noun-verb collocations. In Proceedings of the
Euralex International Congress, pages 301–312.
Jacquemin, C. (1997). Variation terminologique : Recon-
naissance et acquisition automatiques de termes et de
leurs variantes en corpus. In M
´
emoire d’Habilitation
`
a Diriger des Recherches en informatique fondamen-
tale, Universit
´
e de Nantes.
Melcuk, I., Arbatchewsky-Jumarie, N., Elnitsky, L., and
Lessard, A. (1984, 1988, 1992, 1999). Dictionnaire
explicatif et combinatoire du franc¸ais contemporain.
Presses de l’Universit
´
e de Montr
´
eal, 1,2,3,4.
Paik, W., Liddy, E., Yu, E., and McKenna, M. (1994). Cat-
egorizing and standardizing proper nouns for efficient
information retrieval. In Corpus Processing for Lexi-
cal Acquisition, MIT Press, chap. 4.
Petrovic, S., Snajder, J., Dalbelo-Basic, B., and Kolar, M.
(2006). Comparison of collocation extraction mea-
sures for document indexing. In Proceedings of ITI
(Information technology interfaces conference), pages
451–456.
Roche, M. and Kodratoff, Y. (2006). Pruning Terminol-
ogy Extracted from a Specialized Corpus for CV On-
tology Acquisition. In Proceedings of onToContent
Workshop - OTM’06, Springer Verlag, LNCS, pages
1107–1116.
Roche, M. and Kodratoff, Y. (2009). Text and web min-
ing approaches in order to build specialized ontolo-
gies. Journal of Digital Information (JoDI), 10(4).
Roche, M. and Prince, V. (2008). Managing the
Acronym/Expansion Identification Process for Text-
Mining Applications. International Journal of Soft-
ware and Informatics, 2(2):163–179.
Smadja, F., McKeown, K., and Hatzivassiloglou, V. (1996).
Translating collocations for bilingual lexicons : A sta-
tistical approach. Comp. Linguistics, 22(1):1–38.
Thanopoulos, A., Fakotakis, N., and Kokkianakis, G.
(2002). Comparative evaluation of collocation extrac-
tion metrics. In Proceedings of LREC (International
Conference on Language Resources and Evaluation),
pages 620–625.
Turney, P. (2001). Mining the Web for synonyms:
PMI–IR versus LSA on TOEFL. In Proceedings
of ECML/PKDD (European Conference on Machine
Learning and Principles and Practice of Knowledge
Discovery in Databases), pages 491–502.
HOW STATISTICAL INFORMATION FROM THE WEB CAN HELP IDENTIFY NAMED ENTITIES
689