Challenges and Potentials for Keyword Extraction from Company Websites for the Development of Regional Knowledge Maps

Christian Wartena, Montserrat Garcia Alsina



Regional Innovation Systems describe the relations between actors, structures and infrastructures in a region in order to stimulate innovation and regional development. For these systems the collection and organization of information is crucial. In the present paper we investigate the possibilities to extract information from websites of companies. First we describe regional innovation systems and the information types that are necessary to create them. Then we discuss the possibilities of text mining and keyword extraction techniques to extract this information from company websites. Finally, we describe a small scale experiment in which keywords related to economic sectors and commodities are extracted from the websites of over 200 companies. This experiment shows what the main challenges are for information extraction from websites for regional innovation systems.


  1. Asheim, B. and Gertler, M. (2005). The geography of innovation: regional innovation systems. In Fagerberg, J., Mowery, D., and Nelson, R., editors, The Oxford Handbook of Innovation., pages 291 - 317. Oxford University Press, Oxford.
  2. Barinani, A., Agard, B., and Beaudry, C. (2011). Competence maps using agglomerative hierarchical clustering. Journal of Intelligence Manufacturing, pages 1-12.
  3. Canongia, C. (2007). Synergy between competitive intelligence (CI), knowledge management (KM) and technological foresight (TF) as a strategic model of prospecting - the use of biotechnology in the development of drugs against breast cancer. Biotechnology Advances, 25(1):57-74.
  4. Cunningham, H., Maynard, D., Bontcheva, K., and Tablan, V. (2002). A framework and graphical development environment for robust nlp tools and applications. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, July 6-12, 2002, Philadelphia, PA, USA, pages 168-175. ACL.
  5. David, P. and Foray, D. (1995). Assessing and expanding the science and technology knowledge base. STI Review, 14:13-68.
  6. De Campos, L. M., Fernandez-Luna, J. M., Huete, J. F., and Romero, A. E. (2007). Automatic indexing from a thesaurus using bayesian networks. In Mellouli, K., editor, Symbolic and Quantitative Approaches to Reasoning with Uncertainty, pages 865-877. LNCS 4724, Springer.
  7. Doloreux, D., Nabil, A., and Landry, R. (2008). Mapping regional and sectoral characteristics of knowledgeintensive business services: Evidence from the province of Quebec (Canada). Growth and Change, 39(3):464-496.
  8. Doloreux, D. and Parto, S. (2005). Regional innovation systems: Current discourse and unresolved issues. Technology in Society, 27:133-153.
  9. Driessen, S., Huijsen, W., and Grootveld, M. (2007). A framework for evaluating knowledge-mapping tools. Journal of Knowledge Management, 11(2):109 - 117.
  10. Eckert, K., Stuckenschmidt, H., and Pfeffer, M. (2007). Interactive thesaurus assessment for automatic document annotation. In Proceedings of the 4th international conference on Knowledge capture., pages 103- 110. ACM.
  11. Escorsa, P., Rodriguez, M., and Maspons, R. (2000). Technology mapping, business strategy and market opportunities. Competitive Intelligence Review, 11(1):46- 57.
  12. Färber, M. and Rettinger, A. (2013). A semantic wiki for novelty search on documents. In Proceedings of the 13th Dutch-Belgian Workshop on Information Retrieval, pages 60-61, Delft.
  13. Frank, E., Paynter, G. W., Witten, I. H., Gutwin, C., and Nevill-Manning, C. G. (1999). Domain-specific keyphrase extraction. In Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence, IJCAI 99, Stockholm, Sweden, July 31 - August 6, 1999., pages 668-673.
  14. Garcia-Alsina, M. and Ortoll, E. (2012). La Inteligencia Competitiva: evolución histórica y fundamentos teóricos. Trea, Gijón.
  15. Garcia-Alsina, M., Wartena, C., and Lieberam-Schmidt, S. (2013). Regional knowledge maps: potentials and challenges. In Fifth International Conference on Knowledge Management and Information Sharing (KMIS 2013).
  16. Gazendam, L., Wartena, C., and Brussee, R. (2010). Thesaurus based term ranking for keyword extraction. In Tjoa, A. M. and Wagner, R., editors, Database and Expert Systems Applications, DEXA, 10th International Workshop on Text-based Information Retrieval, TIR, pages 49-53. IEEE.
  17. Gazendam, L., Wartena, C., Malaisé, V., Schreiber, G., De Jong, A., and Brugman, H. (2009). Automatic annotation suggestions for audiovisual archives: Evaluation aspects. Interdisciplinary Science Reviews, 34, 2(3):172-188.
  18. Girardot, J.-J. (2008). Evolution of the concept of territorial intelligence within the coordination action of the european network of territorial intelligence. Ricerca e Sviluppo per le politiche sociali, 1(1-2):11-29.
  19. Girardot, J.-J. and Brunau, Ó. (2010). Territorial intelligence and innovation for the socio-ecological transition. In 9th. International conference of Territorial Intelligence, ENTI, Strasbourg.
  20. Grineva, M. P., Grinev, M. N., and Lizorkin, D. (2009). Extracting key terms from noisy and multitheme documents. In Proceedings of the 18th International Conference on World Wide Web, WWW 2009, Madrid, Spain, April 20-24, 2009, pages 661-670.
  21. Herbaux, P. (2008). Tools for territorial intelligence and generic scientific methods. In Internationa Annual Conference on Territorial Intelligence. Besanc¸on: 16.16 October.
  22. Isaac, A. and Summers, E. (2009). Skos simple knowledge organization system primer. W3C Working Group Note.
  23. Jimenez, F., Fernández, I., and Menéndez, A. (2011). Los sistemas regionales de innovaci ón: revisión conceptual e implicaciones en américa latina. In Los Sistemas Regionales de Innovación en América Latina. Banco Interamericano de Desarrollo, Washington.
  24. Lundvall, B., editor (1992). National Systems of Innovation: Towards a Theory of Innovation and Interactive Learning. Pinter, London.
  25. Lundvall, B.-A. (1998). Why study national systems and national styles of innovations? Technology Analysis & Strategic Management, 10(4):407 - 421.
  26. Lundvall, B.-A. and Johnson, B. (1994). The learning economy. Journal of Industry Studies, 1(2):23-42.
  27. Malaisé, V., Gazendam, L., and Brugman, H. (2007a). Disambiguating automatic semantic annotation based on a thesaurus structure. In Hathout, N. and Muller, P., editors, Actes de la 14e conférence sur le Traitement Automatique des Langues Naturelles (communications orales), pages 197-206, Toulouse. Association pour le Traitement Automatique des Langues.
  28. Malaisé, V., Isaac, A., Gazendam, L., and Brugman, H. (2007b). Anchoring dutch cultural heritage thesauri to wordnet: two case studies. In ACL 2007, pages 57- 63.
  29. Medelyan, O. and Witten, I. H. (2005). Thesaurus-based index term extraction for agricultural documents. In Proc. of the 6th Agricultural Ontology Service workshop.
  30. Mollo, M. (2009). The survey on territory research in europe,. In International Conference of Territorial Intelligence, Papers on Tools and methods of Territorial Intelligence (MSHE), Besanc¸on.
  31. Nahapiet, J. and Ghoshal, S. (1998). Social capital, intellectual capital, and the organizational advantage. The Academy of Management Review, 23(2):242-266.
  32. Nelson, R. R., editor (1993). National Innovation Systems: A Comparative Study. Oxford University Press, Oxford.
  33. OECD and EUROSTAT (2005). Oslo Manual: Guidelines for collecting and interpreting innovation data. OECD Publising and European Commission. 3rd. edition.
  34. Salavisa, I. and Vali, M. (2012). Social Networks, Innovation and the Knowledge Economy. Routledge.
  35. Salton, G. and Buckley, C. (1987). Term weighting approaches in automatic text retrieval. Technical report, Cornell University.
  36. Sebastiani, F. (2002). Machine learning in automated text categorization. ACM computing surveys (CSUR), 34(1):1-47.
  37. Tiun, S., Abdullah, R., and Kong, T. E. (2001). Automatic topic identification using ontology hierarchy. In Gelbukh, A. F., editor, Computational Linguistics and Intelligent Text Processing, Second International Conference, CICLing 2001, Mexico-City, Mexico, February 18-24, 2001, Proceedings, volume 2004 of Lecture Notes in Computer Science, pages 444- 453. Springer.
  38. Turney, P. D. (2000). Learning algorithms for keyphrase extraction. Inf. Retr., 2(4):303-336.
  39. Wang, J., Liu, J., and Wang, C. (2007). Keyword extraction based on pagerank. Advances in Knowledge Discovery and Data Mining, 4426:857-864.
  40. Wartena, C., Brussee, R., Gazendam, L., and Huijsen, W. (2007). Apolda: A practical tool for semantic annotation. In Database and Expert Systems Applications, DEXA, 7th International Workshop on Text-based Information Retrieval, TIR, pages 288-292. IEEE.
  41. Weeds, J., Weir, D. J., and McCarthy, D. (2004). Characterising measures of lexical distributional similarity. In COLING 2004, Proceedings of the 20th International Conference on Computational Linguistics.

Paper Citation

in Harvard Style

Wartena C. and Garcia Alsina M. (2013). Challenges and Potentials for Keyword Extraction from Company Websites for the Development of Regional Knowledge Maps . In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval and the International Conference on Knowledge Management and Information Sharing - Volume 1: SSTM, (IC3K 2013) ISBN 978-989-8565-75-4, pages 241-248. DOI: 10.5220/0004660002410248

in Bibtex Style

author={Christian Wartena and Montserrat Garcia Alsina},
title={Challenges and Potentials for Keyword Extraction from Company Websites for the Development of Regional Knowledge Maps},
booktitle={Proceedings of the International Conference on Knowledge Discovery and Information Retrieval and the International Conference on Knowledge Management and Information Sharing - Volume 1: SSTM, (IC3K 2013)},

in EndNote Style

JO - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval and the International Conference on Knowledge Management and Information Sharing - Volume 1: SSTM, (IC3K 2013)
TI - Challenges and Potentials for Keyword Extraction from Company Websites for the Development of Regional Knowledge Maps
SN - 978-989-8565-75-4
AU - Wartena C.
AU - Garcia Alsina M.
PY - 2013
SP - 241
EP - 248
DO - 10.5220/0004660002410248