A NEW APPROACH TOWARDS VERTICAL SEARCH ENGINES - Intelligent Focused Crawling and Multilingual Semantic Techniques

Sybille Peters, Claus-Peter Rückemann, Wolfgang Sander-Beuermann

Abstract

Search engines typically consist of a crawler which traverses the web retrieving documents and a search frontend which provides the user interface to the acquired information. Focused crawlers refine the crawler by intelligently directing it to predefined topic areas. The evolution of search engines today is expedited by supplying more search capabilities such as a search for metadata as well as search within the content text. Semantic web standards have supplied methods for augmenting webpages with metadata. Machine learning techniques are used where necessary to gather more metadata from unstructured webpages. This paper analyzes the effectiveness of techniques for vertical search engines with respect to focused crawling and metadata integration exemplarily in the field of “educational research”. A search engine for these purposes implemented within the EERQI project is described and tested. The enhancement of focused crawling with the use of link analysis and anchor text classification is implemented and verified. A new heuristic score calculation formula has been developed for focusing the crawler. Full-texts and metadata from various multilingual sources are collected and combined into a common format.

References

  1. Abiteboul, S., Preda, M., and Cobena, G. (2003). Adaptive On-Line Page Importance Computation. In Proceedings of the 12th international conference on World Wide Web, pages 280-290. ACM.
  2. Bergmark, D., Lazoze, C., and Sbityakov, A. (2002). Focused Crawls, Tunneling, and Digital Libraries. In Proceedings of the 6th European Conference on Digital Libraries.
  3. Chakrabarti, S., Dom, B., Raghavan, P., Rajagopalan, S., Gibson, D., and Kleinberg, J. (1998). Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text. In Proceedings of the Seventh International World Wide Web Conference.
  4. EERQI (2009). EERQI project website. http://www.eerqi.eu.
  5. EERQI-Annex1 (2008). EERQI Annex I - Description of Work. http://www.eerqi.eu/sites/default/files/11-06- 2008 EERQI Annex I-1.PDF (PDF).
  6. ERIC (2009). Education Resources Information Center (ERIC). http://www.eric.ed.gov.
  7. Google Scholar (2009). Google Scholar. http://scholar.google.com.
  8. Hadoop (2009). Apache Hadoop. http://hadoop.apache.org/.
  9. Han, H., Giles, C. L., Manavoglu, E., Zha, H., Zhang, Z., and Fox, E. (2003). Automatic Document Metadata Extraction using Support Vector Machines. In Proceedings of the 2003 Joint Conference on Digital Libraries (JCDL 2003).
  10. Kleinberg, J. (1999). Authoritative Sources in a Hyperlinked Environment. Journal of the ACM, pages 604- 632.
  11. Liu, B. (2008). Web Data Mining. Springer.
  12. Lucene (2009). Apache Lucene. http://lucene.apache.org/.
  13. Manning, C. D., Raghavan, P., and Schü tze, H. (2008). Introduction to Information Retrieval. Cambridge University Press.
  14. Nutch (2009). Apache Nutch. http://lucene.apache.org/ nutch/.
  15. OAIster (2009). OAIster. http://oaister.org.
  16. Pant, G., Tsioutsiouliklis, J. J., and Giles, C. L. (2004). Panorama: Extending Digital Libraries with Topical Crawlers. In Proceedings of the 2004 Joint ACM/IEEE Conference on Digital Libraries.
  17. Scirus (2009). Scirus. http://www.scirus.com.
  18. Sebastiani, F. (2002). Machine Learning in Automated Text Categorization. ACM Computing Surveys, 34:1-47.
  19. SVMLight (2009). SVMlight. http://svmlight.joachims.org/.
  20. Witten, I., Don, K. J., Dewsnip, M., and Tablan, V. (2004). Text mining in a digital library. International Journal on Digital Libraries.
  21. Zheng, X., Zhou, T., Yu, Z., and Chen, D. (2008). URL Rule Based Focused Crawler. In Proceedings of 2008 IEEE International Conference on e-Business Engineering.
  22. Zhuang, Z., Wagle, R., and Giles, C. L. (2005). What's there and what's not? Focused crawling for missing documents in digital libraries. In Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries.
Download


Paper Citation


in Harvard Style

Peters S., Rückemann C. and Sander-Beuermann W. (2010). A NEW APPROACH TOWARDS VERTICAL SEARCH ENGINES - Intelligent Focused Crawling and Multilingual Semantic Techniques . In Proceedings of the 6th International Conference on Web Information Systems and Technology - Volume 2: WEBIST, ISBN 978-989-674-025-2, pages 181-186. DOI: 10.5220/0002777901810186


in Bibtex Style

@conference{webist10,
author={Sybille Peters and Claus-Peter Rückemann and Wolfgang Sander-Beuermann},
title={A NEW APPROACH TOWARDS VERTICAL SEARCH ENGINES - Intelligent Focused Crawling and Multilingual Semantic Techniques},
booktitle={Proceedings of the 6th International Conference on Web Information Systems and Technology - Volume 2: WEBIST,},
year={2010},
pages={181-186},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0002777901810186},
isbn={978-989-674-025-2},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 6th International Conference on Web Information Systems and Technology - Volume 2: WEBIST,
TI - A NEW APPROACH TOWARDS VERTICAL SEARCH ENGINES - Intelligent Focused Crawling and Multilingual Semantic Techniques
SN - 978-989-674-025-2
AU - Peters S.
AU - Rückemann C.
AU - Sander-Beuermann W.
PY - 2010
SP - 181
EP - 186
DO - 10.5220/0002777901810186