and metadata extraction have been shown as valuable
methods for enhancing search results. Link analysis
optimization achieved considerably better results than
anchor text optimization or no optimization. Using
link analysis, the number of necessary crawl cycles
are reduced by at least 50 %, leading to faster results
and less use of resources. The EERQI search engine,
accessible on the EERQI project website (EERQI,
2009), provides extensive search capabilities within
metadata and full-texts. It is the goal of the search
engine to gather information about a large number of
relevant “educational research” documents and pro-
vide access to information about these documents.
The first steps have been taken to achieve this goal.
A new formula (equation 1) has been developed for
focused crawling.
5 FUTURE WORK
Based on the current implementation, the next stage
of the EERQI search engine development will con-
centrate on optimized content classification (ERDD)
and metadata extraction. Further effort needs to be
put into metadata extraction from anchor texts and
full text. Preliminary tests revealed that a significant
number of anchor texts include title, author, and / or
journal names. This may be combined with metadata
extraction from full-texts. The search engine user in-
terface will be enhanced to facilitate ergonomic us-
ability for a number of features, such as clustering and
sorting of results as well as complex search queries.
ACKNOWLEDGEMENTS
We kindly thank the partners within the EERQI
project and the colleagues at RRZN for their valuable
input and support. EERQI is funded by the European
Commission under the 7th Framework Programme,
grant 217549.
REFERENCES
Abiteboul, S., Preda, M., and Cobena, G. (2003). Adaptive
On-Line Page Importance Computation. In Proceed-
ings of the 12th international conference on World
Wide Web, pages 280–290. ACM.
Bergmark, D., Lazoze, C., and Sbityakov, A. (2002). Fo-
cused Crawls, Tunneling, and Digital Libraries. In
Proceedings of the 6th European Conference on Digi-
tal Libraries.
Chakrabarti, S., Dom, B., Raghavan, P., Rajagopalan, S.,
Gibson, D., and Kleinberg, J. (1998). Automatic Re-
source Compilation by Analyzing Hyperlink Structure
and Associated Text. In Proceedings of the Seventh
International World Wide Web Conference.
EERQI (2009). EERQI project website.
http://www.eerqi.eu.
EERQI-Annex1 (2008). EERQI Annex I - Description of
Work. http://www.eerqi.eu/sites/default/files/11-06-
2008 EERQI Annex I-1.PDF (PDF).
ERIC (2009). Education Resources Information Center
(ERIC). http://www.eric.ed.gov.
Google Scholar (2009). Google Scholar.
http://scholar.google.com.
Hadoop (2009). Apache Hadoop.
http://hadoop.apache.org/.
Han, H., Giles, C. L., Manavoglu, E., Zha, H., Zhang, Z.,
and Fox, E. (2003). Automatic Document Metadata
Extraction using Support Vector Machines. In Pro-
ceedings of the 2003 Joint Conference on Digital Li-
braries (JCDL 2003).
Kleinberg, J. (1999). Authoritative Sources in a Hyper-
linked Environment. Journal of the ACM, pages 604–
632.
Liu, B. (2008). Web Data Mining. Springer.
Lucene (2009). Apache Lucene. http://lucene.apache.org/.
Manning, C. D., Raghavan, P., and Sch
¨
utze, H. (2008). In-
troduction to Information Retrieval. Cambridge Uni-
versity Press.
Nutch (2009). Apache Nutch. http://lucene.apache.org/
nutch/.
OAIster (2009). OAIster. http://oaister.org.
Pant, G., Tsioutsiouliklis, J. J., and Giles, C. L. (2004).
Panorama: Extending Digital Libraries with Topi-
cal Crawlers. In Proceedings of the 2004 Joint
ACM/IEEE Conference on Digital Libraries.
Scirus (2009). Scirus. http://www.scirus.com.
Sebastiani, F. (2002). Machine Learning in Automated Text
Categorization. ACM Computing Surveys, 34:1–47.
SVMLight (2009). SVMlight.
http://svmlight.joachims.org/.
Witten, I., Don, K. J., Dewsnip, M., and Tablan, V. (2004).
Text mining in a digital library. International Journal
on Digital Libraries.
Zheng, X., Zhou, T., Yu, Z., and Chen, D. (2008). URL Rule
Based Focused Crawler. In Proceedings of 2008 IEEE
International Conference on e-Business Engineering.
Zhuang, Z., Wagle, R., and Giles, C. L. (2005). What’s there
and what’s not? Focused crawling for missing docu-
ments in digital libraries. In Proceedings of the 5th
ACM/IEEE-CS Joint Conference on Digital Libraries.
WEBIST 2010 - 6th International Conference on Web Information Systems and Technologies
186