TARANTULA - A Scalable and Extensible Web Spider

Anshul Saxena, Keshav Dubey, Sanjay K. Dhurandher, Issac Woungang

Abstract

Web crawlers today suffer from poor navigation techniques which reduce their scalability while crawling the World Wide Web (WWW). In this paper we present a web crawler named Tarantula that is scalable, platform independent and fully configurable. The work on Tarantula project was started with the aim of making a simple, elegant yet efficient Web Crawler offering better crawling strategies while walking through the WWW. This paper also presents a comparison with the Heritrix crawler. The structure of the crawler facilitates new navigation techniques which when used with existing techniques gives better crawl results. Tarantula has a pluggable, extensible architecture that further facilitates customization by the user.

References

  1. Djeraba, C., Hafri, Y., 2004. Dominos: a New Web Crawler's Design. In ECDL'04, 8th European Conference on Research and Advanced Technologies for Digital Libraries. Springer Press.
  2. Mohr, G., Stack, M., Ranitovic, I., Avery, D. and Kimpton, M., 2004. An Introduction to Heritrix An open source archival quality web crawler. In IWAW'04, 4th International Web Archiving Workshop. Springer Press.
  3. Boldi, P., Codenotti, B., Santini, M., Vigna, S., 2004. UbiCrawler: A Scalable Fully Distributed Web Crawler. In 8th Australian World Wide Web Conference. John Wiley & Sons Publications.
  4. Koht-arsa, K., Sanguanpong, S., 2002. High Performance Large Scale Web Spider Architecture. In International Symposium on Communications and Information Technology. ANREG Publication.
  5. Maly, K., 1976. Compressed Trie. ACM Publications.
Download


Paper Citation


in Harvard Style

Saxena A., Dubey K., Dhurandher S. and Woungang I. (2009). TARANTULA - A Scalable and Extensible Web Spider . In Proceedings of the International Conference on Knowledge Management and Information Sharing - Volume 1: KMIS, (IC3K 2009) ISBN 978-989-674-013-9, pages 167-172. DOI: 10.5220/0002302001670172


in Bibtex Style

@conference{kmis09,
author={Anshul Saxena and Keshav Dubey and Sanjay K. Dhurandher and Issac Woungang},
title={TARANTULA - A Scalable and Extensible Web Spider},
booktitle={Proceedings of the International Conference on Knowledge Management and Information Sharing - Volume 1: KMIS, (IC3K 2009)},
year={2009},
pages={167-172},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0002302001670172},
isbn={978-989-674-013-9},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Knowledge Management and Information Sharing - Volume 1: KMIS, (IC3K 2009)
TI - TARANTULA - A Scalable and Extensible Web Spider
SN - 978-989-674-013-9
AU - Saxena A.
AU - Dubey K.
AU - Dhurandher S.
AU - Woungang I.
PY - 2009
SP - 167
EP - 172
DO - 10.5220/0002302001670172