gines are using. The system will also improve perfor-
mance use since resources will only be used when it
is necessary.
3 DISCUSSION
According to some studies existing in the literature
and others we have performed, we have observed that
Spam and Soft-404 pages represent 27.35% of the
content of the .com domain. In addition, we have
checked that the resources a search engine has in-
dexed are obsolete more than 50% of the time, which
affects to the quality of the results, too.
In the literature, there are not any search en-
gine architecture that improve their performance and
the quality of their results by detecting and ignoring
“garbage” content, and therefore, reducing the num-
ber of used resources. The architecture of the web
search engine we have proposed in this paper con-
tains a module that is in charge of detecting the web
“garbage”, to improve the quality of the pages we in-
dex and process. Furthermore, a module for detecting
modifications provides the system with information
about the changes in the pages when the users navi-
gate through a site, which helps to improve the re-
fresh policies. The results we have obtained point out
that a crawler that uses the Web Spam detection mo-
dule and the Soft-404 detection module would avoid
processing 22.37% of the resources in the worst case
and 26.62% in the best one, which is the practical
totality of the aforementioned 27.35% of “garbage
pages”. What is more, the modification detection mo-
dule would allow the search engine to know the exact
change moment, not wasting resources in returning to
a page that has not changed and allowing the system
to decide the best moment to re-crawl it.
Our future work consists in developing this archi-
tecture completely and assessing it in real environ-
ments. In parallel, we will develop the modules for
a distributed architecture.
ACKNOWLEDGEMENTS
This work was supported by the Spanish government
(TIN 2009-14203).
REFERENCES
Bar-Yossef, Z., Broder, A. Z., Kumar, R., and Tomkins, A.
(2004). Sic transit gloria telae: towards an understand-
ing of the web’s decay. In Proceedings of the 13th
international conference on World Wide Web, WWW
’04, pages 328–337, New York, NY, USA. ACM.
Bergman, M. K. (2000). The deep web: Surfacing hidden
value.
Brewington, B. and Cybenko, G. (2000). How dynamic is
the web? pages 257–276.
Chellapilla, K. and Maykov, A. (2007). A taxonomy of
javascript redirection spam. In Proceedings of the
3rd international workshop on Adversarial informa-
tion retrieval on the web, AIRWeb ’07, pages 81–88,
New York, NY, USA. ACM.
Cho, J. and Garcia-Molina, H. (2003). Estimating fre-
quency of change. ACM Trans. Internet Technol.,
3:256–290.
Fetterly, D., Manasse, M., and Najork, M. (2004). Spam,
damn spam, and statistics: using statistical analysis
to locate spam web pages. In Proceedings of the 7th
International Workshop on the Web and Databases:
colocated with ACM SIGMOD/PODS 2004, WebDB
’04, pages 1–6, New York, NY, USA. ACM.
Fetterly, D., Manasse, M., and Najork, M. (2005). Detect-
ing phrase-level duplication on the world wide web. In
In Proceedings of the 28th Annual International ACM
SIGIR Conference on Research & Development in In-
formation Retrieval, pages 170–177. ACM Press.
Gyongyi, Z. and Garcia-Molina, H. (2004). Web spam tax-
onomy. Technical Report 2004-25, Stanford InfoLab.
Kumar, J. P. and Govindarajulu, P. (2009). Duplicate and
near duplicate documents detection: A review. Euro-
pean Journal of Scientific Research, 32:514–527.
Ntoulas, A. and Manasse, M. (2006). Detecting spam web
pages through content analysis. In In Proceedings of
the World Wide Web conference, pages 83–92. ACM
Press.
Prieto, V. M.,
´
Alvarez, M., and Cacheda, F. (2012). Analy-
sis and detection of web spam by means of web con-
tent. In Proceedings of the 5th Information Retrieval
Facility Conference, IRFC ’12.
Quinlan, J. R. (1996). Bagging, boosting, and c4.5. In In
Proceedings of the Thirteenth National Conference on
Artificial Intelligence, pages 725–730. AAAI Press.
Raghavan, S. and Garcia-Molina, H. (2001). Crawling the
hidden web. In Proceedings of the 27th International
Conference on Very Large Data Bases, VLDB ’01,
pages 129–138, San Francisco, CA, USA. Morgan
Kaufmann Publishers Inc.
Wu, B. and Davison, B. D. (2005a). Cloaking and redirec-
tion: A preliminary study.
Wu, B. and Davison, B. D. (2005b). Identifying link farm
spam pages. In Special interest tracks and posters of
the 14th international conference on World Wide Web,
WWW ’05, pages 820–829, New York, NY, USA.
ACM.
ArchitectureforaGarbage-lessandFreshContentSearchEngine
381