Architecture for a Garbage-less and Fresh Content Search Engine

ıctor M. Prieto, Manuel

Alvarez, Rafael L

opez Garc

ıa and Fidel Cacheda

Department of Information and Communication Technologies, University of A Coru

na, A Coru

na, Spain

Keywords:

Web Crawling, Web Spam, Soft-404, Crawling Refresh, Crawling Performance, Crawling Architecture.

Abstract:

This paper presents the architecture of a Web search engine that integrates solutions for several state-of-the-

art problems, such as Web Spam and Soft-404 detection, content update and resource use. To this end, the

system incorporates a Web Spam detection module that is based on techniques that have been presented in

previous works and whose success have been assessed in well-known public datasets. For the Soft-404 pages

we propose some new techniques that improve the ones described in the state of the art. Finally, a last module

allows the search engine to detect when a page has changed considering the user interaction. The tests we

have performed allow us to conclude that, with the architecture we propose, it is possible to achieve important

improvements in the efﬁcacy and the efﬁciency of crawling systems. This has repercussions in the content that

is provided to the users.

1 INTRODUCTION

Currently, the WWW is the biggest information

repository ever built, and it is continuously growing.

Due to its big size, it is indispensable to use search

engines to access the information which is relevant to

the user.

A search engine faces many challenges because

of the quantity, the variability and the quality of the

information that it has to gather. Among others, we

can highlight: server/client side technologies (Ragha-

van and Garcia-Molina, 2001) (Bergman, 2000), Web

Spam (Gyongyi and Garcia-Molina, 2004) (Fetterly

et al., 2004), Soft-404 pages (Bar-Yossef et al., 2004)

which are those that return a 200 OK HTTP code in-

stead of a 404 one, with or without content, repeated

content (Kumar and Govindarajulu, 2009), content

refresh (Brewington and Cybenko, 2000) (Cho and

Garcia-Molina, 2003), bad quality content, etc. These

challenges can be summarized in one: processing the

biggest number of web documents of the best qual-

ity, using the lowest quantity of resources and in the

shortest time.

There are studies that try to improve the perfor-

mance of the search engines and their crawling sys-

tems by proposing different types of architectures,

and trying to reduce the resources that are necessary

in certain tasks such as disk access or URL caching.

However, to the best of our knowledge, there is not

a paper that proposes a global search engine archi-

tecture that improves the performance by focusing on

avoiding processing those pages whose content is not

appropriate to be indexed and by optimizing the re-

fresh time of pages with relevant content. Accord-

ing to Ntoulas et al. (Ntoulas and Manasse, 2006),

70% of the pages of the .biz domain are Spam, and

the same happens with 20% of .com and 15% of .net.

Moreover, considering our own studies that make ran-

dom requests to existing domains, we have stated that

7.35% of the web servers return a Soft-404 page when

a request to an unknown resource is made.

If these “garbage” pages can be detected, we can

protect both the users, who are wasting their time and

money, and also the companies that develop search

engines. The latters are very affected, since they do

not only lose prestige when they show Spam among

the results, but they are also wasting resources, and

therefore money, in analysing, indexing and showing

pages that should not be shown. Not protecting any

of these parties means an economic loss.

We have also stated that, for domains whose Page

Rank is between 0 and 3, Google has not been up to

date 60.42% of the times, Yahoo! 70.93% and Bing

66.40%. For domains with PageRank between 3 and

6, Google has been out of date 50.52% of the times,

Yahoo! 72% and Bing 70.17%.

This paper presents the architecture of a search en-

gine whose processes are optimised to avoid irrele-

vant pages and to refresh the contents in the most ap-

propriate moment. This will minimise unnecessary

378

M. Prieto V., Álvarez M., López García R. and Cacheda F..

Architecture for a Garbage-less and Fresh Content Search Engine.

DOI: 10.5220/0004167203780381

In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval (KDIR-2012), pages 378-381

ISBN: 978-989-8565-29-7

 2012 SCITEPRESS (Science and Technology Publications, Lda.)

Crawling Module

Indexing

Module

robots.txt Analyser

Visited URL Controller

URL Extractor and Validator

Garbage Detector

Crawler

Crawler Frontier

Searching Module

Document Controller

Manager

Ranking

Module

WWW

Updating

Module

Repository Index

Content

Analyser

System

Configurator

Web Spam

Decision

Tree

Soft-404

Decision

Tree

Web Spam and Soft-404 Analyser Process

Supervised

Result

Analyser

Web Spam

Soft-404

Repository

Web Spam

Soft-404

Metadata

Web Modification

Detection

Module

Web Modification

Detector

Repository

Request

Processor

Web

Modification

Detector

 Query

Results 

JS Request 

JS Code

Figure 1: Architecture proposed for a Search Engine and its Crawling System.

refreshing at the same time as their level of obso-

lescence. For that purpose, the system incorporates

a module for detecting “garbage” pages (Spam and

Soft-404) and a module for detecting changes which

focuses on user accesses, using a collaborative archi-

tecture.

This architecture includes the Spam and Soft-404

detection techniques published by Prieto et al. (Prieto

et al., 2012). We also propose a system to detect “real-

time” changes in Web pages too. This, along with the

refreshing policies (in which current search engines

are based), will improve signiﬁcantly the update of

web resources in the search engine.

2 ARCHITECTURE OF THE

SEARCH ENGINE

Figure 1 shows the architecture we propose for a

search engine, along with the modules for detecting

web “garbage” and for detecting modiﬁcations in the

pages. In an introductory overview of the system, we

can appreciate that the search engine consists of the

following elements:

• Crawling Module: it consists of a group of

crawlers that are in charge of traversing the Web

and creating a repository with the contents found.

• Repository: set of downloaded pages.

• Updating Module: its objective is to keep the

repository up to date. This module communicates

with the modiﬁcation detection system that is ex-

plained in section 2.2.

• Indexing Module: it is responsible for extract-

ing the keywords of each page and for associating

them the URLs the system has found.

• Index: it represents the set of indices that have

been generated by the indexing module. These

indices will be used to solve the queries sent by

the user.

• Query Manager: its mission is to process the

queries and to return those documents that are rel-

evant for them.

• Ranking Module: it is in charge of sorting the re-

sults the Query Manager has returned according

to the relevance of the documents.

Focusing on the architecture of the crawling mo-

dule, we can highlight the following modules:

• Document Controller: it is in charge of control-

ling that the content of a web document is not re-

peated or it is not very similar to others. In this

case it should not be processed and indexed again.

• Garbage detector: it is in charge of detecting

Spam and Soft-404 pages. It will be explained

in section 2.1.

• URL extractor and validator: it is responsible for

processing the content of each page and the ex-

traction of new valid links.

• Visited URL Controller: its mission consists in

checking if a URL has already been processed.

• Robot.txt Analyser: following the robots ex-

clusion protocol, this module determines which

pages must be indexed and/or visited.

2.1 Garbage Detection Module

The garbage detection module has the goal of de-

tecting both Web Spam and Soft-404 pages in or-

der to avoid their processing and indexing. It is

based on decision tree algorithms that use as input

ArchitectureforaGarbage-lessandFreshContentSearchEngine

379

the result of applying different heuristics that try to

partially characterize both Web Spam and Soft-404

pages. Regarding Web Spam, in contradistinction to

other approaches in the literature, this module is de-

signed to detect all kinds of Web Spam: Cloaking (Wu

and Davison, 2005a), Link Farm (Wu and Davison,

2005b), Redirection Spam (Chellapilla and Maykov,

2007) and Content Spam (Fetterly et al., 2005). It is

based in the techniques introduced in (Prieto et al.,

2012). Regarding Soft-404 pages, this module de-

tects parking pages as well as pages that notify 404

errors, but whose web server does not send the corre-

sponding HTTP code. This allows the system to use

fewer resources than the used by other algorithms in

the literature, which send several random requests to

the domain in order to check the differences between

the normal pages and the Soft-404 pages.

This module uses C4.5 (Quinlan, 1996) as the

decision tree algorithm, along with “bagging” and

“boosting” techniques to improve the results. The

system contains a set of decision trees to detect Web

Spam and another one for Soft-404 pages, with each

tree corresponding to a conﬁguration that employs

more or less resources (maximizing the detection or

the performance respectively). The process of this

module for each new page is the following:

• The Content Analyser extracts the results of each

heuristic.

• The System Conﬁgurator chooses the decision

trees that will be used in each case. This elec-

tion depends on the detection level that has been

previously conﬁgured and the resources that are

available for the detection module of that level.

• Depending on the results obtained by the heuris-

tics we have applied, the Web Spam and Soft-404

decision trees will establish the type of page.

• If the page is considered Spam or Soft-404, its

processing is stopped and the domain will be

stored in the Spam and Soft-404 Repository,

which will be queried by the URL Validator. If

any link points to a Spam domain the page will

be discarded. If the link is part of a domain that

returns Soft-404 pages, this page must be marked

as “possible” Soft-404 and analysed later.

• Supervised Result Analyser allows recalculating

the decision tree from time to time in order to im-

prove the quality of the detection. Feedback is

received by means of manual labelling of Spam

pages.

Finally, this module communicates with the Doc-

ument Controller, since the ﬁrst step in the Garbage

Detection Module consists in knowing whether a doc-

ument has already been processed or whether it has

already been detected as Spam or Soft-404.

2.2 Web Modiﬁcation Detection Module

This section describes the architecture of the module

we propose to improve the information about mod-

iﬁcations employed by the Updating Module of the

search engine. This module is based on a collabora-

tive architecture similar to the one used in “Google

Analytics”

for obtaining data from the client that ac-

cesses a page. In our case, the navigation of a user

through a web resource is used as a changes notiﬁ-

cation agent. This module is shown on the right side

of Figure 1. The creator of a web site should embed

a little JavaScript code in the pages that wish a “real

time” updating by the search engines.

Due to the JavaScript code that has been embed-

ded in the pages, for each request of a user, the “Re-

quest Processor” will check in the repository the last

date which contains information about that page. If

the information is new enough, we will not send the

JavaScript code, so the client’s browser will not send

anything to the server. Otherwise, the code we send

will make a MD5 summary of several parts of the

page and it will send it to the web modiﬁcation de-

tection module, which will queue the request. Fi-

nally, the “Web Modiﬁcation Detector” will traverse

the queues of pending operations, choosing the most

updated information and updating the repository.

The system we have designed is highly scalable

since a big part of the processing is done in the users’

web browsers. Moreover, in situations of high load

or attacks against our system (DDOS - Distributed

Denial of Service), the module will not send the

JavaScript, and, in case that the load is big enough to

affect the system for a long time, the system will wait,

not receiving information for that period of time, but

not affecting the ﬁnal user either.

The Updating Module queries the Web Modiﬁ-

cation Detection Module to know what pages have

changed. Then, the latter decides whether to re-crawl

the page by considering other data (relevance, fre-

quency of change, type of page, etc.).

Current re-crawling policies consist in polling

web sites more or less frequently to detect changes.

This approach has two problems: (1) visiting pages

that have changed, which implies wasting resources,

and (2) not having the repositories up to date at “real

time”. However, in our approach the crawler is indi-

cated when a page has changed, so our repository will

be more updated, with the consequential improve-

ment of the quality of the indices that the search en-

http://www.google.com/analytics/

KDIR2012-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval

380

gines are using. The system will also improve perfor-

mance use since resources will only be used when it

is necessary.

3 DISCUSSION

According to some studies existing in the literature

and others we have performed, we have observed that

Spam and Soft-404 pages represent 27.35% of the

content of the .com domain. In addition, we have

checked that the resources a search engine has in-

dexed are obsolete more than 50% of the time, which

affects to the quality of the results, too.

In the literature, there are not any search en-

gine architecture that improve their performance and

the quality of their results by detecting and ignoring

“garbage” content, and therefore, reducing the num-

ber of used resources. The architecture of the web

search engine we have proposed in this paper con-

tains a module that is in charge of detecting the web

“garbage”, to improve the quality of the pages we in-

dex and process. Furthermore, a module for detecting

modiﬁcations provides the system with information

about the changes in the pages when the users navi-

gate through a site, which helps to improve the re-

fresh policies. The results we have obtained point out

that a crawler that uses the Web Spam detection mo-

dule and the Soft-404 detection module would avoid

processing 22.37% of the resources in the worst case

and 26.62% in the best one, which is the practical

totality of the aforementioned 27.35% of “garbage

pages”. What is more, the modiﬁcation detection mo-

dule would allow the search engine to know the exact

change moment, not wasting resources in returning to

a page that has not changed and allowing the system

to decide the best moment to re-crawl it.

Our future work consists in developing this archi-

tecture completely and assessing it in real environ-

ments. In parallel, we will develop the modules for

a distributed architecture.

ACKNOWLEDGEMENTS

This work was supported by the Spanish government

(TIN 2009-14203).

REFERENCES

Bar-Yossef, Z., Broder, A. Z., Kumar, R., and Tomkins, A.

(2004). Sic transit gloria telae: towards an understand-

ing of the web’s decay. In Proceedings of the 13th

international conference on World Wide Web, WWW

’04, pages 328–337, New York, NY, USA. ACM.

Bergman, M. K. (2000). The deep web: Surfacing hidden

value.

Brewington, B. and Cybenko, G. (2000). How dynamic is

the web? pages 257–276.

Chellapilla, K. and Maykov, A. (2007). A taxonomy of

javascript redirection spam. In Proceedings of the

3rd international workshop on Adversarial informa-

tion retrieval on the web, AIRWeb ’07, pages 81–88,

New York, NY, USA. ACM.

Cho, J. and Garcia-Molina, H. (2003). Estimating fre-

quency of change. ACM Trans. Internet Technol.,

3:256–290.

Fetterly, D., Manasse, M., and Najork, M. (2004). Spam,

damn spam, and statistics: using statistical analysis

to locate spam web pages. In Proceedings of the 7th

International Workshop on the Web and Databases:

colocated with ACM SIGMOD/PODS 2004, WebDB

’04, pages 1–6, New York, NY, USA. ACM.

Fetterly, D., Manasse, M., and Najork, M. (2005). Detect-

ing phrase-level duplication on the world wide web. In

In Proceedings of the 28th Annual International ACM

SIGIR Conference on Research & Development in In-

formation Retrieval, pages 170–177. ACM Press.

Gyongyi, Z. and Garcia-Molina, H. (2004). Web spam tax-

onomy. Technical Report 2004-25, Stanford InfoLab.

Kumar, J. P. and Govindarajulu, P. (2009). Duplicate and

near duplicate documents detection: A review. Euro-

pean Journal of Scientiﬁc Research, 32:514–527.

Ntoulas, A. and Manasse, M. (2006). Detecting spam web

pages through content analysis. In In Proceedings of

the World Wide Web conference, pages 83–92. ACM

Press.

Prieto, V. M.,

Alvarez, M., and Cacheda, F. (2012). Analy-

sis and detection of web spam by means of web con-

tent. In Proceedings of the 5th Information Retrieval

Facility Conference, IRFC ’12.

Quinlan, J. R. (1996). Bagging, boosting, and c4.5. In In

Proceedings of the Thirteenth National Conference on

Artiﬁcial Intelligence, pages 725–730. AAAI Press.

Raghavan, S. and Garcia-Molina, H. (2001). Crawling the

hidden web. In Proceedings of the 27th International

Conference on Very Large Data Bases, VLDB ’01,

pages 129–138, San Francisco, CA, USA. Morgan

Kaufmann Publishers Inc.

Wu, B. and Davison, B. D. (2005a). Cloaking and redirec-

tion: A preliminary study.

Wu, B. and Davison, B. D. (2005b). Identifying link farm

spam pages. In Special interest tracks and posters of

the 14th international conference on World Wide Web,

WWW ’05, pages 820–829, New York, NY, USA.

ACM.

ArchitectureforaGarbage-lessandFreshContentSearchEngine

381