ON THE EVOLUTION OF SEARCH ENGINE RANKINGS

Panagiotis Takis Metaxas

Computer Science Department, Wellesley College, Wellesley, MA02481, U.S.A.

Keywords:

Search Engines, Web search, Web graph, Link structure, PageRank, HITS, Web Spam, Social Networks.

Abstract:

Search Engines have greatly inﬂuenced the way we experience the web. Since the early days of the web, users

have been relying on them to get informed and make decisions. When the web was relatively small, web

directories were built and maintained using human experts to screen and categorize pages according to their

characteristics. By the mid 1990’s, however, it was apparent that the human expert model of categorizing web

pages does not scale. The ﬁrst search engines appeared and they have been evolving ever since, taking over

the role that web directories used to play.

But what need makes a search engine evolve? Beyond the ﬁnancial objectives, there is a need for quality in

search results. Users interact with search engines through search query results. Search engines know that the

quality of their ranking will determine how successful they are. If users perceive the results as valuable and

reliable, they will use it again. Otherwise, it is easy for them to switch to another search engine.

Search results, however, are not simply based on well-designed scientiﬁc principles, but they are inﬂuenced

by web spammers. Web spamming, the practice of introducing artiﬁcial text and links into web pages to affect

the results of web searches, has been recognized as a major search engine problem. It is also a serious users

problem because they are not aware of it and they tend to confuse trusting the search engine with trusting the

results of a search.

In this paper, we analyze the inﬂuence that web spam has on the evolution of the search engines and we identify

the strong relationship of spamming methods on the web to propagandistic techniques in society. Our analysis

provides a foundation for understanding why spamming works and offers new insight on how to address it. In

particular, it suggests that one could use social anti-propagandistic techniques to recognize web spam.

1 INTRODUCTION

Search Engines have greatly inﬂuenced the way we

experience the web. Since the early days of the web

people have been relying on search engines to ﬁnd

useful information. When the web was relatively

small, Web directories were built and maintained that

were using human experts to screen and categorize

pages according to their characteristics. By the mid

1990’s, however, it was apparent that the human ex-

pert modelof categorizingweb pages would not scale.

The ﬁrst search engines appeared and they have been

evolving ever since.

But what inﬂuences their evolution? The way

a user interacts with a search engine is through the

search results to a query that he or she has issued.

Search engines know that the quality of their rank-

ing will determine how successful they are. If users

perceive the results as valuable and reliable, they will

come again. Otherwise, it is easy for them to switch

to another search engine.

Research in Information Retrieval has produced a

large body of work that, theoretically, produces high

quality search results. Yet, search engines admit that

IR theory is but one of their considerations. One of

the major issues that inﬂuences the quality of ranking

is the effect that web spam has on their results. Web

spamming is deﬁned as the practice of manipulating

web pages in order to inﬂuence search engines rank-

ings in ways beneﬁcial to the spammers.

Spammers aim at search engines, but target the

end users. Their motive is usually commercial, but

can also be political or religious.

One of the reasons behind the users’ difﬁculty

to distinguish trustworthy from untrustworthy infor-

mation comes from the success that both search en-

gines and spammers have enjoyed in the last decade.

Users have come to trust search engines as a means of

ﬁnding information, and spammers have successfully

managed to exploit this trust.

200

Metaxas P.

ON THE EVOLUTION OF SEARCH ENGINE RANKINGS.

DOI: 10.5220/0001843102000207

In Proceedings of the Fifth International Conference on Web Information Systems and Technologies (WEBIST 2009) , page

ISBN: 978-989-8111-81-4

Some estimates indicate that at least 8% of all

pages indexed is spam (Fetterly et al., 2004) while ex-

perts consider web spamming the single most difﬁcult

challenge web searching is facing today(Henzinger

et al., 2002). Search engines typically see web spam

as an interference to their operations and would like

to restrict it, but there can be no algorithm that can

recognize spamming sites based solely on graph iso-

morphism (Bianchini et al., 2003).

First, however, we need to understand why spam-

ming works beyond the technical details, because

spamming is a social problem ﬁrst, then a technical

one. In this paper we show its extensive relationship

to social propaganda, and evidence of its inﬂuence on

the evolution of search engines. Our approach can

explain the reasons why web spamming has been so

successful and suggest new ways of dealing with it.

Finally, we present a framework for the long-term ap-

proach to web spam.

1.1 Background

Web spamming has received a lot of attention lately

(Bharat et al., 2001; Bianchini et al., 2003; Fetterly

et al., 2004; Fetterly et al., 2003; Gy¨ongyi et al., 2004;

Henzinger, 2001; Henzinger et al., 2002; Introna and

Nissenbaum, 2000; Kumar et al., 1999; Lynch, 2001;

Marchiori, 1997; Pringle et al., 1998). The ﬁrst papers

to raise the issue were (Marchiori, 1997; Henzinger

et al., 2002). The spammers’ success was noted in

(Bharat et al., 2001; Corey, 2001; Fetterly et al., 2004;

Fetterly et al., 2003; Hindman et al., 2003).

Characteristics of spamming sites based on di-

version from power laws are presented in (Fetterly

et al., 2004). Current tricks employed by spammers

are detailed in (Gy¨ongyi and Garcia-Molina, 2005).

An analysis of the popular PageRank method em-

ployed by many search engines today and ways to

maximize it in a spamming network is described in

(Bianchini et al., 2003). TrustRank, a modiﬁcation to

the PageRank to take into account the evaluations of

a few seed pages by human editors, employees of a

search engine, is presented in (Gy¨ongyi et al., 2004).

Techniques for identifying automatically link farms

of spam pages were presented in (Wu and Davison,

2005; Bencz´ur et al., 2005).

A comprehensive treatment on social networks is

presented in (Wasserman and Faust, 1994). The con-

nection between the Web and social networks was ex-

plicitly noted in (Raghavan, 2002) and implicitly used

in (Brin and Page, 1998; Kleinberg, 1999). Iden-

tiﬁcation of web communities was explored in (Ku-

mar et al., 1999; Flake et al., 2002). The effect that

search engines have on page popularity was discussed

in (Cho and Roy, 2004).

The rest of this paper is organized as follows. The

next section gives an overview of the problem of in-

formation reliability and web spamming. Section 3

has a short introduction to the theory of propaganda

detection and the next section 4 discusses the relation-

ship between the webgraph and the trust social net-

work. The following section 5 analyzes the evolution

of search engines as their response to spam. Finally,

the last section has the conclusions and a framework

for the long-term approach to web spam.

2 WEB SPAM

The web has changed the way we inform and get in-

formed. Every organization has a web site and peo-

ple are increasingly comfortable accessing it for in-

formation on any question they may have. The ex-

ploding size of the web necessitated the development

of search engines and web directories. Most peo-

ple with online access use a search engine to get in-

formed and make decisions that may have medical,

ﬁnancial, cultural, political, security or other impor-

tant implications in their lives (Corey, 2001; Vedder,

2000; Hindman et al., 2003; Lynch, 2001). More-

over, 85% of the time, people do not look past the

ﬁrst ten results returned by the search engine (Silver-

stein et al., 1999). Given this, it is not surprising that

anyone with a web presence struggles for a place in

the top ten positions of relevant web search results.

The importance of the top-10 placement has given

birth to a new “Search Engine Optimization” industry,

which claims to sell know-how for prominent place-

ment in search results and includes companies, pub-

lications, and even conferences. Some of them are

willing to bend the truth in order to fool the search

engines and their customers, by creating web pages

containing web spam (Fetterly et al., 2004).

Spammers attack search engines through text and

link manipulations:

• Text Spam. This includes repeating text exces-

sively and/or adding irrelevant text on the page

that will cause incorrect calculation of page rele-

vance; adding misleading meta-keywords or irrel-

evant “anchor text” that will cause incorrect appli-

cation of rank heuristics.

• Link Spam. This technique aims to change the

perceived structure of the webgraph in order to

cause incorrect calculation of page reputation.

Such examples are the so-called “link-farms,”

page “awards,”

domain ﬂooding (plethora of do-

With this technique, the spammer pretends to run an

ON THE EVOLUTION OF SEARCH ENGINE RANKINGS

201

mains that re-direct to a target site), etc.

Both kinds of spam aim to boost the ranking of

spammed web pages. So as not to get caught, spam-

mers conceal their actions through cloacking, content

hiding and redirection. Cloaking, for example, aims

to serve different pages to search engine robots and to

web browsers (users).

For a comprehensive treatment of the spam-

ming techniques, the interested reader is referred to

(Gy¨ongyi and Garcia-Molina, 2005).

Since anyone can be an author on the web, these

practices have naturally created a question of infor-

mation reliability. An audience used to trusting the

written word of newspapers and books is unable, un-

prepared or unwilling to think critically about the in-

formation obtained from the web. A recent study

(Graham and Metaxas, 2003) found that while college

students regard the web as a primary source of infor-

mation, many do not check more than a single source,

and have trouble recognizing trustworthy sources on-

line.

We have no reason to believe that the general pub-

lic will perform any better than well-educated stu-

dents. In fact, a recent analysis of internet related

fraud by a major Wall Street law ﬁrm (Corey, 2001)

puts the blame squarely on the questionable critical

thinking skills of the investors for the success of stock

fraud cases.

3 ON PROPAGANDA THEORY

On the outset, it may seem surprising that a technical

article discusses social propaganda. This is a subject

that has been studied extensively by social scientists

and might seem out of the realm of computing. How-

ever, the web is a social network, inﬂuenced daily by

the actions (intentional or otherwise) of millions of

people. In that respect, web researchers should be

aware of social theories and practices since they may

have applicability in their work. We believe that a ba-

sic understanding of social propaganda can be valu-

able to technical people designing and using systems

that affect our social interactions.

We offer here a brief introduction to the theory of

propaganda detection.

There are many deﬁnitions of propaganda, reﬂect-

ing its multiple uses over time. One working deﬁni-

tion we will use here is

organization that distributes awards for web site design or

information. The awarded site gets to display the “award”,

an image linking back to awarding organization. The effect

is that the awarded site increases the visibility of the spam-

mer’ site.

Propaganda is the attempt to modify human be-

havior, and thus inﬂuence people’s actions in ways

beneﬁcial to propagandists.

Propaganda has a long history in modern soci-

ety and is often associated with negative connotation.

This was not always the case, however. The term was

ﬁrst used in 1622, in the establishment by the Catholic

Church of a permanent Sacred Congregation de Pro-

paganda Fide (for the propagaton of faith), a depart-

ment which was trying to spread Catholicism in non-

Catholic Countries.Its current meaning comes from

the successful Enemy Propaganda Department in the

British Ministry of Information during WWI. How-

ever, it was not until 1938, in the beginning of WWII,

that a theory was developed to detect propagandistic

techniques. For the purposes of this paper we are in-

terested in ways of detecting propaganda, especially

by automatic means.

First developed by the Institute for Propaganda

Analysis (Lee and Lee(eds.), 1939), classic Propa-

ganda Theory identiﬁes several techniques that pro-

pagandists often employ in order to manipulate per-

ception.

• Name Calling is the practice of giving an idea a

bad label. It is used to make people reject and con-

demn the idea without examining the evidence.

For example, using the term “miserable failure”

to refer to political leaders such as US President

George Bush can be thought of as an application

of name calling.

• Glittering Generalities is the mirror image

name calling: Associating an idea with a “virtue

word”, in an effort to make us accept and approve

the idea without examining the evidence. For ex-

ample, using the term “patriotic” to refer to illegal

actions is a common application of this technique.

• Transfer is the technique by which the propagan-

dist carries over the authority, sanction, and pres-

tige of something respected and revered to some-

thing he would have us accept. For example, de-

livering a political speech in a mosque or a church,

or ending a political gathering with a prayer have

the effect of transfer.

• Testimonial is the technique of having some re-

spected person comment on the quality of an is-

sue on which they have no qualiﬁcations to com-

ment. For example, a famous actor who plays a

medical doctor on a popular TV show tells the

viewers that she only uses a particular pain relief

medicine. The implicit message is that if a famous

personality trusts the medicine, we should too.

Name calling and glittering generalities are sometimes

referred to as “word games.”

WEBIST 2009 - 5th International Conference on Web Information Systems and Technologies

202

• Plain Folks is a technique by which speakers at-

tempt to convince their audience that they, and

their ideas, are “of the people,” the “plain folks”.

For example, politicians sometimes are seen ﬂip-

ping burgers at a neighborhood diner.

• Card Stacking involves the selection of facts

(or falsehoods), illustrations (or distractions), and

logical (or illogical) statements in order to give an

incorrect impression. For example, some activists

refer to the Evolution Theory as a theory teaching

that humans came from apes (and not that both

apes and humans have evolved from a common

ancestor who was neither human nor ape).

• Bandwagon is the technique with which the pro-

pagandist attempts to convince us that all mem-

bers of a group we belong to accept his ideas and

so we should “jump on the band wagon”. Often,

fear is used to reinforce the message. For exam-

ple, commercials might show shoppers running to

line up in front of a store before it is open.

The reader should not have much trouble identi-

fying additional examples of such techniques used in

politics or advertising. The next section discusses the

relationship of propaganda to web spam, by ﬁrst de-

scribing the similarity of social networks to the web

graph.

4 THE WEBGRAPH AS A TRUST

NETWORK

The web is typically represented by a directed graph

(Broder et al., 2000). The nodes in the webgraph are

the pages (or sites) that reside on servers on the in-

ternet. Arcs correspond to hyperlinks that appear on

web pages (or sites). In this context, web spammers’

actions can be seen as altering the contents of the web

nodes (mailnly through text spam), and the hyperlinks

between nodes (mainly through link spam).

The theory of social networks (Wasserman and

Faust, 1994) also uses directed graphs to represent

relationships between social entities. The nodes cor-

respond to social entities (people, institutions, ideas).

Arcs correspond to recommendations between the en-

tities they connect. In this context, propagandistic

techniques can be seen as altering the trust social net-

work by altering one or more of its components (i.e.,

nodes, arcs, weights, topology).

To see the correspondence more clearly, we will

examine some of the propagandistic techniques that

have been used successfully by spammers: The tech-

nique of testimonials effectively adds a link between

previously unrelated nodes. Glittering generalities

change the contents of a node, effectively changing

its perceived relevance. Mislabeled anchor text is an

example of card stacking. And the technique of band-

wagon creates many links between a group of nodes,

a “link farm”. So, we deﬁne web spam based on the

spammers actions:

Web Spam is the attempt to modify the web (its

structure and contents), and thus inﬂuence search en-

gine results in ways beneﬁcial to web spammers.

Table 1 has the correspondence, in graph theoretic

terms, between the web graph according to a search

engine and the trust social network of a particular per-

son. Web pages or sites correspond to social entities

and hyperlinks correspond to trust opinions. The rank

that a search engine assigns to a page or a site cor-

responds to the reputation a social entity has for the

person. This rank is based on some ranking formula

that a search engine is computing, while the reputa-

tion is based on idiosyncratic components associated

with the person’s past experiences and selective appli-

cation of critical thinking skills; both are secret and

changing.

This correspondence is more than a coincidence.

The web itself is a social creation, and both PageR-

ank and HITS are socially inspired ranking formulas.

(Brin and Page, 1998; Kleinberg, 1999; Raghavan,

2002). Socially inspired systems are subject to so-

cially inspired attacks. Not surprisingly then, the the-

ory of propaganda detection can provide intuition into

the dynamics of the web graph.

PageRank is based on the assumption that the rep-

utation of an entity (a web page in this case) can be

measured as a function of both the number and repu-

tation of other entities linking to it. A link to a web

page is counted as a “vote of conﬁdence” to this web

site, and in turn, the reputation of a page is divided

among those it is recommendingThe implicit assump-

tion is that hyperlink “voting” is taking place inde-

pendently, without prior agreement or central control.

Spammers, like social propagandists, form structures

that are able to gather a large number of such “votes

of conﬁdence” by design, thus breaking the crucial as-

sumption of independence in a hyperlink. But while

the weights in the web graph are assigned by each

search engine, the weights in the trust social network

are assigned by each person. Since there are many

more persons than search engines, the task of a web

spammer is far easier than the task of a propagandist.

5 SEARCH ENGINE EVOLUTION

In the early 90’s, when the web numbered just a few

million servers, the ﬁrst generation search engines

ON THE EVOLUTION OF SEARCH ENGINE RANKINGS

203

Table 1: Graph theoretic correspondence between the Webgraph and the Trust Social Network. There is a one-to-one corre-

spondence between each component of the two graphs. A major difference, however, is that, even though a person may feel

negative trust (distrust) for some entity, there is no negative weight for hyperlinks.

Graph Theory Web Graph Trust Social Network

Node web page or site social entity

weight rank (accord. to a search engine) reputation (accord. to a person)

weight computation ranking formula (e.g., pagerank) idiosyncratic (e.g., 2 recommenders)

computed continuously computed on demand

Arc hyperlink trust opinion

semantics “vote of conﬁdence” “recommendation”

weight degree of conﬁdence degree of entrustment

weight range [0. . . 1] [distrust . . . trust]

were ranking search results using the vector model of

classic information retrieval techniques: the more rare

words two documents share, the more similar they are

considered to be. (Salton, 1972; Henzinger, 2001)

According to the vector model in Information Re-

trieval (Salton, 1972), documents contained in a doc-

ument collection D are viewed as vectors in term

space T. Each document vector is composed of term

weights w

of term T

appearing in document D

These weights are computed as the normalized dot

product of t f

· id f

, where t f

is the frequency of

term T

in document D

, and id f

is the inverse docu-

ment frequency of term T

in document collection D.

Typically, id f

is computed by a logarithmic formula

so that this term will not grow signiﬁcantly as the

number of occurrences of T

increase. Under this for-

mulation, rare words have greater weight than com-

mon words, because they are viewed as better repre-

senting the document contents. The term weights are

then normalized to fall on a unit sphere so that longer

documents will not have an advantage over shorter

documents:

t f

· id f

∑

1≤k≤t

(t f

)

(id f

)

In the vector model, document similarity sim(D

, D

)

between document vectors D

and D

is represented

by the angle between them, and is computed as

∑

1≤i≤t

· w

cosine normalized:

sim(D

, D

) =

∑

1≤i≤t

· w

∑

1≤i≤t

)

∑

1≤i≤t

)

A search query Q is considered simply a short docu-

ment and the results of a search for Q are ranked ac-

cording to their (normalized) similarity to the query.

While the exact details of the computation of term

weights were kept secret, we can say that the rank-

ing formula R

in the ﬁrst generation search engines

was based in the following principle: the more rare

keywords a document shares with a query, the higher

similarity it has with it, resulting in a higher ranking

score for this document:

= f(sim(p, Q)) (1)

The ﬁrst attack to this ranking came from within

the search engines. In 1996, search engines started

openly selling search keywords to advertisers (CNET-

News, 1996) as a way of generating revenue: If a

search query contained a “sold” keyword, the results

would include targeted advertisement and a higher

ranking for the link to the sponsor’s web site.

Mixing search results with paid advertisement

raised serious ethical questions, but also showed the

way to ﬁnancial proﬁts to spammers who started their

own attacks using keyword stufﬁng, i.e., by creat-

ing pages containing many rare keywords to obtain a

higher ranking score. In terms of propaganda theory,

the spammers employed a variation of the technique

of glittering generalities to confuse the ﬁrst genera-

tion search engines (Lee and Lee(eds.), 1939, pg. 47):

The propagandist associates one or more sugges-

tive words without evidence to alter the conceived

value of a person or idea.

In an effort to nullify the effects of glittering gen-

eralities, second generation search engines started

employing additionally more sophisticated ranking

techniques. One of the more successful techniques

was based on the “link voting principle”: Each web

site s has value equal to its “popularity” |B

| which is

inﬂuenced by the set B

of sites pointing to s.

Therefore, the more sites were linking to a site

s, the higher the popularity of s’s pages. Lycos be-

came the champion of this ranking technique (Mauld-

ing, 1997) and had its own popularity skyrocket in

late 1996. Doing so, it was also distancing itself from

the ethical questions introduced by blurring advertis-

ing with ranking (CNETNews, 1996).

The ranking formula R

in the second generation

search engines was a combination of a page’s similar-

ity, sim(p, Q), and its site’s popularity |B

= f(sim(p, Q), |B

|) (2)

WEBIST 2009 - 5th International Conference on Web Information Systems and Technologies

204

To avoid spammers search engines would keep se-

cret their exact ranking algorithm. Secrecy is no de-

fense, however, since secret rules were ﬁgured out

by experimentation and reverse engineering. (e.g.,

(Pringle et al., 1998; Marchiori, 1997)).

Unfortunately, this ranking formula did not suc-

ceed in stopping spammers either. Spammers started

creating clusters of interconnected web sites that had

identical or similar contents with the site they were

promoting, a technique that subsequently became

known as link farms. The link voting principle was

socially inspired, so spammers used the well known

propagandistic method of bandwagon to circumvent

it (Lee and Lee(eds.), 1939, pg. 105):

With it, the propagandist attempts to convince us

that all members of a group to which we belong are

accepting his program and that we must therefore fol-

low our crowd and “jump on the band wagon”.

Similarly, the spammer is promoting the impres-

sion of a high degree of popularity by inter-linking

many internally controlled sites that will eventually

all share high ranking.

PageRank and HITS marked the development of

the third generation search engines. The introduc-

tion of PageRank in 1998 (Brin and Page, 1998) was

a major event for search engines, because it seemed to

provide a more sophisticated anti-spamming solution.

Under PageRank, not every link contributes equally to

the “reputation” of a page PR(p). Instead, links from

highly reputable pages contribute much higher value

than links from other sites. A page p has reputation

PR(p) which is calculated as the sum of fractions of

the reputations of the set B

of pages pointing to p.

Let F

be the set of links out of page v, v ∈ B

. The

reputation of a page is

PR(p) =

1−t

+ t

∑

v∈B

PR(v)

where t is the so-called “transport” factor and N is the

total number of pages in the collection. That way, the

link farms developed by spammers would not inﬂu-

ence much their PageRank, and Google became the

search engine of choice. HITS is another socially-

inspired ranking which has also received a lot of at-

tention (Kleinberg, 1999) and is reportedly used by

the AskJeeves search engine. The HITS algorithm di-

vides the sites related to a query between “hubs” and

“authorities”. Hubs are sites that contain many links

to authorities, while authorities are sites pointed to by

the hubs and they both gain reputation.

Unfortunately, spammers again found ways of cir-

cumventing these rankings. In PageRank, a page

enjoys absolute reputation: its reputation is not re-

stricted on some particular issue. Spammers deploy

sites with expertise on irrelevant subjects, and they

acquire (justiﬁably) high ranking on their expert sites.

Then they bandwagon the irrelevant expert sites, cre-

ating what we call a mutual admiration society. In

propagandistic terms, this is the technique of testimo-

nials (Lee and Lee(eds.), 1939, pg. 74) often used by

advertisers:

Well known people (entertainers, public ﬁgures,

etc.) offer their opinion on issues about which they

are not experts.

Spammers were so aggressive in pursuing this

technique that they openly promoted “reciprocal

links”: Web masters controlling sites that had some

minimum PageRank, were invited to join a mutual

admiration society by exchanging links, so that at the

end everyone’s PageRank would increase. HITS has

also shown to be highly spammable by this technique

due to the fact that its effectiveness depends on the

accuracy of the initial neighborhood calculation.

Another heuristic that third generation search en-

gines used was that of exploiting “anchor text”. It had

been observed that users creating links to web pages

would come to use, in general, meaningful descrip-

tions of the contents of a page. (Initially, the anchor

text was non-descriptive,such as “click here”, but this

changed in the late 1990’s.) Google was the ﬁrst

engine to exploit this fact noting that, even though

IBM’s web page made no mention that IBM is a com-

puter company, many users linked to it with anchor

text such as “computer manufacturer”.

Spammers were quick to exploit this feature too.

In early 2001, a group of activists started using the

anchor text “miserable failure” to link to the ofﬁcial

Whitehouse page of American President George W.

Bush. Using what became known as “Googlebomb”

or, more accurately, link-bomb since it does not per-

tain to Google only, other activists linked the same

anchor text to President Carter, ﬁlmmaker Michael

Moore and Senator Hilary Clinton.

Using the anchor text is socially inspired, so

spammers used the propagandistic method of card

stacking to circumvent it (Lee and Lee(eds.), 1939,

pg. 95):

Card stacking involves the selection and use of

facts or falsehoods, illustrations or distructions, and

logical or illogical statements in order to give the best

or the worst possible case for an idea, program, per-

son or product.

The ranking formula R

in the third generation

search engines is, therefore, some secret combination

of a number of features, primarily the page’s simi-

larity, sim(p, Q), its site’s popularity |B

| and its the

page’s reputation PR(p):

= f(sim(p, Q), |B

|, PR(p)) (3)

ON THE EVOLUTION OF SEARCH ENGINE RANKINGS

205

Table 2: Changes in ranking by generations of search en-

gines, the response of the web spammers and the corre-

sponding propagandistic techniques.

S.E.’s Ranking Spamming Propaganda

1st Gen Doc keyword glittering

Similarity stufﬁng generalities

2nd Gen + Site + link + bandwagon

popularity farms

3rd Gen + Page + mutual + testimonials

reputation admiration

societies

+ anchor + link + card

text bombs stacking

Search engines these days claim to have devel-

oped hundreds of little heuristics for improving their

web search results (Hansell, 2007) but no big idea that

would move their rankings beyond the grasp of spam-

mers. As Table 2 summarizes, for every idea that

search engines have used to improve their ranking,

spammers have managed quickly to balance it with

techniques that resemble propagandistic techniques

from society. Web search corporations are reportedly

busy developing the engines of the next generation

(Broder, 2002). The new techniques aim to be able

to recognize “the need behind the query” of the user.

Given the success the spammers have enjoyed so far,

one wonders how will they spam the fourth genera-

tion engines. Is it possible to create a ranking that is

not spammable? Put another way, can the web as a

social space be free of propaganda?

This may not be possible. Our analysis shows that

we are trying to create in cyberspace what societies

have not succeeded in creating in their real space.

However, we can learn to live in a web with spam as

we live in society with propaganda, given appropriate

education and technology. We touch upon it in our

concluding section.

6 CONCLUSIONS

In this paper we have argued that web spam is to cy-

berworld what propaganda is to society. As evidence

of the importance of this analogy, we have shown

that the evolution of search engines can be largely un-

derstood as the search engines’ responses in defend-

ing against spam. We do not suggest here that web

spam is the sole force behind the evolution of search

engines, but that it is a dominant one. New search

engines are developed when researchers believe they

have a good answer to spam because it directly affects

the quality of the search results.

Further, our ﬁndings suggests that anti-spamming

techniques can be developed by mimicking anti-

propagandistic methods. One way to do that is, of

course, discrediting spam networks wheneverthey are

recognized. Search engines are doing a fair amount

of this (Totty and Mangalindan, 2003; Fetterly et al.,

2004). A more effective way is personalizing the web

graph a user sees, effectively increasing the task dif-

ﬁculty of a spammer to the level of a propagandist:

As we mentioned, a spammer has an easier job than a

propagandist because he/she has to inﬂuence the web

graphs of a few search engines instead of the trust

graphs of millions of individuals. Yet another way,

is to propagate distrust to a spamming network when-

ever one of them is recognized.

But what one should do once one recognizes a

spamming network. This is a question that has not at-

tracted the necessary attention in the past. The default

approach is that a search engine would delete such

networks from its indices (Fetterly et al., 2004) or

might downgrade them by some prespeciﬁed amount

(Gy¨ongyi et al., 2004).

There are clearly cases where these approaches

are appropriate and effective. But in general, both

of these approaches require a universal agreement of

what constitutes spam. Such an agreement cannot ex-

ist; one person’s spam may be another person’s trea-

sure. Should the search engines determine what is

trustworthy and what is not? Willing or not, they

are the de facto arbiters of what information users

see (Totty and Mangalindan, 2003). As in a popu-

lar cartoon by Ohman & Willis, a kid responds to

the old man who has been searching his entire life

for the meaning of life: “[...]if it’s not on Google,

you probably won’t ﬁnd it.” We believe that it is the

users’ right and responsibility to decide what is ac-

ceptable for them. Their browser, their window to cy-

berworld, should enhance their ability to make this

decision. User education is fundamental: without it,

people will largely trust what they see, regardless its

credibility. People should know how search engines

work and why, and how information appears on the

web.

But they should also have a trained browser that

can help them determine the validity and trustworthi-

ness of information.

REFERENCES

Bencz´ur, A., Csalog´any, K., Sarl´os, T., and Uher, M. (2005).

Spam Rank – Fully automatic link spam detection. In

Proceedings of the AIRWeb Workshop.

Bharat, K., Chang, B.-W., Henzinger, M. R., and Ruhl, M.

(2001). Who links to whom: Mining linkage between

WEBIST 2009 - 5th International Conference on Web Information Systems and Technologies

206

web sites. In Proceedings of the 2001 IEEE Inter-

national Conference on Data Mining, pages 51–58.

IEEE Computer Society.

Bianchini, M., Gori, M., and Scarselli, F. (2003). PageRank

and web communities. In Web Intelligence Confer-

ence 2003.

Brin, S. and Page, L. (1998). The anatomy of a large-scale

hypertextual Web search engine. Computer Networks

and ISDN Systems, 30(1–7):107–117.

Broder, A. (2002). A taxonomy of web search. SIGIR Fo-

rum, 36(2):3–10.

Broder, A., Kumar, R., Maghoul, F., Raghavan, P., Ra-

jagopalan, S., Stata, R., Tomkins, A., and Wiener, J.

(2000). Graph structure in the web. Comput. Net-

works, 33(1-6):309–320.

Cho, J. and Roy, S. (2004). Impact of search engines on

page popularity. In WWW 2004.

CNETNews (1996). Engine sells results, draws ﬁre.

http://news.cnet.com/2100-1023-215491.html.

Corey, T. S. (2001). Catching on-line traders in a

web of lies: The perils of internet stock fraud.

Ford Marrin Esposito, Witmeyer & Glesser, LLP.

http://www.fmew.com/archive/lies/.

Fetterly, D., Manasse, M., and Najork, M. (2004). Spam,

damn spam, and statistics. In WebDB2004.

Fetterly, D., Manasse, M., Najork, M., and Wiener, J.

(2003). A large-scale study of the evolution of web

pages. In Proceedings of the twelfth international con-

ference on World Wide Web, pages 669–678. ACM

Press.

Flake, G. W., Lawrence, S., Giles, C. L., and Coetzee, F.

(2002). Self-organization of the web and identiﬁcation

of communities. IEEE Computer, 35(3):66–71.

Graham, L. and Metaxas, P. T. (2003). “Of course it’s true;

i saw it on the internet!”: Critical thinking in the inter-

net era. Commun. ACM, 46(5):70–75.

Gy¨ongyi, Z. and Garcia-Molina, H. (2005). Web spam tax-

onomy. In Proceedings of the AIRWeb Workshop.

Gy¨ongyi, Z., Garcia-Molina, H., and Pedersen, J. (2004).

Combating web spam with TrustRank. In VLDB 2004.

Hansell, S. (2007). Google keeps tweaking its search en-

gine. New York Times.

Henzinger, M. R. (2001). Hyperlink analysis for the web.

IEEE Internet Computing, 5(1):45–50.

Henzinger, M. R., Motwani, R., and Silverstein, C. (2002).

Challenges in web search engines. SIGIR Forum,

36(2):11–22.

Hindman, M., Tsioutsiouliklis, K., and Johnson, J. (2003).

Googlearchy: How a few heavily-linked sites domi-

nate politics on the web. In Annual Meeting of the

Midwest Political Science Association.

Introna, L. and Nissenbaum, H. (2000). Deﬁning the web:

The politics of search engines. Computer, 33(1):54–

62.

Kleinberg, J. M. (1999). Authoritative sources in a hyper-

linked environment. Journal of the ACM, 46(5):604–

632.

Kumar, R., Raghavan, P., Rajagopalan, S., and Tomkins,

A. (1999). Trawling the Web for emerging cyber-

communities. Computer Networks (Amsterdam,

Netherlands: 1999), 31(11–16):1481–1493.

Lee, A. M. and Lee(eds.), E. B. (1939). The Fine Art of

Propaganda. The Institute for Propaganda Analysis.

Harcourt, Brace and Co.

Lynch, C. A. (2001). When documents deceive: trust and

provenance as new factors for information retrieval in

a tangled web. J. Am. Soc. Inf. Sci. Technol., 52(1):12–

17.

Marchiori, M. (1997). The quest for correct information on

the web: hyper search engines. Comput. Netw. ISDN

Syst., 29(8-13):1225–1235.

Maulding, M. L. (1997). Lycos: Design choices in

an internet search service. IEEE Expert, January-

February(12):8–11.

Pringle, G., Allison, L., and Dowe, D. L. (1998). What is a

tall poppy among web pages? In Proceedings of the

seventh international conference on World Wide Web

7, pages 369–377. Elsevier Science Publishers B. V.

Raghavan, P. (2002). Social networks: From the web to the

enterprise. IEEE Internet Computing, 6(1):91–94.

Salton, G. (1972). Dynamic document processing. Com-

mun. ACM, 15(7):658–668.

Silverstein, C., Marais, H., Henzinger, M., and Moricz, M.

(1999). Analysis of a very large web search engine

query log. SIGIR Forum, 33(1):6–12.

Totty, M. and Mangalindan, M. (2003). As google becomes

web’s gatekeeper, sites ﬁght to get in. In Wall Street

Journal CCXLI(39).

Vedder, A. (2000). Medical data, new information technolo-

gies and the need for normative principles other than

privacy rules. In Law and Medicine. M. Freeman and

A. Lewis (Eds.), (Series Current Legal Issues), pages

441–459. Oxford University Press.

Wasserman, S. and Faust, K. (1994). Social Network Analy-

sis: Methods and Applications. Cambridge University

Press.

Wu, B. and Davison, B. (2005). Identifying link farm spam

pages. In WWW 2005.

ON THE EVOLUTION OF SEARCH ENGINE RANKINGS

207