Are Related Links Effective for Contextual Advertising?

A Preliminary Study

Giuliano Armano

, Alessandro Giuliani

and Eloisa Vargiu

1,2

Dept. of Electrical and Electronic Engineering, University of Cagliari, Piazza d’Armi, I09123, Cagliari, Italy

Barcelona Digital Technological Center, Carrer Roc Boronat, 117, E08018, Barcelona, Spain

Keywords:

Collaborative Filtering, Contextual Advertising, Related Links.

Abstract:

Classical contextual advertising systems suggest suitable ads to a given webpage just analyzing its content,

without relying on further information. We claim that adding some information extracted by semantically

related pages can improve the overall performances. To this end, this paper proposes an experimental study

aimed at verifying to which extent the analysis of related links, i.e., inlinks and outlinks, can help contextual

advertising. Experiments have been performed on about 15000 webpages extracted by DMoz. Results show

that the adoption of related links signiﬁcantly improves the performance of the adopted baseline system.

1 INTRODUCTION

Contextual Advertising (CA), also called Content

Match, is aimed at suggesting to a webpage ads that

are related to the content of it. The proposed state-

of-the-art CA systems infer the context of a given

webpage p by analyzing its content, without relying

on any further information. CA is the economic en-

gine behind a large number of non-transactional sites

on the Web. A main factor for the success in CA is

the relevance to the surrounding scenario. As a CA

task can be also viewed as a recommendation task

(Armano and Vargiu, 2010), we claim that CA sys-

tems can be improved by using collaborative ﬁltering

through the extraction of suitable information from

semantically related links.

Marchiori (Marchiori, 1997) states that ’The

power of the Web resides in its capability of redirect-

ing the information ﬂow via hyperlinks, so it should

appear natural that in order to evaluate the infor-

mation content of a Web object, the Web structure

has to be carefully analyzed’. The beneﬁts of link

information for information retrieval have been well

researched (Koolen and Kamps, 2011). Link-based

ranking algorithms use the implicit assumption that

linked documents tend to be related each other and,

therefore, that link information is potentially useful

for retrieval and ﬁltering (Cohen and Kjeldsen, 1987)

(Kleinberg, 1999) (Lempel and Moran, 2001) (Shak-

ery and Zhai, 2006).

In this paper, we present an experimental study

aimed at investigating whether or not related links are

effective for CA. To our best knowledge, this is the

ﬁrst attempt to assess the effectiveness of semanti-

cally related links in the ﬁeld of CA. We consider as

related links of a webpage p: its inlinks (also called

backlinks), i.e., pages that link to p; and its outlinks

(also called inbound and outbound links depending on

the corresponding domain), i.e., pages that are linked

by p. The motivation why we rely on related links is

that if a page q links a page p, at least in principle, the

topics of q are related to the topics of p (Koolen and

Kamps, 2011). To assess whether related links are

useful to improve CA systems, we developed a suit-

able CA system and performed several experiments

on it. Results show that adopting related links signif-

icantly improves the performance of the adopted CA

system.

The rest of the paper is organized as follows:

Section 2 illustrates the approach adopted to assess

whether related links are effective for CA. In Section

3, we report and discuss the experiments and their re-

sults. In Section 4, related work is recalled. Section 5

ends the paper with conclusions and future work.

2 METHODOLOGY

Given a webpage p, a CA system analyzes it in order

to suggest suitable ads. As sketched in Figure 1, a CA

task typically involves three phases: text summariza-

221

Armano G., Giuliani A. and Vargiu E..

Are Related Links Effective for Contextual Advertising? - A Preliminary Study.

DOI: 10.5220/0004135802210226

In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval (KDIR-2012), pages 221-226

ISBN: 978-989-8565-29-7

 2012 SCITEPRESS (Science and Technology Publications, Lda.)

Figure 1: A generic approach to CA.

tion, classiﬁcation, and matching. Text summarization

is aimed at generating a short representation of p with

a minor loss of information. This representation is

typically given in terms of Bag of Words (BoW ), in

which each term is typically weighted by the TFIDF

(Salton and McGill, 1984). Classiﬁcation is devoted

to alleviate possible harmful effects of summariza-

tion. To this end both page excerpts and ads are clas-

siﬁed according to a relevant set of categories, usu-

ally organized in a taxonomy, giving as output the so-

called Classiﬁcation-based Features (CF). Matching

is devoted to suggest ads to p according to a similar-

ity score based on both BoW and CF. Let us note that

this model is compliant with most of state-of-the-art

systems, including those proposed in (Broder et al.,

2007) (Anagnostopoulos et al., 2007) (Armano et al.,

2011).

To study the role of related links in CA, we de-

vised the model depicted in Figure 2. The proposed

model embodies four modules: Related Link Extrac-

tor, Text Summarizer, Classiﬁer, and Matcher. No-

tably, the model is compliant with that reported in

Figure 1 in which only CF are considered in the

matching phase. In particular, they coincide on ev-

erything, but the related link extractor.

Let us recall that with “related links” of a webpage

p we denote both inlinks and outlinks. Figure 3 gives

a view of two different kinds of related links: those

Figure 2: The model of the adopted approach.

that link to an external domain (i.e., from A and to B

in the Figure) and those that link to the same domain

of the target webpage (i.e., from T 1 and to T 2 in the

Figure). Without loss of generality, in this work, we

consider only links belonging to different domains;

in other words, we intentionally disregard inlinks that

come from the same Web domain and inbounds (e.g.,

in the example reported in Figure, we do not consider

T 1 and T 2).

Figure 3: A graphical view of the adopted related links.

Related Link Extractor. This module extracts

the related links of a given webpage p. It col-

lects the set of inlinks and the set of outlinks

of the webpage. The former is built by per-

forming a query to a special service of Yahoo!

Search (http://siteexplorer.search.yahoo.com)

, the

performed query being the URL of the page, in or-

der to ask for the inlinks. The latter is built by parsing

p ()looking for the anchor tag < a >)

. The module

selects the URL of the ﬁrst 10 inlinks and the ﬁrst

10 outlinks, if available. For the sake of experimen-

tal reproducibility we decided to consider the ﬁrst 10

outlinks rather than a random selection.

Text Summarizer. Instead of considering the

whole content of pages, we ﬁrst extract the snip-

pets of, respectively, p, its inlinks, and its out-

links by asking to the Yahoo! search engine

(http://www.yahoo.com). The content of each query

is the URL of the link under analysis. Summarizing,

the main purpose of this module is to remove stop-

words and to stem each term through the Porter’s al-

gorithm (Porter, 1980).The output is a vector repre-

sentation of the original text as BoW , each word being

weighted by TFIDF (TFIDF score is computed over a

training collection of webpage snippets).

Classiﬁer. To infer the topics related to each in-

link, outlink, or page in hand, snippets are classiﬁed

according to a given taxonomy. First, for each node

Currently it is merged into Bing Webmaster Tools

(http://www.bing.com/toolbox/webmaster).

To minimize the impact of “general-purpose” websites,

we ﬁlter links such as Facebook, Twitter, Google+, and so

on.

KDIR2012-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval

222

of the taxonomy, we merge all its training documents

into a single compound document. We then use it as

a centroid for the Rocchio classiﬁer (Rocchio, 1971)

with only positive examples and no relevance feed-

back. Each centroid is deﬁned as a sum of the TFIDF

values of each term, normalized by the number of the

training documents in the class. Snippet classiﬁcation

is based on the cosine of the angle between the snip-

pet and the centroid of the class. The classiﬁer out-

puts the snippet-category matrix, whose generic ele-

ment w

i j

reports the score given by the classiﬁer for

the similarity between the category j and the snippet

Matcher. This module is devoted to suggest ads

to the webpage, according to the given taxonomy.

First, for each column of the page-category matrix,

the matcher calculates the sum of scores as follows:

∑

i=1

i j

(1)

where N is the total number of extracted pages. Then,

the Matcher selects k categories, i.e., those with the

highest values of σ, k being dependent on the agree-

ment between publisher and advertiser. Finally, for

each selected category, an ad is randomly extracted

from the Ads repository

3 EXPERIMENTAL RESULTS

We propose an automatic evaluation algorithm, based

on the associated categories for both page p and ad a

(categories should belong to the adopted taxonomy).

Given a page p and an ad a, the hp, ai pair has been

scored on a 1 to 3 scale, deﬁned as follows:

1 - Relevant: a is semantically and directly related

to the main subject of p (i.e.,p and a belong to the

same node of the taxonomy);

2 - Somewhat Relevant: (i) a is related to a similar

topic of p (sibling nodes), (ii) a is related to the

main topic of p in a more general way (general-

ization, p category is child of a category), or (iii)

a is related to the main topic of p in a too spe-

ciﬁc way (speciﬁcation, a category is child of p

category);

3 - Irrelevant. a is unrelated to p.

According to the existing literature(e.g., (Broder

et al., 2007)), we considered as True Positives (T P)

We assume that a repository of ads is available, in

which company or service webpages are classiﬁed accord-

ing to the given taxonomy.

ads scored as 1 or 2, and as False Positives (FP) ads

scored as 3. Performances have been calculated in

terms of precision at k (π@k) with k ∈ [1, 5] (i.e., the

precision in suggesting k ads), as follows:

π@k =

∑

i=1

∑

j=1

T P

i j

∑

i=1

∑

j=1

(T P

i j

+ FP

i j

)

(2)

Since we rely on a graded relevance scale of eval-

uation, to measure the effectiveness of the approach

we adopt two further evaluation metrics: the Normal-

ized Discounted Cumulative Gain (nDCG) and the

Expected Reciprocal Rank (ERR). The former mea-

sures the usefulness, or gain, of the suggested ad cat-

egories based on its position in the result list (J

arvelin

and Kek

ainen, 2000). The latter is deﬁned as the ex-

pected (inverse) rank at which the user will stop and

click the associated ad. In our case it considers the in-

verse ranking of the ﬁrst relevant category (Chapelle

et al., 2009).

Experiments have been performed on a total of

about 15000 webpages extracted by a subset of

DMoz

. In particular, we selected 18 categories,

all belonging to the root Recreation. The taxonomy

depth equals to 3. All the systems embed the same

Classiﬁer that has been ﬁrstly trained by using about

100 webpages per class.

As for the ads to be suggested, we manually built a

suitable repository in which ads are classiﬁed accord-

ing to the given taxonomy. In that repository each ad

is represented by the webpage of a product or service

company (landing page). The ad repository contains

135 ads.

We made experiments aimed at evaluating to

which extent related links affect the performance of a

CA system. Due to the two-tiered nature of a related

link (inlink or outlink), we also separately evaluated

the contributions of inlinks and outlinks. Summariz-

ing, experiments have been performed by taking into

account:

• the webpage alone (P)

;

• the set of inlinks alone (I);

• the set of outlinks alone (O);

• the webpage in conjunction with its inlinks (P+I);

• the webpage in conjunction with its outlinks

(P+O);

• the set of related links (inlinks and outlinks) alone

(RL);

http://www.dmoz.org

baseline system, compliant with most state-of-the-art

systems

AreRelatedLinksEffectiveforContextualAdvertising?-APreliminaryStudy

223

Figure 4: Precision at k.

• the webpage in conjunction with its related links

(P+RL).

We report in the Figure 4 the precisions of each sys-

tem.

The ﬁrst step was concerned with evaluating the

performance of the baselin system (P). The second

step was concerned with evaluating the contribution

of inlinks. Figure 4 shows that P+I leads to an im-

provement with respect to I and P. Notably, the in-

troduction of inlinks leads to an increment of about

7.3%, whereas the introduction of the page with re-

spect to the sole inlinks leads to an increment of about

0.6%.

We then evaluated the impact of outlinks. Figure

4 shows that P performs better than O and than P+O.

Hence, the adoption of outlinks leads to a remarkable

decrease of performance.

Finally, we evaluated the conjuncted adoption of

inlinks and outlinks (the related links). Figure 4

shows that the best performances are always obtained

by P+RL. Notably, the introduction of related links

with respect P leads to an increment of about 9.2%,

whereas the introduction of the page with respect RL

leads to an increment of about 1.4%.

Table 1 reports the performances of each system,

in suggesting 5 ads, in terms of nDCG and ERR. It

conﬁrms the behavior of each system, giving rise to

the contribution of related links, conﬁrming the as-

sumption that linked documents have related content.

Table 1: nDCG and ERR of each approach in suggesting 5

ads.

P I O RL P+I P+O P+RL

nDCG 0.839 0.843 0.748 0.861 0.844 0.834 0.875

ERR 0.552 0.567 0.409 0.584 0.574 0.539 0.597

Summarizing, it is clear, from the results, how the

adoption of related links increases the performances.

They also put into evidence that the main contribution

is given by the introduction of inlinks. The differ-

ences between the adoption of inlinks and the adop-

tion of outlinks could be due to the different amount

of examined links. In particular, for a given webpage,

it is easy to ﬁnd at least 10 inlinks, while it is more

difﬁcult that the webpage contains at least the same

number of outlinks. In fact, we analyzed the dataset in

order to compute the number of outlinks per page, and

we found that a webpage contains an average number

of 3.6 inlinks.

4 RELATED WORK

4.1 Semantically Related Links

Many researchers investigated the role of links in

information retrieval, see, for example, (Marchiori,

1997). In particular, links have been used to (i) en-

hance document representation (Picard and Savoy,

2003), (ii) improve document ranking by propagat-

ing document score (Frei and Stieger, 1995), (iii) pro-

vide an indicator of popularity (Brin and Page, 1998),

and (iv) ﬁnd hubs and authorities for a given topic

(Chakrabarti et al., 1999).

It is worth noting that a key problem in this re-

search ﬁeld is how to measure the semantic related-

ness of documents, see (Budanitsky and Hirst, 2006)

for a survey. In this work, we just propose a prelimi-

nary study on the impact of related links in CA with-

out taking into account this problem.

4.2 Contextual Advertising

Each solution for CA evolved from search advertis-

ing, where a search query matches with a bid phrase

of the ad. A natural extension of search advertising is

KDIR2012-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval

224

extracting phrases from the target page and matching

them with the bid phrases of ads. Yih et al. (Yih et al.,

2006) proposed a system for phrase extraction to de-

termine the importance of page phrases for advertis-

ing purposes. Since the repository of ads adopted in

our work is composed by webpages of companies, we

do not take into account the phrase extraction but rely

only on extraction-based text summarization by using

the snippets provided by Yahoo!.

Ribeiro-Neto et al. (Ribeiro-Neto et al., 2005) ex-

plored the use of different sections of ads as a ba-

sis for the vector, mapping both page and ads in the

same space. Since there is a discrepancy between

the vocabulary used in the pages and in the ads (the

so called impedance mismatch), the authors improved

the matching precision by expanding the page vocab-

ulary with terms from similar pages. According to

their work, we represent both pages and ads in a vec-

tor space and we consider the contribution of simi-

lar pages. Nevertheless, since our ads are actually

webpages, we do not take into account the impedance

mismatch.

Broder et al. (Broder et al., 2007) improved the

performance of CA by classifying both pages and ads

according to a given taxonomy and matching ads to

the page falling into the same node of the taxonomy.

Each node of the taxonomy is built as a set of bid

phrases or queries corresponding to a certain topic.

We adopted the same Rocchio classiﬁer, according to

their results and taking into account the effectiveness

of adopting CF.

Nowadays, ad networks need to deal in real time

with a large amount of data, involving billions of

pages and ads. Therefore, several constraints must be

taken into account for building CA systems. In par-

ticular, efﬁciency and computational costs are crucial

factors in the choice of methods and algorithms. In or-

der to analyze the entire body of webpages on-the-ﬂy,

state-of-the-art systems use text summarization tech-

niques (Anagnostopoulos et al., 2007) (Armano et al.,

2011). Our choice for text summarization exploits

the snippet provided by Yahoo!. The effectiveness of

using snippets as text summarization techniques has

been proved in (Armano et al., 2012).

Since bid phrases are basically search queries, an-

other relevant approach is to view CA as a problem of

query expansion and rewriting (Murdock et al., 2007)

(Ciaramita et al., 2008). According to this view, we

assess the performance of our approach by adopting

π@k, a classical measure adopted in query search,

which in this case represents the capability of the sys-

tem to suggest k relevant ads to a webpage.

Since the task of suggesting an ad to a webpage

can be also viewed as the task of recommending an

item (the ad) to a user (the webpage), another perspec-

tive consists on addressing a CA problem as a recom-

mendation task (Armano and Vargiu, 2010). Accord-

ing to this view, the system developed to perform the

experiments can be though as a hybrid recommender

system in which collaborative ﬁltering is used in a

content-based setting.

5 CONCLUSIONS AND FUTURE

WORK

This paper was aimed at assessing whether or not re-

lated links can be effective for contextual advertising.

To this end, we conducted a preliminary experimen-

tal study aimed at investigating the impact of related

links in a generic contextual advertising system. Ex-

periments have been performed considering the im-

pact of related links. Results clearly show that the

adoption of related links increases the performances.

They also put into evidence that the main contribution

is given by the introduction of inlinks. Hence, we can

sentence that related links are effective for contextual

advertising.

As for future work, we are implementing a system

in which the matching is performed by taking into ac-

count the information of each ad rather than the ad

category. Moreover, we are studying how to improve

the proposed approach by considering also the title

and the url of each page, in conjunction with its snip-

pet.

ACKNOWLEDGEMENTS

We wish to thank Massimo Puddu, Ugo Sirca and

Alessandro Zou for their support in developing the

system.

REFERENCES

Anagnostopoulos, A., Broder, A. Z., Gabrilovich, E., Josi-

fovski, V., and Riedel, L. (2007). Just-in-time con-

textual advertising. In CIKM ’07: Proceedings of

the sixteenth ACM conference on Conference on infor-

mation and knowledge management, pages 331–340,

New York, NY, USA. ACM.

Armano, G., Giuliani, A., and Vargiu, E. (2011). Studying

the impact of text summarization on contextual adver-

tising. In 8th International Workshop on Text-based

Information Retrieval.

Armano, G., Giuliani, A., and Vargiu, E. (2012). Using

snippets in text summarization: a comparative study

AreRelatedLinksEffectiveforContextualAdvertising?-APreliminaryStudy

225

and an application. In IIR’12: 3rd Italian Information

Retrieval (IIR) Workshop.

Armano, G. and Vargiu, E. (2010). A unifying view of con-

textual advertising and recommender systems. In Pro-

ceedings of International Conference on Knowledge

Discovery and Information Retrieval (KDIR 2010),

pages 463–466.

Brin, S. and Page, L. (1998). The anatomy of a large-scale

hypertextual web search engine. Comput. Netw. ISDN

Syst., 30:107–117.

Broder, A., Fontoura, M., Josifovski, V., and Riedel, L.

(2007). A semantic approach to contextual advertis-

ing. In SIGIR ’07: Proceedings of the 30th annual in-

ternational ACM SIGIR conference on Research and

development in information retrieval, pages 559–566,

New York, NY, USA. ACM.

Budanitsky, A. and Hirst, G. (2006). Evaluating wordnet-

based measures of lexical semantic relatedness. Com-

put. Linguist., 32:13–47.

Chakrabarti, S., van den Berg, M., and Dom, B. (1999). Fo-

cused crawling: a new approach to topic-speciﬁc Web

resource discovery. Computer Networks (Amsterdam,

Netherlands: 1999), 31(11–16):1623–1640.

Chapelle, O., Metlzer, D., Zhang, Y., and Grinspan, P.

(2009). Expected reciprocal rank for graded rele-

vance. In Proceedings of the 18th ACM conference on

Information and knowledge management, CIKM ’09,

pages 621–630, New York, NY, USA. ACM.

Ciaramita, M., Murdock, V., and Plachouras, V. (2008). On-

line learning from click data for sponsored search. In

Proceeding of the 17th international conference on

World Wide Web, WWW ’08, pages 227–236, New

York, NY, USA. ACM.

Cohen, P. R. and Kjeldsen, R. (1987). Information re-

trieval by constrained spreading activation in semantic

networks. Information Processing and Management,

23(4):255–268.

Frei, H. P. and Stieger, D. (1995). The use of semantic links

in hypertext information retrieval. Inf. Process. Man-

age., 31:1–13.

arvelin, K. and Kek

ainen, J. (2000). Ir evaluation meth-

ods for retrieving highly relevant documents. In Pro-

ceedings of the 23rd annual international ACM SIGIR

conference on Research and development in informa-

tion retrieval, SIGIR ’00, pages 41–48, New York,

NY, USA. ACM.

Kleinberg, J. M. (1999). Authoritative sources in a hyper-

linked environment. Journal of ACM, 46:604–632.

Koolen, M. and Kamps, J. (2011). Are semantically related

links more effective for retrieval? In Proceedings

of the 33rd European conference on Advances in in-

formation retrieval, ECIR’11, pages 92–103, Berlin,

Heidelberg. Springer-Verlag.

Lempel, R. and Moran, S. (2001). SALSA: the stochastic

approach for link-structure analysis. ACM Transac-

tions on Information Systems, 19:131–160.

Marchiori, M. (1997). The quest for correct information on

the web: hyper search engines. Comput. Netw. ISDN

Syst., 29:1225–1235.

Murdock, V., Ciaramita, M., and Plachouras, V. (2007).

A noisy-channel approach to contextual advertising.

In Proceedings of the 1st international workshop on

Data mining and audience intelligence for advertis-

ing, ADKDD ’07, pages 21–27, New York, NY, USA.

ACM.

Picard, J. and Savoy, J. (2003). Enhancing retrieval with

hyperlinks: a general model based on propositional

argumentation systems. J. Am. Soc. Inf. Sci. Technol.,

54:347–355.

Porter, M. (1980). An algorithm for sufﬁx stripping. Pro-

gram, 14(3):130–137.

Ribeiro-Neto, B., Cristo, M., Golgher, P. B., and Silva de

Moura, E. (2005). Impedance coupling in content-

targeted advertising. In SIGIR ’05: Proceedings of the

28th annual international ACM SIGIR conference on

Research and development in information retrieval,

pages 496–503, New York, NY, USA. ACM.

Rocchio, J. (1971). The SMART Retrieval System: Ex-

periments in Automatic Document Processing, chap-

ter Relevance feedback in information retrieval, pages

313–323. PrenticeHall.

Salton, G. and McGill, M. (1984). Introduction to Modern

Information Retrieval. McGraw-Hill Book Company.

Shakery, A. and Zhai, C. X. (2006). A probabilistic

relevance propagation model for hypertext retrieval.

In Proceedings of the 15th ACM international con-

ference on Information and knowledge management,

CIKM ’06, pages 550–558, New York, NY, USA.

ACM.

Yih, W.-t., Goodman, J., and Carvalho, V. R. (2006). Find-

ing advertising keywords on web pages. In WWW

’06: Proceedings of the 15th international conference

on World Wide Web, pages 213–222, New York, NY,

USA. ACM.

KDIR2012-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval

226