Web Forums Change Analysis

Tomasz Kaczmarek and Dawid Grzegorz W˛eckowski

Department of Information Systems, Pozna´n University of Economics, Pozna´n, Poland

Keywords:

Web Crawler, Web Forum, Web Page Changes.

Abstract:

In this paper we present results from an experiment conducted on over 27 900 web pages gathered every

2 hours over 22 days from 16 forums (4256 independent crawls), to investigate how these web pages evolve

over time. The results of the experiment became a basis for design choices for a focused incremental crawler,

that will be specialized for efﬁcient gathering of documents from web forums, maintaining high freshness of

the local collection of obtained pages. The data analysis shows, that forums differ from generic web portals

and identifying places in the source navigational structure, where new documents occur more often, would

allow to improve the crawler’s performance and the collection freshness.

1 INTRODUCTION

This article reports on the results of experiment con-

ducted on 16 web forums crawled for web pages. The

experiments were conducted to gather data concern-

ing web page changes. On one hand we would like

to validate the results obtained previously in simi-

lar experiments, but for a slightly different conditions

(more general web sites and less granular data). On

the other hand we are looking for the clues about the

possible patterns present in web page change data.

Our main motivation for analysing the data is to

improve over existing results in web crawler design

and in particular approaches to scheduling web pages

for revisitation in subsequent (or incremental) crawls.

The goal for crawling would be to capture new infor-

mation published on the forum as early as possible,

without overburdening the crawler infrastructure and

the target websites with unnecessary fetches.

2 RELATED WORK

The work related to our concerns several areas that

have much in common: research on the Web structure

and its changes, methods of change detection, incre-

mental crawler design issues and websites navigation

issues including crawling speciﬁc types of websites

(focused crawling).

2.1 Web Structure and Changes

We base our work on seminal papers by Cho

and Garcia-Molina (Cho and Garcia-Molina, 2000),

where they made similar investigation of web page

evolution in order to design a better incremental

crawler. They conducted their analysis on a large col-

lection of web pages (over 500 000) and conclude that

pages change on average every 10 days (with 1 day

granularity of measurement) with signiﬁcant differ-

ences between commercial and less business oriented

pages. Secondly, they found that web pages’ changes

follow the mathematical model of Poisson process,

which is used for sequences of random events hap-

pening independently with ﬁxed rate over time.

The same authors continue their work to reﬁne

the mathematical model for estimating frequency of

change and optimize the access time for incremen-

tal crawler in (Cho and Garcia-Molina, 2003). Their

experiments show, that using the right estimates the

performance of the web crawler can be signiﬁcantly

improved, resulting in more changes being detected.

Further work on exploiting knowledge of the web

page changes characteristics focuses on optimisation

of the crawler. The work of Baeza-Yates and col-

leagues from 2005 (Baeza-Yates et al., 2005) dis-

cusses strategies for prioritizing web page download.

They focus on strategies that take into account web

page importance (calculated mainly using Pagerank).

They concluded, that using the right strategy it is pos-

sible to ﬁnd the high quality pages sooner, than with

breadth-ﬁrst strategy for example.

105

Kaczmarek T. and W˛eckowski D..

Web Forums Change Analysis.

DOI: 10.5220/0004373201050110

In Proceedings of the 9th International Conference on Web Information Systems and Technologies (WEBIST-2013), pages 105-110

ISBN: 978-989-8565-54-9

 2013 SCITEPRESS (Science and Technology Publications, Lda.)

Another analysis of web page change evolution

patterns was presented in (Adar et al., 2009b). The

analysis was conducted on a ﬁner grain than the

whole-page comparison because structural elements

of web pages were analysed. Each of 55 000 URLs

sampled based on user visits in this study was crawled

on a hourly basis which allowed also for detailed anal-

ysis of the time of changes. The ﬁndings of this study

are that the rates of change were higher than in previ-

ously reported experiments (averages in the range of

tens of hours), with a large portion of pages chang-

ing more than hourly. Detailed content and structure

analyses identiﬁed stable and dynamic content within

each page. Using the same data set the Authors re-

ported also in (Adar et al., 2009a) the discovered res-

onance between user behaviour patterns (frequency of

visits or returns to the page) and the page change pat-

terns.

In (Ben Saad and Gançarski, 2011) the Authors

took another approach and tried to identify periodic

patterns of change on the monitored web pages, in or-

der for the crawler to be able to predict the exact time

when the change is most likely to occur on a given

URL. The research is based on observations that for

some frequently updated web pages (several times per

day) the periodic activity can be detected. The ap-

proach was applied for web archiving to improve the

quality of the web archive.

2.2 Methods of Change Detection

The method of change detection can be crucial for

crawling web forums and adjusting revisiting strat-

egy for keeping the local collection of web pages

copies up-to-date. One of the simplest solution is

the use the document digest to compare subsequent

versions of a page. This method is efﬁcient, how-

ever the disadvantage of such approach is that both

small and large changes are equal. The fact of modiﬁ-

cation can be also established upon the information

in HTTP header Last-Modified, although in case

of dynamically generated web pages the information

can be unreliable. Several other solutions were devel-

oped for more robust change detection and descrip-

tion. One of the ﬁrst solutions designed for HTML

documents comparison is HtmlDiff program, devel-

oped at AT&T Bell Laboratories (Douglis and Ball,

1996; Douglis et al., 1998). It takes advantage of the

algorithm proposed by Hirshberg to solve the longest

common subsequence (LCS) problem (Hirschberg,

1977). Rocco, Buttler and Liu (Rocco et al., 2003;

Buttler et al., 2004) introduced mechanism called

Page Digest for storing and processing web docu-

ments. Page Digest introduces a clean separation of

the content of a web page and the page structure,

and can operate in linear time, with 75% improve-

ment as compared to other solutions. The analysis of

well known document similarity metrics (byte-wise

comparison, TF-IDF cosine distance, word distance,

edit distance, and w-shingling) was also performed

in (Kwon et al., 2006). On a structural level of HTML

documents (Adar et al., 2009b) presented algorithms

for describing DOM tree elements modiﬁcations as

well as persistence of structural blocks. Several re-

cent studies make use of visual features of web pages

to detect and measure changes (Saad and Gançarski,

2010; Law et al., 2012).

2.3 Incremental Crawler Design

The work by Cho on web page evolution, as men-

tioned earlier (Cho and Garcia-Molina, 2000), al-

lowed them to formulate certain clues for incremen-

tal crawler design. Signiﬁcant changes in the rate

of change between different pages (large number of

them did not change over the whole 4 month period

of the experiment) suggests that crawling should be

guided by the rate of change of the web page, not

to waste the resources on pages that change less fre-

quently. Secondly, the Poisson process model for

web page changes suggests a strategy for incremen-

tal crawling, which depends on estimation of which

pages are likely to change in a given time period ac-

cording to the process model and history of changes,

and scheduling them ﬁrst for crawling.

A number of other approaches to prioritize pages

in the crawl queue were proposed, that take into ac-

count factors other than the frequency of changes of

the web page. As (Liu et al., 2011) reports, measures

of page importance stemming from web graph struc-

ture (like in-degree or PageRank) were successfully

evaluated. Web page topic, web site structure or even

direct mining of user interest from the search engine

log were also proposed as factors to take into consid-

eration when crawling the Web.

2.4 Focused Crawling

There were several efforts aiming at development of

methods and building web crawlers, focusing on re-

trieval of speciﬁc web pages.

A focused crawling for collecting mainly novel

pages can be established with the use of novelty mea-

sure proposed in (Toyoda and Kitsuregawa, 2006). It

can be used to ﬁnd emergent information according

to some topical range, e.g. a user’s interest. A fo-

cused crawler can use the distribution of the novelty

measure calculated for current local collection of web

WEBIST2013-9thInternationalConferenceonWebInformationSystemsandTechnologies

106

pages, to adjust revisitation strategy, and explore the

areas of the Web that most probably contain novelties.

(Baeza-Yates and Castillo, 2007) discussed the

most common ways of deﬁning the range of pages to

be retrieved from the inﬁnite Web. One can download

purely static pages or use only one set of parameters

in the URL, alternatively pages can be collected until

speciﬁed total / per domain amount or until reaching

certain number of levels per web site. Authors pro-

pose various models of random surﬁng in a generic

unbounded web site and they show the main areas of

focus for a web crawler by analyzing how deep the

users generally go into the web site.

Recently many solutions deal with web page re-

trieval from web forums. (Cai et al., 2008) proposed

an intelligent forum crawler, called iRobot, which is

able to choose traversal paths for visiting different

types of pages, by analysing content and structure of

a forum web site. The crawler tries to reconstruct the

forum’s sitemap, with the use of preliminary sampling

of web pages from the site and grouping them accord-

ing to content layout characteristics. The sitemap al-

lows iRobot to select an optimal path for web page

retrieval, including only informative pages, discard-

ing duplicates and invalid ones.

(Yang et al., 2009) studied incremental crawling

of web forums. They showed that traditional ap-

proaches to crawling web pages may be insufﬁcient

in terms of forums sites, which have different char-

acteristics than general web sites and contain inter-

relationships between pages. Determining revisiting

strategy based on individual pages change frequency

can be inefﬁcient, because within web forums each

list of threads or list of posts is split among several

pages, implying that one change in a list can cause

modiﬁcation of all related pages. Thus, a list-wise

crawling strategy was proposed, after reconstructing

a forum linking structure, posts, that are spread across

different pages but are from the same thread, can be

concatenated and a regression model of change can be

applied for a whole thread.

FoCUS (Forum Crawler Under Supervision) was

presented in (Jiang et al., 2012). Their goal was

to build a forum crawler for visiting only relevant

content with minimal overhead. FoCUS design was

based on the observation, that, despite being gen-

erated by various software engines, having different

styles and structures, web forums have similar, so

called, implicit navigation paths, represented by spe-

ciﬁc URLs, that connect entry pages and thread pages.

Authors reduced the forum crawling problem, to the

problem of recognition of URL type, and enabled the

crawler to learn regular expression patterns of the im-

plicit navigation paths. It was showed that, with the

use of robust page classiﬁer, even small training set

can be enough for a large scale forum crawling.

2.5 Motivation

As evident from the above literature review, there is

abundance of methods for scheduling web pages for

retrieval. However we would like to validate the ba-

sic assumptions and see what are the actual patterns

of web page changes on web forums, which are our

target. The experiments are performed to further re-

ﬁne the scheduling methods based on the web page

changes or propose some hybrid approach, which

would take into account a larger number of factors,

leading to even better crawler performance.

3 EXPERIMENT SETUP

3.1 What is a Change?

To analyze the data we had to deﬁne the notion of

page change. The data that we gathered at certain

time points contained three types of information: the

page was not available under a given URL (marked

N/A), the page was available and did not change in

comparison to the previous fetch of the page from a

given URL even by a single byte (marked as 0), or the

page was available and comparison with the previous

fetch revealed a change (even if only by a single bit)

– marked as 1. The latter case includes also a situa-

tion, when the page was collected for the ﬁrst time,

or it was not available under a given URL previously,

but at certain point in time a non-empty content was

fetched from this URL.

Therefore several situations may occur in practice:

• N/A-N/A – page not available previously and in a

subsequent fetch, no change

• N/A-1 – page not available previously, but content

fetched in a subsequent round

• 1-1 – page changed previously and in a subse-

quent fetch

• 1-N/A – page was available previously but disap-

peared in a subsequent fetch

• 1-0 – page was available previously, and did not

change

• 0-0 – page was available previously and still did

not change

• 0-N/A – page was available and disappeared

• 0-1 – page was available and changed in a subse-

quent fetch

WebForumsChangeAnalysis

107

Out of all possibilities one (N/A-0) could not occur in

our data set.

The motivation for this method of counting

changes is that we wanted to distinguish between the

pages that appeared and did not change later and the

pages that appeared and disappeared immediately (or

even later in time).

3.2 Data Set Characteristics

For our studies we have chosen 16 different Polish

forum sites, which were periodically crawled during

the experiment. After each crawl, a new snapshot was

created, that consisted of ﬁles fetched from a given

web site. The ﬁles were stored in a raw form for the

ease of further analysis and comparison.

We wanted to achieve a reasonably high granu-

larity of fetches, so the web pages from speciﬁed

sites were retrieved every 2 hours. As the experiment

lasted for 23 days (2012-05-05 – 2012-05-27), this

gave us a sample of 266 snapshots for each forum site.

The total number of unique URLs that were visited is

27958, including web pages that were discovered dur-

ing the experiment, as well as pages that, after initial

retrieval, were found unavailable later on. To deter-

mine the existence of change (as described in 3.1) we

used data from 649705 successful fetch events that

were conducted.

Similarly to the (Cho and Garcia-Molina, 2000),

we used an active crawling technique with a page

window. Using active crawling, a program robot

(crawler) visits selected web pages periodically and

each version of a page is stored for further analysis

and change detection. Active crawling can be too ob-

trusive for web sites being monitored, especially if the

fetch frequency is high. Nevertheless, we chose this

approach because it provides more reliable statistics

and can be more precisely customised in terms of fre-

quency of visits and the scope of monitoring.

In our case the selection of web pages to be visited

depends on the structure of a given web site. For each

web site the crawler starts with a predeﬁned main

page, and conducts breadth-ﬁrst search for available

web pages. All pages that can be found not deeper

than on the ﬁfth level of the web site, are perceived

as a page window for that site. The page window can

change during the consecutive crawls – some pages

can be created ore moved to an appropriate level, and

thus can be added to the page window; other ones can

be removed from the window as they become unavail-

able or as they are moved deeper in the web site struc-

ture.

This method of crawling for the depth of ﬁve al-

lowed us to ﬁnish each crawl before the time point to

start the subsequent crawl. Obviously, we could ex-

tend the depth of the crawl, but given our infrastruc-

ture limitations, it would require extending the time

span between subsequent crawls as well. Our initial

experiments indicated, that for the depth of 6 or more,

the number of pages retrieved in each crawl raises sig-

niﬁcantly, which was expected.

4 RESULTS

Our studies show some interesting features of web fo-

rum sites that were monitored. The 1 shows the num-

ber of new and deleted web pages, that were found

during the experiment period, aggregated on daily ba-

sis. The data indicate a very high change rate of the

local collection – there was more than 1000 new pages

on average every day, and a similar amount of web

pages were found to be no longer available. As the

average amount of web pages in the local collection

was 3028, that means every day 1/3 of the collection

was replaced by new pages.

It can be seen, that the number of both, the new

and the deleted web pages, prove to have weekly ﬂuc-

tuation, with the peak in the middle of the week, and

decreasing at the weekends.

Days

New pages count

0 500 1000 1500 2000 2500 3000

May 05 May 08 May 11 May 14 May 17 May 20 May 23 May 26

●

created

deleted

Figure 1: Number of new URLs over the experiment period

per day.

The 2 shows the results of our experiment con-

cerning average change frequency of retrieved web

pages. As it can be seen, a signiﬁcant percent of

pages change very often — 30.42% of monitored web

pages changed every 2 hours or more often on aver-

age, 50.41% – at least every 4 hours on average and

87.96% every day or more often on average.

To analyze how quickly the local collection be-

comes out of date, we measured the time from the

WEBIST2013-9thInternationalConferenceonWebInformationSystemsandTechnologies

108

ﬁrst fetch of each web page to its ﬁrst change after-

wards. A snapshot of web pages that was retrieved

from web forums, became obsolete in a short time –

after 2 hours, 40.49% of web pages were out of date,

after 4 hours – it was 63.6%, and it took only 12 hours

for 80.56% of web pages to be outdated.

0 20 40 60 80 100

2 5 10 20 50 100 200 500

Percentage of pages (%)

Change frequency (h)

Figure 2: Percent of pages with a given change frequency

rate (log scale).

We also calculated the lifespan of web pages in

several ranges, as the time span between the ﬁrst date

and the last date when the content was available un-

der a given URL, regardless of its changes and pos-

sible disappearances in the meantime. The data pre-

sented in 1 indicates that a large number of pages has

relatively short lifespan (over 40% disappears within

4 hours). At the same time almost 25% of pages

lives longer than one day. Obviously our crawling ap-

proach with page window could affect these results to

an unknown extent.

Table 1: Lifespan of web pages.

Time Percentage

0-2 h 27.82%

2-4 h 14.9%

4-8 h 8.16%

8-12 h 8.54%

12-24 h 16.44%

1 day – 1 week 14.16%

> 1 week 9.95%

Finally, we analysed the relation between fre-

quency of changes and possible web page behaviour

patterns approximated by the standard deviation of

the frequency of changes. We have found that the

pages that change frequently (with averages between

1 to 10 hours) present consistent behaviour in terms of

their frequency of changes. The higher average time

span between subsequent changes, the less consis-

tent behaviour one page presents - standard deviations

tend to fan out for higher averages. This might be an

evidence for low predictability of web page change

patterns and randomness of the change occurrences.

We speculate, that the group of pages that changes

often and consistently include the “front” of the web

forum, where new information is published, while the

other group of pages, that change much less often,

consist of pages that are updated, removed or com-

mented. A relatively broad range of most frequently

changing pages indicates also, that the new posts

pushing back the old ones create artiﬁcial change no-

tiﬁcations perceived on the web page level. To cap-

ture actual changes of content more in depth analysis

of the pages is necessary.

5 SUMMARY OF THE

EXPERIMENT RESULTS

Over 30 % pages are worth visiting more often than

every 2 hours, and over 50 % of pages are worth visit-

ing more often than every 4 hours. The open question

is how much of these 30 % are changed due to scripts

generating the page, without any actual new content.

However, regardless of the answer to this question,

which would require another experiment with much

higher granularity of fetches, it seems that for instant

monitoring of forums it is essential to identify fre-

quently changing pages, in order to capture new con-

tent as soon as it appears, and be able to increase the

freshness of the collection.

6 IMPLICATIONS FOR

CRAWLER DESIGN

Our main motivation to conduct the experiments was

to check, whether existing incremental crawling ap-

proaches are justiﬁed and sufﬁcient for dealing with

web forums. The data that we obtained indicate, that

if large scale forum crawling is to be performed, high

frequency of changes of the web pages has to be taken

into account. Our goal for crawling is to be able to

capture new information as early as possible. Since

large majority of web pages changes within 24 hours

it is necessary for the crawler to revisit URLs much

more often. Given the usual situation of limited net-

work bandwidth and the necessary politeness policy

to avoid overburdening of target web servers it is es-

sential to guide the crawl towards the resources that

change more often and are of particular relevance.

WebForumsChangeAnalysis

109

However, randomness of the web page change pro-

cess causes, that basing web page revisit on an es-

timate of web page change frequency might not be

sufﬁcient. Other factors could be taken into account,

such as a day of a week or a forum structure. Com-

bining information about the frequency of changes to-

gether with information about the structure of a web

site as a graph could indicate places in the graph,

where new information is actually published. At the

same time it would allow to optimize the revisit policy

to be able to visit most relevant URLs on a less-than-

hour basis.

ACKNOWLEDGEMENTS

The work published in this article was supported by

the project titled: “Ego – Virtual Identity”, ﬁnanced

by the Polish National Centre of Research and Devel-

opment (NCBiR), contract no. NR11-0037-10.

REFERENCES

Adar, E., Teevan, J., and Dumais, S. T. (2009a). Resonance

on the web: web dynamics and revisitation patterns.

In Proceedings of the 27th international conference on

Human factors in computing systems, CHI ’09, pages

1381–1390, New York, NY, USA. ACM.

Adar, E., Teevan, J., Dumais, S. T., and Elsas, J. L. (2009b).

The web changes everything: understanding the dy-

namics of web content. In Proceedings of the Second

ACM International Conference on Web Search and

Data Mining, WSDM ’09, pages 282–291, New York,

NY, USA. ACM.

Baeza-Yates, R. and Castillo, C. (2007). Crawling the inﬁ-

nite web. J. Web Eng., 6(1):49–72.

Baeza-Yates, R., Castillo, C., Marin, M., and Rodriguez,

A. (2005). Crawling a country: better strategies than

breadth-ﬁrst for web page ordering. In Special inter-

est tracks and posters of the 14th WWW conference,

WWW ’05, pages 864–872, New York. ACM.

Ben Saad, M. and Gançarski, S. (2011). Archiving the web

using page changes patterns: a case study. In Pro-

ceedings of the 11th annual international ACM/IEEE

joint conference on Digital libraries, JCDL ’11, pages

113–122, New York, NY, USA. ACM.

Buttler, D., Rocco, D., and Liu, L. (2004). Efﬁcient web

change monitoring with page digest. In Proceedings

of the 13th international World Wide Web conference

on Alternate track papers & posters, WWW Alt. ’04,

pages 476–477, New York, NY, USA. ACM.

Cai, R., Yang, J.-M., Lai, W., Wang, Y., and Zhang, L.

(2008). irobot: an intelligent crawler for web forums.

In Proceedings of the 17th WWW conference, WWW

’08, pages 447–456, New York, NY, USA. ACM.

Cho, J. and Garcia-Molina, H. (2000). The evolution of the

web and implications for an incremental crawler. In

Proceedings of the 26th International Conference on

Very Large Data Bases, VLDB ’00, pages 200–209,

San Francisco. Morgan Kaufmann Publishers Inc.

Cho, J. and Garcia-Molina, H. (2003). Estimating fre-

quency of change. ACM Trans. Internet Technol.,

3(3):256–290.

Douglis, F. and Ball, T. (1996). Tracking and viewing

changes on the web. In USENIX Technical Confer-

ence. AT&T Bell Laboratories.

Douglis, F., Ball, T., Chen, Y.-F., and Koutsoﬁos, E. (1998).

The AT&T Internet Difference Engine: Tracking and

viewing changes on the web. World Wide Web, 1:27–

44.

Hirschberg, D. S. (1977). Algorithms for the longest com-

mon subsequence problem. J. ACM, 24(4):664–675.

Jiang, J., Yu, N., and Lin, C.-Y. (2012). Focus: learn-

ing to crawl web forums. In Proceedings of the 21st

WWW conference, WWW ’12 Companion, pages 33–

42, New York. ACM.

Kwon, S., Lee, S., and Kim, S. (2006). Effective crite-

ria for web page changes. In Zhou, X., Li, J., Shen,

H., Kitsuregawa, M., and Zhang, Y., editors, Frontiers

of WWW Research and Development - APWeb 2006,

volume 3841 of Lecture Notes in Computer Science,

pages 837–842. Springer Berlin / Heidelberg.

Law, M. T., Thome, N., Gançarski, S., and Cord, M.

(2012). Structural and visual comparisons for web

page archiving. In Proceedings of the 2012 ACM sym-

posium on Document engineering, DocEng ’12, pages

117–120, New York, NY, USA. ACM.

Liu, M., Cai, R., Zhang, M., and Zhang, L. (2011). User

browsing behavior-driven web crawling. In Proceed-

ings of the 20th ACM international conference on In-

formation and knowledge management, CIKM ’11,

pages 87–92, New York, NY, USA. ACM.

Rocco, D., Buttler, D., and Liu, L. (2003). Page digest for

large-scale web services. In IEEE International Con-

ference on E-Commerce, pages 381 – 390.

Saad, M. B. and Gançarski, S. (2010). Using visual pages

analysis for optimizing web archiving. In Proceedings

of the 2010 EDBT/ICDT Workshops, EDBT ’10, pages

43:1–43:7, New York, NY, USA. ACM.

Toyoda, M. and Kitsuregawa, M. (2006). What’s really new

on the web?: identifying new pages from a series of

unstable web snapshots. In Proceedings of the 15th

WWW conference, WWW ’06, pages 233–241, New

York, NY, USA. ACM.

Yang, J.-M., Cai, R., Wang, C., Huang, H., Zhang, L., and

Ma, W.-Y. (2009). Incorporating site-level knowledge

for incremental crawling of web forums: a list-wise

strategy. In Proceedings of the 15th ACM SIGKDD

international conference on Knowledge discovery and

data mining, KDD ’09, pages 1375–1384, New York,

NY, USA. ACM.

WEBIST2013-9thInternationalConferenceonWebInformationSystemsandTechnologies

110