experiments. The Web is highly dynamic, the birth
rate of new pages is fast and old pages disappeared
soon. Besides, we found that the pages in the same
host are extremely link to each other. These results
imply that the link structure in the same host may be
quite similar and the directories vice versa. If we
sort the PageRank of Web pages in a directory and
draw each dot of PageRank, we can observe that the
PageRank values would form several clusters. The
pages in the same cluster, which have similar link
structure and thus the numbers of in-link are
similarly. Thus, the similarly linking relation pages
would cause the similar PageRank and would be a
cluster after sorting the PageRank of pages.
In our previous work, (Kao and Lin, 2007)
predicts the “true importance score” of pages in the
future that based on the clustering feature of
PageRank in a directory. The PageRank of a page at
different previous time stages is growth in the
cluster. Thus, the prediction of PageRank at next
time stage could be the average PageRank of the
cluster, which this page belongs to. In this paper, we
modify the original prediction algorithm to give a
more precise prediction. In our experiments, we
show that the augmented prediction algorithm can
reduce the relative error of prediction effectively
under the cases, which the original method can be
not covered.
2 RELATED WORK
There have been many researchers who investigated
the Web search engines for a long history. The
Information Retrieval (IR) community had proposed
many outstanding algorithms to match documents
for a given query. They analyze the content of the
documents to find the best matching results. Authors
in (Salton and McGill, 1983) provide an overview of
these traditional works.
A number of researchers have investigated the
link structure of the Web and discovered how to
utilize it to improve the search results. Also they
have proposed various ranking metrics. Major
search engines are used the PageRank algorithm.
Works in (Abiteboul et al., 2003) (Kamvar et al.,
2003) provided the different ways to improve
PageRank computation. Authors in (Haveliwala
2002) study how to personalize PageRank by giving
different weights to pages. Work in (Xing and
Ghorbani, 2004) shows that we can get a better
search results by considering another weighting
function to link. Authors in (Jiang et al., 2004)
found that dividing the Web into different blocks
and assigning different weights to different blocks
based on some principles can achieve a better
performance of PageRank search results. Authors in
(Xue et al., 2005) discover the inherent property of
the Web and then propose a novel ranking method
called Hierarchical Rank to re-estimate the
PageRank of pages. Authors in (Yates et al., 2002)
propose a new method to calculate page importance
by considering the last modified time and thus it
could treat newly created pages equitably. Authors
in (Eiron and McCurley, 2003) (Kumar et al., 2000)
also investigate the properties of the Web.
2.1 Page Quality
Cho et al. (Cho et al., 2005) proposed a new point of
view to explain the meaning of PageRank. They
believe that the users determine the PageRank score
of Web pages. The quality estimator is listed in
formula (1):
()
),(
),(
/,
),(),(),( tpP
tpP
dttpdP
r
n
tpPtpItpQ +
⎟
⎟
⎠
⎞
⎜
⎜
⎝
⎛
⎟
⎠
⎞
⎜
⎝
⎛
=+=
∧
(1)
where
),( tpQ
∧
means the page quality of page p at
time t,
),( tpI
and
),( tpP
represent the
increasing popularity and popularity of page p at
time t, respectively. Moreover, n is the total number
of Web users and r is normalization constant. In
practice, however, we cannot obtain the time
derivative immediately, but only can be
approximated through the increase of PageRank at
different time points. In other words, we utilize the
PageRank score at discrete time points to reach the
goal of anticipation. Formula (1) is modified as the
following:
)
()
i
i
ii
i
tpPR
tpPR
ttpPR
r
n
tpQ ,
),(
/,
),( +
⎟
⎟
⎠
⎞
⎜
⎜
⎝
⎛
ΔΔ
=
∧
(2)
where
),(
i
tpPR
is the PageRank of p at time t,
),(),(),(
1−
−
iii
tpPRtpPRtpPR
and
.
1−
iii
ttt
In their model, the quality of pages is composed
of the “increasing popularity” and the “page
popularity at current time”. Then the quality of a
page in the long run comes to a stable value.
Figure 1
shows the evolution of page popularity.
2
The
2
This plot is cited from http://www.seoresearcher.com/popularity-ranking-
faults.htm, also introduced in (Cho and Roy, 2004). The original source
is experimentally observed in the site
popularity evolution data
collected by web ranking companies.
WEBIST 2008 - International Conference on Web Information Systems and Technologies
176