MODELING THE WEB AS A FOREST OF TREES
Fathi Tenzakhti
Computer Science Department, University of Nizwa, Oman
Keywords: Internet, Web, Replication, Tree model, Response time, Dynamic Programming.
Abstract: This study tries to demonstrate that the World Wide Web (the Web for short) could be modeled as a forest
of trees. Each Web site has its own tree for which it is the root. Since trees were extensively studied in the
literature, many problems related to performance, fault-tolerance and availability of the Web could be
understood more easily and the existing body of knowledge about trees could be applied to solve these
problems.
1 INTRODUCTION
The World Wide Web (Web for short) is a very
important source of information, and Web services
are being promoted as the next generation
technology for Business to Business (B2B) and
Business to Consumer (B2C) E-commerce.
Unfortunately, the Web suffers from some
drawbacks from which we state slow response time
and high rate of unavailability especially in the
EAST. Solving these problems in an efficient and
simple way requires understanding the topology of
the Web and how information travels from a Web
site to its clients.
In this study, we try to show that it is safe and
sound to model the Web as a forest of trees. Each
Web site is the root of a tree along which
information from the Web site to its clients travels.
The argument is that in a given period of time
θ
(around 6 hours), each Web site has a set of clients
accessing it. Given that the Internet routes are stable
(V. Paxon, 1997), the information traveling from the
Web site to its clients take the same root during this
period of time
θ
.
Consequently, each Web site is the root of a tree
whose leafs are the clients of this Web site during
the period
θ
. The internal nodes of the tree are
intermediary nodes that could be clients to the Web
site. Information from the Web site to its clients
travels along the path of this tree. Since the Web is a
collection of Web sites, it is safe to say that the Web
could be modeled as a forest of trees. Each Web site
being the root of its tree.
In the rest of the paper, section 2 presents studies
that have assumed that the Web is modeled as a
forest of trees. Section 3 presents the organization of
the Web and tries to show that the tree modeling is
safe and accurately describes how the information
flows between the Web site and its clients. Section 4
presents the benefits of such modeling and how it
affects the way the replica placement problem in the
Web is solved. Finally section 5 concludes the
paper.
2 RELATED WORK
To the best of my knowledge, the first study that has
assumed the Web as a set of trees is the one in (B.Li
et al., 1997) The authors assumed the Web as a
forest of trees (the tree model for short); each tree is
rooted at the target Web server. This tree model
reduced the problem of placing Web proxies to
placing copies on a tree and helped optimize a
performance measure for the target Web server
subject to system resources and traffic pattern.
Specifically, the study was interested in finding the
optimal placement of multiple web proxies (M)
among potential sites (N) under a give traffic
pattern.
The study in (F. Tenzakhti, 2006) has assumed
that the routes along witch information flows from
Web servers to clients form a tree rooted at the Web
server. It then considered the problem of finding a
minimum cost residence set that optimizes the cost
of servicing access requests in a read-only
environment taking into account the capacity
constraint of the links.
216
Tenzakhti F. (2008).
MODELING THE WEB AS A FOREST OF TREES.
In Proceedings of the Fourth International Conference on Web Information Systems and Technologies, pages 216-219
DOI: 10.5220/0001513302160219
Copyright
c
SciTePress
3 WWW AS A FOREST OF
TREES
The tree model that we advocate for the Web
depends on the stability of the Internet routing
methods. If the routes on the Internet are stable, the
routes used by clients to access the Web server will
form a shortest path tree (or routing tree) rooted at
the server. Existing studies in (V. Paxon, 1997) have
pointed out that in practice most routes in the
Internet are stable. It has been found that 80% of
routes change at a frequency lower than once a day.
(Krishnan et al, 2000) have traced the routes from
Bell Lab’s Web server to 13,533 destinations. They
have found that almost 93% of the routes are stable
during their experiments. Therefore the stability of
Internet routing is a realistic assumption that can
reduce the Web arbitrary topology to a forest of
trees.
Given the stability of Internet routing (V. Paxon,
1997) , an object requested by a client
c
and located
at server
s travels through a path
s
1
r
2
r
n
r c , called a preference path
and is denoted by
),( cs
π
. The preference path
consists of a sequence of nodes with the
corresponding routers. Routes from
s
to the various
clients form a routing tree along which requests are
propagated. Consequently, for each server
s , a tree
s
T rooted at s could be constructed to depict the
routing tree, and the entire Web could be represented
as a collection of such routing trees, each routed at a
given Web server. Formally, a routing tree is the
union of the preference paths.
Each server
s knows the preference path from
itself to any client
c
. This information can be
extracted and periodically refreshed from the routing
database kept by the routers (Anne Benoit et al,
2006). The routing information allows the
comparison of network distances (e.g. number of
hops) among servers within a given platform. A user
issues one request at a time for a Web page, which is
fetched to the user as a single unit.
In the tree
s
T
, when a client c sends a request to
access a server s, the request is always sent to the
root along the preference path. If the server is
replicated, the request meets a replica on its way and
the requested object is available, it is served by the
replica. Otherwise it has to travel all the way to the
root where it is serviced by s. Note that if there is a
replica closer to the client
c but not enroot between
c and s, it is ignored.
According to this tree model, the network
topology is thus represented by a graph
G = (V, E)
where
m
= V is the number of nodes and E is the
set of edges and represent physical links connecting
these nodes. Nodes are routers, Web servers or a
combination of both (servers provide the
information a client is looking for). Routers are
connected via wide-area links to form the
communication network.
Some routers, called
gateways, provide connections to the outside
Internet. These are the gateways through which all
requests enter the system.
4 BENEFITS OF TREE
MODELLING
Replication is a technique of storing copies of shared
objects on servers where they are frequently
accessed. It is used to address the scalability
problem of popular sites (Anne Benoit et al, 2006).
Replication improves efficiency by allowing
operations to use local replicas instead of remote
ones (Anne Benoit et al, 2006, F. Tenzakhti et al,
2003).
The replica placement problem deals with many
issues. It tries to find how many replicas are needed
in the replicated system, where to place these
replicas, how to route requests to the appropriate
replica etc… (F. Tenzakhti et al, 2003, B.Li et al.,
1997). In this study, we are mainly interested in
where to place a given number of replicas. Along the
study, we propose to use the word proxy to mean a
replica of the whole site. The proxies discussed in
this paper are transparent proxies. They are located
along the routes from clients to a Web server and are
transparent to the clients. A proper placement of
proxies would lead most client requests to be served
at proxies, without letting them travel further to the
server. Since the access patterns of clients and the
sizes of the trees are different, the allocation and
placement of proxies have significant impact on the
overall system performance. To formally define our
problem, we introduce the following notations. Let
),( vud be the distance between any two vertices u
and
v in the tree graph,
s
T ),( vud is equal to the
length of the shortest path
),( vu
π
.
=
),(),(
),(),(
vuyx
yxdvud
π
(1)
Let
),( svp
be the first proxy met while traveling
MODELING THE WEB AS A FOREST OF TREES
217
from
v
to
s
in
s
T . This will be referred to as the
optimal proxy. This could be v itself if v is a proxy
or
s
if no proxy is met on the way to the root
server. Let
)(vf
be the access frequency from client
v
to server
s
during a period of time
θ
and )(v
β
the load that node
v
imposes on the proxy ),( svp .
If
is the replication scheme (the set of proxies for
the tree
s
T associated with the ),( svp function),
then the total distance to access the proxies is
=
s
Tv
svpvdPd )),(,()( and the total cost of
accessing the data is given by:
=
s
Tv
s
svpvdvfPTC )),(,()()(
,
(2)
Any node
v whose optimal proxy is ),( svpu
=
imposes a load
)(v
β
on u . As a constraint, the set
of nodes whose optimal proxy is
u should not
impose a load that is greater than the capacity
u
κ
of u . Consequently, if
}),(:{ usxpTxP
su
=
=
,
then the following equality must be
satisfied
u
P
u
x
κβ
)( . Now, for a fixed number of
proxies
1k
, let us find the optimal replication
scheme
that minimizes the total access cost
),,(
κ
PTC
s
over the tree
s
T , taking into
consideration the capacity constraints
κ
of the
proxies (
κ
being a vector storing the node
capacities). The problem thus reduces to finding:
)},,({),,(
min
,
κκ
PTCkTC
s
kPTP
s
s
=
= , (3)
subject to
Pux
u
P
u
,)(
κβ
(4)
We will use dynamic programming to compute the
above recurrence, and therefore find the optimal
k-
replication scheme. Consider tree
s
T rooted at s
with a set
V of vertices. Assume that the children of
each non-leaf vertex are ordered from left-to-right so
that given any two siblings
u and v , we are able to
determine that
u is to the left of v or vice versa.
For
s
Tv , let
v
T be the subtree of
s
T rooted at v .
For any
v
Tu , we can partition
v
T into 3 subtrees:
1) Subtree
uv
L
,
containing all nodes to the left
of
u .
2) Subtree containing all nodes in
u
T .
3) Subtree
.,uv
R containing the rest of the nodes.
Formally, we write:
}ofleftthetois::{
,
uxTxxL
vuv
=
=
u
T subtree of
v
T rooted at u
},:{
,, vuuvuv
LTxTxxR
=
.
The central issue here is to divide the problem into
small-scale sub problems. For this reason, we need
to further partition
uv
R
,
into smaller subtrees. For
any
uv
Ru
,
'
, we introduce
}':{
,',,
uofleftthetoisyandRyyL
uvuuv
=
Given the recursive nature of the solution, equation
3 applied to
v
T yields
)},({
min
),,(
,
PTCkTC
v
kPTP
vv
v
=
=
κ
(5)
and
Pux
u
P
u
,)(
κβ
(6)
where
),,(
vv
kTC
κ
is the minimum access cost
obtained by placing
k proxies in
v
T
given the load
capacity
u
κ
of each node
v
Tu .When 1=k , the
only proxy is always placed at root
v . When 1>k ,
we can always find a node
u ,
v
Tu and
vu
which satisfies:
1. A proxy is placed at
u ;
2. No proxy is placed in
uvL , ;
3. No proxy is placed in
},{),( vuvu π .
Assuming
v
T is partitioned at the node u , and
that
'k proxies are placed in
u
T , 1 1'
kk ,
then
'kk
proxies are placed in
uv
R
,
.
For all k proxies, we need to find all possible
partitioning points
v
Tu
and all possible values 'k .
Recursively, the proxies are allocated to
u
T and
uv
R
,
the same way as in
v
T . The dynamic
programming approach can thus be formulated by
the equations (8).
WEBIST 2008 - International Conference on Web Information Systems and Technologies
218
We can therefore write
)',',(),',(),(),,(
,, vuvuuvuvvv
kkRCkTCLkTC
κ
κ
κ
κ
+
+= A , where
=
uv
Lx
vv
x
,
)('
βκκ
(7)
>
++
=
∑∑
=
∈∈
v
Tx
v
Tx
kand
v
xifvxdxf
kand
v
uv
Ly
yif
v
kk
uv
RC
u
k
u
TC
v
uv
L
kk
v
Tu
OtherwiseUndefined
kTC
vv
1)(),()(
1
,
)()}'',,
,
()',,(),
,
({
min
'1
,
),,(
κβ
κβκκκ
κ
A
(8)
In equation (8),
),(
, vuv
L
κ
A is a constant that could
be undefined if the total load of the nodes in
uv
L
,
is
higher then the capacity
v
κ
of the node v .
),',(
uu
kTC
κ
is recursively defined in
v
T with a
capacity constraint
u
κ
of node u and
x
κ
of
all
x
u
T
.
uv
R
,
is further partitioned into
',, uuv
L ,
'u
T
and
',uv
R around the node 'u where a proxy is put.
The capacity constraint of
uv
R
,
with respect to
proxy v is the remaining capacity
v
'
κ
obtained by
subtracting from the capacity
v
κ
of v the total loads
imposed on
v by the nodes in
uv
L
,
.
5 CONCLUSIONS
In this study, we have showed that it is safe and
sound to assume that the Web is a forest of trees.
Each tree is rooted at a Web site. The leaves of the
tree are the client of the Web site during a given
period of time
θ
and the interior nodes are either
clients or routers used to route the requests and
information from and to the client. With this simple
topology, many of the problems related to the
performance and availability of the Web could be
studied easily and simple algorithms based on the
tree topology could be easily developed to solve
these Web problems. In this study, we have shown
how and algorithm for replication in the Web for
performance could be easily developed once the tree
structure is assumed.
REFERENCES
Anne Benoit, V. Rehn, and Y. Robert., 2006. Impact of
QoS on Replica Placement in Tree Networks.
Research Report 2006-48, LIP, ENS Lyon, and
France. Available at graal.ens-lyon.fr/~yrobert/.
B. Li, M.J. Golin, F. Italiano, X. Deng, K. Sohraby., 1999.
On the Optimal Placement of Web Proxies in the
Internet, Proc. IEEE INFOCOM, pp. 1282-1290.
F. Tenzakhti, 2006. Optimal Placement of Web Proxies in
the Internet with Link capacity constraints, Journal of
Digital Information Management, Vol.4. No. 4.
F. Tenzakhti, M. Ould Khaoua, K. Day, 2003. On the
Availability of Replicated Content in the Web.
International Journal on computing and information
Sciences, Vol 1. No.1, pp. 51-60.
P. Krishnan, D. Raz, and Y. Shavitt, 2000. The cache
Location Problem, IEEE/ACM Transactions on
Networking, vol. 8, no. 5, pp. 568-582.
V. Paxson, 1997 End-to-End Routing Behavior in the
Internet, IEEE/ACM Transactions Networking, vol. 5,
no. 5, pp. 601-615.
MODELING THE WEB AS A FOREST OF TREES
219