
2 EFFICIENT WEB CACHING
2.1 Web Caching
Web caching inherits several techniques and issues
from caching in processor memory and file systems.
However, the peculiarities of the web give rise to a
number of novel issues which call for adequate
solutions.(Pitkow,1994)(Brain,2002)(Duane,1995).
Caches can be deployed in different ways, which
can be classified depending on where the cache is
located within the network. The spectrum of
possibilities ranges from caches close to the client
(browser caching, proxy server caching) to caches
close to the origin server (web server caching).
Proxy server caching provides an interface between
many clients and many servers. The effectiveness of
proxy caching relies on common accesses of many
users to a same set of objects. Even closer to the
client we find the browser caches, which perform
caching on a per user basis using the local file
system. However, since they are not shared, they
only help the single user activity. Web server
caching provides an interface between a single web
server and all of its users. It reduces the number of
requests the server must handle, and then helps load
balancing, scalability and availability.
2.2 Web Cache Replacement
strategies
The cache replacement strategy decides which
objects will remain in cache and which are evicted to
make space for new objects. The choice of this
strategy has an effect on the network bandwidth
demand and object hit rate of the cache (which is
related to page load time).
Caching algorithms are characterized by their
replacement strategy, which mainly consists of
ordering objects in cache accordingly to some
parameters (the arrival order, the request frequency,
the object size, and compositions of them): objects
are evicted according to such an order.
Various cache replacement strategies have been
described and analyzed since processor memory
caching was first invented .Such as FIFO – First In
First Out Strategy (order by arrival time).LFU –
Least Frequently Used Strategy (order inversely to
number of requests). LRU - Least Recently Used
Strategy (order by last request time). One of the
most popular replacement strategies is the Least
Recently Used (LRU) strategy, which evicts the
object that has not been accessed for the longest
time. This strategy works well when there is a high
temporal locality of reference in the workload -that
is, when most recently referenced objects are most
likely to be referenced again in the near future
.SLRU - (order inversely to ∆T . Size , ∆T being
the number of requests since the last request to the
object). LRU-K-The LRU-K replacement strategy
considers both frequency and recency of reference
when selecting an object for replacement. LRU-
MIN- (when the cache has to make room for an
object of size S, first it tries to do it by evicting
objects of size S or greater in LRU order; if that
fails, it tries with objects of size S/2 or more, then
S/4 or more and so on).
To summarize, an ideal cache replacement
policy should be able to accurately determine future
popularity of documents and choose how to use its
limited space in the most advantageous way. In the
real world, we develop heuristics to approximate this
ideal behaviour.
2.3 Measures of Efficiency
The most popular measure of cache efficiency is hit
rate. This is the number of times that objects in the
cache are re-referenced. A hit rate of 70% indicates
that seven of every 10 requests to the cache found
the object being requested. Another important
measure of web cache efficiency is byte hit rate.
This is the number of bytes returned directly
from the cache as a fraction of the total bytes
accessed. This measure is not often used in classical
cache studies because the objects (cache lines) are of
constant size.
However, web objects vary greatly in size, from
a few bytes to millions. Byte hit rate is of particular
interest because the external network bandwidth is a
limited resource (sometimes scarce, often
expensive). A byte hit rate of 30% indicates that 3 of
10 bytes requested by clients were returned from the
cache; and conversely 70% of all bytes had to be
retrieved across the external network link.
Other measures of cache efficiency include the
cache server CPU or I/O system utilization, which
are driven by the cache server’s implementation.
Average object retrieval latency (or page load time)
is a measure of interest to end users and others.
3 WEB USAGE MINING
3.1 Data Preparation For Web Usage
Mining
Web Usage mining is the application of data mining
techniques to discover usage patterns, or models,
extracted from the web log data
ICETE 2004 - GLOBAL COMMUNICATION INFORMATION SYSTEMS AND SERVICES
178