One main area of application is web-site adaptation,
where information obtained from phase two is used
in conjunction with specific web page relationships
and actions. The relationships and actions are
determined and provided by the owners of the web-
site.
2 RELATED RESEARCH
Pattern analysis in the context of Web usage mining
has been the subject of numerous research projects.
Two distinct directions are, in general, considered in
Web usage mining research: Statistics and Artificial
Intelligence techniques. The first approach using
Statistics consists of a range of applications from
overall analysis to an adapted version of statistical
data mining techniques (Srivastava et. al., 2000).
These data mining techniques are those that mine for
rules using the pre-defined support and confidence
values. The second approach uses Artificial
Intelligence techniques which draw upon methods
and algorithms developed from machine learning
and pattern recognition (Srivastava et. al., 2000).
2.1 Statistical Mining Techniques
The research and application of statistical techniques
in the analysis of web usage data is a wide area.
Three specific research areas are covered in this
section. These are the overall statistics of the web
usage data, probability analysis and standard data
mining techniques.
Statistical techniques are the most common
method to extract knowledge about the users of a
web-site. There are different levels of analysis
which have been used from the area of Statistics in
web mining. One general area is the use of overall
statistical information, which is given by some
software programs. Several commercial software
packages such as Analog (Analog, 2004) and OLAP
(The OLAP Report, 2005) are available for web log
analysis.
One more statistical approach which has been
taken in the past research is the use of probability in
the form of Markov chain modeling (
Borges, 2004).
Another well known area of Statistics which has
been used in the area of Web usage mining is the use
of data mining techniques. These techniques
involve finding association rules within a set of data
and mining by determining the rules based on the
rules’ confidence and support (
Agrawal et. al., 1994).
One example of work done in this area is with the
application of the “a priori” algorithm, in which the
association rule mining searches for relationships
between the items in the data.
2.2 Artificial Intelligence Techniques
The second type of technique involves the use of
Artificial Intelligence techniques such as clustering
and classification. Clustering and classification are
machine learning techniques that are used to group
together a set of items having similar characteristics.
In the Web Usage domain, there are two kinds of
interesting clusters to be discovered: page clusters
and usage clusters. Clustering of pages discovers
groups of pages having related content. Clustering of
users, however, tends to establish groups of users
exhibiting similar browsing patterns.
Classification is the task of mapping a data item
into one of several predefined classes. Classification
is done by using supervised inductive learning
algorithms such as decision tree classifiers, naïve
Bayesian classifiers, k-nearest neighbor classifiers,
Support Vector Machines, etc.
Mobasher (
Mobasher et.al., 1999) modeled the
profile based on the clustering approach and named
it PACT, which stands for Profile Aggregations
based on Clustering Transactions. The goal is to
effectively capture common usage patterns from
potentially anonymous click-stream data. The data
is preprocessed and then used to create the profile.
Preprocessing of the data is done in two steps:
identifying users and determining pageviews. First,
unique users are identified from the anonymous
usage data. Erroneous or redundant references are
removed from the data. Second, pageviews are
identified. Pageview identification is the task of
determining which page file accesses contribute to a
single browse display. Relevant pageviews are
included in
transaction files and weights are
assigned to reflect the significance of the pageview.
The clustering approach however requires that the
structure of the web-site to be known.
Classification is also used to determine the user
preferences (
Baglioni et. al., 2003). Classification
algorithms require training data as input. In this
example, the input is a set of cases whereby each
case specifies values for a collection of attributes
and for a class. The output of the classification
algorithm is a model that describes or predicts the
class value of a case on the basis of the values of the
attributes of the case. The predictive accuracy of the
extracted model is evaluated on a test set for which
the actual class is known.
This method requires registered users to provide
information about themselves. The registered users’
information is broken up whereby 67% become the
training set and 33% become the test set. The
attributes of a class consist of the site pages or
sections visited by the user and the class consists of
the user’s sex. In this case the goal is to accurately
AN EFFICIENT APPROACH FOR WEB-SITE ADAPTATION
109