MINING CLICKSTREAM-BASED DATA CUBES
Ronnie Alves and Orlando Belo
Departament of Informatics,Schooç of Engineering, University of Minho
Campus de Gualtar, 4710-057 Braga, Portugal
Keywords: Clickstream Analysis, OLAP systems, Multidimensional Databases, On-line Analytical Mining.
Abstract: Clickstream analysis can reveal usage patter
ns on company’s web sites giving highly improved
understanding of customer behaviour. This can be used to improve customer satisfaction with the website
and the company in general, yielding a great business advantage. Such information has to be extracted from
very large collections of clickstreams in web sites. This is challenging data mining, both in terms of the
magnitude of data involved, and the need to incrementally adapt the mined patterns and rules as new data is
collected. In this paper, we present some guidelines for implementing on-line analytical mining engines
which means an integration of on-line analytical processing and mining techniques for exploring
multidimensional data cube structures. Additionally, we describe a data cube alternative for analyzing
clickstreams. Besides, we discussed implementations that we consider efficient approaches on exploring
multidimensional data cube structures, such as DBMiner, WebLobMiner, and OLAP-based Web Access
Engine.
1 INTRODUCTION
The concepts (and techniques) of data mining and
knowledge discovery could be applied efficiently on
web sites (or e-commerce sites). Recently, web
usage mining has attracted much attention from
researchers and e-business professionals, because it
offers many benefits to an e-commerce website such
as:
Targeting customers based on usage behaviour
or profile (personalization).
Adjusting web content and structure dynamically
based on page access pattern of users (adaptive
web site).
Enhancing the service quality and delivery to the
end user (cross-selling, up-selling).
Improving web server system performance based
on the web traffic analysis.
Identifying hot area/killer area of the web site.
The data needed to accomplish such tasks is
deri
ved normally from a Web server log file
almost all e-commerce applications are Web based.
Clickstream files are generated in order to represent
information that is specific to each Web access
attempt.
The recent progress and development of data
m
ining and data warehousing technologies
contribute effectively to the emergence of new data
mining and data warehousing systems (Fayyad et al.,
1998) (Chen et al., 1996), opening new doors to the
handling of very large data files like clickstreams.
In this paper we propose a new computational
p
latform specially design to support the exploration
of multidimensional data cube structures. We also
describe some of the most relevant mechanisms for
exploring data cubes, and discuss the best practices
in data cubes analysis and exploration.
2 EXPLORING DATA CUBE
STRUCTURES
The main idea of a data warehouse is to provide
decision makers with integrated information that is
organized according to their requirements. Online
Analytical Processing (OLAP) systems are the
predominant front-end tools used in these
environments. The main focus of OLAP tools is to
provide multidimensional analysis over decision-
support oriented data. To achieve this goal, these
tools employ multidimensional models for the
storage and presentation of data. In these systems,
583
Alves R. and Belo O. (2004).
MINING CLICKSTREAM-BASED DATA CUBES.
In Proceedings of the Sixth International Conference on Enterprise Information Systems, pages 583-586
DOI: 10.5220/0002631305830586
Copyright
c
SciTePress
data is organized in data cubes (or hypercubes),
which are defined over a multidimensional space
involving several dimensions of analysis.
Consequently, mining can be performed in different
portions of data cubes and at different levels of
abstraction (Zaiane et al., 1998).
2.1 Mining Functions of Data Cube
Engines
Building a clickstream data cube allows for the
application of OLAP operations such as drill-
down, roll-up, and slice and dice , to view and
analyze clickstreams from different angles, derive
ration and compute measures across many
dimensions (Kimbal, 2000). This greatly facilitates
the exploration process, since such a process should
be investigative in nature, that is, mining should be
performed at different portions of data at multiple
levels of abstraction improving Knowledge
Discovery Process in Database systems.
We believe the following mining functions are
essential for successful implementation of data cube
engines, and its uses are extremely desired for
clickstream analysis: data characterization, class
comparison, association, prediction, classification,
and time-series analysis.
2.2 Data Cube Engines
There are several projects and implementations
on data cube engines. Most of them try to
accomplish the features mentioned in the previous
section.
DBMiner
DBMiner has been developed for interactive
mining of multiple-level knowledge in large
relational databases and data warehouses (Han et al.,
1997). The system implements a wide spectrum of
data mining functions, including characterization,
comparison, association, classification, prediction,
and clustering. By incorporating several interesting
data mining techniques, including OLAP and
attribute-oriented induction, statistical analysis,
progressive deepening for mining multiple-level
knowledge, and meta rule guided mining, the system
provides a user friendly, interactive data mining
environment with good performance.
WebLogMiner
In the WebLogMiner project, the data collected in
the web logs goes through four stages. In the first
stage, the data is filtered to remove irrelevant
information and a relational database is created
containing the meaningful remaining data. This
database facilitates information extraction and data
summarization based on individual attributes like
user, resource, user’s locality, day, etc. In the second
stage, a data cube is constructed using the available
dimensions. On-line analytical processing (OLAP) is
used in the third stage to drill-down, roll-up, slice
and dice in the web log data cube. Finally, in the
fourth stage, data mining techniques are put to use
with the data cube to predict, classify, and
discoverer interesting correlations (Zaiane et al.,
1998).
An OLAM based Web Access Analysis Engine
In (Chen et al., 1999), was described a scalable
framework developed on top of an Oracle-8 based
data warehouse and a commercially available multi-
dimensional OLAP server, Oracle Express, which
they have used to develop applications for analyzing
customer calling patterns from telecom networks
and shopping transaction from e-commerce sites. In
(Chen et al., 2000), they have described a web
access analysis engine implemented on this
framework to support the collection and mining of
web log records at the high data volumes typical of
large commercial web sites.
3 A DATA CUBE ALTERNATIVE
ENGINE FOR CLICKSTREAM
ANALYSIS
Although there are many studies, implementations
and proposals for efficient and effective data mining
algorithms, data cube engines require fast response
due to its nature of interactive mining, which poses
new challenges on efficient implementation.
The main goal of this system is to develop a data
cube engine based on data mining and OLAP
techniques with abilities to analyze specific
clickstreams from specialized data cubes. In
addition, using this engine, it will be possible to:
create efficient data cube structures for effective
pattern behaviour analysis;
create efficient data cube structures for pattern
usage analysis;
ICEIS 2004 - DATABASES AND INFORMATION SYSTEMS INTEGRATION
584
Figure 1: An overall perspective of analysing clickstreams using data cubes.
perform efficient OLAP operations for effective
exploration of data cubes;
perform efficient mining techniques for effective
discovery and understanding of clickstream data
cubes.
In fact, we believe that these features will allow
users to carry out several web usage mining tasks
such as mentioned previously. Our motivation in this
data cube approach relies not only on the
implementation issues behind such integration of
OLAP and mining techniques, but on the
configuration, handling and deployment of data
cubes for e-commerce purposes.
3.1 Implementation Guidelines
It is expected that special attention should be paid to
the following implementation considerations to the
successful of this approach, as mentioned in (Han et
al., 1998).
Modularized design and standard APIs.
Support of OLAM by high performance data
cube technology.
Constraint-based on-line analytical mining.
Progressive refinement of data mining quality.
Layer-shared mining with data cubes.
Book marking and backtracking techniques..
Based on these considerations in the next section
we introduce our data cube alternative for analyzing
clickstreams.
3.2 The Engine
In way to attend the requirements previously
mentioned, the data cube engine has been built in
five modules (Figure 1):
Cube Definition. In this module is defined the
data source which it will be used for creating the
data cube. Then, it is possible to model the cube,
which means designing dimensions and
measures. Also, some hierarchies are defined as
well as some OLAP operations.
Cube Evaluation. It is responsible for evaluating
cubes generated, which means to verify if all the
constraints and the requirements on the cube
definition engine are satisfied.
Query Definition. This module is responsible for
describing the query for exploring data cube
using OLAP operations. And sometimes,
interchange these OLAP operations with mining
techniques.
Query Evaluation. This module acts in the same
way as the cube evaluation module. But, in this
case, it analyzes the query with constraints and
definitions of the data cube that would be
explored, checking the consistency and
presenting query results.
Cube Mining. This system’s module must be
used for mining data cube using the techniques
available inside the engine. It is also used when
some mining query is required by the query
evaluation module.
The process of analyzing clickstreams, using data
cube structures, begins with some ETL tasks on the
MINING CLICKSTREAM-BASED DATA CUBES
585
clickstream. Next, a ROLAP database is used for
storing the clickstreams. As long as the clickstream
are available on the database it is possible to perform
exploratory analysis using data cubes. Besides,
before any kind of exploration over data cubes a
multidimensional structure needs to be built. This is
supported using the data cube definition module. It
includes the following attributes: page_dimension,
time_di-mension, date_dimension, user_agent,
referrer_di-mension, request_dimension, and
session_di-mension.
Building this clickstream data cube allows the
application of OLAP operations, to view and
analyze the clickstreams from different angles,
derive ratios, or compute measures across many
dimensions. The data cube structure offers analytical
modelling capabilities, including a calculation
engine for deriving various statistics, and a highly
interactive and powerful data retrieval and analysis
environment. It is possible to use this engine to
discover implicit knowledge in the clickstream data
cube. The knowledge that can be discovered is
represented in the form of rules, tables, charts,
graphs, and other visual presentation forms for
associating or classifying data from clickstream data
cube (Figure 2).
4 CONCLUSIONS AND FUTURE
WORK
In this paper, we have discussed some
implementation issues on on-line analytical mining,
specifically on the data cube engines and its desired
functions. In fact, the observations mentioned in
(Han et al., 1998) (Chen et al., 1999) (Chen et al.,
2000) (Han et al., 1997) motivated us to study the
desired way to perform data cube mining and its
efficient implementation. As a result, we present our
data cube mining engine proposal, which contains
some guidelines and perspectives of research in
applying data cube techniques for analyzing
clickstreams.
REFERENCES
Chen, S., M., Han, J. and Yu, S., P., 1996. Data Mining:
An overview from database perspective. IEEE Trans.
Knowledge and Data Engineering, 8:866-8883.
Chen, Q., Dayal, U. and Hsu, M., 2000. An OLAP-based
Scalable Web Access Analysis Engine”. HP Labs,
Hewlett-Packard, 1501 Page Mill Road, MS 1U4, Palo
Alto, CA 94303, USA.
Chen, Q., Dayal, U. and Hsu, M., 1999. A Distributed
OLAP Infrastructure for E-Commerce. Proc. Fourth
IFCIS Conference on Cooperative Information
Systems (CoopIS’99).
Fayyad, U., M., Piatetsky-Shapiro, G., Smyth, P., and
Uthurusamy., R., 1998. Advances in Knowledge
Discovery and Data Mining. AAAI/MIT Press.
Han., J., 1998 Towards on-line analytical mining in large
databases. ACM SIGMOD Record, 27:97-107.
Han, J., Chee, S., and Chiang, J., Y., 1998. Issues for On-
line Analytical Mining of Data Warehouses,
SIGMOD’98 Workshop on Research Issues on Data
Mining and Knowledge Disvovery (DMKD’98).
Han, J., Chiang, J., Chee, S., Chen, J., Chen, Q. , Cheng,
S., Gong, W., Kamber, M., Liu, G., Koperski, K., Lu,
Y., Stefanovic, N., Winstone, L., Xia, B., Zaiane, O.,
R., Zhang, S. and Zhu H. 1997. DBMiner: A system
for data mining in relational databases and data
warehouses. In Proc. CASCON'97.
Figure 2: Discovering implicit knowledge on
clickstream data cubes.
Kimbal. R., 2000. The Data Webhouse Toolkit, Wiley.
Zaiane, O., Xin, M., and Han. J., 1998. Discovering web
access patterns and trends by applying olap and data
mining technology on web logs. In Proceedings of
Advances in Digital Libraries Conference (ADL),
pages 19—29.
ICEIS 2004 - DATABASES AND INFORMATION SYSTEMS INTEGRATION
586