study.
In this paper, we propose an approach to the
problems of change detection and warehouse main-
tenance for an XML web warehouse system. First,
we propose an object-oriented data model for XML
web pages in the web warehouse as well as system
architecture for change detection and warehouse
maintenance. Then, we propose a change detection
method based on mobile agent technology to ac-
tively detect changes of data sources of the web
warehouse. Finally, we propose an incremental and
deferred maintenance method to maintain XML web
pages in the web warehouse. We have implemented
an experimental prototype system for change detec-
tion and maintenance of an XML web warehouse
and have compared our approach with a rewriting
approach by experiments. Performance evaluation
shows that our approach is efficient in terms of the
response time and storage space of the web ware-
house.
The remainder of this paper is organized as fol-
lows. In Section 2 we illustrate the data model of the
web warehouse and the system architecture for
change detection and warehouse maintenance. In
Section 3 we present the change detection method
and algorithm. In Section 4 we present the ware-
house maintenance method and algorithm. Section 5
illustrates the experimental results and states our
observations from the experimental results. Section
6 concludes this paper and gives some directions for
future research.
2 DATA MODEL AND SYSTEM
ARCHITECTURE
2.1 Data Model
Our web warehouse stores XML data from the Web.
Hence, we propose a data model, called the XML
Web Warehouse Data Model (XWWDM), for XML
web pages in the web warehouse. Due to the hierar-
chical structure of an XML web page, we follow the
Document Object Model (Apparao, 1998) to de-
compose an XML web page into a tree structure.
Besides, the design of the XWWDM model is based
on the OEM-like model (Chawathe, 1999) and con-
siders the characteristics of a web warehouse. First,
a web warehouse is like a data warehouse in that it
can store historical data. Therefore, the data model
includes version information to keep track of the
change of data. Second, data in a web warehouse are
sourced from remote web sites. Therefore, the data
model includes source information to identify the
source of data. The XWWDM model is an ob-
ject-oriented model whose class definition is shown
in Figure 1.
class XML_Page {root: XML_Node, version:
Version_Info, source: Source_Info};
class XML_Node
{content: Node_Content, version: Version_Info};
class Node_Content {label: string, value: string, p-node:
XML_Node, child#: integer, s-action: char};
class Version_Info
{version#: integer, update-time: time};
class Source_Info
{url: string, title: string};
class Update
{content: Update_Content, source: Source_Info};
class Update_Content {label: string, value: string,
p-node: XML_Node, detect-time: time, action: char};
Figure 1: The class definition of the XWWDM model
An XML web page is represented as an object of
the class XML_Page, which has three attributes root,
version, and source. The attribute root records the
root node of the tree structure of the web page. The
attributes version and source record the newest ver-
sion information and source information of the web
page, respectively. Each node of a web page is rep-
resented as an object of the class XML_Node, which
has two attributes content and version. The attribute
content records the content, position, and source
action of a node. The attribute version records the
version information of a node. The class
Node_Content has five attributes label, value,
p-node, child#, and s-action. The attributes label and
value record the tag label and data content of a node,
respectively. The attributes p-node and child# record
the parent node and child number under its parent,
respectively. The attribute s-action records the
source action causing the creation of a node, whose
value is I (for insertion), D (for deletion), or M (for
modification). The class Version_Info has two attrib-
utes version# and update-time, which record the
version number and time of last update, respectively.
The class Source_Info has two attributes url and title,
which record the URL and title of the source web
page, respectively.
We adopt a change-centric approach to storage
of all versions of an XML web page. Only the first
version is completely stored. For subsequent ver-
sions, only deltas are stored. As shown in Figure 2,
all frames represent the same web page, in which
each frame represents a specific version at time Ti.
The first frame represents the first version, in which
all nodes of a web page are stored. Other frames
represent subsequent versions, in which only nodes
that are changed are stored. The number and letter
drawn by a node are the child number and source
CHANGE DETECTION AND MAINTENANCE OF AN XML WEB WAREHOUSE
53