
subtasks, according to Etzioni (Etzioni, 1996). Each
task is discussed in the next paragraphs.
2.1 Information Retrieval
Information Retrieval deals with automatic
discovery of all relevant documents satisfying a
specific query. Most of the work on information
retrieval focuses on automatic indexing of web
documents. However, web pages indexing is not a
trivial task if compared to database indexing where
there are well-defined techniques. The huge number
of web pages, their heterogeneity, and frequent
changes in number and content make this task very
difficult. At present there are several search engines
for querying and retrieve web documents, each one
has a unique interface and a database, which covers
a different fraction of the Web. Their indexes have
been created and constantly updated by web robots,
which scan millions of web pages and store an index
of the words in the documents.
2.2 Information Extraction
Once the documents have been retrieved, the
challenge is the automated extraction of knowledge
from the source without any human effort. At
present most of the work in information extraction is
carried out by wrappers built around web sources. A
wrapper is a special program, which accepts queries
about information present in the pages of the source,
extracts the requested information and returns the
result. But it is impractical to build wrappers for web
sources by hand for several reasons: the number of
web pages is very large, a lot of new pages are
frequently added and the format of web pages often
changes. Ashish and
Knoblock (Ashish and
Knoblock, 1997) propose an approach to semi-
automatically generation of wrappers for Web
sources.
2.3 Generalization
Once automated the discovery and extraction
processes from web pages, the next step is the
generalization from the experience. This phase
involves pattern recognition and machine learning
techniques. The bigger obstacle in learning about
web is the large amount of unlabelled data. Many
data mining techniques require inputs labelled as
positive or negative examples with respect to some
concept. Fortunately, clustering techniques do not
require labelled inputs and have been applied
successfully to large collections of documents. Other
techniques, used in this phase, are association rules.
They allow the discovery of all associations and
correlations among data items where the presence of
one set of items in a transaction implies, with a
certain degree of confidence, the presence of other
items.
2.3 Analysis
Analysis is a data-driven problem where humans
play an important role for validation and
interpretation of the results. Once patterns have been
discovered, analysts need suitable tools to
understand, visualize, and interpret these patterns.
One technique is represented by OLAP (On Line
Analytical Process), which uses data cube structure
for simplifying visualization of multidimensional
data. Some others (Mobasher et al., 1997) proposed
an SQL-like language for querying the discovered
knowledge.
3 WEB MINING CATEGORIES
Web Mining includes three areas, based on which
part of the Web mine:
Web Content Mining (WCM),
Web Structure Mining (WSM),
Web Usage Mining (WUM).
The distinctions among the above categories are
not clear-cut; the three Web Mining tasks could be
used in isolation or combined in an application. An
overview of each category follows.
3.1 Web Content Mining
The aim of the WCM is the automation of the
process of information discovery and extraction
from Web documents and services. Mainly, there are
two approaches to solve this problem (Cooley et al.,
1997):
1. Agent Based approach: “it involves
artificial intelligence systems that can act
autonomously or semi-autonomously on
behalf of a particular user, to discover and
organize Web-Based information”.
2. Database approach: “it organizes
heterogeneous and unstructured or semi-
structured data into more structured data,
such as relational database, and using
standard database querying mechanism and
ICEIS 2004 - DATABASES AND INFORMATION SYSTEMS INTEGRATION
576