settrainingindocumentsofnumbertotaltheisN
termiththecontainthatdocumentsofnumbertheisdfwhere
))
df
N
log(tf),....
df
N
log(tf),
df
N
log(tf(W
i
n
ntfidf
2
2
1
1
=
the input data and to develop an accurate model for
each class using the features presented in the data.
The class descriptions are used to classify future test
data for which the class labels are unknown. Web
document classification is an attempt to merge the
quality and user-friendliness of directories with the
popular ranked list presentation. By classification
the results, increasing their readability by showing
thematic groups, instead of a mixture of documents
on all possible subjects matching the query.
Web Mining is an important field that aims to
make good use of the information available on the
web and find the data that was either previously
unknown or hidden. An important step in the mining
process is information retrieval and extraction. The
retrieval and extraction methods differ in what
aspect of a document is used in extraction
information (Lan and Bing, 2003). In general there
are two schools of thought; natural language
processing techniques and techniques that use the
structure of the web. Natural language processing
techniques involve using the data of the web using
string manipulation. Structural methods build a
structure from the structure of the document itself.
The research in web mining also derives from the
research in other fields like natural language
processing, artificial intelligence and machine
learning. The techniques that are dealt in these fields
mostly deal with a subset of the web pages. Efforts
to combine the content and structure of a web page
to build a model that is suitable for mining a wide
variety of web documents are few and certainly
insufficient.
2.1 Centroid Technique
In centroid-based classification algorithm, the Web
documents are represented using the vector-space
model (Salton, 1989) (Raghavan and Wong, 1986).
In this model, each Web document is considered to
be the term-frequency vector as following equation.
(1)
A widely used refinement to this model is to
weigh each term based on its inverse document
frequency (IDF) in the Web document collection
(Salton, Wong, and Yang, 1975). The motivation
behind this weighting is that terms appearing
frequently in many Web documents have limited
discrimination power, and for this reason they need
to be de-emphasized. This is commonly done by
multiplying the frequency of each term i by
log(N/df
i
), This leads to the tf-idf representation of
the Web document as equation 2 .
(2)
In order to account for documents of different
lengths, the length of each Web document vector is
normalized so that it is of unit length. Given a set N
of N Web documents and their corresponding vector
representations, the centroid vector (Han and
Karypis, 2000) is described as equation 3.
(3)
Equation 3 is nothing more than the vector
obtained by averaging the weights of the various
terms presented in N Web documents. N is referred
as the supporting set for the centroid. In the vector-
space model, the similarity between two Web
documents and is commonly measured using
the cosine function, given by equation 4
(4)
The advantage of the summarization performed
by the centroid vectors is that the computational
complexity of the learning phase of this centroid-
based classifier is linear on the number of Web
documents and the number of terms in the training
set. Moreover, the amount of time required to
classify a new Web document x is at most O(km),
where k is the number of centroids and m is the
number of terms present in x .
2.2 Web Document Indexing
In order to reduce the complexity of the Web
documents and make them easier to handle, they
have to be transformed to the vectors. The vector
space model procedures can be divided in to three
steps. The first step is content extraction where
content bearing terms are extracted from each Web
page. The second step is term weighting to enhance
retrieval of Web document relevant to the user. The
last step ranks the Web document with respect to the
query according to similarity measure.
vectors two the of product-dot the denotes “·” where
W*W
W.W
)W,Wcos(
ji
ji
ji
22
rr
rr
=
∑
∈
=
NW
tfidfC
tfidf
W
N
W
1
termiththeoffrequencytheistf
vectordocumentWebisWwhere
)tf,....tf,tf(W
i
tf
ntf
r
r
21
=
i
W
j
W
WEBIST 2005 - WEB INTERFACES AND APPLICATIONS
334