Another work on web document clustering that
uses graph clustering methods is (He et al., 2001),
where spectral clustering on a combination of textual
similarity, co-citation similarity and the hyperlink
structure is applied.
1.2 Background
We start with a unified approach for both kinds of
similarity. As in standard Information Retrieval we
describe a corpus containing n documents by a set of
m attributes. The attributes are words and each word
is either contained in the document or not. The
corresponding term-document-matrix is defined as
usual: D=(d
ij
) where d
ij
=1 if document i contains
term j and d
ij
=0 otherwise. The i-th line of D is
called the document vector for the i-th document.
The following step describes the selection of
attributes. Usually, all words with the possible
exception of stop words are considered. This
approach ensures a description for almost any
document because a meaningful document does not
contain only stop words. But high frequent words
are responsible for noise in this description. They are
not very special in the sense that they may have
multiple meanings and can be used in very different
settings. This disadvantage is usually addressed by
term weights, but this will only reduce some of the
noise. In our approach, we dramatically reduce the
number of attributes by reducing the number of
attributes to less then 30 for a typical document. For
this, we restrict the set of terms to all low frequent
words having an absolute frequency<f. In the
experiments, we deliberately choose f=256 which
means we ignored the 100.000 most frequent words.
Such a rigorous reduction of the feature space is not
recommendable for Information Retrieval. For
clustering, this helps to avoid artefacts caused by
ambiguity and speeds up processing considerably.
As a consequence, we get a very specific
description using only very special terms. This will
lead to a very strict similarity if two documents
share many such terms. As will be shown in the
evaluation, the converse is also true: With a high
probability, two similar documents share several
more special terms not used as attributes.
This approach using less then 30 attributes to
describe a document is tested in the following two
settings:
1. We describe a document with low frequent
words contained in the document.
2. we describe a web page by the link targets
found in this page.
Both approaches allow efficient calculation and give
remarkable results.
1.2.1 Document Similarity using DD
T
The similarity of two documents is usually
calculated as the dot product of the corresponding
document vectors. The product matrix S=DD
T
contains exactly these similarities. Having used only
low frequent words as described above we do not
need any term weighting.
The above similarity matrix can be calculated
efficiently steps by the following algorithm:
For each word do {
list all pairs of documents
containing this word;
sort the resulting list of pairs; }
For each pair (i,j) in this list, count
the number of occurrences as s
ij
;
Depending of the size if the collection, s
ij
>7 (or
so) will show some weak similarity, s
ij
>15 (or so)
will be returned for very similar documents.
1.2.2 Co-occurrence for Words using D
T
D
Using the matrix T= D
T
D instead of S=DD
T
we
count the co-occurrences of pairs of terms. Usually
in co-occurrence analysis (e.g. Krenn and Evert,
2001), there is an additional significance measure to
translate co-occurrence numbers into a significance.
But in our case of low frequent words (to be more
precise: in the case of similar frequencies for all
terms) there is no need for this significance measure.
From a more semantic point of view, repeated
co-occurrence of two words is known to show a
strong semantic association (Heyer et al., 2001). The
type of this association is not limited to similarity
(or, even stronger: synonymy), in fact we will find
any semantic relation. Similar thresholds as above
apply. For example, co-occuring terms of the word
Dresden (ordered by significance) are: Leipzig,
Chemnitz, Erfurt , ... , Frauenkirche, München,
Technischen Universität, Hamburg, Rostock,
Magdeburg , …, Staatlichen Kunstsammlungen, …,
Semperoper, …, Sächsische Schweiz, ...
These related terms are other cities near Dresden
and local organizations or tourist attractions.
1.2.3 Co-occurrence of Hyperlinks
In this section, we will use the in-links as attributes
for documents. Then, two documents are similar if
there are many sources linking to both of the
documents.
For technical reasons, we again use co-occurrence
statistics to calculate these similarities. The URLs
SIMILARITY OF DOCUMENTS AND DOCUMENT COLLECTIONS USING ATTRIBUTES WITH LOW NOISE
131