(Beeferman and Berger, 2000) is another approach,
which uses the "click-through data" method, forming
a bipartite graph of queries and documents.
However, it does not take into account the content
features of both query and document, leading to an
ineffective clustering.
So far, several Web search results clustering
systems have been implemented. We can mention
four of them. Firstly, (Cutting et al., 1992) have
created the Scatter/Gather system to cluster Web
search results. This system is based on two
clustering algorithms: Buckshot – fast for online
clustering and Fractionation – accurate for offline
initial clustering of the entire set. This system has
some limitations due to the shortcomings of the
traditional heuristic clustering algorithms (e.g. k-
means) they used. Secondly, (Zamir and Etzioni,
1998) proposed in an algorithm named Suffix Tree
Clustering (STC) to automatically group Web search
results. STC operates on query results snippets and
clusters together documents with large common
subphrases. The algorithm first generates a suffix
tree where each internal node corresponds to a
phrase, and then clusters are formed by grouping the
Web search results that contain the same “key”
phrase. Afterwards, highly overlapping clusters are
merged. Thirdly, (Stefanowski and Weiss, 2003)
developed Carrot
2
, an open source search results
clustering engine. Carrot
2
can automatically
organize documents (e.g. search results) into
thematic categories. Apart from two specialized
document clustering algorithms (Lingo and STC),
Carrot
2
provides integrated components for fetching
search results from various sources including
YahooAPI, GoogleAPI, MSN Live API, eTools
Meta Search, Lucene, SOLR, Google Desktop and
more. Finally, (Zhang and Dong, 2004) proposed a
semantic, hierarchical, online clustering approach
named SHOC, in order to automatically group Web
search results. Their work is an extension of O.
Zamir and O. Etzioni's work. By combining the
power of two novel techniques, key phrase
discovery and orthogonal clustering, SHOC can
generate suggestive clusters. Moreover, SHOC can
work for multiple languages: English and oriental
languages like Chinese.
3 THEORETICAL FOUNDATION
Clustering is one of the most useful tasks in the data
mining process for discovering groups and
identifying interesting distributions and patterns in
the underlying data. The clustering problem is about
partitioning a given data set into groups (clusters), so
that the data points in a cluster are more similar to
each other than points in other clusters. The
relationship between objects is represented in a
Proximity Matrix (PM), in which rows and columns
correspond to objects. This idea is applicable in
many fields, such as life sciences, medical sciences,
engineering or e-learning.
3.1 Clustering by Compression
In 2004, Rudi Cilibrasi and Paul Vitanyi proposed a
new method for clustering based on compression
algorithms (Cilibrasi and Vitanyi, 2005). The
method works as follows. First, it determines a
parameter-free, universal, similarity distance, the
normalized compression distance or NCD, computed
from the lengths of compressed data files (singly and
in pair-wise concatenation). Second, it applies a
clustering method.
The method is based on the fact that compression
algorithms offer a good evaluation of the actual
quantity of information comprised in the data to be
clustered, without requiring any previous processing.
The definition of the normalized compression
distance is the following: if x and y are the two
objects concerned, and C(x) and C(y) are the lengths
of the compressed versions of x and y using
compressor C, then the NCD is defined as:
NCD(x,y) =
)}(),(max{
)}(),(min{)(
yCxC
yCxCxyC
(1)
The most important advantage of the NCD over
classic distance metrics is its ability to cluster a large
number of data samples, due to the high
performance of the compression algorithms.
The NCD is not restricted to a specific
application. To extract a hierarchy of clusters from
the distance matrix, a dendrogram (ternary tree) is
determined by using a clustering algorithm.
Evidence of successful application has been reported
in areas such as genomics, virology, languages,
literature, music, handwritten digits, and astronomy.
The quality of the NCD depends on the
performances of the compression algorithms.
3.2 Classification Methods
The goal of the classification methods is to group
elements sharing the same information.
Classification methods can be divided into the
following three categories: distance methods,
characters methods and quadruplets methods.
In order to achieve the objective of this work,
only the methods of distance and the methods of
ICSOFT 2010 - 5th International Conference on Software and Data Technologies
294