repetitively when trying to guess an optimal cluster.
In contrast with original DOC, we use a hash
structure to store previously calculated dimension
qualities. That saves us a costly data scan per each
cluster quality computation. Hash is indexed by used
dimension combinations.
The benefit of dimension caching may be worth
over 50% of data scans for web snippets. It grows if
the DocDOC procedure is called multiple times on
the same data (i.e. to bump the optimality
probability). The beauty of dimension cache is also
that when sorted, it gives a ranked list of
overlapping clusters that is identical to STC base
clusters, only with 100% overlapping clusters
merged. This is very important, since that makes
DocDOC identical to STC with respect to the result
achieved (when identical postprocessing is done, i.e.
hierarchization).
3.5 Usage and Output
The DOC procedure returns one cluster at a time.
There are three possibilities how to use it:
greedy – found clusters are removed from
collection, this relies heavily on cluster quality
estimation by the quality function, valuable
information can be lost, must be run several
times to increase the guaranteed 50%+
probability of returning an optimal
cluster (empirically, the algorithm returns good
clusters all the time, but potentially destroying
the optimal one), runs while there is enough
data (with collection size percentage as
threshold) and its running time may vary
substantially.
overlapping – every found cluster is
remembered in a list sorted by quality and the
procedure is run few times. This improves
result stability and speed, loses no information
and can be dealt with as with merged STC
base clusters, that is, merging and rescoring
until no clusters overlap enough.
Final task is creating cluster labels from (C, D)
projective clusters. That is done by choosing all or
best ranking dimensions (phrases, words) from D.
As in HSTC, clusters can be topologically sorted to
get a nice cluster hierarchy.
4 COMPARISON
OF ALGORITHMS
We have made the following observations:
Clusters found by DOC are exactly those
generated by STC, but merged in case of 100%
overlap.
Suffix arrays are 5 times more efficient with
respect to memory usage than suffix tree.
DocDOC has a potential for more flexible
ranking than STC.
The quality function introduced in DocDOC
defines different ordering of clusters than STC.
DocDOC is better parallelizable and scalable.
DocDOC need less memory footprint than
Lingo and maybe STC.
The quality function introduced in DocDOC
defines different ordering of clusters than STC.
5 CONCLUSIONS
We proposed and implemented an improved version
of the DOC algorithm used on Web snippets in our
experiments. We have shown that it has a number of
better properties than other algorithms of this
category.
Since discovering knowledge from and about
Web is one of the basic abilities of an intelligent
agent, an applicability of the algorithm can be found
e.g. in semantic Web services.
Although named DocDOC, the algorithm has far
greater usability than just texts. If used on vector
data (as opposed to the Boolean model, but keeping
the dimension weight information), there are
applications across various disciplines (i.e.
medicine, data mining) that may benefit from this
algorithm.
ACKNOWLEDGEMENTS
This research was supported in part by GACR grant
201/09/0990.
REFERENCES
Cox, D. R. and Hinkley, D. V: Theoretical Statistics,
Chapman and Hall, 1974.
Ferragin, P. and Gulli, A.: A personalized search engine
based on Web-snippet hierarchical clustering. In: Proc.
of 14th International Conference on World Wide Web
2005, Chiba, Japan, pp. 801-810.
Húsek, D., Pokorný, J., Řezanková, H., Snášel, V.: Data
Clustering: From Documents to the Web. Chapter 1 in
book Web Data Management Practices: Emerging
Techniques and Technologies, A. Vakali and Pallis
Eds., Idea Group Inc., 2007, p. 1-33.
MONTE CARLO PROJECTIVE CLUSTERING OF TEXTS
241