SIMILARITY OF DOCUMENTS AND DOCUMENT COLLECTIONS USING ATTRIBUTES WITH LOW NOISE

Chris Biemann, Uwe Quasthoff

2007

Abstract

In this paper, a unified framework for clustering documents based on vocabulary overlap and in-link similarity is presented. A small number of non-zero attributes per document, taken from a very large set of possible attributes, ensure efficient comparisons procedures. We show that A) low frequent words are excellent attributes for textual documents as well as B) sources of in-links as attributes for web documents. In the cases of web documents, co-occurrence analysis is used to identify similarity. The documents are represented as nodes in a graph with edges weighted by similarity. A graph clustering algorithm is applied to group similar documents together. Evaluation for textual documents against a gold standard is provided.

References

  1. C. Biemann 2006. Chinese Whispers - an Efficient Graph Clustering Algorithm and its Application to Natural Language Processing Problems. Proceedings of the HLT-NAACL-06 Workshop on Textgraphs-06, New York, USA
  2. S. Brin and L. Page, L. 1998. The anatomy of a large scale hypertextual web search engine, Proceedings of the 7th WWW conference / Computer Networks 30(1-7)
  3. A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins and J. Wiener. 2000. Graph structure in the web. Proceedings of the 9th WWW conference, Amsterdam, Netherlands
  4. S. Chakrabarti, B.E. Dom, D. Gibson, J. Kleinberg, R. Kumar, P. Raghavan, S. Rajagopalan, A. Tomkins. 1999. Mining the Link Structure of the World Wide Web. IEEE Computer 32(8), pp. 60-67
  5. G.W. Flake, S. Lawrence and C.L. Giles. 2000. Efficient Identification of web communities. KDD 2000
  6. D. Gibson, J. Kleinberg and P. Raghavan. 1998. Inferring web communities from link topology. In Proceedings of the 9th ACM Conference on Hypertext and Hypermedia (HYPER-98), ACM Press, New York
  7. X. He, H. Zha, C. Ding, and H. Simon. 2001. Web document clustering using hyperlink structures, Tech. Rep. CSE-01-006, Department of Computer Science and Engineering, Pennsylvania State University
  8. G. Heyer, M. Läuter, U. Quasthoff, T. Wittig and C. Wolff. 2001. Learning Relations using Collocations. In: Proc. IJCAI Workshop on Ontology Learning, Seattle/WA
  9. G. Heyer and U. Quasthoff. 2004. Calculating Communities by Link Analysis of URLs. Proceedings of IICS-04, Guadalajara, Mexico pp. 151-156
  10. S. Kirkpatrick, C.D. Gelatt and M.P. Vecchi. 1983. Optimization By Simulated Annealing, Science 220
  11. B. Krenn and S. Evert. 2001. Can we do better than frequency? A case study on extracting PP-verb collocations. Proceedings of the ACL-2001 Workshop on Collocations, Toulouse, France
  12. M. Meila. 2002. Comparing clusters. Technical Report 418, Department of Statistics, University of Washington
Download


Paper Citation


in Harvard Style

Biemann C. and Quasthoff U. (2007). SIMILARITY OF DOCUMENTS AND DOCUMENT COLLECTIONS USING ATTRIBUTES WITH LOW NOISE . In Proceedings of the Third International Conference on Web Information Systems and Technologies - Volume 2: WEBIST, ISBN 978-972-8865-78-8, pages 130-135. DOI: 10.5220/0001260601300135


in Bibtex Style

@conference{webist07,
author={Chris Biemann and Uwe Quasthoff},
title={SIMILARITY OF DOCUMENTS AND DOCUMENT COLLECTIONS USING ATTRIBUTES WITH LOW NOISE},
booktitle={Proceedings of the Third International Conference on Web Information Systems and Technologies - Volume 2: WEBIST,},
year={2007},
pages={130-135},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0001260601300135},
isbn={978-972-8865-78-8},
}


in EndNote Style

TY - CONF
JO - Proceedings of the Third International Conference on Web Information Systems and Technologies - Volume 2: WEBIST,
TI - SIMILARITY OF DOCUMENTS AND DOCUMENT COLLECTIONS USING ATTRIBUTES WITH LOW NOISE
SN - 978-972-8865-78-8
AU - Biemann C.
AU - Quasthoff U.
PY - 2007
SP - 130
EP - 135
DO - 10.5220/0001260601300135