loading
Papers Papers/2022 Papers Papers/2022

Research.Publish.Connect.

Paper

Paper Unlock

Authors: Chris Biemann and Uwe Quasthoff

Affiliation: Universityof Leipzig, Germany

Keyword(s): Document clustering, graph clustering, link similarity, content similarity.

Related Ontology Subjects/Areas/Topics: Accessibility Issues and Technology ; Artificial Intelligence ; Knowledge Discovery and Information Retrieval ; Knowledge-Based Systems ; Portal Strategies ; Searching and Browsing ; Soft Computing ; Symbolic Systems ; Web Information Systems and Technologies ; Web Interfaces and Applications ; Web Mining

Abstract: In this paper, a unified framework for clustering documents based on vocabulary overlap and in-link similarity is presented. A small number of non-zero attributes per document, taken from a very large set of possible attributes, ensure efficient comparisons procedures. We show that A) low frequent words are excellent attributes for textual documents as well as B) sources of in-links as attributes for web documents. In the cases of web documents, co-occurrence analysis is used to identify similarity. The documents are represented as nodes in a graph with edges weighted by similarity. A graph clustering algorithm is applied to group similar documents together. Evaluation for textual documents against a gold standard is provided.

CC BY-NC-ND 4.0

Sign In Guest: Register as new SciTePress user now for free.

Sign In SciTePress user: please login.

PDF ImageMy Papers

You are not signed in, therefore limits apply to your IP address 3.145.64.132

In the current month:
Recent papers: 100 available of 100 total
2+ years older papers: 200 available of 200 total

Paper citation in several formats:
Biemann, C. and Quasthoff, U. (2007). SIMILARITY OF DOCUMENTS AND DOCUMENT COLLECTIONS USING ATTRIBUTES WITH LOW NOISE. In Proceedings of the Third International Conference on Web Information Systems and Technologies - Volume 1: WEBIST; ISBN 978-972-8865-78-8; ISSN 2184-3252, SciTePress, pages 130-135. DOI: 10.5220/0001260601300135

@conference{webist07,
author={Chris Biemann. and Uwe Quasthoff.},
title={SIMILARITY OF DOCUMENTS AND DOCUMENT COLLECTIONS USING ATTRIBUTES WITH LOW NOISE},
booktitle={Proceedings of the Third International Conference on Web Information Systems and Technologies - Volume 1: WEBIST},
year={2007},
pages={130-135},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0001260601300135},
isbn={978-972-8865-78-8},
issn={2184-3252},
}

TY - CONF

JO - Proceedings of the Third International Conference on Web Information Systems and Technologies - Volume 1: WEBIST
TI - SIMILARITY OF DOCUMENTS AND DOCUMENT COLLECTIONS USING ATTRIBUTES WITH LOW NOISE
SN - 978-972-8865-78-8
IS - 2184-3252
AU - Biemann, C.
AU - Quasthoff, U.
PY - 2007
SP - 130
EP - 135
DO - 10.5220/0001260601300135
PB - SciTePress