UTILIZING TERM PROXIMITY BASED FEATURES TO IMPROVE TEXT DOCUMENT CLUSTERING

Shashank Paliwal; Vikram Pudi

Research.Publish.Connect.

*Please fill out at least one Field. *Value must be an number!

Title:
ISBN:
Year:
Acronym:
Subject:

Advanced Search Proceedings Search

If you're looking for an exact phrase use quotation marks on text fields.

*Please fill out at least one Field.

Title:
Author:
Affiliation:
Subject:

Advanced Search Papers Search

If you're looking for an exact phrase use quotation marks on text fields.

*Please fill out at least one Field.

Name:
Affiliation:
Country:
Conference:
Subject:

Advanced Search Authors Search

If you're looking for an exact phrase use quotation marks on text fields.

*Please fill out at least one Field.

Name:
Country:
Subject:

Advanced Search Affiliations Search

If you're looking for an exact phrase use quotation marks on text fields.

Proceedings

Proceedings Search *Please fill out at least one Field. *Value must be an number!

Title:
ISBN:
Year:
Acronym:
Subject:

Advanced Search Proceedings Search

If you're looking for an exact phrase use quotation marks on text fields.

Papers

Papers Search *Please fill out at least one Field.

Title:
Author:
Affiliation:
Subject:

Advanced Search Papers Search

If you're looking for an exact phrase use quotation marks on text fields.

Authors

Authors Search *Please fill out at least one Field.

Name:
Affiliation:
Country:
Conference:
Subject:

Advanced Search Authors Search

If you're looking for an exact phrase use quotation marks on text fields.

Advanced Search

Paper

UTILIZING TERM PROXIMITY BASED FEATURES TO IMPROVE TEXT DOCUMENT CLUSTERING

Topics: Clustering and Classification Methods; Mining Text and Semi-Structured Data

In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: SSTM, 529-536, 2011 , Paris, France

Authors: Shashank Paliwal and Vikram Pudi

Affiliation: International Institute of Information Technology Hyderabad, India

Keyword(s): Text document clustering, Document similarity, Term proximity, Term dependency, Feature weighting.

Related Ontology Subjects/Areas/Topics: Artificial Intelligence ; Clustering and Classification Methods ; Knowledge Discovery and Information Retrieval ; Knowledge-Based Systems ; Mining Text and Semi-Structured Data ; Symbolic Systems

Abstract: Measuring inter-document similarity is one of the most essential steps in text document clustering. Traditional methods rely on representing text documents using the simple Bag-of-Words (BOW) model which assumes that terms of a text document are independent of each other. Such single term analysis of the text completely ignores the underlying (semantic) structure of a document. In the literature, sufficient efforts have been made to enrich BOW representation using phrases and n-grams like bi-grams and tri-grams. These approaches take into account dependency only between adjacent terms or a continuous sequence of terms. However, while some of the dependencies exist between adjacent words, others are more distant. In this paper, we make an effort to enrich traditional document vector by adding the notion of term-pair features. A Term-Pair feature is a pair of two terms of the same document such that they may be adjacent to each other or distant. We investigate the process of term-pair selection and propose a methodology to select potential term-pairs from the given document. Utilizing term proximity between distant terms also allows some flexibility for two documents to be similar if they are about similar topics but with varied writing styles. Experimental results on standard web document data set show that the clustering performance is substantially improved by adding term-pair features. (More)

CC BY-NC-ND 4.0

Guest: Register as new SciTePress user now for free.

SciTePress user: please login.

My Papers

You are not signed in, therefore limits apply to your IP address 3.128.206.81

In the current month:

Recent papers: 100 available of 100 total

2⁺ years older papers: 200 available of 200 total

Paper citation in several formats:

Paliwal, S. and Pudi, V. (2011). UTILIZING TERM PROXIMITY BASED FEATURES TO IMPROVE TEXT DOCUMENT CLUSTERING. In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval (IC3K 2011) - SSTM; ISBN 978-989-8425-79-9; ISSN 2184-3228, SciTePress, pages 529-536. DOI: 10.5220/0003645805370544

@conference{sstm11,
author={Shashank Paliwal and Vikram Pudi},
title={UTILIZING TERM PROXIMITY BASED FEATURES TO IMPROVE TEXT DOCUMENT CLUSTERING},
booktitle={Proceedings of the International Conference on Knowledge Discovery and Information Retrieval (IC3K 2011) - SSTM},
year={2011},
pages={529-536},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003645805370544},
isbn={978-989-8425-79-9},
issn={2184-3228},
}

TY - CONF

JO - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval (IC3K 2011) - SSTM
TI - UTILIZING TERM PROXIMITY BASED FEATURES TO IMPROVE TEXT DOCUMENT CLUSTERING
SN - 978-989-8425-79-9
IS - 2184-3228
AU - Paliwal, S.
AU - Pudi, V.
PY - 2011
SP - 529
EP - 536
DO - 10.5220/0003645805370544
PB - SciTePress