loading
Papers Papers/2022 Papers Papers/2022

Research.Publish.Connect.

Paper

Paper Unlock

Authors: David Pereira Coutinho 1 and Mário A. T. Figueiredo 2

Affiliations: 1 Depart. de Engenharia de Electrónica e Telecomunicações e de Computadores, Instituto Superior de Engenharia de Lisboa, Portugal ; 2 Instituto de Telecomunicações, Instituto Superior Técnico, Portugal

Abstract: Most approaches to text classification rely on some measure of (dis)similarity between sequences of symbols. Information theoretic measures have the advantage of making very few assumptions on the models which are considered to have generated the sequences, and have been the focus of recent interest. This paper compares the use of the Ziv-Merhav method (ZMM) and the Cai-Kulkarni-Verdú method (CKVM) for the estimation of relative entropy (or Kullback-Leibler divergence) from sequences of symbols when used as a tool for text classification. We describe briefly our implementation of the ZMM based on a modified version of the Lempel-Ziv algorithm (LZ77) and also the CKVM implementation which is based in the Burrows-Wheeler block sorting transform (BWT). Assessing the accuracy of both the ZMM and CKVM on synthetic Markov sequences shows that CKVM yields better estimates of the Kullback-Leibler divergence. Finally, we apply both methods in a text classification problem (more specifically, authorship attribution) but surprisingly CKVM permforms poorly while ZMM outperforms a previously proposed (also information theoretic) method. (More)

CC BY-NC-ND 4.0

Sign In Guest: Register as new SciTePress user now for free.

Sign In SciTePress user: please login.

PDF ImageMy Papers

You are not signed in, therefore limits apply to your IP address 3.145.83.96

In the current month:
Recent papers: 100 available of 100 total
2+ years older papers: 200 available of 200 total

Paper citation in several formats:
Pereira Coutinho, D. and A. T. Figueiredo, M. (2008). Information Theoretic Text Classification Methods Evaluation. In Proceedings of the 8th International Workshop on Pattern Recognition in Information Systems (ICEIS 2008) - PRIS; ISBN 978-989-8111-42-5, SciTePress, pages 77-85. DOI: 10.5220/0001740200770085

@conference{pris08,
author={David {Pereira Coutinho} and Mário {A. T. Figueiredo}},
title={Information Theoretic Text Classification Methods Evaluation},
booktitle={Proceedings of the 8th International Workshop on Pattern Recognition in Information Systems (ICEIS 2008) - PRIS},
year={2008},
pages={77-85},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0001740200770085},
isbn={978-989-8111-42-5},
}

TY - CONF

JO - Proceedings of the 8th International Workshop on Pattern Recognition in Information Systems (ICEIS 2008) - PRIS
TI - Information Theoretic Text Classification Methods Evaluation
SN - 978-989-8111-42-5
AU - Pereira Coutinho, D.
AU - A. T. Figueiredo, M.
PY - 2008
SP - 77
EP - 85
DO - 10.5220/0001740200770085
PB - SciTePress