loading
Papers Papers/2022 Papers Papers/2022

Research.Publish.Connect.

Paper

Paper Unlock

Authors: Anirban Chakraborty 1 ; Kripabandhu Ghosh 1 and Utpal Roy 2

Affiliations: 1 Indian Statistical Institute, India ; 2 Visva-Bharati, India

Keyword(s): Erroneous Text, Cooccurrence, Pointwise Mutual Information.

Related Ontology Subjects/Areas/Topics: Artificial Intelligence ; Clustering and Classification Methods ; Knowledge Discovery and Information Retrieval ; Knowledge-Based Systems ; Symbolic Systems

Abstract: OCR errors hurt retrieval performance to a great extent. Research has been done on modelling and correction of OCR errors. However, most of the existing systems use language dependent resources or training texts for studying the nature of errors. Not much research has been reported on improving retrieval performance from erroneous text when no training data is available. We propose an algorithm of detecting OCR errors and improving retrieval performance from the erroneous corpus. We present two versions of the algorithm: one based on word cooccurrence and the other based on Pointwise Mutual Information. Our algorithm does not use any training data or any language specific resources like thesaurus. It also does not use any knowledge about the language except that the word delimiter is a blank space. We have tested our algorithm on erroneous Bangla FIRE collection and obtained significant improvements.

CC BY-NC-ND 4.0

Sign In Guest: Register as new SciTePress user now for free.

Sign In SciTePress user: please login.

PDF ImageMy Papers

You are not signed in, therefore limits apply to your IP address 18.217.237.169

In the current month:
Recent papers: 100 available of 100 total
2+ years older papers: 200 available of 200 total

Paper citation in several formats:
Chakraborty, A.; Ghosh, K. and Roy, U. (2014). A Word Association Based Approach for Improving Retrieval Performance from Noisy OCRed Text. In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval (IC3K 2014) - KDIR; ISBN 978-989-758-048-2; ISSN 2184-3228, SciTePress, pages 450-456. DOI: 10.5220/0005157304500456

@conference{kdir14,
author={Anirban Chakraborty. and Kripabandhu Ghosh. and Utpal Roy.},
title={A Word Association Based Approach for Improving Retrieval Performance from Noisy OCRed Text},
booktitle={Proceedings of the International Conference on Knowledge Discovery and Information Retrieval (IC3K 2014) - KDIR},
year={2014},
pages={450-456},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005157304500456},
isbn={978-989-758-048-2},
issn={2184-3228},
}

TY - CONF

JO - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval (IC3K 2014) - KDIR
TI - A Word Association Based Approach for Improving Retrieval Performance from Noisy OCRed Text
SN - 978-989-758-048-2
IS - 2184-3228
AU - Chakraborty, A.
AU - Ghosh, K.
AU - Roy, U.
PY - 2014
SP - 450
EP - 456
DO - 10.5220/0005157304500456
PB - SciTePress