Enhancing Open Data Knowledge by Extracting Tabular Data from Text Images

Andrei Puha, Octavian Rinciog, Vlad Posea

2018

Abstract

Open data published by public institutions are one of the most important resources available online. Using this public information, decision makers can improve the lives of citizens. Unfortunately, most of the times these open data are published as files, some of them not being easily processable such as scanned pdf files. In this paper we present an algorithm which enhances nowadays knowledge by extracting tabular data from scanned pdf documents in an efficient way. The proposed workflow consists of several distinct steps: first the pdf documents are converted into images, subsequently images are preprocessed using specific processing techniques. The final steps imply running an adaptive binarization of the images, recognizing the structure of the tables, applying Optical Character Recognition (OCR) on each cell of the detected tables and exporting them as csv. After testing the proposed method on several low quality scanned pdf documents, it turned out that our methodology performs alike dedicated OCR paid software and we have integrated this algorithm as a service in our platform that converts open data in Linked Open Data.

Download


Paper Citation


in Harvard Style

Rinciog O. and Posea V. (2018). Enhancing Open Data Knowledge by Extracting Tabular Data from Text Images.In Proceedings of the 7th International Conference on Data Science, Technology and Applications - Volume 1: DATA, ISBN 978-989-758-318-6, pages 220-228. DOI: 10.5220/0006862402200228


in Bibtex Style

@conference{data18,
author={Octavian Rinciog and Vlad Posea},
title={Enhancing Open Data Knowledge by Extracting Tabular Data from Text Images},
booktitle={Proceedings of the 7th International Conference on Data Science, Technology and Applications - Volume 1: DATA,},
year={2018},
pages={220-228},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0006862402200228},
isbn={978-989-758-318-6},
}


in EndNote Style

TY - CONF

JO - Proceedings of the 7th International Conference on Data Science, Technology and Applications - Volume 1: DATA,
TI - Enhancing Open Data Knowledge by Extracting Tabular Data from Text Images
SN - 978-989-758-318-6
AU - Rinciog O.
AU - Posea V.
PY - 2018
SP - 220
EP - 228
DO - 10.5220/0006862402200228