Extracting Body Text from Academic PDF Documents for Text Mining

Changfeng Yu; Cheng Zhang; Jie Wang

Research.Publish.Connect.

*Please fill out at least one Field. *Value must be an number!

Title:
ISBN:
Year:
Acronym:
Subject:

Advanced Search Proceedings Search

If you're looking for an exact phrase use quotation marks on text fields.

*Please fill out at least one Field.

Title:
Author:
Affiliation:
Subject:

Advanced Search Papers Search

If you're looking for an exact phrase use quotation marks on text fields.

*Please fill out at least one Field.

Name:
Affiliation:
Country:
Conference:
Subject:

Advanced Search Authors Search

If you're looking for an exact phrase use quotation marks on text fields.

*Please fill out at least one Field.

Name:
Country:
Subject:

Advanced Search Affiliations Search

If you're looking for an exact phrase use quotation marks on text fields.

Proceedings

Proceedings Search *Please fill out at least one Field. *Value must be an number!

Title:
ISBN:
Year:
Acronym:
Subject:

Advanced Search Proceedings Search

If you're looking for an exact phrase use quotation marks on text fields.

Papers

Papers Search *Please fill out at least one Field.

Title:
Author:
Affiliation:
Subject:

Advanced Search Papers Search

If you're looking for an exact phrase use quotation marks on text fields.

Authors

Authors Search *Please fill out at least one Field.

Name:
Affiliation:
Country:
Conference:
Subject:

Advanced Search Authors Search

If you're looking for an exact phrase use quotation marks on text fields.

Advanced Search

Paper

Extracting Body Text from Academic PDF Documents for Text Mining

Topics: Information Extraction; Mining Text and Semi-Structured Data

In Proceedings of the 12th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 0IC3K, 235-242, 2020

Authors: Changfeng Yu ; Cheng Zhang and Jie Wang

Affiliation: Department of Computer Science, University of Massachusetts, Lowell, MA, U.S.A.

Keyword(s): Body-text Extraction, HTML Replication of PDF, Line Sweeping, Backward Traversal.

Abstract: Accurate extraction of body text from PDF-formatted academic documents is essential in text-mining applications for deeper semantic understandings. The objective is to extract complete sentences in the body text into a txt file with the original sentence flow and paragraph boundaries. Existing tools for extracting text from PDF documents would often mix body and nonbody texts. We devise and implement a system called PDFBoT to detect multiple-column layouts using a line-sweeping technique, remove nonbody text using computed text features and syntactic tagging in backward traversal, and align the remaining text back to sentences and paragraphs. We show that PDFBoT is highly accurate with average F1 scores of, respectively, 0.99 on extracting sentences, 0.96 on extracting paragraphs, and 0.98 on removing text on tables, figures, and charts over a corpus of PDF documents randomly selected from arXiv.org across multiple academic disciplines.

CC BY-NC-ND 4.0

Guest: Register as new SciTePress user now for free.

SciTePress user: please login.

My Papers

You are not signed in, therefore limits apply to your IP address 216.73.216.59

In the current month:

Recent papers: 100 available of 100 total

2⁺ years older papers: 200 available of 200 total

Paper citation in several formats:

Yu, C., Zhang, C. and Wang, J. (2020). Extracting Body Text from Academic PDF Documents for Text Mining. In Proceedings of the 12th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2020) - KDIR; ISBN 978-989-758-474-9; ISSN 2184-3228, SciTePress, pages 235-242. DOI: 10.5220/0010131402350242

@conference{kdir20,
author={Changfeng Yu and Cheng Zhang and Jie Wang},
title={Extracting Body Text from Academic PDF Documents for Text Mining},
booktitle={Proceedings of the 12th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2020) - KDIR},
year={2020},
pages={235-242},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0010131402350242},
isbn={978-989-758-474-9},
issn={2184-3228},
}

TY - CONF

JO - Proceedings of the 12th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2020) - KDIR
TI - Extracting Body Text from Academic PDF Documents for Text Mining
SN - 978-989-758-474-9
IS - 2184-3228
AU - Yu, C.
AU - Zhang, C.
AU - Wang, J.
PY - 2020
SP - 235
EP - 242
DO - 10.5220/0010131402350242
PB - SciTePress