2 RELATED WORK
Existing tools, such as pdftotext (FooLabs, 2014) and
PDFBox (Apache, 2017) the two most widely-used
tools for extracting text from PDF, and a number of
other tools such as pdftohtml (Kruk, 2013), pdftoxml
(Dejean and Giguet, 2016), pdf2xml (Tiedemann,
2016), ParsCit (Kan, 2016), PDFMiner (Shinyama,
2016), pdfXtk (Hassan, 2013), pdf-extract (Ward,
2015), pdfx (Constantin et al, 2011), PDFExtract
(Berg, 2011), and Grobid (Lopez, 2017), extract text
from PDF extract BT text and NBT text together with-
out a clear distinction. PDFBox can extract text in
two-column layouts; some other tools extract text line
by line across columns.
Using heuristics is a common approach. For ex-
ample, the Java PDF library was used to obtain a
bounding box for each word, compute the distance
between neighboring words, connect them based on a
set of rules to form a larger text block, place them into
rhetorical categories, and connect these categories
following the order of the underlying document (Ra-
makrishnan et al., 2012). However, this method fails
to align broken sentence and determine text on formu-
las, tables, or figures. Using an intermediate HTML
representation generated by pdftohtml (Yildiz et al.,
2005). Text blocks may also be created by grouping
characters based on their relative positions (Shigarov
et al., 2016), while extracting the tables in PDF. These
two methods are focused only on extracting tables.
Other methods include rule-based and machine-
learning models. For example, text may be placed
into predefined logical text blocks based on a set of
rules on the distance, positions, fonts of characters,
words, and text lines (Bast and Korzen, 2017). How-
ever, these rules also connect text on tables or fig-
ures as BT text. A Conditional Random Field (CRF)
model is trained (Luong et al., 2011; Romary and
Lopez, 2015) to extract texts according to a prede-
fined rhetorical category, such as title, abstract, and
other sections in the input document. However, this
model fails to determine paragraph boundaries or
align broken sentences, among other things.
CiteSeerX (Giles, 2006), a search engine, extracts
metadata from indexed articles in scientific docu-
ments for searching purpose, but not focused on the
accuracy of extracting body text. PDFfigures (Clark
and Divvala, 2015) chunks the text table and figure
into blocks, then classifies these blocks into captions,
body text, and part-of-figure text. Recent studies have
shifted attentions to extracting certain types of text,
including titles (Yang et al., 2019) (but not text on ta-
bles or figures), and math expressions in the display
mode and the inline mode (Mali et al., 2020; Pfahler
et al., 2019; Wang et al., 2018; Phong et al., 2020).
In summary, previous methods, while meeting
with certain success, still fall short of the desired ac-
curacy required by text-mining applications relying
on clean extractions of complete sentences and cor-
rect boundaries of paragraphs in BT text.
3 HTML REPLICATION OF PDF
HTML technologies have been used to replicate PDF
layouts to facilitate online publishing. A PDF docu-
ment can be represented as a sequence of pages, with
each page being a DOM tree of objects with sufficient
information for an HTML viewer to display the con-
tent (Wang and Liu, 2013). The text extracted from
PDF by pdf2htmlEX (Wang, 2014) are translated into
HTML text elements that are placed into the same po-
sitions as they are displayed by PDF.
Let F denote a PDF document and f the HTML
file produced by pdf2htmlEX on F. The DOM tree
for f , denoted by T
f
, is divided into four levels: doc-
ument, page, text line, and text block (TBK in short).
(1) Document Structure. T
f
starts with the following
tag as the root: hdiv id=“page-container”i, and each
of its children is the root of a subtree for a page, listed
in sequence, with an id indicating its page number
and a class name indicating the width and height of
a page. For example, a child node with hdiv id=“pf7”
class=“pf w0 h0 data-page-no=“7”i is the root of the
subtree for Page 7, where w0 and h0 are the width and
height of the page (specifying the printable area) with
the origin at the lower-left corner of the page.
(2) Page Structure. Each page starts with a page node,
followed by object nodes with contents to be printed.
Each object occupies a rectangular area (a bounding
box) specified on a coordinate system of pixels. The
text of the document is divided into TBKs as leaf
nodes. Each TBK is represented by a hdivi tag with
corresponding attributes, and so the text in a TBK are
either all BT text or all NBT text. Each object is iden-
tified by coordinates (x,y) at the lower-left corner of
the bounding box relative to the coordinates of its par-
ent node. In what follows, these coordinates are re-
ferred to as the starting point of the underlying object.
In addition to the starting point, a non-textual object is
specified by a width and a height, and a TBK is speci-
fied with a height without a width, where the width is
implied by the enclosed text, font size and style, and
word spacing. The parent of each object may either be
the origin, a node for a figure or a table, or a node due
to some (probably invisible) formatting code. Thus,
the height of a page’s DOM tree could be greater than
3. Figure 1 is a schematic of page structure.
KDIR 2020 - 12th International Conference on Knowledge Discovery and Information Retrieval
236