Extracting Structure, Text and Entities from PDF Documents
of the Portuguese Legislation
Nuno Moniz
Institute of Engineering, Polytechnic of Porto, Rua Dr. António Bernardino de Almeida, 4715-357, Porto, Portugal
Fátima Rodrigues
GECAD – Knowledge Engineering and Decision Support Research Center / Computer Engineering Department,
Institute of Engineering, Polytechnic of Porto, Rua Dr. António Bernardino de Almeida, 4715-357, Porto, Portugal
Keywords: Information Retrieval, Text Extraction, PDF.
Abstract: This paper presents an approach for text processing of PDF documents with well-defined layout structure.
The scope of the approach is to explore the font’s structure of PDF documents, using perceptual grouping. It
consists on the extraction of text objects from the content stream of the documents and its grouping
according to a set criterion, making also use of geometric-based regions in order to achieve the correct
reading order. The developed approach processes the PDF documents using logical and structural rules to
extract the entities present in them, and returns an optimized XML representation of the PDF document,
useful for re-use, for example in text categorization. The system was trained and tested with Portuguese
Legislation PDF documents extracted from the electronic Republic’s Diary. Evaluation results show that our
approach presents good results.
1 INTRODUCTION
The daily increase of information available in the
Internet creates the need for tools that are capable of
extracting and processing it.
Important sources of information are originally
created in the form of text documents. Although
stored in computers, these documents do not contain
a formal indication about the data types they contain
or its own structure. This lack of formal indication
prevents the information from being manipulated to
meet user’s specific needs when
accessing/querying/searching it. To make that
knowledge computer processable it is necessary to
understand the structure of documents, to encode
their knowledge and to develop algorithms to bridge
the gap between text documents and computer
processable representations.
Extracting text from a PDF document is not a
direct and simple task. In our research we conclude
that OCR is the technology used in most cases
(Taylor et al., 1994; Klink and Kieneger, 2001;
Todoran et al., 2001; Hollingsworth et al., 2005) due
to the attempt to perform text extraction on
documents where there is no knowledge of its
document’s structure. However, in most of the cases
mentioned it was concluded that OCR is time
consuming and had issues in error recognition.
We can state that a considerable number of
public and private organizations that issue official
documents regularly adopt well-defined layout
structures. These standards include not only the
geometric position of text but also its hierarchical
structure - differenced fonts, styles and positioning.
Using a combination of hereditary and acquired
knowledge, we can understand the structure of
complex documents without significant effort
(Hassan, 2010).
Today there is technology available to parse
directly information from PDF documents. We
chose to use a free and open source library, iText,
described as “a library that allows you to create and
manipulate PDF documents.”
We found this approach and similar approaches
of directly parsing information from PDF documents
to be used or described in some of our research
(Hassan and Baumgartner, 2005; Antonacopoulos
and Coenen, 1999; Rosenfeld et al., 2008; Siefkes,
2003). For grouping text objects these approaches
mainly use the font size as criterion for grouping text
objects.
In this paper we present an approach for text
123
Moniz N. and Rodrigues F..
Extracting Structure, Text and Entities from PDF Documents of the Portuguese Legislation.
DOI: 10.5220/0004103501230131
In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval (KDIR-2012), pages 123-131
ISBN: 978-989-8565-29-7
Copyright
c
2012 SCITEPRESS (Science and Technology Publications, Lda.)
processing of PDF documents with well-defined
layout structures; we used the Portuguese Republic’s
Diary documents. This approach uses two different
extraction methods, according to the two stages of
document processing - document analysis and
document understanding (Hassan, 2010). The
criterion used for grouping the text objects was the
font style used in text objects.
The next section presents a general description of
the system; it is followed by a section that describes
the system implementation and its functionalities as
well as the general process; furthermore we present
an evaluation of the system performance and the last
sections present discussion, future work and
conclusions.
2 GENERAL DESCRIPTION
PDF uses a structured binary file format described
by a derivation of PostScript page description
language. Objects are the basic data structure in a
PDF file. For the purposes of this paper we elaborate
some of the elements. The content stream is a stream
object that contains the sequence of instructions that
describe the graphical elements of the page. A
dictionary object is an associative table containing
key/value pairs of objects. A name object is an
atomic symbol uniquely defined by a sequence of
characters (Adobe Systems Incorporated, 2008).
PDF document processing can be divided into
two phases referring to the two structures in a
document: document analysis in order to extract the
layout structure and document understanding for
mapping the layout structure into a logical structure
(Klink et al., 2000). Our approach is divided in three
phases: the previous described phases and a third
that combines the outputs from the previous phases.
2.1 Document Analysis
The first step in document analysis is layout analysis
or segmentation. It consists on parsing a document
into atomic blocks. We found in our research two
approaches for segmentation: top-down and bottom-
up.
The top-down approach is an OCR simulation
that usually makes use of whitespace density graphs
or similar. This consists on parsing the documents
along the x and y axis in order to find whitespace
areas. We found reports (Hassan and Baumgartner,
2005) of block recognition problems in certain
layouts.
The bottom-up approach can be described as a
parsing and grouping process of the smallest
segments that share a group of common
characteristics such as font size (Hassan and
Baumgartner, 2005).
In terms of region comparison, we based our
discussion in research made by Antonacopoulos and
Coenen (1999), where two categories of methods for
region comparison are described: pixel-based and
geometric. The geometric is described as the best
approach, but the authors openly state their
reservations of this approach due to the need of
accurate descriptions of the regions.
Regarding segmentation our intended output is
not a hierarchical structure but only the coarse-
grained regions of each page of the documents,
representing in our approach the two halves of the
document page, as shown in Figure 1. We will
elaborate this option in section 3.3. In the given
example, the graphic regions are defined by vertical
ruling.
As stated, our approach is destined for known
and fixed-structured documents. Therefore, we
consider that the top-down approach and the
geometric region comparison method is the most
proper for this step.
The second step is to extract text from the
regions resulting from segmentation. Using the iText
library mentioned before, we are able to determine
areas to extract text within.
Note that the layout objects extracted are solely
for the purpose of extracting text from the PDF file.
The output of this phase is an array of text segments,
according to the reading order but without any
explicit logical structure.
Figure 1: Resulting regions of segmentation process.
KDIR2012-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval
124
2.2 Document Understanding
According to Todoran et al. (2001), document
understanding can be divided into two other phases:
the process of grouping the layout document objects
in order to classify the logical objects; and the
process of determining their logical relations.
In order to complete the first phase the best
criteria for grouping the layout objects is similar to
the perceptual grouping referred by Rosenfeld et al.
(2008). Rosenfeld used spatial knowledge to
aggregate primitive text objects and create groups of
text (line, paragraphs and columns). In our approach
we used the font used in the text segments as
criteria.
As mentioned before, in the structure of a PDF
we are able to find the fonts dictionary in the
resources dictionary. An example from Adobe
Systems Incorporated (2008) is in Figure 2.
Figure 2: Example of Font Dictionary.
Like the work of Giuffrida et al. (2000), Hu et al.
(2005) and Hassan and Baumgartner (2005) the use
of fonts is present in our approach, although the
criterion for grouping objects is different. We
implemented a similar approach, but defined the
criteria as the font itself, as defined by the content
stream of a document.
In text objects from the content stream of a PDF
file we can find both objects: name and string. An
example from Adobe Systems Incorporated (2008)
is presented in Figure 3.
Therefore, we have the objects that are required
for grouping according to our criteria. The operators
BT and ET represent the beginning and the end of
the text object. Using Figure 3 as example, the
second line sets the font and the fourth line prints the
string.
Based on this we are able to state that by
extracting the text objects of a PDF file we are able
to group strings by font used. The result is then
translated into a XML. Note that this result has no
guarantee of being in the correct reading order.
The second phase of document understanding is
integrated in the third and final phase of document
processing, described as follows.
Figure 3: Text Object.
2.3 Merging Phase
At this point we have two outputs from the previous
phases: a complete text description in correct
reading order and a XML file with strings tagged
and grouped by font used.
Therefore, we are missing two processes: we
need to join the two outputs in order to have a XML
file that contains the tagged string groups in the
correct reading order and, it is necessary to apply the
second phase in document understanding, described
as the process of determining the logical relations
between the groups of objects.
In this approach one logical relation that is dealt
from the start, as stated above, is the reading order.
Other logical relations have to be inputted by the
user of the system, such as the structural
relationships between segments (e.g., a paragraph
contains lines). Our approach is based on two sets of
rules: structural and logical rules. Structural rules are
mainly applied in order to classify and create new
groups or to re-label the existing ones; syntactical
rules are used. Logical rules are applied in order to
establish logical relations between groups. Both
structural rules and logical rules have their own
specific syntax. In Section 3 we will explain them in
detail.
The expected output of our approach is a XML
file containing the text description of the PDF file, in
correct reading order, tagged accordingly and
containing logical relations set out by the user.
3 SYSTEM DESCRIPTION
In the previous chapter we presented the general
description of our approach. In this chapter we will
describe its implementation.
We would like to state that despite the previously
presented division of phases, our approach doesn’t
implement them in the same order.
The implementation has two phases: extraction
and analysis.
ExtractingStructure,TextandEntitiesfromPDFDocumentsofthePortugueseLegislation
125
The extraction phase contains three processes:
extraction of information from the PDF’s content
stream, extraction of text using geometric
positioning and merging the output of the two
previous processes into a XML file.
The analysis phase contains two processes:
application of structural rules and application of
logical rules. The system output is a XML file that
contains the mapping of the layout structure to a
logical structure of the PDF document.
The extraction of text within tables and the
extraction of images were not implemented but they
are on our future work objectives.
Before the description of the phases and
processes, we would like to map our processes with
previous research.
3.1 Background
In order to be clear about the influence of the studied
approaches, Table 1 represents the mapping of our
processes with what we consider to be correspondent
to both following descriptions.
Niyogi (1994) presents a description of a
computational model for extracting the logical
structure of a document, described as follows:
1. a procedure for classifying all the distinct
blocks in an image;
2. a procedure for grouping these blocks into
logical units;
3. a procedure for determining the read-order of
the text blocks within each logical unit;
4. a control mechanism that monitors the above
processes and creates the logical representation of
the document;
5. a knowledge base containing knowledge about
document layout and structure and;
6. a global data structure that maintains the
domain and controls data.
Taylor et al. (1994) presents four phases in his
implementation:
1. Physical Analysis
2. Logical Analysis
3. Functional Analysis
4. Topical Analysis
We assume that this mapping is not an exact
match, but it presents a general idea of the
correspondence of processes present in our approach
and previous research.
Giuffrida et al. (2000) used spatial knowledge of
a given domain knowledge to encode a rule-based
system for automatically extracting metadata from
research papers; they used spatial knowledge to
Table 1: Mapping of implemented processes with previous
research.
Processes Niyogi (1994)
Taylor et al
(1994)
Extraction from
Content Stream
1) and 2) 1)
Extraction from
Layout
3) 1)
XML output 3) 2)
Application of
structural rules
4) 3)
Application of logical
rules
4) 4)
create a rule; the metadata was extracted from
PostScript files and formatting information was
used.
Hu et al. (2005) proposed a machine learning
approach to title extraction from general documents;
tests were made with Word and PowerPoint
documents. This method mainly utilizes formatting
information such as font size in the models.
Both approaches use formatting information,
such as the font used. We use the font as declared in
the content stream of PDF documents as criteria for
perceptual grouping.
We assumed this option due to the often presence
of different styles within text segments of the same
font size. Usually this represents an entity; therefore,
using the content stream font description as criteria
instead of the font size, we enable a better
information extraction process.
In the following sections we will describe the
processes of our system’s implementation.
3.2 Extraction from Content Stream
As explained in Section 2 a PDF document is
composed by objects. Regarding this section we will
refer only to text objects.
The objective of this process is to extract strings
labelled with the font resource declared for its use.
This is done by parsing sequentially the content
stream extracting each text object and parsing its
font and string. Sequential strings that have the same
font are grouped. As the results are obtained, they
are appended in a XML structure.
After this, two procedures are called: one to
extract explicit entities and another to clean the
XML.
In the first procedure, as explained before, we
use a single criterion of font used. By analysing the
fonts used in the Portuguese Republic’s Diary, we
found that the italic style is most often used to refer
to an entity. Therefore, this process consists on the
extraction of these explicit entities and its relabeling.
In the second procedure cleaning operations are
KDIR2012-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval
126
made e.g. cleaning empty tags, joining two
consecutive objects with the same tag. Also, in this
procedure tables are removed. However, before this
operation, a regular expression for entity recognition
is applied in the text within, in order to extract the
entities present in the documents tables.
In Figure 4 we present an excerpt of the auxiliary
XML file created to store this information and the
respective PDF document.
3.3 Extraction from Layout
In this process we extract text from the PDF
document using region filters present in the iText
library. We therefore extract text from a known
location. The documents we refer are organized in
double columns. Therefore, we extract text by
setting a vertical ruling in the middle of the page.
The output of this operation consists on arrays of
strings which are joined in order to produce a unique
array. This array will be a sequential list, according
to the reading order of the text. Each string of the
array contains a line of a column.
The sole purpose of this process is to extract text
in the correct reading order.
3.4 XML Output
This is the final process after both extraction phases.
It consists of sequential comparisons between the
previous extractions.
For each line obtained from the Extraction from
Layout, a lookup in the auxiliary XML of the
Extraction from content stream process is made. The
resulting matches are appended into a XML file.
Figure 4: Excerpt of the auxiliary XML.
In Figure 5 a bit from the auxiliary XML (not in
correct reading order) is presented. In Figure 6 a bit
from the XML output is presented. It is possible to
denote that the numbers in the TT8 tag are not
sequential. We do not detain the necessary
information to specify why the iText library is
unable to parse the text objects from the content
stream in correct reading-order. However, we
assume this could be either due to the content stream
not having all of its text objects in a sequential
manner or due to the use of misleading character
recognition because of the use of vectors in that
process.
Figure 5: Bit of auxiliary XML.
Figure 6: Bit of XML output.
3.5 Application of Structural and
Logical Rules
These are the processes of the analysis phase.
Although these processes are separated, they are
implemented using the same paradigm. It consists on
a rule based system that is applied according to the
syntax defined for each set of rules (structural and
logical).
In order to apply these rules it is necessary a user
input. This input is done by the declaration of rules
in four text files containing the respective structural
and logical rules.
Our system embeds operations that enable the
application of these rules. The operations are the
result of the knowledge acquired from the analysis
of the auxiliary XML file – the output of the
previous phase.
ExtractingStructure,TextandEntitiesfromPDFDocumentsofthePortugueseLegislation
127
3.5.1 Structural Rules
The application of structural rules obeys pre-defined
types of operations. The structural rules are defined
in two separate files.
The first file contains rules relating to operations
that include recognition of structure entities (articles,
lines, chapters, sections and others), deletion of
structure entities and recognition of entities. The
second file contains rules to alter original XML tag
names to tag names with a meaning.
This list of operations is not static and we believe
it will grow according with different PDF documents
processed.
The syntax for the specification of rules in the
first file is as follows: ‘
RegExp:::operation’.
The operation bit represents an internal process
encoded in our system. As mentioned, our domain
knowledge is based uniquely in the Portuguese
Republic’s Diary documents. Some examples of
these internal processes are insertion after or before
the present tag and recognition of structural elements
within text objects.
In Figure 7 we present a XML output without the
application of any rule. In Figure 8 the same case is
presented with the application of an example rule
from the first structural rules file, related to the
recognition of chapter elements:
CAPÍTULO\s[IVXLCM]+):::chapter’.
As an important remark, the extraction of entities
is processed at this point with the application of a
structural rule. The successful results obtained have
no implication in the structure of the text or
document; they are stored separately.
The second file of structural rules consists on a
list of rules that deal solely with altering the initial
XML tag names into the user specified desired tag
names.
The syntax for the specification of rules in the
second file is as follows:
RegExp:::previoustag:::newtag’. The rule
may or may not contain regular expressions. In
Figure 9 an example rule from the second file is
applied to the bit previously presented in Figure 5
where tag <TT6> is replaced by tag <govEntity>:
^\s?[A-ZÁ-Ü]{2}.+$:::TT6:::govEntity’.
Figure 7: XML output without structural rules.
Figure 8: XML output with structural rules.
Figure 9: XML output with tag structure rules.
3.5.2 Logical Rules
These rules are defined in two separate files as well.
The logical rules intend to structure the final
XML file in order to replicate the information
hierarchy present in the original PDF document.
This process requires a previous user analysis in
order to specify the correct options. For our
example, in terms of information hierarchy we find
that the Legislation Entity is the most important
element in the Portuguese Republic’s Diary; each
Legislation Entity may or may not have a Sub-
Entity; these Entities issue Legislation Documents; a
Legislation Document may or may not have a
Description; a Legislation Document may or may
not be organized by Articles, etc.
In order to reproduce that hierarchy we require
two types of processing: a first process where a
specific tag appends all the following objects until a
similar tag is found; a second process that appends
the objects of a specific tag onto another preceding
it.
The first logical rules file represents the rules
applied for the first process; the second file contains
the rules that are applied in order to perform the
second process.
The first logical rules file represents a top-down
approach of aggregation. It appends every tag onto a
specific user defined tag. The syntax for these rules
is as follows: ‘
firstTag:::aggregationTag’.
The firstTag field represents the parent tag, and
the aggregationTag represents the tag to which the
following will be appended.
This process is used primarily with the objects
that have higher importance in the structure or
contains most of the text (for example Legislation
Entities and Legislative Documents).
KDIR2012-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval
128
In Figure 10 we present a bit of a XML output
with the application of an example rule
LexEntity:::LexDocument’, from the first
logical rules file.
In the previously mentioned figure we can
observe the application of a rule that follows what
was stated concerning information hierarchy.
The second logical rules file contains rules that
have the objective of appending objects with a
specific tag onto another user defined tag. The
syntax for these rules is as follows:
parentTag:::tagToAppend’.
The parentTag field represents the tag onto
which the objects will be appended; the second field
represents the tag to be appended.
In Figure 11 we present a bit of a XML output
with the application of the rule
line:::paragraph’, from the second logical
rules file.
Figure 10: XML output with logical rules.
Figure 11: XML output with logical rules.
This is the final process of the analysis phase.
The output of this phase is the final XML file that
contains the mapping of the layout structure to
logical structure of a PDF document.
4 PERFORMANCE
We tested our system with a group of 40 Portuguese
Republic’s Diary PDF documents. We chose the
documents randomly in terms of size and date. For
this performance test we did not include the Diaries
supplements. Regarding the timeline of the
documents, it stands between the 1
st
of January 2009
and 19
th
of March 2012. The access to the
documents of our sample was done in an online
environment – remote access.
For each document in our sample we confirmed
if the text extraction was done in a correct and
successful manner. The confirmation was based on a
manual comparison between the original text in the
PDF documents and the XML output. We also
confirmed the extraction of entities; it was based on
a one-by-one evaluation of each entity extracted.
The documents were graded, in terms of
percentage, according to its accuracy in both
processes: extracting text and extracting entities. We
searched for unsuccessful text extractions and non-
entities that were flagged as correct entities.
In our experiments we used the two measures:
Text Extraction Accuracy (TEA) and Entity
Extraction Accuracy (EEA). The measures were
defined as follows:
TEA = 1 – ( UTE / TTE ) (1)
EEA = 1 – ( UEE / TEE ) (2)
Here, UTE and UEE are defined as Unsuccessful
Text Extractions and Unsuccessful Entity
Extractions; TTE and TEE are defined as Total of
Text Extractions and Total of Entity Extractions. In
the following table the results are presented.
Table 2: Results of evaluation.
Period TEA EEA
Jan 2009 – Dec 2009 99,82% 93,55%
Jan 2010 – Dec 2010 99,53% 92,55%
Jan 2011 – Dec 2011 99,68% 94,31%
Jan 2009 – Mar 2012 99,73% 93,61%
For both confirmations, partial results were
considered as wrong. As for the first confirmation
(TEA), the incorrect extractions were promptly
pointed by the system. Nonetheless, some results
pointed out as incorrect were accepted due to the
previous stated expectations: relating to text inside a
table, we expect the system to ignore it. As such,
these results were considered correct. However, in
the second confirmation (EEA), we had to observe
and classify one-by-one, each entity. Entities that
were incomplete; had incorrect phrasing or minor
ExtractingStructure,TextandEntitiesfromPDFDocumentsofthePortugueseLegislation
129
errors were considered as wrong.
In the development of this evaluation, despite the
well-defined layout structure, we found the use of
different and unique combinations of fonts. This
caused some of the text extraction errors. Most of
the text extraction errors were due to minor
incompatibilities (a space character misplaced, for
example) between the content stream extraction and
the layout extraction. At this point we are improving
this situation through trial-and-errors. We are also
considering different approaches in order to extract
the text from the PDF documents, in the correct
reading-order using only its content stream.
To complete this performance evaluation we
would like to point out some global indicators that
were obtained during this process. They are
presented in the following table.
Table 3: Additional evaluation indicators.
Indicator Result
Average PDF size 696,5 Kb
Average Final XML size 101,5 Kb
Average page number per PDF 23
Average processing time per PDF 12 s
Average processing time per PDF page 0,5 s
5 DISCUSSION AND FUTURE
WORK
The main objective of our work was to achieve a
structure, text and entities extraction system from
PDF documents that would be simple, fast and able
to receive inputs from the user. Simple because we
still need a solution that is flexible; fast because the
volume of PDF documents used requires a system
with the ability to process a large number of
documents; and a user-guided system, because this
is directed for cases where there is more specific
knowledge than general knowledge (Klink and
Kieneger, 2001), and that specific knowledge is
static throughout every document of that type.
There are some immediate subjects to improve or
develop in order to achieve a more enthusiastic
result.
Tests have shown that due to the often use of
unexpected fonts in the text, results can be
misleading. However, it showed that although it
reduces the ability for classification of the text
through a rule based approach, the system still
generally recognizes it as valid text strings.
We did not ponder the use of an ontology based
component instead of the developed rule based.
Nonetheless, this presents an inevitable question for
the future, due to the present growth of Semantic
Web (Hendler et al., 2002).
We think it will be necessary for a wider and
diverse evaluation of the system using different
types of documents; this should be critical in order
to develop the user-inputs operability and also to
increase the error-solving capability.
The application of rules and the extraction of
entities are still matters for improvement. Although
we obtained good results, we observed certain
recurrent errors that we should address. At this point
we’re dismissing the processing of images and
tables. However, the entities inside the tables are
processed.
6 CONCLUSIONS
We presented the problem of text extraction in PDF
documents with known and fixed layout structures.
We presented a grouping-based approach as a
possible solution. Furthermore, this solution presents
a capability to extract entities present in the text.
This approach enables the creation of XML files
containing the text and a representation of the PDF
documents structure. The main contribution of our
work is the development of a user-guided system for
text and entities extraction using methods based on
our research. By not using OCR technologies and by
using geometric-based region representations for
segmentation it requires low storage space and low
processing time.
We consider we’ve been able to show that this
goal was achieved with some success. Although
some improvements have to be made, our
preliminary results we’re enthusiastic. Nonetheless
we reckon the system still requires an extended
period of experiments in order to evolve with the
processing of more sets of documents.
ACKNOWLEDGEMENTS
The authors would like to thank all the support
provided by Knowledge Engineering and Decision
Support Research Center.
REFERENCES
Hassan, T. 2010. User-Guided Information Extraction
from Print-Oriented Documents. Dissertation. Vienna
University of Technology
KDIR2012-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval
130
Taylor, S., Dahl, D., Lipshitz, M. et al. 1994. Integrated
Text and Image Understanding for Document
Understanding. Unisys Corporation
Klink, S, Kieninger, T. 2001. Rule-based Document
Structure Understanding with a Fuzzy Combination of
Layout and Textual Features. German Research Center
for Artificial Intelligence
Todoran, L., Worring, M., Aiello, M., Monz, C. 2001.
Document Understanding for a Broad Class of
Documents. ISIS technical report series, Vol. 2001-15
Hollingsworth, B., Lewin, I., Tidhar, D. 2005. Retrieving
Hierarchical Text Structure from Typeset Scientific
Articles - a Prerequisite for E-Science Text Mining.
University of Cambridge Computer Laboratory
Hassan, T., Baumgartner, R. 2005. Intelligent Text
Extraction from PDF. Database & Artificial
Intelligence Group, Vienna University of Technology,
Austria
Antonacopoulos, A., Coenen, F. P. 1999. Region
Description and Comparative Analysis Using a
Tesseral Representation. Department of Computer
Science, University of Liverpool.
Rosenfeld, B., Feldman, R., Aumann, Y. et al. 2008.
Structural Extraction from Visual Layout of
Documents. CIKM ’02
Siefkes, C. 2003. Learning to Extract Information for the
Semantic Web. Berlin-Brandenburg Graduate School
in Distributed Information Systems. Database and
Information Systems Group, Freie Universität Berlin
Adobe Systems Incorporated. 2008. Document
management — Portable document format — Part 1:
PDF 1.7
Klink, S., Dengel, A., Kieninger, T. 2000. Document
Structure Analysis Based on Layout and Textual
Features. DAS 2000: Proceedings of the International
Workshop of Document Analysis Systems
Adobe Systems Incorporated. 2008. Document
management – Portable document format – Part 1: 1.7.
Niyogi, D. 1994. A Knowledge-Based Approach to
Deriving Logical Structure from Document Images.
PhD thesis, State University of New York at Buffalo
Hendler, J., Berners-Lee, T., Miller, E., 2002. Integrating
Applications on the Semantic Web. Journal of the
Institute of Electrical Engineers of Japan, Vol
122(10), October 2002, p.676-680
Hu, Y., Li, H., Cao, Y., Meyerzon, D., Zheng, Q., 2005.
Automatic Extraction of Titles from General
Documents using Machine Learning. JCDL ’05
Giuffrida, G., Shek, E., Yang, J., 2000. Knowledge-Based
Metadata Extraction from Post-Script Files. DL’00
ExtractingStructure,TextandEntitiesfromPDFDocumentsofthePortugueseLegislation
131