errors were considered as wrong.
In the development of this evaluation, despite the
well-defined layout structure, we found the use of
different and unique combinations of fonts. This
caused some of the text extraction errors. Most of
the text extraction errors were due to minor
incompatibilities (a space character misplaced, for
example) between the content stream extraction and
the layout extraction. At this point we are improving
this situation through trial-and-errors. We are also
considering different approaches in order to extract
the text from the PDF documents, in the correct
reading-order using only its content stream.
To complete this performance evaluation we
would like to point out some global indicators that
were obtained during this process. They are
presented in the following table.
Table 3: Additional evaluation indicators.
Indicator Result
Average PDF size 696,5 Kb
Average Final XML size 101,5 Kb
Average page number per PDF 23
Average processing time per PDF 12 s
Average processing time per PDF page 0,5 s
5 DISCUSSION AND FUTURE
WORK
The main objective of our work was to achieve a
structure, text and entities extraction system from
PDF documents that would be simple, fast and able
to receive inputs from the user. Simple because we
still need a solution that is flexible; fast because the
volume of PDF documents used requires a system
with the ability to process a large number of
documents; and a user-guided system, because this
is directed for cases where there is more specific
knowledge than general knowledge (Klink and
Kieneger, 2001), and that specific knowledge is
static throughout every document of that type.
There are some immediate subjects to improve or
develop in order to achieve a more enthusiastic
result.
Tests have shown that due to the often use of
unexpected fonts in the text, results can be
misleading. However, it showed that although it
reduces the ability for classification of the text
through a rule based approach, the system still
generally recognizes it as valid text strings.
We did not ponder the use of an ontology based
component instead of the developed rule based.
Nonetheless, this presents an inevitable question for
the future, due to the present growth of Semantic
Web (Hendler et al., 2002).
We think it will be necessary for a wider and
diverse evaluation of the system using different
types of documents; this should be critical in order
to develop the user-inputs operability and also to
increase the error-solving capability.
The application of rules and the extraction of
entities are still matters for improvement. Although
we obtained good results, we observed certain
recurrent errors that we should address. At this point
we’re dismissing the processing of images and
tables. However, the entities inside the tables are
processed.
6 CONCLUSIONS
We presented the problem of text extraction in PDF
documents with known and fixed layout structures.
We presented a grouping-based approach as a
possible solution. Furthermore, this solution presents
a capability to extract entities present in the text.
This approach enables the creation of XML files
containing the text and a representation of the PDF
documents structure. The main contribution of our
work is the development of a user-guided system for
text and entities extraction using methods based on
our research. By not using OCR technologies and by
using geometric-based region representations for
segmentation it requires low storage space and low
processing time.
We consider we’ve been able to show that this
goal was achieved with some success. Although
some improvements have to be made, our
preliminary results we’re enthusiastic. Nonetheless
we reckon the system still requires an extended
period of experiments in order to evolve with the
processing of more sets of documents.
ACKNOWLEDGEMENTS
The authors would like to thank all the support
provided by Knowledge Engineering and Decision
Support Research Center.
REFERENCES
Hassan, T. 2010. User-Guided Information Extraction
from Print-Oriented Documents. Dissertation. Vienna
University of Technology
KDIR2012-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval
130