
 
errors were considered as wrong. 
In the development of this evaluation, despite the 
well-defined layout structure, we found the use of 
different and unique combinations of fonts. This 
caused some of the text extraction errors. Most of 
the text extraction errors were due to minor 
incompatibilities (a space character misplaced, for 
example) between the content stream extraction and 
the layout extraction. At this point we are improving 
this situation through trial-and-errors. We are also 
considering different approaches in order to extract 
the text from the PDF documents, in the correct 
reading-order using only its content stream. 
To complete this performance evaluation we 
would like to point out some global indicators that 
were obtained during this process. They are 
presented in the following table. 
Table 3: Additional evaluation indicators. 
Indicator Result 
Average PDF size  696,5 Kb 
Average Final XML size  101,5 Kb 
Average page number per PDF  23 
Average processing time per PDF  12 s 
Average processing time per PDF page  0,5 s 
5 DISCUSSION AND FUTURE 
WORK 
The main objective of our work was to achieve a 
structure, text and entities extraction system from 
PDF documents that would be simple, fast and able 
to receive inputs from the user. Simple because we 
still need a solution that is flexible; fast because the 
volume of PDF documents used requires a system 
with the ability to process a large number of 
documents; and a user-guided system, because this 
is directed for cases where there is more specific 
knowledge than general knowledge (Klink and 
Kieneger, 2001), and that specific knowledge is 
static throughout every document of that type. 
There are some immediate subjects to improve or 
develop in order to achieve a more enthusiastic 
result. 
Tests have shown that due to the often use of 
unexpected fonts in the text, results can be 
misleading. However, it showed that although it 
reduces the ability for classification of the text 
through a rule based approach, the system still 
generally recognizes it as valid text strings.  
We did not ponder the use of an ontology based 
component instead of the developed rule based. 
Nonetheless, this presents an inevitable question for 
the future, due to the present growth of Semantic 
Web (Hendler et al., 2002).  
We think it will be necessary for a wider and 
diverse evaluation of the system using different 
types of documents; this should be critical in order 
to develop the user-inputs operability and also to 
increase the error-solving capability.  
The application of rules and the extraction of 
entities are still matters for improvement. Although 
we obtained good results, we observed certain 
recurrent errors that we should address. At this point 
we’re dismissing the processing of images and 
tables. However, the entities inside the tables are 
processed. 
6 CONCLUSIONS 
We presented the problem of text extraction in PDF 
documents with known and fixed layout structures. 
We presented a grouping-based approach as a 
possible solution. Furthermore, this solution presents 
a capability to extract entities present in the text. 
This approach enables the creation of XML files 
containing the text and a representation of the PDF 
documents structure. The main contribution of our 
work is the development of a user-guided system for 
text and entities extraction using methods based on 
our research. By not using OCR technologies and by 
using geometric-based region representations for 
segmentation it requires low storage space and low 
processing time. 
We consider we’ve been able to show that this 
goal was achieved with some success. Although 
some improvements have to be made, our 
preliminary results we’re enthusiastic. Nonetheless 
we reckon the system still requires an extended 
period of experiments in order to evolve with the 
processing of more sets of documents. 
ACKNOWLEDGEMENTS 
The authors would like to thank all the support 
provided by Knowledge Engineering and Decision 
Support Research Center.
 
REFERENCES 
Hassan, T. 2010. User-Guided Information Extraction 
from Print-Oriented Documents. Dissertation. Vienna 
University of Technology 
KDIR2012-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval
130