
is an image highlighted in binary pixels of inten-
sity 0 or 1, usually white and black respectively.
This technique is used in character recognition to
highlight text in relation to the environment and
material on which it is printed or written.
• Layout analysis: at this stage the document is sep-
arated by areas of interest for analysis. In the
case of this research, texts are extracted from the
documents and separated from images, drawings,
graphics and other non-textual elements.
• Text analysis and recognition: once the texts have
been properly filtered and separated, with the best
possible highlighting, some recognition algorithm
is applied to each character of the texts found in
the document. The technique used in this research
uses machine learning to recognize handwritten
texts.
• Text description prediction: after recognizing
each character, a statistical value between 0 and
1 is assigned, which corresponds to the chance of
the character being a particular letter of the por-
tuguese alphabet, making changes if necessary to
form words correctly and maintain the original
content of the document. This stage results in the
completely digitized text of the initial manuscript.
In these more specific stages, related to optical
character recognition, two main technologies from the
Kraken ecosystem were used: eScriptorium and Ke-
tos. At first, after choosing the documents, the steps
that make up the construction of the database are car-
ried out, in which the software web eScriptorium. It
pre-processes, analyzes and recognizes the text and
also creates transcription files for each word and let-
ter found. Figure 5 shows how the stages are divided
according to each technology.
Next, the ALTO files are compiled by Ketos into a
Arrow file type used by the program to carry out train-
ing faster than directly using the documents in XML
pairs, with coordinates and transcription, and the im-
ages. This compilation stage divides the data into
personalized parts for training, validation and testing,
which in the case of this work follow the proportions
of 0.75, 0.15 and 0.15 respectively. Finally, after com-
pilation and division, training is carried out using the
Ketos training command with a pre-trained base neu-
ral network. We also used the SGD optimizer, learn-
ing rate of 0.001 and the tool’s own data augmenta-
tion.
For the number of training epochs, a earling stop-
ping solution was chosen, which stops training when
there are a number of drops in accuracy in the val-
idation at the end of each epoch, in the case of this
research. This technique avoids overtraining (Over-
fitting is a common problem in machine learning and
neural networks, in which a model overfits the train-
ing data. As a result, the model becomes very specific
to the training data and is often unable to generalize
well to new, unseen data, leading to poorer perfor-
mance in real-world situations.) with the data used
and works well in cases of fine-tuning (Fine-tuning is
the process of adjusting a pre-trained machine learn-
ing model to a specific task or data set, taking advan-
tage of the model’s prior knowledge.)
5.1 Database and Transcripts
An extremely important part of defining the effective-
ness of a neural network is the data on which it is
trained. For this research, we couldn’t find a dataset
that had an available catalog and transcriptions of his-
torical documents handwritten in Portuguese. There-
fore, it was necessary to create a database of tran-
scriptions to test how the model would adapt to Old
Manuscript Portuguese.
The first step was to choose the documents needed
to carry out the procedures. The criteria were files
that had good reading quality and few faults to facil-
itate and speed up the data cataloging process, since
the main objective of the research is not to create a
large complex database that can be used by other OCR
training models, although it is possible, as this would
require increasing the overall scope of the research.
Using the search engines and filters of the web-
site libraries, the few documents with a good reading
quality were chosen so that there would be no doubts
at the time of transcription, thus avoiding divergences
caused by wrong classification.
After the selection, the documents were processed
to remove artifacts, deterioration, stains and other
problems that make it difficult to extract only the text.
This part is left to non-linear binarization, which uses
methods that take into account the relationship be-
tween neighbouring pixels and the distribution of in-
tensities in the image to carry out the binarization in-
stead of a fixed threshold value. This stage of the
process highlights the letters written in focus on the
page and eliminates most of the problems of histori-
cal manuscripts such as ghosting and bleed-through.
After the binarization stage, the images go
through a layout analysis and segmentation process in
order to eliminate the other non-textual components
and ensure that only the text is catalogued and used
in the transcriptions. This is done by another internal
neural network and pre-trained by the eScriptorium,
which has a very interactive abstraction in the web-
site, which requires no configuration other than file
selection and segmentation initialization. The infor-
Character Identification in Images Extracted from Portuguese Manuscript Historical Documents
405