word search takes into account possible small OCR
errors, i.e. it allows a flexible similarity matching (
see (Ha, 2019) for details). The data annotation
module searches for structural data types such as a
date, VAT number, or legislation reference using reg-
ular expressions. In each contract, entity mentions
(e.g. an organization (ORG), a person (PER), or a lo-
cation (LOC)) play an important role, especially in
contract party detection. OCRMiner currently uses
named entity recognition module based on the Slavic
BERT model for 4 languages (Bulgarian, Czech, Pol-
ish, and Russian) (Arkhipov et al., 2019), which ex-
tends the multilingual BERT model by adding a CRF
layer tuned for Slavic languages using Wikipedia and
news articles. To improve address recognition, an
extra module based on a global address parser Lib-
postal (Barrentine et al., 2020) is used to detect parts
of addresses, such as road/street name, postcode, city,
state, or country.
After the annotations, each block is assigned a
block type in the logical structure analysis based on
the information gained in the preceding steps using a
set of logical rules. These rules are human readable
and easy to edit. The reasoning here mimics the hu-
man decisions based on visual inspection of the doc-
ument.
The information extraction module concludes the
processing to present the identified pieces of informa-
tion. For each extracted item, the module firstly looks
for the item “anchor” in the text, i.e. the correspond-
ing keywords or blocks. Then, in the surroundings
of the keyword position, the algorithm searches for
the appropriate data type, e.g. a “date” for the invoice
date item. The surroundings is limited to either next
to the keyword on the same text line, or the text line
on the right, or below it. The exact position of the
item value is decided by a score weighting function
fulfilling the criteria that the block/line contains the
data type and does not contain other keywords. Some
types of data can be found without keywords such as
ORG(anization), PER(son), VAT number, or legisla-
tion references. Contract parties are extracted only
in blocks being identified as the block type “party”,
i.e. a block containing at least one keyword in the
group of organization, address, contact person, com-
pany id, vat number, or bank information, or at least
two named entity entries in the corresponding class
(PER, ORG, LOC, CITY, COUNTRY, VAT NUM-
BER). Before parsing a party’s information in a block,
text blocks that may belong to the same party but that
are separated either by physical distance or by cov-
ered lines in the block, are joined together using log-
ical rules. The principle here is that if consecutive
blocks contain non-overlapping parts of a party’s in-
Table 1: Text statistics of the evaluation contract dataset.
dev test total
documents 10 102 112
pages 36 589 625
blocks 430 8,451 8,881
lines 16,587 2,426,298 2,442,885
words 147,154 4,911,953 5,059,107
formation, then they should be merged together. Each
extracted party is assigned a confidence score corre-
sponding to the amount of identified labeled informa-
tion (ORG, PER, VAT number, company id, or role)
in the block.
4 EXPERIMENTS
4.1 Dataset
The dataset used for development and evaluation of
the contract analysis module of OCRMiner comes
from the official state registry of Czech public con-
tracts
2
. The data obtained from the website include
contract texts (in PDF) and metadata files (in XML).
The registry contains not only contracts but also ap-
pendices, price lists, invoices, etc. Therefore, a 2-step
filter is applied to select contracts only. The first
step automatically filters out documents based on the
filename and the text content. The filename usually
reflexes the content, so, files having names contain-
ing ‘obj’ (“objedn
´
avka” – order), ‘cen
´
ık’ or ‘cenov
´
a
nab
´
ıdka’ (price list), ‘p
ˇ
r
´
ıloha’ (appendix) have been
removed. Then remaining files have been converted
into OCR text. If the text does not contain the key-
word ‘smlouva’ (contract), then the document is also
filtered out. The second step involves manual check.
Finally, 112 contracts were selected randomly for the
thorough evaluation to be annotated (by one annota-
tor) as the gold standard data. Ten documents are used
as a development set and the remaining ones form a
test set. Text statistics of the final datasets are enlisted
in Table 1.
Although the contracts metadata are available, a
further step is still needed to prepare the gold stan-
dard data for evaluation. Firstly, the metadata does
not contain all the information that is to be extracted
such as a representative person or role of a contract
party. Secondly, since the registry metadata were en-
tered manually through the available forms, they are
in different formats compared to the contract text, es-
pecially the dates and addresses. Thirdly, some pieces
of information appear in the metadata but not in the
2
https://smlouvy.gov.cz/
Contract Metadata Identification in Czech Scanned Documents
797