Contract Metadata Identiﬁcation in Czech Scanned Documents

Hien Thi Ha

, Ale

s Hor

1 a

and Minh Tuan Bui

NLP Centre, Faculty of Informatics, Masaryk University, Brno, Czech Republic

Le Quy Don Technical University, Vietnam

Keywords:

Information Extraction, Scanned Documents, Document Metadata, Contract Metadata Extraction, Czech.

Abstract:

Although nowadays digital-born documents are generally prevalent, exchange of business documents often

consists in processing their scanned image form as a general human-readable format with one-to-one corre-

spondence to paper documents. Bulk processing of such scanned documents then requires human intervention

to extract and enter the main document metadata. In this paper, we present the design and evaluation of a

contract processing module in the OCRMiner system. The information extraction process allows to combine

layout properties with text analysis as input to a rule-based extraction with conﬁdence score propagation.

The ﬁrst results are evaluated with public Czech contract documents reaching the item extraction accuracy of

almost 88%.

1 INTRODUCTION

A contract is a legally binding document that recog-

nizes and governs the rights and duties of the parties

to an agreement (Ryan, 2006). Organizations such

as companies, institutions, or governmental ofﬁces

must monitor and handle contracts for a wide range

of tasks (Milosevic et al., 2004). Some of them are

checking whether obligations, e.g. payments, bind-

ing on the party are fulﬁlled, tracking taxation duties

of valuable contracts, or notifying legislation amend-

ments’ affects. An important part of such tasks can

be automated by extracting contract metadata such

as parties involved, dates, or legislation references.

However, these pieces of information are mostly ﬁlled

in management systems manually which is costly and

time-consuming.

In a previous work (Ha et al., 2018), the OCR-

Miner system designed to process scanned invoices

based on the combination of layout and text analysis

was presented. In the current work, we adapt the sys-

tem to extract metadata elements from contracts based

on a small development set. We also offer an evalua-

tion with detailed analysis of errors.

The next section gives an overview of state-of-the-

art in the legal documents processing domain. Sec-

tion 3 presents a description of the system compo-

nents with the adaptation to contractual documents.

https://orcid.org/0000-0001-6348-109X

In Section 4, we offer a detailed evaluation of the sys-

tem with a Czech contract dataset.

2 RELATED WORKS

Research in legal document content classiﬁcation re-

cently focuses on extracting and classifying clauses,

particularly deontic clauses (obligations, prohibitions,

and permissions). (Neill et al., 2017) classify deontic

clauses using an ensemble of bidirectional long short-

term memory networks (BiLSTMs) with the inputs

of Google news embeddings. They trained speciﬁc

legal domain word and phrase embeddings and com-

pared the result with other neural and non-neural clas-

siﬁers. In a similar task, (Chalkidis et al., 2018) use

word embeddings and part-of-speech (POS) tag em-

beddings trained on an English contract dataset and

pre-trained token shape embeddings. The network is

also based on BiLSTM but in a hierarchical architec-

ture along with self-attention mechanism to improve

training time and accuracy of the classiﬁers.

In terms of information extraction, (Kwok and

Nguyen, 2006) proposed a general template based

framework to extract data from PDF contracts. A pat-

tern for each contract data item in a contract type in-

cludes data tag, number of words and location (page,

paragraph, line, and word numbers). A document

type, which is determined by the beginning and end-

ing patterns, identiﬁes a pattern matrix and a list of

Ha, H., Horák, A. and Bui, M.

Contract Metadata Identiﬁcation in Czech Scanned Documents.

DOI: 10.5220/0010243807950802

In Proceedings of the 13th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2021) - Volume 2, pages 795-802

ISBN: 978-989-758-484-8

795

contract data tags. There is no speciﬁc example of ei-

ther a contract data pattern or a document type pattern

to illustrate the idea in the paper.

In (Winter and Rinderle-Ma, 2018) and (Drag-

oni et al., 2016), natural language processing (NLP)

techniques are used to detect constraints and their

relations, or rules in legal documents. In the for-

mer, constraints are detected by modal verbs (shall,

should, must). These constraints are grouped by ei-

ther term frequencies or related subjects based on sen-

tence structure or external information. In each group,

similarity between each pair of constraints is counted

to detect redundant, subsumed, and conﬂicting con-

straints using cosine distance of the each constraint

word vectors. (Dragoni et al., 2016) use NLP tools to

extract rules from legal text. First, they identify deon-

tic components (prohibition, permission, obligation)

using a deontic lightweight ontology. Then, these

components are combined to create rules using a pat-

tern based model.

The most related works are (Chalkidis et al., 2017;

Chalkidis and Androutsopoulos, 2017). In these

works, the authors resolve the extraction of contract

elements such as contract title, clause headings, par-

ties, dates, values, or legislation references as a se-

quence labeling task, similar to e.g. named entity

recognition (NER). Each sliding window classiﬁer is

used for an element type to classify each token of

pre-deﬁned extraction zones as positive if it is a part

of a contract element and negative otherwise. In the

former work, they use Logistic Regression, or Lin-

ear Support Vector Machines (SVMs) models. The

features involve word embeddings and POS tag em-

beddings, both pre-trained on a contract dataset, plus

hand-crafted features. With the same approach, but

using BiLSTM-based models instead of linear ones

and with the hand-crafted features being replaced by

token-shape embeddings, the latter work improves the

previous result. Their best macro average F1 score is

0.88 using a relaxed match. For the contract parties,

only the organization name is extracted. The extrac-

tion zones, which is up to 20 tokens before and af-

ter speciﬁed keyword, are explicitly marked in each

training and test contract. The system also needs a

large amount of data to be annotated for training mod-

els.

3 METHODOLOGY

The OCRMiner system pipeline is illustrated in Fig-

ure 1. Modules speciﬁc for invoice analysis and in-

formation extraction were introduced and evaluated

in (Ha et al., 2018). Each piece of information is ex-

Figure 1: The processing pipeline.

tracted based on a weighted combination of layout

and text analysis. The text analysis involves a se-

ries of annotations to detect keywords and data types

based on either patterns or learning models. Firstly,

the contract image is recognized by an OCR tool

obtain words and word positions (bounding boxes).

Then, the physical layout including hierarchical ele-

ments (lines and blocks) of the pages, block positions

in the page and relative positions with neighboring

blocks, is built by the layout analysis module.

From this point, annotations are added by annota-

tion modules. They involve title, keywords, structural

data types, named entities, and parts of addresses. For

example, characteristics of the title text are detected

by biggest font size, usually center alignment, and

containing keyword ‘contract’ (‘smlouva’ or its vari-

ants in Czech). The ﬁrst two features are based on

layout attributes. For the last one, the text lines are

parsed to obtain words and their index forms (lem-

mata) before searching for the title keyword. Each

characteristic increases the conﬁdence score of the

item detection. Finally, the candidate with the highest

conﬁdence score is marked as the title. If there are

more than one with same conﬁdence score, then the

ﬁrst candidate in the reading order, i.e. the one closest

to the top of the page, is selected.

Keyword annotation looks for markers of de-

sired data, for example ‘contract number’ (‘smlouva

ıslo’), ‘date’ (‘dne’), ’address’ (’se s

ıdlem’), etc.

The list of keywords is prepared based on the most

frequent words and word bigrams of the contract

dataset adapted using the development set. The key-

The open source OCR system Tesseract (Smith, 2007;

Smith et al., 2020) is currently used in OCRMiner.

ICAART 2021 - 13th International Conference on Agents and Artiﬁcial Intelligence

796

word search takes into account possible small OCR

errors, i.e. it allows a ﬂexible similarity matching (

see (Ha, 2019) for details). The data annotation

module searches for structural data types such as a

date, VAT number, or legislation reference using reg-

ular expressions. In each contract, entity mentions

(e.g. an organization (ORG), a person (PER), or a lo-

cation (LOC)) play an important role, especially in

contract party detection. OCRMiner currently uses

named entity recognition module based on the Slavic

BERT model for 4 languages (Bulgarian, Czech, Pol-

ish, and Russian) (Arkhipov et al., 2019), which ex-

tends the multilingual BERT model by adding a CRF

layer tuned for Slavic languages using Wikipedia and

news articles. To improve address recognition, an

extra module based on a global address parser Lib-

postal (Barrentine et al., 2020) is used to detect parts

of addresses, such as road/street name, postcode, city,

state, or country.

After the annotations, each block is assigned a

block type in the logical structure analysis based on

the information gained in the preceding steps using a

set of logical rules. These rules are human readable

and easy to edit. The reasoning here mimics the hu-

man decisions based on visual inspection of the doc-

ument.

The information extraction module concludes the

processing to present the identiﬁed pieces of informa-

tion. For each extracted item, the module ﬁrstly looks

for the item “anchor” in the text, i.e. the correspond-

ing keywords or blocks. Then, in the surroundings

of the keyword position, the algorithm searches for

the appropriate data type, e.g. a “date” for the invoice

date item. The surroundings is limited to either next

to the keyword on the same text line, or the text line

on the right, or below it. The exact position of the

item value is decided by a score weighting function

fulﬁlling the criteria that the block/line contains the

data type and does not contain other keywords. Some

types of data can be found without keywords such as

ORG(anization), PER(son), VAT number, or legisla-

tion references. Contract parties are extracted only

in blocks being identiﬁed as the block type “party”,

i.e. a block containing at least one keyword in the

group of organization, address, contact person, com-

pany id, vat number, or bank information, or at least

two named entity entries in the corresponding class

(PER, ORG, LOC, CITY, COUNTRY, VAT NUM-

BER). Before parsing a party’s information in a block,

text blocks that may belong to the same party but that

are separated either by physical distance or by cov-

ered lines in the block, are joined together using log-

ical rules. The principle here is that if consecutive

blocks contain non-overlapping parts of a party’s in-

Table 1: Text statistics of the evaluation contract dataset.

dev test total

documents 10 102 112

pages 36 589 625

blocks 430 8,451 8,881

lines 16,587 2,426,298 2,442,885

words 147,154 4,911,953 5,059,107

formation, then they should be merged together. Each

extracted party is assigned a conﬁdence score corre-

sponding to the amount of identiﬁed labeled informa-

tion (ORG, PER, VAT number, company id, or role)

in the block.

4 EXPERIMENTS

4.1 Dataset

The dataset used for development and evaluation of

the contract analysis module of OCRMiner comes

from the ofﬁcial state registry of Czech public con-

tracts

. The data obtained from the website include

contract texts (in PDF) and metadata ﬁles (in XML).

The registry contains not only contracts but also ap-

pendices, price lists, invoices, etc. Therefore, a 2-step

ﬁlter is applied to select contracts only. The ﬁrst

step automatically ﬁlters out documents based on the

ﬁlename and the text content. The ﬁlename usually

reﬂexes the content, so, ﬁles having names contain-

ing ‘obj’ (“objedn

avka” – order), ‘cen

ık’ or ‘cenov

nab

ıdka’ (price list), ‘p

ıloha’ (appendix) have been

removed. Then remaining ﬁles have been converted

into OCR text. If the text does not contain the key-

word ‘smlouva’ (contract), then the document is also

ﬁltered out. The second step involves manual check.

Finally, 112 contracts were selected randomly for the

thorough evaluation to be annotated (by one annota-

tor) as the gold standard data. Ten documents are used

as a development set and the remaining ones form a

test set. Text statistics of the ﬁnal datasets are enlisted

in Table 1.

Although the contracts metadata are available, a

further step is still needed to prepare the gold stan-

dard data for evaluation. Firstly, the metadata does

not contain all the information that is to be extracted

such as a representative person or role of a contract

party. Secondly, since the registry metadata were en-

tered manually through the available forms, they are

in different formats compared to the contract text, es-

pecially the dates and addresses. Thirdly, some pieces

of information appear in the metadata but not in the

https://smlouvy.gov.cz/

Contract Metadata Identiﬁcation in Czech Scanned Documents

797

Table 2: Identiﬁed items in the contract texts.

in in

Item dev test Example

title 9 102 “Smlouva o poskytov

ı slu

zeb” (supply of services contract)

contract type 10 100 “poskytov

ı slu

zeb” (supply of services)

legislation 33 547 “$ 1746 a n

asl. z

akona

c. 89/2012 Sb., ob

cansk

y z

akon

ık” ($ 1746

et seq. Act No. 89/2012 Coll., Civil Code)

contract number 7 58 ”VODA/ZA20-4023”

contract date 8 78 10.1.2020

company name 13 175 “TESCO SW a.s.”

representative 13 164 “Josefem Tesa

ıkem” (by Josef Tesa

ık)

address 21 194 “t

r. Kosmonaut

u 1288/1, Hodolany, Olomouc, PS

C 779 00”

vat number 10 102 “CZ699000785”

company id 19 191 “25892533”

bank name 6 56 ”

Cesk

a spo

ritelna, a.s.”

account number 4 49 ”1303699319/0800”

role 19 194 “poskytovatel” (supplier)

contract text. For example, contract numbers in some

cases are not stated in the original contract but in the

metadata only. Moreover, in many contracts, private

information is covered, such as an account number or

contact details. So, after converting the registry meta-

data ﬁle into the desired format, the data is manually

examined before becoming the ground truth for the

evaluation.

4.2 Information to Extract

The detected and extracted pieces of the contract in-

formation are summarized in Table 2. Speciﬁcally,

the contract date is the closest date that all parties have

signed the contract. Usually, it appears at the end of

the contract, before the signatures. If the signature

dates are different then the later one is extracted. A

contract party is a group of information, involving or-

ganization, address, company id, VAT number, a com-

pany representative, a party role in the contract, bank

name, or an account number. The party role is of-

ten stated at the beginning of the party text block, e.g.

‘zhotovitel’/contractor and ‘objednatel’/customer, or

after the keyword ‘d

ale jen’/hereinafter. A full exam-

ple of information extracted from the ﬁrst page of a

contract is illustrated in Figure 4 in the Appendix.

4.3 Results

Within the evaluation process, each piece of extracted

information is evaluated as a match, a partial match,

or a mismatch. For all ﬁelds except organization and

address, the match means an exact match. For these

two exceptions, a match allows to ignore the piece of

the gold standard information which is not crucial for

the company or address identiﬁcation.

For example, organization full ofﬁcial name oc-

currences of:

ground truth:

Czech Airlines Technics, a.s.

extracted:

Czech Airlines Technics

are considered a match.

Differently from the previous use case of invoice

information extraction where the parties can always

be classiﬁed into a seller and a buyer, in contracts the

number of parties is not predetermined. Therefore,

the extraction process needs to take each text block

containing an organization’s information as a possible

contract party information. In the evaluation phase,

each gold standard contract party is compared to each

extracted party. The result is recorded for the party

having the most common information. This means

the evaluation will not search for each piece of in-

dividual contract party information in the extracted

data as a whole and return a match if such piece is

found. The evaluation is here strict in the sense that

even if a sought piece of information is extracted but

in a different party block then the result will be a mis-

match. Some works use a relaxed match, i.e. if the

extracted information matches the ground truth at a

threshold, e.g. 80%, then it is considered as a (re-

laxed) match. The importance of the missing piece

is ignored here. To give an example, if a contract

date ground truth is “1.12.2019” and the extracted

date is “31.12.2019”, that makes only 10% differ-

ence. However, in the context, the second date was

meant as the payment due date, so, it should have

been considered as a mismatch instead of a match.

Due to such complications, the evaluation is ﬁrst pre-

ICAART 2021 - 13th International Conference on Agents and Artiﬁcial Intelligence

798

Table 3: Test set evaluation results.

Result items Percentage

Match 1,631 81.14%

Partial match 137 6.82%

Mismatch 242 12.04%

Total 2,010 100.00%

processed automatically using approximate compar-

isons based on the Levenshtein distance, then exam-

ined manually.

The evaluation results of the OCRMiner contract

module with the test set are presented in Table 3. Al-

together, almost 88% of gold standard information

was extracted, with 81% in the exact expected form

and approx. 7% with minor differences. Just 12% of

items were not identiﬁed or identiﬁed wrongly.

A detailed evaluation of the individual item types

is illustrated in Figure 2. Addresses and contract types

have the highest accuracy of 94.3% and 93% respec-

tively. In contrast, contract dates and party roles dis-

play the highest number of mismatches with 30.8%

and 26.3%. The legislation reference ﬁeld contains

the highest number of minor errors (partial matches)

of 15.7%.

In the following section, a detailed error analysis

of 50 contracts in the test set identiﬁes and explains

the causes of both minor and major mismatches.

4.4 Error Analysis

In the OCRMiner extraction pipeline, the data to ex-

tract are identiﬁed by keywords, data format or text

position. If a keyword is found, then the extraction

module looks for the appropriate data item around

the keyword based on the visual layout, especially in

relation to the keyword position. Non-keyword data

are detected by a pattern (e.g. the VAT number) or a

pre-trained model (e.g. organization or person name).

Therefore, the error causes are classiﬁed into differ-

ent categories: OCR errors, keyword error (there is

either no keyword in the text or a new keyword which

did not appear in the development set), layout error,

named entity recognition (NER) error, block misiden-

tiﬁcation (extracted in another block), and others.

A layout error means the keyword is found but the

Table 4: Error analysis of partial matches.

Error type

items in %

OCR error 18 39.13

NER 9 19.57

Multi-lines 5 10.87

Pattern 9 19.57

Other 5 10.87

Total 46 100.00

Table 5: Error analysis of mismatches.

Error type items in %

In another block 7 6.31

OCR error 31 27.93

Keyword 27 24.32

Layout 10 9.10

NER 12 10.81

Pattern 7 6.31

Title 6 5.40

Other 11 9.91

Total 111 100.00

data text line is not found in the expected relative

position, either due to a typing error or the layout

match criteria. In the detection of a company name,

a keyword is often elided, thus the extraction relies

on the NER annotation or the company name’s end-

ing. However, the dataset originates in the public sec-

tor where many parties are public organizations of a

speciﬁc area (e.g city, village, etc.). In consequence,

NER recognizes only part of the organization name as

a location instead of the whole chunk as an organiza-

tion. For example, ‘M

esto Hostinn

e’ (Hostinn

e town)

or ‘Slu

zby m

esta N

e nad Oslavou’ (Town ser-

vices of N

t nad Oslavou). As mentioned above,

in parties’ evaluation the comparison is made for the

whole group instead of searching for each piece of

information separately causing a mismatch when a

piece of information is correctly extracted but as-

signed to a different block. Furthermore, as we de-

scribed in 3, the contract title is extracted using 3 cri-

teria involving the font size, the central alignment and

a keyword. However, in some cases, the title can be

left aligned, or the biggest font is a part of text in the

logo or another line. The combination of these errors

falls into the title category. Legislation references are

identiﬁed by ﬂexible patterns consisting of 3 parts:

the section mark (§), paragraph ID (‘odst. X’), and

the act or law ID. But not all 3 parts are obligatory.

The act or law is illustrated by number, e.g. 89/2012,

or name (‘ob

cansk

y z

akon

ık’/Civil code). Full exam-

ples are:

• § 1746 odst. 2 z

akona

c. 89/2012 Sb.,

cansk

y z

akon

ık (§ 1746 par. 2 of Act No.

89/2012 Coll., Civil Code)

• § 2586 a n

asl. z

ak.

c. 89/2012 Sb.,

cansk

eho z

akon

ıku (§ 2586 et seq. Act. No.

89/2012 Coll., Civil Code)

• § 92a z

akona o dani (§ 92a of the Tax Act)

•

c. 340/2015 Sb. (No. 340/2015 Coll.)

These patterns are based on ﬁndings in the develop-

ment set with extra ﬂexibility, however, some test set

cases yet remained uncovered such as more than one

Contract Metadata Identiﬁcation in Czech Scanned Documents

799

a) b)

Figure 2: Evaluation of each ﬁeld: a) by item, and b) by percentage.

section in a legislation reference (§27 a 31 z

akona

134/2016 Sb.), or a connection word ‘n

asl.’/seq. writ-

ten in the full form (‘n

asleduj

ıc

ıch’/sequentes).

The error analysis of each category is summarized

in Tables 4 and 5. In the partial match section, al-

most 40% of errors are due to OCR errors, usually in

characters sharing similar shapes, e.g. 4-A, Z-7, or O-

0. The cases where NER and pattern did not detect

full organization or full legislation reference caused

the same number of errors (19.57%), leaving 10.87%

for multiple lines and for other reasons.

In the mismatch section, OCR errors caused more

than a quarter of mismatches, followed by keyword

errors with another 24.32%. As we can see in Fig-

ure 2, the accuracy of the contract date item is low.

In the analysed contracts, the dates usually appear

stamped or hand-written when signing the contract

(see Figure 3 for an example). Thus, most of the date

errors happen because the OCR engine could not rec-

ognize the hand-written characters correctly. In addi-

tion, the cover of conﬁdential information sometimes

overlapped text in the surrounding areas which made

more OCR errors. 10.8% errors was because of NER

recognizing a part of an organization name as a lo-

cation. Layout errors appear in 9% and 6% of items

were extracted in a wrong block. The reason was ei-

ther due to a speciﬁc layout design or to distances be-

tween information in the group created by a covering

black line (as we can see in the example Figure 4 in

the Appendix). Pattern errors appear also in 6% fol-

lowed by title errors (5.41%) and 9.91% of others rea-

sons.

Figure 3: A contract date example.

5 CONCLUSIONS

In the paper, we have presented the ﬁrst version of the

OCRMiner system module for information extraction

of scanned contract documents. The design and the

architecture of the module have been described in de-

tails.

A new dataset for the evaluation of the contract in-

formation extraction has been built and used in a thor-

ough evaluation of the contract analysis modules. The

evaluation results show that the system is able to iden-

tify almost 88% of the contract metadata correctly. A

detailed error analysis depicts and classiﬁes the rea-

sons of the current mismatches. Although some mod-

ules (e.g. keyword detection) are language dependent,

the pipeline is easily adaptable to other languages.

With the presented analysis, the current test set

can be seen as an extended development set to evalu-

ate the system on a new and larger test set to conﬁrm

its generalization capabilities. This version also offers

as a strong baseline for further work where we plan to

employ state-of-the-art NLP techniques such as pre-

trained BERT model tuning on the contract dataset or

grammar induction techniques for the layout and con-

tent analysis.

ACKNOWLEDGEMENTS

This work has been partly supported by Konica Mi-

nolta Business Solution Czech within the OCR Miner

project, the Ministry of Education of CR within the

LINDAT-CLARIAH-CZ project LM2018101 and the

Speciﬁc University Research of Masaryk University.

REFERENCES

Arkhipov, M., Troﬁmova, M., Kuratov, Y., and Sorokin,

A. (2019). Tuning multilingual transformers for

ICAART 2021 - 13th International Conference on Agents and Artiﬁcial Intelligence

800

language-speciﬁc named entity recognition. In Pro-

ceedings of the 7th Workshop on Balto-Slavic Natural

Language Processing, pages 89–93, Florence, Italy.

Association for Computational Linguistics.

Barrentine, A. et al. (2020). Libpostal. https://github.com/

openvenues/libpostal.

Chalkidis, I. and Androutsopoulos, I. (2017). A deep learn-

ing approach to contract element extraction. In JU-

RIX, pages 155–164.

Chalkidis, I., Androutsopoulos, I., and Michos, A. (2017).

Extracting contract elements. In Proceedings of the

16th edition of the International Conference on Arti-

cial Intelligence and Law, pages 19–28.

Chalkidis, I., Androutsopoulos, I., and Michos, A. (2018).

Obligation and prohibition extraction using hierarchi-

cal RNNs. arXiv preprint arXiv:1805.03871.

Dragoni, M., Villata, S., Rizzi, W., and Governatori, G.

(2016). Combining NLP approaches for rule extrac-

tion from legal documents.

Ha, H. T. (2019). Approximate string matching for detect-

ing keywords in scanned business documents. In Pro-

ceedings of Recent Advances in Slavonic Natural Lan-

guage Processing, RASLAN 2019, pages 49–54.

Ha, H. T., Nev

rilov

a, Z., Hor

ak, A., et al. (2018). Recog-

nition of OCR Invoice Metadata Block Types. In

Text, Speech, and Dialogue. TSD 2018, pages 304–

312. Springer, Cham.

Kwok, T. and Nguyen, T. (2006). An automatic method

to extract data from an electronic contract composed

of a number of documents in PDF format. In The 8th

IEEE International Conference on E-Commerce Tech-

nology and The 3rd IEEE International Conference on

Enterprise Computing, E-Commerce, and E-Services

(CEC/EEE’06), pages 33–33. IEEE.

Milosevic, Z., Gibson, S., Linington, P. F., Cole, J., and

Kulkarni, S. (2004). On design and implementation

of a contract monitoring facility. In Proceedings. First

IEEE International Workshop on Electronic Contract-

ing, 2004., pages 62–70. IEEE.

Neill, J. O., Buitelaar, P., Robin, C., and Brien, L. O. (2017).

Classifying sentential modality in legal language: a

use case in ﬁnancial regulations, acts and directives.

In Proceedings of the 16th edition of the International

Conference on Articial Intelligence and Law, pages

159–168.

Ryan, F. (2006). Round Hall nutshells Contract Law.

Thomson Round Hall.

Smith, R. (2007). An overview of the Tesseract OCR en-

gine. In Document Analysis and Recognition, Ninth

International Conference on, volume 2, pages 629–

633. IEEE.

Smith, R. et al. (2020). Tesseract OCR. https://github.com/

tesseract-ocr/tesseract.

Winter, K. and Rinderle-Ma, S. (2018). Detecting con-

straints and their relations from regulatory documents

using NLP techniques. In OTM Confederated Interna-

tional Conferences” On the Move to Meaningful Inter-

net Systems”, pages 261–278. Springer.

APPENDIX

An example of a scanned contract with the extracted

metadata information is presented on the next page in

Figure 4.

Contract Metadata Identiﬁcation in Czech Scanned Documents

801

Figure 4: A contract example: extracted information is in the red boxes.

ICAART 2021 - 13th International Conference on Agents and Artiﬁcial Intelligence

802