tences with a variable middle part, e.g. there are dif-
ferent possible subjects or objects for a clause. The
list items are not individual sentences but form one
whole sentence because they do not conclude any ac-
tion. All of those list variations may start with a sen-
tence concluded by a colon which is then individually
annotated. Lists of lists are treated in a recursive way
and the decision on the sentence boundary is based
on the sub-entries. If one entries consist of multi-
ple sentences, each is individually annotated. These
rules follow de Maat (de Maat, 2012) who states that
lists can form “single sentences [...] [though] the list
items are often referenced separately”. This more fine
grained segmentation is motivated despite the fact that
thus incomplete sentences can be created.
Definitions are a combination of a headline and
(multiple) sentences, thus their individual parts are
annotated accordingly. Data fields which hold the
information about dates, people, locations, etc. and
case names are similar. Endnotes/Footnotes can also
be seen as a combination of headline and sentences.
Page numbers are, if not already removed from
the text, only annotated if they are found between two
sentences.
Mixed cases of all the previous structures are also
possible. The described textual units are only the ba-
sic building blocks found in a legal document. There
is a large amount of variation even within the same
legal document type. In such cases, the smaller seg-
mentation is chosen, e.g. in a list which combines
an enumeration and a list, the individual parts of the
enumeration would also be annotated as sentences to
avoid any inconsistency within the list.
The aim of this paper is not to create the perfect
definition for a sentence in the legal domain but to cre-
ate a logical, consistent and practical rule set to work
with. This paper sometimes produces textual parts
which cannot be described as complete sentences in
a grammatical way because they are lacking essen-
tial parts of speech. But those segments are needed
for further processing steps, as overall they reduce the
complexity of the given text part.
Table 1 provides a few examples for the different
types of special sentences based on our taxonomy. As
such examples can be quite complex, we provide the
reference to selected norms from the German Civil
Code (BGB).
3 DATASET
In order to get a broad overview about the perfor-
mance of SBD in the (German) legal domain, a di-
verse set of different documents was collected. The
main focus was on judgments and laws with around
20,000 sentences each, but privacy policies (PRVs)
and terms of services (ToS) are also part of the col-
lection. The dataset consists of approximately 52,000
sentences. The judgments and laws are used for train-
ing and the other document types mainly for valida-
tion and testing.
The dataset contains the BGB, the German Consi-
titution (GG), Code of Social Law (SGB) 1 to 3 and
the Criminal Code (StGB), with the BGB being by
far the longest text. We have made this specific selec-
tion of laws on purpose to capture a broad variety of
different areas of law with it’s distinct linguistic char-
acteristics.
The judgments were collected from the website
Bayern.Recht
1
, which is a collection of many judg-
ments from Bavarian courts. They are from different
legal domains and courts (Verfassungs-, Ordentliche,
Verwaltungs-, Finanz-, Arbeits- and Sozialgerichts-
barkeiten) and thus have a wide difference in struc-
ture and content. The PRVs of major tech companies
were collected and the ToS of online shops gathered
by Koller (Koller, 2019) are also used. A detailed
statistic on the documents and the number of sentence
boundaries can be seen in Table 2.
The documents are annotated with the help of an
already existing online annotation tool developed by
Savelka (not published yet) and later on with a newly
created graphical user interface which is also used to
display and analyze the classification results. The last
non-punctuation token or word in every sentence is
annotated as the end of the sentence. Each document
is saved as a separate JSON file containing the text
and the sentence boundary annotations with their po-
sition in the text. The annotations, based on the taxon-
omy described in Section 2, were performed by three
researchers. For this purpose an annotation guideline
based on the mentioned taxonom was utilized. Each
researcher annotated 33.33% percent of the whole
corpus, while we randomly selected 10.00% of each
annotator and assigned it additionally to all the other
two annotators. As a result, each annotator annotated
36.66% of the complete corpus and 9.99% were an-
notated three times. Table 3 reports on the numbers
which are relevant for the inter-annotator agreement
for this threefold annotated part of the corpus.
Annotator A is a data science student with three
years of experience in NLP applied to the legal do-
main. Annotator B is a PhD student in NLP with re-
search focus on the application of legal court rulings.
The third annotator C is a legal expert working with
a working experience of 10 years. For this work, we
report the inter-annotator agreement by means of the
1
See https://www.gesetze-bayern.de/
ICAART 2021 - 13th International Conference on Agents and Artificial Intelligence
814