out in the edition. At the end of the publication pro-
cess 37 volumes will be available, approximately 50
000 pages of documents, offering an inside-view of
more than three decades of the East German History
as seen through the eyes of the Stasi. The cooper-
ation with the Natural Language Processing Group,
Department of Computer Science at the University of
Leipzig is an attempt to get more, innovativeand long
term approaches to these important data.
2 THE COOPERATION
The perspective of the BStU is mainly historical and
political aiming for the reappraisal of the past whereas
the NLP group provides methods and technical re-
sources to process text automatically and to extract
information from it. This is a result of different fo-
cuses. We see a lack of automation and use of in-
formation systems concerning the editorial process on
the part of the BStU - a lack of historical expertise and
a constant need of data available to apply classic auto-
matic language processing methods on the part of the
NLP Group. We believe, that by combining the com-
petences of each partner we are able to create added
value on both sides.
The benefit of this cooperation has effects in two
directions: on the one hand there are improvements
of the existing workflow. The introduction of automa-
tion and support systems during the editorial process
can lead to a more efficient workflow and can also re-
duce errors and redundancies. On the other hand there
are applications that would not be possible without
the use of NLP methods, e.g. when the effort is too
high to be done with human resources only. Further-
more the NLP group is able to develop new methods
and to solve new problems on real world data. There-
fore we believe, a cooperation is neccessary for ad-
vancing in the reappraising of the past as well as in
NLP. In the following we present first results of pro-
cessing these data with advanced NLP methods. In
particular, applying methods for latent semantic in-
dexing when analysing recipients yields highly inter-
esting and promising results.
3 DATA AND STRUCTURE
As a result from the digitization process we get dig-
ital documents which does not mean that they are
structured in a way that is best for NLP. The original
documents are stored in an XML-like format which
is optimized for printing thus containing also layout
information. This format can be considered semi-
structured in the common XML-sense. The tags used
do not support the semantics of the document struc-
ture but the semantics of layout and printing. Retriev-
ing information from those documents is associated
with high search costs. In addition to the documents’
contents there are metadata stored for each document.
Those include the date, the subject, a list of recipients
of that document and other information. To create a
format that is less complicated concerning querying
and flexible to use for a wider range of applications
we parsed the original documents and designed a new
XML schema to which we transfered them. Transfer-
ring the documents to a database seems natural, how-
ever we found designing a new XML schema to be the
first step in that direction since it makes further pro-
cessing easier. The whole collection is categorized by
year. Table 1 lists the available XML-formatted docu-
ments we extracted from the original format for each
year.
Table 1: Currently available Stasi documents by year.
Year #Documents
1953 198
1961 260
1976 320
1977 337
1988 279
Total 1394
The transformed documents are structured in a
way that the document contents, possible attachments
and the different metadata elements have their own
tags. New annotations can be added easily, for exam-
ple POS tags or named entities which then can be used
as features for advanced NLP methods. Some prob-
lems still remain after the transformation. The list of
recipients is a plain string list which can contain infor-
mation annotated during the editorial process. This
makes it difficult to split the list into individual re-
cipients which is needed for tasks like automatic in-
dexation. We extracted the recipients from the list
by applying regular expressions on the string which
works in most cases. However, it was not possible to
use exactly the same regular expressions on the docu-
ments of every volume so a more refined approach is
required to improve the quality of the extracted data.
Despite existing challenges, the transformation was a
first step to structure the data. Still, further work is
necessary.
Without previous attempts to digitally analyse the
ZAIG reports, only little is known about the quan-
tity, coverage and relations of certain topics and about
the feasibility of employing certain NLP methods.
This can be illustrated by Figure 1. It shows the fre-
quency of the words ”Sozialismus” (eng. socialism)
TheGDRThroughtheEyesoftheStasi-DataMiningontheSecretReportsoftheStateSecurityServiceoftheformer
GermanDemocraticRepublic
361