machine learning purposes. The pseudonymization
can be improved and extended by implementing cus-
tom libraries, but this is a manual process. Our goal is
to automate improvement processes of entity recog-
nition and anonymization by including self-learning
components, that are able to improve the automated
detections based on the users’ manual corrections on
the results of the analysis tools, thus realizing the
bootstrapping cycle outlined above.
6 CONCLUSION AND OUTLOOK
In this work we have described the challenges of per-
forming applied research in NLP and AI with real un-
structured data while remaining compliant with the
GDPR and other data protection laws.
We have shown the need to anonymize and share
data in three application areas: court decision, health-
care and insurance fraud detection. From practical ex-
perience in research projects, we have outlined chal-
lenges and possible solutions for obtaining data to
develop research prototypes. Based on these experi-
ences, we have defined a bootstrap challenge: AI and
NLP can be used to automate data anonymization for
research, but anonymized data is needed to create AI
and NLP anonymization solutions in the first place.
The resulting research questions is how to solve this
bootstrap problem while lessening manual effort for
anonymization.
We have outlined a possible solution architec-
ture, which incrementally improves domain-specific
pseudonymization in a bootstrap cycle, thus solving
the bootstrap challenge, and shown Textominado, a
prototype for pseudonymization and anonymization
of unstructured documents.
Since the contents discussed in this paper are still
ongoing research, no evaluation has been done for our
prototype yet. But talking to different companies and
public organizations revealed that there is indeed a big
need for practicable ways of anonymizing unstruc-
tured textual data. In future research, we plan to use
Textominado to acquire anonymized data from real-
world organizations for use in AI and NLP research
projects. In this process, we plan to extend Textomi-
nado in order to implement the outlined solution ar-
chitecture and investigate its feasibility.
ACKNOWLEDGEMENTS
This work was partly supported by the project Smar-
tAIwork, which is funded by the Federal Ministry of
Education and Research (BMBF) under the funding
number 02L17B00ff. We like to thank our contacts at
the court and in the healthcare and insurance industry
as well as the students working in student projects for
their efforts.
REFERENCES
AYLIEN (2019). Text analysis platform — custom nlp
models. https://aylien.com/text-analysis-platform/.
Coussement, K. and den Poel, D. V. (2008). Improving
customer complaint management by automatic email
classification using linguistic style features as predic-
tors. Decision Support Systems, 44(4):870 – 882.
Dias, F. M. C. (2016). Multilingual automated text
anonymization. Master’s thesis, Instituto Superior
T
´
ecnico, Lisboa.
European Comission (2014). Text and data mining - report
from the expert group.
European Union (2016). Regulation (EU) 2016/679 of the
European Parliament and of the Council of 27 April
2016 on the protection of natural persons with re-
gard to the processing of personal data and on the
free movement of such data, and repealing Directive
95/46/EC (General Data Protection Regulation).
i2b2 Informatics for Integrating Biology & the Bedside
(2019). 2016 cegs n-grid shared-tasks and workshop
on challenges in natural language processing for clin-
ical data. https://www.i2b2.org/NLP/.
IDC (2018). Multi-Client-Studie K
¨
unstliche Intel-
ligenz und Machine Learning in Deutschland
2018. https://idc.de/de/research/multi-client-
projekte/kunstliche-intelligenz-und-machine-
learning-in-deutschland-die-nachste-stufe-der-
datenrevolution/kunstliche-intelligenz-und-machine-
learning-in-deutschland-projektergebnisse.
Kamarinou, D., Millard, C., and Singh, J. (2016). Machine
learning with personal data. In Queen Mary School
of Law Legal Studies Research Paper No. 247/2016.
SSRN.
Lexalytics (2019). Salience 6, lexalytics state of the art nat-
ural language processing engine on your own hard-
ware. https://www.lexalytics.com/salience/server.
Lux, T., Breil, B., D
¨
orries, M., Gensorowsky, D., Greiner,
W., Pfeiffer, D., Rebitschek, F. G., Gigerenzer, G., and
Wagner, G. G. (2017). Healthcare — between privacy
and state-of-the-art medical technology. Wirtschafts-
dienst, 97(10).
Marko, K. (2017). Using machine in-
telligence to protect sensitive data.
https://diginomica.com/2017/08/24/using-machine-
intelligence-protect-sensitive-data/.
Meinel, C. and Koppenhagen, N. (2015). Thesen-
papier zum Schwerpunktthema Smart Data im
Gesundheitswesen (in German). https://www.digitale-
technologien.de/DT/Redaktion/DE/Downloads/Publi-
kation/Smart Data Thesenpapier SmartData Gesund-
heitswesen.html.
ICEIS 2019 - 21st International Conference on Enterprise Information Systems
460