ANONIMYTEXT: ANONIMIZATION OF UNSTRUCTURED
DOCUMENTS
Rebeca Perez-Lainez, Ana Iglesias and Cesar de Pablo-Sanchez
Universidad Carlos III de Madrid, Avenida de la Universidad 30, 28911 Leganes (Madrid), Spain
Keywords: Anonymization, Medical Records, Sensible Data, Private Data, De-identification, Clinical Notes.
Abstract: The anonymization of unstructured texts is nowadays a task of great importance in several text mining
applications. Medical records anonymization is needed both to preserve personal health information privacy
and enable further data mining efforts. The described ANONYMITEXT system is designed to de-identify
sensible data from unstructured documents. It has been applied to Spanish clinical notes to recognize
sensible concepts that would need to be removed if notes are used beyond their original scope. The system
combines several medical knowledge resources with semantic clinical notes induced dictionaries. An
evaluation of the semi-automatic process has been carried on a subset of the clinical notes on the most
frequent attributes.
1 INTRODUCTION
Nowadays the task of anonymizing texts is
fundamental to preserve the security of information
in certain application domains. After anonymizing
texts, they should be legible but they could not
disclose individual information. For example, in the
health information domain, de-identification is an
important task if medical records are used for
judicial purpose, epidemiologic studies, research,
etc.
Most countries have developed its own
legislation to preserve medical records privacy. In
this paper, the American Health Insurance
Portability and Accountability Act (HIPPA) of 1996,
and the Spanish Law for Protection of Personal Data
(LOPD) (1999) have been taken into account to de-
identify the clinical documents.
De-identification is defined as the process of
identify, select and remove sensible data that appear
in a text. Sensible data can be defined as personal
data which could be used to identify a person and do
not have an explicit purpose for the final application.
The de-identification task can be addresed using
Natural Language Processing (NPL) and
Information Extraction (IE). This challenge remains
interesting because usually these medical records are
often unstructured, ungrammatical and they usually
present some misprints, dificulting the de-
identification task.
The main objective of ANONIMITEXT is to de-
identify clinical notes used in a spanish hospital. The
system acquires sensible data from text by using the
dictionary induction technique.
2 RELATED WORK
Several research groups have been working on
developing techniques to de-identify unstructured
English medical records according to HIPAA. Most
of the present approaches fall into two categories:
Dictionary Based Techniques (DBT) or Machine
Learning Techniques (MLT).
The UMLS metathesaurus (Bodenreider, 2004) is
an essential clinical resource that some authors like
(Ruch, Baud, Rassinoux, Bouillon, & Rober, 2000),
(Gupta, Saul, & Gilberston, 2004) and (Morrison &
Li, 2008) use to recognize medical terminology. The
remaining tokens should be considered as candidates
for de-identification. To avoid removing too much
content which is sometime not accurately identified,
other resources like dictionaries for personal names,
surnames, etc. are often used.
On the other hand, MLT and IE were used by
authors like (Aramaki & Miyo, 2006), (Szarvas,
Farkas, & Busa-Fekete, 2007) to extract medical
records information, however (Sim & Wright, 2005)
use it to minimize the number of values which
should be hidden.
284
Perez-Lainez R., Iglesias A. and de Pablo-Sanchez C. (2009).
ANONIMYTEXT: ANONIMIZATION OF UNSTRUCTURED DOCUMENTS.
In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval, pages 284-287
DOI: 10.5220/0002297102840287
Copyright
c
SciTePress
Both techniques have advantages and
disadvantages: DBT are fast, but protection is
limited by the coverage of used dictionaries.
Moreover, the use of DBT is hindered by the
problem of ambiguous terms, that is, terms which
could have more than one meaning (Ruch et al,
2000). However, MLT are useful to obtain inference
rules which generalize the model beyond the training
data, but a large amount of training data is needed to
learn effective models. Besides, if the source of the
data changes, retraining the models is needed to
guarantee the performance.
A complementary idea that has been applied in
Adaptive IE consists on the acquisition or induction
of semantic dictionaries from a large collection of
documents in the same domain of the application.
This technique only requires specifying a set of
interesting concepts that are prominent (seeds) in the
domain. Semantic dictionary induction is often less
expensive than annotating full documents as it only
requires to specify related seeds.
3 ANONYMITEXT
ARCHITECTURE
ANONYMITEXT system is designed to recognize
sensible data in unstructured documents to enable
their de-identification. The system combines general
and domain knowledge resources with automatically
induced dictionaries. The system input is a set of
unstructured documents. The output is the de-
identified input corpus.
The system architecture is composed of five
steps: Dictionary Induction, Tagger, Adviser, Expert
Revision and Anonymizer (Figure 1).
In the Dictionary Induction step, a domain expert
selects the set of seeds examples that are frequent in
the corpus like person, hospital names, etc. The
dictionaries are created by extending the seeds with
new terms that co-occur in similar contexts using the
collection of clinical notes.
The Tagger step performs a morphological and
semantic analysis of the text, which is tokenized,
split in sentences and enriched with part of speech
information. Induced dictionaries are used to include
semantic information. Moreover, other domain
resources could be used to improve semantic
tagging, as the UMLS metathesaurus. This phase
finds the semantic information that will be used to
de-identify sensible information in next phases.
In the Adviser step, the system detects sensible
data from the documents according to the country
information security laws in the domain of the
Figure 1: ANONYMITEXT De-Identification
architecture.
system. Sensible data is marked for the expert to
make their final decision on which data will be
preserved in the next step.
An interface will show to the expert the
documents semantically tagged and the source of
these tags (induced dictionary, biomedical resource,
etc). The task of the expert is to accept or not the
recommendation of the system reporting the cause of
reject. Among the causes for rejection we can find:
1) a dictionary induction mistake, 2) a morpho-
semantic mistake, that occurs when an ambiguous
word is incorrectly tagged, 3) an advice mistake, this
is when the system advises to hide all words not
tagged as personal information. Feedback data is
logged and would be useful to adjust previous
phases (Dictionary Induction, Tagger, Adviser) .The
models for the different parts could be retrained or
improved using a similar idea than active learning.
The system could learn continuously to become
more efficient reducing the time that the expert
spends in the Revision step.
Finally, in the Anonymizer step sensible data are
ciphered with a public-key algorithm or a hash
function.
4 EXPERIMENT DESIGN
ANONYMITEXT system has been evaluated using
a corpus of 60 Spanish clinical notes from a Spanish
hospital. These clinical notes contain sensible data
such as patient names, patient ages, phones, cities,
dates, medical facility names or doctor names,
according to the HIPAA and LOPD security laws.
Three domain experts participated in the
experiment annotating manually the gold standard
corpus. Moreover, these experts collaborated in the
Dictionary Induction phase obtaining frequent seeds
examples. The induced dictionaries were obtained
ANONIMYTEXT: ANONIMIZATION OF UNSTRUCTURED DOCUMENTS
285
automatically by using the whole collection
(210.700 clinical notes).
Due to the low frequency of some sensible data
in the corpus, this paper is focused in the evaluation
of the de-identification of doctor, medical facility
and patient names. From the corpus used for the
experiment, 172 tokens belong to the Doctor Name
class, 79 Patient Names, and 107 Medical Facility
Names were identified.
In the corpus, most of clinical notes present
sentences that are not tabulated and they do not fulfil
grammar rules, so sentence analysis become
difficult.
Therefore, the evaluation process is composed of
the next phases:
1) Corpus Annotation: Firstly, the domain
experts annotated the medical records using a
common set of tags for sensible data. These tags
were clearly defined taking into account the HIPPA
and LOPD laws. Secondly, we checked if tags were
correctly defined and if they were understand in the
same way by the annotators. To ensure that the
annotation process had been correctly executed, the
agreement level between annotators was calculated
with the Kappa measure (Sim & Wright, 2005).
2) Dictionary Induction: Next, the domain
experts were asked to obtain seeds from the corpus.
These seeds were used to induce person name,
doctor name and medical facility dictionaries. The
tool used for this induction is SPINDEL (De Pablo-
Sanchez & Martínez, 2009).
3) De-Identification: This phase includes
morpho-semantic analysis of the clinical texts and
the anonymization phase in which sensible data is
hidden. For the morpho-semantic analysis,
ANONYMITEXT uses STILUS tool (Villena,
González, & González, 2002). STILUS includes
resources for classifying semantically a token as
person, organization or location. To tag sensible
tokens, two alternatives have been taken into
account: A) search the token into an induced
dictionary, if it is found then it will be tagged. B) If
STILUS tags a word as organization, location or
person then the word is searched into the induced
dictionary. If the semantic category of the induced
dictionary matches up with the semantic category of
STILUS, then the word is tagged, otherwise not.
STILUS includes few biomedical terms so it has
been necessary to use biomedical specific resources
as a Spanish health acronyms dictionary (Yetano &
Alberola, 2003), an active principles dictionary
(Cantalapiedra, 1989) and the SNOMED
metathesaurus (Spackman, Campbell, & Cote,
1997).
Once medical records have been analyzed, and
the tokens are tagged as Patient_Name or
Medical_Facility, are ciphered using SHA-1 security
algorithm (De Cannièr et al, 2006).
4) Evaluation: In this phase, we compare the
annotations provided by ANONIMITEXT with the
manually annotated documents. Precision, Recall
and F
0.5
-Measure have been calculated at the token
level. (β=0.5 weights precision twice as much as
recall).
5 RESULTS
Table 1 shows a summary of main results obtained
for the experiment.
Table 1: Results for ANONYMITEXT.
Precision Recall F-Measure
Person_Name 89.5 67.85 84.15
Medical_Facility 26.21 23.68 25.66
Overall 67.22 53.6 63.97
Due to STILUS classify the tokens in the same
way as induced dictionaries; precision, recall, and F-
Measure obtained good values for Person_Name
class. However, precision is not 100% because
STILUS does not allow splitting certain tokens like
a surname followed by a punctuation sign. It is one
of the STILUS limitations.
On the other hand, the system did not achieve
good results for the de-identification of Medical_
Facility names. Analysing the results, two main
causes were found: 1) semantic ambiguity of terms,
that is, polysemous words which depending on the
context, could refer to a sensible data or not.
Moreover, a great majority of Spanish medical
facility names contain a person name, which
generates a semantic ambiguity between tokens
belonging to Person_Name class and
Medical_Facility class. 2) Acronyms of Medical
Facility Names make the de-identification process
more difficult. Later analysis of the human tagged
results has shown that some medical facilities are
written with acronyms by clinicians. However the
Spanish health acronyms dictionary used during the
experiments only contained acronyms related with
diseases and medical concepts. Unfortunately, the
use of acronyms in medical records are usual, so it
would be necessary to upgrade SPINDEL in order to
include acronyms into the induced dictionary.
Overall results are calculated by micro-averaging
for all semantic concepts. They indicate that the
system is not ready yet to work automatically,
KDIR 2009 - International Conference on Knowledge Discovery and Information Retrieval
286
although, the current system configuration, the
system proposes a fast solution to identify sensible
data that would make the task easier and more
effective.
6 CONCLUSIONS AND FUTURE
WORK
The main difference between ANONYMITEXT and
previous approaches is the combination of medical
resources with the use of the dictionary induction
technique. The main advantage of this approach lies
in the minimal effort required for a human annotator,
which only needs some seeds from a subset of the
corpus.
Both stages Dictionary induction and Revision,
allows including tagging tules, new induced
dictionaries and new system steps. Therefore, those
stages make possible system scalability.
Moreover, ANONYMITEXT preserves the
integrity and confidentiality of documents, because
it replaces sensible data by ciphered information.
Currently we are working towards a more
comprehensive evaluation of the tool including a
larger number of documents and representative
categories of sensible data. Besides, we are working
on improving dictionary acquisition techniques.
Finally, we are developing the framework that
allows taking profit of the expert feedback to
improve final results.
ACKNOWLEDGEMENTS
This work has been partially supported by MAVIR
(S-0505/TIC-0267) and by the TIN2007-67407-
C03-01 project BRAVO.
REFERENCES
Aramaki, E., & Miyo, K. (2006). Automatic De-
identification by Using Sentence Features and Label
Consistency. I2b2 Workshop on Challenges in Natural
Language Processing for Clinical Data.
Bodenreider, O. (2004). The Unified Medical Language
System (UMLS): integrating biomedical terminology.
Nucleic Acids Res. 3, 267-270.
Boletín Oficial del Estado. (1999, December 14).
http://www.boe.es/boe/dias/1999/12/14/index.php
Cantalapiedra, J. (1989). Diccionario de excipientes de las
especialidades farmacéuticas españolas. Madrid:
Ministerio de Sanidad y Consumo.
De Cannière, C., & Rechberger, C. (2006). Finding SHA-1
Characteristics: General Results and Applications. In
Advances in Cryptology – ASIACRYPT 2006 (pp. 1-
20). Springer Berlin / Heidelberg.
De Pablo-Sanchez, C., & Martínez, P. (2009). Building a
Graph of Names and Contextual Patterns for Named
Entity Classification. In Advances in Information
Retrieval (pp. 530-537). Springer Berlin / Heidelberg.
Gupta, D., Saul, M., & Gilberston, J. (2004). Evaluation of
a de-identification (De-Id) software engine to share
pathology reports and clinical documents for research.
American Journal of Clinical Pathology, 176-186.
Im, S., & Raś, Z. W. (2005). Ensuring Data Security
Against Knowledge Discovery in Distributed
Information Systems. In Rough Sets, Fuzzy Sets, Data
Mining, and Granular Computing (pp. 548-557).
Springer Berlin / Heidelberg.
Morrison, F. P., & Li, L. (2008). Repurposing the Clinical
Record: Can an Existing Natural Language Processing
System De-identify Clinical Notes? J Am Med Inform
Assoc , 37-39.
Ruch, P., Baud, R. H., Rassinoux, A.-M., Bouillon, P., &
Rober, G. (2000). Medical Document Anonymization
with a Semantic Lexicon. AMIA Annu Symp Proc ,
729-733.
Spackman, K. A., Campbell, K. E., & Cote, R. A. (1997).
SNOMED RT: a reference terminology for health
care. Proceedings of the AMIA Fall Symposium, (pp.
640-644).
Sim, J., & Wright, C. C. (2005). The Kappa Statistic in
Reliability Studies: Use, Interpretation, and Sample
Size Requirements. In Physical Therapy (pp. 257-
268).
Szarvas, G., Farkas, R., & Busa-Fekete, R. (2007). State-
of-the-art in anonymization of medical records using
an iterative machine learning framework. Journal of
the American Medical Informatics Association (pp.
574-580).
Thelen, M. (2002) Simultaneous Generation of Domain-
Specific Lexicons for Multiple Semantic Categories.
U.S. Department of Health & Human Services. (2006).
http://www.hhs.gov/ocr/privacy/index.html
Villena, J., González, J., & González, B. (n.d.). STILUS:
Sistema de revisión lingüística de textos en castellano.
Procesamiento del Lenguaje Natural núm 29 , 305-
306.
Yetano, J., & Alberola, V. (2003). Diccionario de siglas
médicas y otras abreviaturas, epónimos y términos
médicos relacionados con la codificación de las altas
hospitalarias.
ANONIMYTEXT: ANONIMIZATION OF UNSTRUCTURED DOCUMENTS
287