automatically by using the whole collection
(210.700 clinical notes).
Due to the low frequency of some sensible data
in the corpus, this paper is focused in the evaluation
of the de-identification of doctor, medical facility
and patient names. From the corpus used for the
experiment, 172 tokens belong to the Doctor Name
class, 79 Patient Names, and 107 Medical Facility
Names were identified.
In the corpus, most of clinical notes present
sentences that are not tabulated and they do not fulfil
grammar rules, so sentence analysis become
difficult.
Therefore, the evaluation process is composed of
the next phases:
1) Corpus Annotation: Firstly, the domain
experts annotated the medical records using a
common set of tags for sensible data. These tags
were clearly defined taking into account the HIPPA
and LOPD laws. Secondly, we checked if tags were
correctly defined and if they were understand in the
same way by the annotators. To ensure that the
annotation process had been correctly executed, the
agreement level between annotators was calculated
with the Kappa measure (Sim & Wright, 2005).
2) Dictionary Induction: Next, the domain
experts were asked to obtain seeds from the corpus.
These seeds were used to induce person name,
doctor name and medical facility dictionaries. The
tool used for this induction is SPINDEL (De Pablo-
Sanchez & Martínez, 2009).
3) De-Identification: This phase includes
morpho-semantic analysis of the clinical texts and
the anonymization phase in which sensible data is
hidden. For the morpho-semantic analysis,
ANONYMITEXT uses STILUS tool (Villena,
González, & González, 2002). STILUS includes
resources for classifying semantically a token as
person, organization or location. To tag sensible
tokens, two alternatives have been taken into
account: A) search the token into an induced
dictionary, if it is found then it will be tagged. B) If
STILUS tags a word as organization, location or
person then the word is searched into the induced
dictionary. If the semantic category of the induced
dictionary matches up with the semantic category of
STILUS, then the word is tagged, otherwise not.
STILUS includes few biomedical terms so it has
been necessary to use biomedical specific resources
as a Spanish health acronyms dictionary (Yetano &
Alberola, 2003), an active principles dictionary
(Cantalapiedra, 1989) and the SNOMED
metathesaurus (Spackman, Campbell, & Cote,
1997).
Once medical records have been analyzed, and
the tokens are tagged as Patient_Name or
Medical_Facility, are ciphered using SHA-1 security
algorithm (De Cannièr et al, 2006).
4) Evaluation: In this phase, we compare the
annotations provided by ANONIMITEXT with the
manually annotated documents. Precision, Recall
and F
0.5
-Measure have been calculated at the token
level. (β=0.5 weights precision twice as much as
recall).
5 RESULTS
Table 1 shows a summary of main results obtained
for the experiment.
Table 1: Results for ANONYMITEXT.
Precision Recall F-Measure
Person_Name 89.5 67.85 84.15
Medical_Facility 26.21 23.68 25.66
Overall 67.22 53.6 63.97
Due to STILUS classify the tokens in the same
way as induced dictionaries; precision, recall, and F-
Measure obtained good values for Person_Name
class. However, precision is not 100% because
STILUS does not allow splitting certain tokens like
a surname followed by a punctuation sign. It is one
of the STILUS limitations.
On the other hand, the system did not achieve
good results for the de-identification of Medical_
Facility names. Analysing the results, two main
causes were found: 1) semantic ambiguity of terms,
that is, polysemous words which depending on the
context, could refer to a sensible data or not.
Moreover, a great majority of Spanish medical
facility names contain a person name, which
generates a semantic ambiguity between tokens
belonging to Person_Name class and
Medical_Facility class. 2) Acronyms of Medical
Facility Names make the de-identification process
more difficult. Later analysis of the human tagged
results has shown that some medical facilities are
written with acronyms by clinicians. However the
Spanish health acronyms dictionary used during the
experiments only contained acronyms related with
diseases and medical concepts. Unfortunately, the
use of acronyms in medical records are usual, so it
would be necessary to upgrade SPINDEL in order to
include acronyms into the induced dictionary.
Overall results are calculated by micro-averaging
for all semantic concepts. They indicate that the
system is not ready yet to work automatically,
KDIR 2009 - International Conference on Knowledge Discovery and Information Retrieval
286