ANONIMYTEXT: ANONIMIZATION OF UNSTRUCTURED DOCUMENTS

Rebeca Perez-Lainez, Ana Iglesias, Cesar de Pablo-Sanchez

Abstract

The anonymization of unstructured texts is nowadays a task of great importance in several text mining applications. Medical records anonymization is needed both to preserve personal health information privacy and enable further data mining efforts. The described ANONYMITEXT system is designed to de-identify sensible data from unstructured documents. It has been applied to Spanish clinical notes to recognize sensible concepts that would need to be removed if notes are used beyond their original scope. The system combines several medical knowledge resources with semantic clinical notes induced dictionaries. An evaluation of the semi-automatic process has been carried on a subset of the clinical notes on the most frequent attributes.

References

  1. Aramaki, E., & Miyo, K. (2006). Automatic Deidentification by Using Sentence Features and Label Consistency. I2b2 Workshop on Challenges in Natural Language Processing for Clinical Data.
  2. Bodenreider, O. (2004). The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Res. 3, 267-270.
  3. Boletín Oficial del Estado. (1999, December 14). http://www.boe.es/boe/dias/1999/12/14/index.php
  4. Cantalapiedra, J. (1989). Diccionario de excipientes de las especialidades farmacéuticas españolas. Madrid: Ministerio de Sanidad y Consumo.
  5. De Cannière, C., & Rechberger, C. (2006). Finding SHA-1 Characteristics: General Results and Applications. In Advances in Cryptology - ASIACRYPT 2006 (pp. 1- 20). Springer Berlin / Heidelberg.
  6. De Pablo-Sanchez, C., & Martínez, P. (2009). Building a Graph of Names and Contextual Patterns for Named Entity Classification. In Advances in Information Retrieval (pp. 530-537). Springer Berlin / Heidelberg.
  7. Gupta, D., Saul, M., & Gilberston, J. (2004). Evaluation of a de-identification (De-Id) software engine to share pathology reports and clinical documents for research. American Journal of Clinical Pathology, 176-186.
  8. Im, S., & Ras, Z. W. (2005). Ensuring Data Security Against Knowledge Discovery in Distributed Information Systems. In Rough Sets, Fuzzy Sets, Data Mining, and Granular Computing (pp. 548-557). Springer Berlin / Heidelberg.
  9. Morrison, F. P., & Li, L. (2008). Repurposing the Clinical Record: Can an Existing Natural Language Processing System De-identify Clinical Notes? J Am Med Inform Assoc , 37-39.
  10. Ruch, P., Baud, R. H., Rassinoux, A.-M., Bouillon, P., & Rober, G. (2000). Medical Document Anonymization with a Semantic Lexicon. AMIA Annu Symp Proc , 729-733.
  11. Spackman, K. A., Campbell, K. E., & Cote, R. A. (1997). SNOMED RT: a reference terminology for health care. Proceedings of the AMIA Fall Symposium, (pp. 640-644).
  12. Sim, J., & Wright, C. C. (2005). The Kappa Statistic in Reliability Studies: Use, Interpretation, and Sample Size Requirements. In Physical Therapy (pp. 257- 268).
  13. Szarvas, G., Farkas, R., & Busa-Fekete, R. (2007). Stateof-the-art in anonymization of medical records using an iterative machine learning framework. Journal of the American Medical Informatics Association (pp. 574-580).
  14. Thelen, M. (2002) Simultaneous Generation of DomainSpecific Lexicons for Multiple Semantic Categories.
  15. U.S. Department of Health & Human Services. (2006). http://www.hhs.gov/ocr/privacy/index.html
  16. Villena, J., González, J., & González, B. (n.d.). STILUS: Sistema de revisión lingüística de textos en castellano. Procesamiento del Lenguaje Natural núm 29 , 305- 306.
  17. Yetano, J., & Alberola, V. (2003). Diccionario de siglas médicas y otras abreviaturas, epónimos y términos médicos relacionados con la codificación de las altas hospitalarias.
Download


Paper Citation


in Harvard Style

Perez-Lainez R., Iglesias A. and de Pablo-Sanchez C. (2009). ANONIMYTEXT: ANONIMIZATION OF UNSTRUCTURED DOCUMENTS . In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2009) ISBN 978-989-674-011-5, pages 284-287. DOI: 10.5220/0002297102840287


in Bibtex Style

@conference{kdir09,
author={Rebeca Perez-Lainez and Ana Iglesias and Cesar de Pablo-Sanchez},
title={ANONIMYTEXT: ANONIMIZATION OF UNSTRUCTURED DOCUMENTS},
booktitle={Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2009)},
year={2009},
pages={284-287},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0002297102840287},
isbn={978-989-674-011-5},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2009)
TI - ANONIMYTEXT: ANONIMIZATION OF UNSTRUCTURED DOCUMENTS
SN - 978-989-674-011-5
AU - Perez-Lainez R.
AU - Iglesias A.
AU - de Pablo-Sanchez C.
PY - 2009
SP - 284
EP - 287
DO - 10.5220/0002297102840287