Authors:
Navya Martin Kollapally
1
and
James Geller
2
Affiliations:
1
Department of Computer Science, New Jersey Institute of Technology, Newark, U.S.A.
;
2
Department of Data Science, New Jersey Institute of Technology, Newark, U.S.A.
Keyword(s):
Natural Language Processing, Redaction, Re-identification of EHR Entries, Large Language Models, Privacy-Preserving Machine Learning, HIPAA Act, Social Determinants of Health.
Abstract:
Research on privacy-preserving Machine Learning (ML) is essential to prevent the re-identification of health data ensuring the confidentiality and security of sensitive patient information. In this era of unprecedented usage of large language models (LLMs), LLMs carry inherent risks when applied to sensitive data, especially as LLMs are trained on trillions of words from the internet, without a global standard for data selection. The lack of standardization in training LLMs poses a significant risk in the field of health informatics, potentially resulting in the inadvertent release of sensitive information, despite the availability of context-aware redaction of sensitive information. The research goal of this paper is to determine whether sensitive information could be re-identified from electronic health records during Natural Language Processing (NLP) tasks such as text classification without using any dedicated re-identification techniques. We performed zero and 8-shot learning wi
th the quantized LLM models FLAN, Llama2, Mistral, and Vicuna for classifying social context data extracted from MIMIC-III. In this text classification task, our focus was on detecting potential sensitive data re-identification and the generation of misleading or abusive content during the fine-tuning and prompting stages of the process, along with evaluating the performance of the classification.
(More)