Safeguarding Ethical AI: Detecting Potentially Sensitive Data Re-Identification and Generation of Misleading or Abusive Content from Quantized Large Language Models

Navya Martin Kollapally, James Geller

2024

Abstract

Research on privacy-preserving Machine Learning (ML) is essential to prevent the re-identification of health data ensuring the confidentiality and security of sensitive patient information. In this era of unprecedented usage of large language models (LLMs), LLMs carry inherent risks when applied to sensitive data, especially as LLMs are trained on trillions of words from the internet, without a global standard for data selection. The lack of standardization in training LLMs poses a significant risk in the field of health informatics, potentially resulting in the inadvertent release of sensitive information, despite the availability of context-aware redaction of sensitive information. The research goal of this paper is to determine whether sensitive information could be re-identified from electronic health records during Natural Language Processing (NLP) tasks such as text classification without using any dedicated re-identification techniques. We performed zero and 8-shot learning with the quantized LLM models FLAN, Llama2, Mistral, and Vicuna for classifying social context data extracted from MIMIC-III. In this text classification task, our focus was on detecting potential sensitive data re-identification and the generation of misleading or abusive content during the fine-tuning and prompting stages of the process, along with evaluating the performance of the classification.

Download


Paper Citation


in Harvard Style

Martin Kollapally N. and Geller J. (2024). Safeguarding Ethical AI: Detecting Potentially Sensitive Data Re-Identification and Generation of Misleading or Abusive Content from Quantized Large Language Models. In Proceedings of the 17th International Joint Conference on Biomedical Engineering Systems and Technologies - Volume 2: HEALTHINF; ISBN 978-989-758-688-0, SciTePress, pages 554-561. DOI: 10.5220/0012411900003657


in Bibtex Style

@conference{healthinf24,
author={Navya Martin Kollapally and James Geller},
title={Safeguarding Ethical AI: Detecting Potentially Sensitive Data Re-Identification and Generation of Misleading or Abusive Content from Quantized Large Language Models},
booktitle={Proceedings of the 17th International Joint Conference on Biomedical Engineering Systems and Technologies - Volume 2: HEALTHINF},
year={2024},
pages={554-561},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0012411900003657},
isbn={978-989-758-688-0},
}


in EndNote Style

TY - CONF

JO - Proceedings of the 17th International Joint Conference on Biomedical Engineering Systems and Technologies - Volume 2: HEALTHINF
TI - Safeguarding Ethical AI: Detecting Potentially Sensitive Data Re-Identification and Generation of Misleading or Abusive Content from Quantized Large Language Models
SN - 978-989-758-688-0
AU - Martin Kollapally N.
AU - Geller J.
PY - 2024
SP - 554
EP - 561
DO - 10.5220/0012411900003657
PB - SciTePress