Language text, called clinical notes. Clinical notes
occasionally mention the social context of a patient,
such as high-risk behaviors, family details,
unemployment, etc. These notes can be used in
community-based research such as investigating the
origins of non-communicable diseases, etc. (Munir &
Ahmed, 2020). In our research, we are targeting the
classification of notes extracted from the MIMIC-III
de-identified medical records to determine whether
they express social context, using the Social
Determinants of Health Ontology (SOHO)
(Kollapally & Geller, 2023). Our goal is to detect the
potential release of sensitive information, when the
knowledge embedded in large language models is
combined with the context of notes from MIMIC-III.
We have been able to potentially re-
identify private data, including names of people. In
order not to commit the same offenses that we are
censuring in this paper, all sensitive data, especially
names, are replaced by [*tag*] in this paper, however,
we do have the data in our private repository.
2 BACKGROUND
Large Language Models (LLMs) have received
substantial attention since November 2022, due to the
release of ChatGPT (of OpenAI), generating
unprecedented interest, with over one million unique
users within five days. By November 2023, this
increased to 180 million users. The introduction of
Generative Pre-trained Transformer-4 (GPT-4) in
March 2023 marked a breakthrough in utilizing large
language models in multi-modal disciplines,
especially medicine. Numerous research articles are
published daily, employing these models to analyze
pathology reports, MRI scans, X-rays, microscopy
images, dermoscopy images and many more (Yan,
2023). However, the continuous release of various
LLMs and chatbots makes it challenging to conduct
thorough red-teaming for each model to assess and
analyze the LLM's responses, behavior, and
capabilities. Therefore, it is imperative to establish
robust regulatory, ethical, and technological
safeguards to ensure the responsible use of LLMs in
healthcare and other critical domains.
LLMs demand comprehensive contextual data to
execute NLP tasks effectively, highlighting the need
to handle lengthy input sequences during the
inference process. As a solution, quantization
techniques have gained popularity to run LLM
models efficiently. The key idea is to convert each of
the parameters from 32-bit/16-bit float to 4-bit/8-bit
representations. This enables downloading and
running the LLM models on local machines without
GPUs. Recent quantization methods such as QLoRA
(Dettmers, Pagnoni, Holtzman, & Zettlemoyer,
2023), LoRA, and Parameter Efficient Fine-Tuning
(PEFT) (Ding, 2023) can reduce the memory
footprint of LLMs considerably. QLoRA introduces
4-bit normal float quantization and double
quantization, which yields reasonable results
compared to the original 16-bit fine-tuning.
In this work, we are using Ollama (Ollama., 2023)
to download and create quantized LLM models in the
GPT-Generated Unified Format (GGUF) file format,
supporting zero-shot and few-shot learning tasks. The
GGUF format was specifically designed for LLM
inferences (Hugging face, 2023). It is an extensible
binary format for AI models. GGUF also packages
models into a single file for easier distribution of
models that are easy to load with little coding.
2.1 Models
Google’s FLAN (Wei, 2022) Large Language Model
(LLM) utilizes the LaMDA-PT 137B (Billion)
parameter pre-trained language model and instruction
tuned it with over 60 NLP datasets. This model was
pre-trained with a collection of web documents,
dialog data, and Wikipedia pages, tokenized into
2.49T BPE (Byte Pair Encoding) tokens with a 32k
vocabulary using the SentencePiece library.
Meta Llama2 (Touvron, 2023) is an updated
version of Llama2. According to Meta, the training
corpus of Llama2 includes a mix of data from
publicly available sources, except for Meta’s products
and services. They also claim that an effort has been
made to remove data from certain sites known to
contain high volumes of personal information about
individuals.
The Mistral model by Mistral AI (Jiang, 2023)
was developed with customized training, tuning, and
data processing techniques. It leverages grouped-
query attention (GQA) and sliding window attention
(SWA) mechanisms. GQA accelerates the inference
speed, and reduces the memory requirements during
decoding, allowing for bigger batch sizes, hence
resulting in higher throughput. The Mistral 7B–
Instruct model was developed by fine-tuning Mistral-
7B on datasets publicly available on the Hugging
Face repository.
Vicuna (Peng, 2023), developed by Large Model
Systems (LMSYS), is an open-source chatbot trained
by fine-tuning Llama with user-shared conversations
collected from ShareGPT. It utilizes 700K instruction
tuning, extracting samples from ShareGPT.com
(ShareGPT, 2023) via its public APIs. It is an