based methods (Li, Jamieson, & DeSalvo, 2017), etc.
All these techniques have a search space defined by the
choice and range of parameters under consideration.
The goal of this paper is to identify, from a large
database of clinical text (the MIMIC-III database)
(Johnson, 2016), text samples that express an SDoH
sentiment about the described patient (but not about
people related to the patient). This is achieved in a
two-step process. First, we extract text samples with
a regular expression that looks for concepts from the
SDoH ontology (SOHO) in the input text. However,
some text samples use a SOHO term in an incidental
way, not really referring to an SDoH issue of the
patient. To classify text input as being SDoH text or
not, we use a neural network pipeline. We combine a
Clinical BioBERT (Alsentzer, Murphy, & Boag,
2019) model with a neural network classifier
framework. To achieve a better performance, we
optimized the selected hyperparameters of the model
using a genetic algorithm (GA). We considered three
ML optimizers, namely AdamW (Zhang, 2018),
Adafactor (Noam Shazeer, 2018) and LAMB (You,
2020). Alongside the GA operations called n-bit
crossover and random bit flip mutations, we also used
roulette-wheel selection to obtain the optimal
candidate solution using the genetic algorithm.
We framed this problem as an entity recognition
task and used the latest advancements in large language
models (LLM), specifically Universal NER (Zhou &
Zhang, 2023). Universal NER uses a smaller model
with minimal parameters that it learned from its teacher
LLM model gpt-3.5-tubo-0301, by applying target
distillation. Additionally, we employed the state-of-
the-art hyperparameter optimization framework
Optuna (Akiba, Sano, & Yanase, 2019) to compare the
results with our model. The Optuna hyperparameter
optimization framework is among the latest
advancements in this field and is unique because of its
define-by-run and pruning strategies. The comparison
studies of our model with Universal NER and Optuna
will be presented in the Discussion Section.
2 METHODS
2.1 Model Architecture
The Clinical BioBERT model architecture is a multi-
layer bidirectional transformer encoder
implementation. The input data is converted into token
embeddings, each as a 768-dimensional vector
representation. This (768) is the standard size in the
BERT architecture. The input embeddings are first
passed through a multi-head self-attention
mechanism. The self-attention mechanism generates a
set of attention weights that are used to weigh the
importance of each token in the input sequence. The
context vector is passed through a position-wise feed-
forward neural network, which further transforms it.
The classification layer takes the CLS token of the last
layer and predicts the context of the text sample. This
layer is made up of two linear layers separated by two
drop out layers. Figure 1 shows the model architecture
of Clinical BioBERT for SDoH text classification.
2.2 Dataset
We utilized the SOHO ontology (Kollapally, Chen, Xu,
& Geller, 2022), available in BioPortal, as a reference
terminology for extracting concepts from MIMIC-III
v1.4. The concepts in the SOHO branch “Social
determinants of health” were used for concept
extraction from MIMIC-III clinical notes. MIMIC-III
contains data associated with 53,423 distinct hospital
admissions for patients 16 years and up, admitted to
critical care units between 2001 and 2012. We
specifically utilized clinical notes available in the
Note_events table, which is a 4GB data file.
The Stanford NLTK library was used for text pre-
processing. After stop word removal and converting
the text to all lower case, the clinical notes from
MIMIIC-III were fed to a regex-based Python script to
extract text fragments with SDoH concepts in them.
Using regular expressions, whenever we found a
concept in the Note_events file that matched a
concept in the SOHO ontology, we extracted the
preceding four sentences, and the succeeding four
sentences from Note_events. Preliminary
observations showed that this is typically sufficient to
capture the SDoH context. Not all rows of data
returned by the Python regex script expressed a strong
SDoH sentiment about the patient under
consideration. Hence, we performed a manual review
of a subset of approximately 1500 rows of extracted
text, and we annotated 1054 rows of them with the
label “1” for training the Clinical BioBERT
architecture. Those sentences described SDoH
statements about the patient. Negative training
samples (1130 rows) were extracted from admission
labs, discharge labs and discharge instructions and
labelled as “0.” These do not describe SDoH
statements about the patient. The resulting 2184 rows
of data were split into 80% training and 20% test data.
2.3 Choice of Optimizers
Adaptive optimization algorithms such as Adam tend
to have a better performance compared to Stochastic