
ensuring patients’ continuity of treatment even if they
are moved to another facility.
EHRs encompass an expansive array of data types,
spanning from, numerical data - such as blood pres-
sure; categorical data like pain scale assessments;
textual information including prescription details- to
even temporal data, indicating the timing of mea-
surements. This extensive variety of data types con-
tributes to the heterogeneity of this dataset. On the
other hand, most ML algorithms are primarily de-
signed to handle numerical data and face difficulties
when dealing with non-numerical types like categor-
ical data, which can be categorised into nominal data
(without any inherent order) and ordinal data (char-
acterised by a specific order). Despite significance
of this information in enhancing the interpretability
of ML models, they pose challenges. Conventional
techniques can convert these features into numerical
variables; however, the increasing number of unique
values results in high-dimensional feature matrix and
computational challenges, especially when used with
computationally demanding models.
In recent studies, particularly in the field of
medicine, there is a growing trend of using a subset
of values extracted from nominal features (NFs) to a
harmonious balance between optimising data utility
and managing the dimension of the dataset. Nonethe-
less, this approach has potential downsides, including
the risk of losing valuable information and heavily
relying on domain expertise to select the most rel-
evant values. Therefore, in numerous applications,
these features are often disregarded or considered to
be leveraging domain knowledge, so only a subset of
their distinctive values is considered. In this study,
our contributions are to:
• Tackle the challenges associated with NFs in
EHRs by employing the proposed target encoding
preprocessing framework (TE-PrepNet).
• Optimise high-cardinality NFs handling by min-
imising dependency on domain experts, while
maximising the integration of embedded values.
This optimisation is accomplished through incor-
porating the TE-PrepNet.
• Assess two distinct ED-based prediction tasks:
prediction of hospital admissions at the time of
triage in the ED; prediction of reattendance to the
ED within 72 hours after discharge.
We applied the target encoding approach on a
set of chosen NFs (race, arrival transport mode, and
chief complaint), encompassing both high and low-
cardinality characteristics, which are extracted from
the Medical Information Mart for Intensive Care
IV Emergency Department (MIMIC-IV-ED) dataset
(Johnson et al., 2021). The results highlighted the
performance enhancements and effectiveness of us-
ing the TE-PrepNet on both of the aforementioned
prediction tasks. In particular, the implementation of
random forest with target encoding achieved an AU-
ROC of 0.8458, outperforming the baseline AUROC
of 0.7520. Furthermore, in predicting 72-hour reat-
tendance, the use of XGBoost with target encoding
achieved an AUROC of 0.6975, showing an improve-
ment from the baseline’s previous AUROC of 0.6166.
2 RELATED WORK
Given the continual influx of data into EHRs, the in-
tegration of ML holds promise in facilitating com-
prehensive analysis. By discerning trends, detecting
patterns, and offering predictions pertaining to a pa-
tient’s well-being, ML can play a pivotal role in en-
hancing healthcare. In recent years, ML models have
capitalised on the potential offered by EHRs to un-
dertake a spectrum of predictions pertinent to the ED.
These efforts contain predictions related to hospital
admission (Barak-Corren et al., 2017; Xie et al., 2022;
Hong et al., 2018; Graham et al., 2018; Al Shal-
abi et al., 2006), early prediction of sepsis or septic
shock in the ED (Wardi et al., 2021), predictions con-
cerning the length of stay within the ED (Gurazada
et al., 2022; Rahman et al., 2020), as well as fore-
casts regarding the length of stay for COVID-19 pa-
tients specifically within the ED (Etu et al., 2022). So,
implementing early prediction models for patient ad-
missions can be beneficial in addressing the problem
of long boarding times and expediting resource allo-
cation, and enhancing overall patient care efficiency.
Conventional techniques, such as one-hot encod-
ing (or dummy encoding), have been employed in the
handling of nominal variables with a limited number
of distinct values (Hancock and Khoshgoftaar, 2020).
These methods effectively convert a nominal variable
with N unique values into N new variables (or N − 1
variables in the case of dummy encoding) to cap-
ture its categorical nature. However, their effective-
ness significantly decreases when dealing with NFs
with many distinct values, primarily due to the inher-
ent challenge of high dimensionality. The escalation
in dimensionality poses computational and interpre-
tational difficulties, limiting the applicability of these
methods in scenarios where NFs exhibit a multitude
of unique values.
Apart from employing one-hot encoding, propos-
ing the use of clustering techniques is also an option.
These techniques involve grouping individual values
into K sets. Although this approach results in fewer
ICT4AWE 2024 - 10th International Conference on Information and Communication Technologies for Ageing Well and e-Health
38