Using Under-Represented Subgroup Fine Tuning to Improve Fairness for

Disease Prediction

Yanchen Wang

1 a

, Rex Bone

1 b

, Will Fleisher

1 c

, Carole Roan Gresenz

1 d

, Jean Mitchell

1 e

Wilbert van der Klaauw

2 f

, Crystal Wang

2 g

and Lisa Singh

1 h

Georgetown University, Washington, DC, U.S.A.

Federal Reserve Bank New York, New York, NY, U.S.A.

Keywords:

Machine Learning Fairness, Disease Prediction, Multivariate Sensitive Attribute, Model Fine Tuning.

Abstract:

The role of artiﬁcial intelligence is growing in healthcare and disease prediction. Because of its potential

impact and demographic disparities that have been identiﬁed in machine learning models for disease predic-

tion, there are growing concerns about transparency, accountability and fairness of these predictive models.

However, very little research has investigated methods for improving model fairness in disease prediction,

particularly when the sensitive attribute is multivariate and when the distribution of sensitive attribute groups

is highly skewed. In this work, we explore algorithmic fairness when predicting heart disease and Alzheimer’s

Disease and Related Dementias (ADRD). We propose a ﬁne tuning approach to improve model fairness that

takes advantage of observations from the majority groups to build a pre-trained model and uses observations

from each underrepresented subgroup to ﬁne tune the pre-trained model, thereby incorporating additional

speciﬁc knowledge about each subgroup. We ﬁnd that our ﬁne tuning approach performs better than other

algorithmic fairness ﬁxing methods across all subgroups even if the subgroup distribution is very imbalanced

and some subgroups are very small. This is an important step toward understanding approaches for improving

fairness for healthcare and disease prediction.

1 INTRODUCTION

Algorithmic decision making that relies on artiﬁcial

intelligence is increasingly impacting people’s daily

lives in areas such as credit approval, job hiring, and

criminal justice. We are also seeing its growth with re-

spect to disease prediction and clinical decision mak-

ings (Jiang et al., 2017; Secinaro et al., 2021; Yu

et al., 2018). Given the importance of such deci-

sion making, there are growing concerns about the

transparency, accountability, and fairness of the pre-

dictive models being designed (Binns, 2018; Center

for Democracy & Technology, 2024). In September

https://orcid.org/0000-0002-7822-7163

https://orcid.org/0009-0000-2778-9783

https://orcid.org/0000-0002-5980-3970

https://orcid.org/0000-0002-7381-7914

https://orcid.org/0000-0002-2765-4624

https://orcid.org/0000-0002-7977-3342

https://orcid.org/0009-0003-2970-0887

https://orcid.org/0000-0002-8300-2970

2022, the U.S. Food and Drug Administration (FDA)

issued a guidance for Clinical Decision Support Soft-

ware. The guidance mentions potential risks associ-

ated with software intended to provide recommenda-

tions to a healthcare provider about prevention, diag-

nosis, or treatment of a disease or condition. It de-

scribes an automation bias where human tends to over

rely on suggestions from automated systems. They

recommend that algorithmic decision making not re-

place or direct the judgment of healthcare profession-

als (FDA, 2022).

The guidance from the FDA is not unfounded.

Researchers have identiﬁed demographic disparities

in disease diagnosis and treatment usage (Barthold

et al., 2020; Straw et al., 2024; Allen et al., 2020).

Researchers have also identiﬁed demographic dispar-

ities in machine learning models predicting disease

(Yuan et al., 2023; Davoudi et al., 2024; Fazelpour

and Danks, 2021). For example, Davoudi and col-

leagues study machine learning models predicting the

risk of hospitalization and emergency department vis-

its in home healthcare patients and identify signif-

240

Wang, Y., Bone, R., Fleisher, W., Gresenz, C. R., Mitchell, J., van der Klaauw, W., Wang, C. and Singh, L.

Using Under-Represented Subgroup Fine Tuning to Improve Fairness for Disease Prediction.

DOI: 10.5220/0013318600003911

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 18th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2025) - Volume 2: HEALTHINF, pages 240-253

ISBN: 978-989-758-731-3; ISSN: 2184-4305

icant disparities in model performance across both

racial and gender subgroups (Davoudi et al., 2024).

These examples suggest that members of marginal-

ized groups unfairly receive worse predictions than

those in advantaged groups.

While researchers have continued to identify dis-

parities in drug use and disease prediction models,

comparatively little research has focused on devel-

oping mitigation algorithms to improve the model

fairness for health care applications. This focus on

healthcare is important since the design and appli-

cation of fairness may be different than other do-

mains. For example, one important difference is that

demographic variables are expected to be part of the

model since risk factors associated with many differ-

ent health conditions are known to vary based on pa-

tient demographics. A technical consideration, given

this, is that it is possible for there to be multiple differ-

ent sensitive attributes, some of which are multivari-

ate. Since most algorithmic fairness literature to date

focuses on corrections for binary sensitive attributes,

more work is needed to develop technical solutions

for handling this situation.

Disparities in model performance for people in

different social groups can have signiﬁcant conse-

quences for the well-being of people in marginal-

ized groups. Inaccurate or failed diagnoses, for in-

stance, can result in people failing to receive appro-

priate medical care. If a diagnostic system is only

accurate for members of advantaged social groups,

advantaged group members are likely to receive bet-

ter care and a disproportionate amount of medical re-

sources. If these disparities are the result of failures

in the collection of data, or at other points in the ma-

chine learning pipeline, then the use of machine learn-

ing models can contribute to an unfair distribution of

medical care and resources. Our goal in this project

is to develop methods that help mitigate these dispar-

ities with respect to the accuracy of machine learning

models used for healthcare decisions.

To that end, this paper focuses on improving

the algorithmic fairness of machine learning models

for disease prediction. We explore existing strate-

gies and propose a new strategy for reducing model

bias. We demonstrate its effectiveness on a heart

disease data set and synthetic data designed for de-

tecting Alzheimer’s Disease and Related Dementias

(ADRD). Our new approach focuses on pretraining a

model using available (possibly skewed) initial data

to provide sufﬁcient contextual insight for the model,

and then ﬁne tuning the pretrained model on exam-

ples from subgroups that are not well represented in

the training data set. This results in a set of models

that are upgraded for different under-represented sub-

groups.

Our main contributions can be summarized as fol-

lows. (1) We propose a novel ﬁne tuning approach for

improving model fairness across all subgroups, even

in the presence of an imbalanced group distribution.

(2) Our approach considers multivariate sensitive at-

tributes with highly skewed, imbalanced group distri-

bution, where previous literature has focused on bi-

nary sensitive attributes and/or more balanced sensi-

tive attribute distributions. (3) We develop and release

the source code for a synthetic data generator that can

generate temporal data sets containing variables fol-

lowing a range of distributions, thereby allowing re-

searchers to easily generate synthetic data for disease

prediction applications so private health data does not

need to be shared.

2 RELATED LITERATURE

2.1 Disease Prediction Using Machine

Learning

In the past decade, we have seen a growth in disease

prediction research that uses machine learning mod-

els. Some recent surveys have discussed the strengths

and limitations of different methods for speciﬁc pre-

diction tasks (Shah et al., 2020; Singh and Kumar,

2020; Fatima and Pasha, 2017). To date, heart dis-

ease prediction is the task that has received the most

attention in the literature. The models being used

for heart disease prediction tend to be developed us-

ing classic machine learning methods, including ran-

dom forest, decision trees, logistic regression, and

support vector machine (SVM). Recently, Xie and

colleagues conducted a survey of disease prediction

research that uses deep learning models (Xie et al.,

2021). They showed that deep learning models are

outperforming classic models, especially when the

data are not in tabular form. For example, when the

data are image-based, e.g., X-ray, computed tomog-

raphy (CT), and magnetic resonance imaging (MRI)

scans, deep learning neural network models, includ-

ing convolutional neural networks (CNNs) and recur-

rent neural networks (RNNs) have performed better.

When the data are in tabular format, artiﬁcial neural

networks (ANNs) sometimes still perform better than

the classic models.

Machine Learning for Predicting Alzheimer’s Dis-

ease and Related Dementias Diagnosis. Existing

machine learning models predicting ADRD mostly

use features from medical images (Rathore et al.,

Using Under-Represented Subgroup Fine Tuning to Improve Fairness for Disease Prediction

241

2017) and medical exams, e.g. blood plasma spec-

troscopy (Paraskevaidi et al., 2018; Doecke et al.,

2012). While these machine learning models can

achieve relatively high accuracy using features ex-

tracted from these data sources, the data sets are ex-

pensive, making it hard to scale to population lev-

els (Frisoni, 2001). In recent years, researchers have

started to understand the relationship between ADRD

and personal ﬁnances. For example, Gresenz and col-

leagues ﬁnd that early-stage ADRD can negatively af-

fect household ﬁnancial worth (Gresenz et al., 2020).

Agarwal and Muckley have similar ﬁndings that older

persons’ money management difﬁculty can help iden-

tify individuals with early-stage dementia (Agarwal

and Muckley, 2024).

In this work, we focus on fairness related to heart

disease prediction, because of its prevalence in the lit-

erature, and ADRD prediction, because of its link to

personal ﬁnancial management. Our team has exper-

tise in economics and ﬁnance. This allows us to gen-

erate realistic, synthetic ﬁnancial data that we use in

this paper.

2.2 Bias Mitigation Algorithms

While fairness metrics for multivariate sensitive at-

tributes are similar to binary ones, bias mitigation al-

gorithms for multivariate sensitive attribute are very

different (Kang et al., 2022; Ma et al., 2021; Chen

et al., 2024), even though they have also been ap-

plied at different stages of the modeling process: pre-

processing (Kamiran and Calders, 2012; Feldman

et al., 2015; Chakraborty et al., 2021), in-processing

(Tarzanagh et al., 2023; Shui et al., 2022; Chen et al.,

2022; Peng et al., 2022) and post-processing (Hardt

et al., 2016; Pleiss et al., 2017).

Beginning with a pre-processing method, Wang

and Singh propose a resampling approach to im-

prove fairness. This method achieves statistical in-

dependence between the sensitive attributes and the

outcome (Wang and Singh, 2021). Chakraborty

and colleagues propose Fair-SMOTE (Chakraborty

et al., 2021). It utilizes the existing Synthetic Mi-

nority Over-sampling Technique (SMOTE) algorithm

(Chawla et al., 2002) to generate synthetic samples

by using k-nearest neighbors (KNN) to generate new

observations that are close to existing observations.

Chen and colleagues (Chen et al., 2022) pro-

pose MAAT, an in-processing method for improving

ML fAirness-performAnce Trade-of. Their approach

trains two models, one that optimizes performance

and one that optimizes fairness using the training

data. The fairness optimization model corrects selec-

tion bias by undersampling over-represented groups

to improve fairness and the performance optimiza-

tion model uses a classic machine learning model, e.g.

such as random forest and logistic regression, to opti-

mize performance.

Finally, Hardt and colleagues (Hardt et al., 2016)

propose equalized odds processing, a post-processing

method to improve fairness. In this method, the au-

thors utilize the decision probability from the classi-

ﬁer to determine a different probability threshold for

each subgroup, instead of the traditional 50/50 split.

None of these proposed methods address a major

challenge that arises when we have a multivariate sen-

sitive attribute: limited training data for multiple, spe-

ciﬁc subgroups, not just a single minority class. To

address this problem, recent research uses multi-level

modeling. Shui and colleagues (Shui et al., 2022) pro-

pose a bi-level objective model. In the lower-level, the

subgroup speciﬁc predictors are trained using a small

amount of data from each subgroup then in the upper-

level, the model takes feedback from each of the lower

level results and updates the model to be close to all

subgroup speciﬁc predictors. However, this approach

tends to overﬁt the data, especially when the distribu-

tion of data for each subgroup is different (Tarzanagh

et al., 2023).

Fine Tuning Approach. Fine tuning is a process-

ing of adapting a pre-trained deep learning model for

speciﬁc machine learning tasks. One can view this

as adding domain-speciﬁc knowledge to a more gen-

eral knowledge-base. Speciﬁcally, the process up-

dates parameters in a neural network model using the

domain-speciﬁc training examples, thereby adjusting

the pre-trained model to perform better on the spe-

ciﬁc learning task of interest. It accomplishes this

using a very small amount of training data. This ap-

proach has been shown to be effective for natural lan-

guage processing (NLP) models. Speciﬁc examples

include ﬁne tuning Bidirectional Encoder Represen-

tations from Transformers (BERT) (Sun et al., 2019;

Liu et al., 2019), and more recent GPT models devel-

oped by OpenAI (Houlsby et al., 2019; Howard and

Ruder, 2018; Min et al., 2023). In this paper, we pro-

pose a ﬁne tuning approach that focuses on fairness

improvement instead of general model performance

improvement. We will further discuss our approach

in Section 3.2.

2.3 Machine Learning Fairness on

Disease Prediction

In many areas such as hiring and credit approval, de-

cision makers are prohibited by law from using de-

mographic features of individuals to make decisions.

HEALTHINF 2025 - 18th International Conference on Health Informatics

242

The goal is to remove the inﬂuence of demograph-

ics to reduce the likelihood of bias, e.g. sexism or

racism. In disease prediction, demographic features

such as sex, race and age can provide necessary in-

sight into possible risk factors for speciﬁc diseases

and are therefore, often included as training features

in machine learning models (Grampurohit and Sagar-

nal, 2020; Arumugam et al., 2023). This means that

we expect that there are differences in predictive ac-

curacy across subpopulations and we can easily de-

termine which subpopulations have higher predictive

accuracy and which do not. For those who do not, our

goal is to develop strategies for improving the predic-

tive model accuracies to ones that are similar to the

highest ones when more training data are not avail-

able.

Existing Work to Identify and Reduce Bias in Dis-

ease Prediction Models. Research on fairness in

disease prediction is fairly limited and mostly focuses

on fairness for binary sensitive attributes (Li et al.,

2023; Chae et al., 2023; Davoudi et al., 2024; Feng

et al., 2024; Grote and Keeling, 2022; Raza et al.,

2023; Chen et al., 2023). Li and colleagues study

the bias in machine learning models for cardiovascu-

lar disease prediction and compare the performance

of various bias mitigation strategies. They use binary

sensitive attributes, sex and race, with both classic

machine learning and deep learning models. Their

bias mitigation algorithms include resampling using

two approaches: (1) resampling the training data by

group, e.g., sampling the binary sex feature so that

each group has the same number of observations, and

(2) resampling the training data by label and remov-

ing the sensitive attribute from training data during

model training and prediction (Li et al., 2023).

Chae and colleagues build time series risk models

using the AutoGluon model and the LightGBM model

to predict emergency department visits and hospital-

izations for patients with heart failure. Features in-

cluded patients demographics, vital signs, medical

history, and notes from prior visits (Chae et al., 2023).

A year later, the same team investigate the fairness

gaps in the same two models with respect to race and

gender. They use error rate balance and predictive

parity as fairness metrics and ﬁnd that there are signif-

icant disparities in model performance across demo-

graphic subgroups. However, in the paper, the authors

do not propose or use any existing bias mitigation al-

gorithms to improve fairness (Davoudi et al., 2024).

We extend the literature in the following ways: (1)

we consider cases when the sensitive attribute is mul-

tivariate, (2) we propose an approach that ﬁne tunes

the model using subpopulations that are not well rep-

resented (resampling by group) to improve fairness of

the multivariate sensitive attribute, (3) and unlike (Li

et al., 2023) that remove the sensitive attribute to im-

prove fairness, our approach includes sensitive demo-

graphic attributes in the training data because these

attributes are associated with risk factors that are im-

portant for building robust disease prediction models.

3 FINE TUNING FOR FAIRNESS

This section begins with deﬁnitions and notation. We

then describe our proposed ﬁne tuning approach.

3.1 Preliminaries

Let Y = {y

, y

, ··· , y

} be the set of binary labels we

want to predict and for the ith observation, y

∈ {0, 1},

where y

= 1 indicating being diagnosed with the dis-

ease and y

= 0 indicating not being diagnosed with

the disease. Similarly, let

Y = { ˆy

, ˆy

, · ·· , ˆy

} and

ˆy

∈ {0, 1} be the predicted label for the ith obser-

vation. Let X = {x

, x

, · ·· , x

} be the set of train-

ing features. A machine learning model M is trained

using X and Y . Using the model M we get a set of

predicted labels

Y = M(X). Let S = {s

, s

, · ·· , s

where S ⊂ X and s

∈ {sensitive attribute group}, is

the sensitive attribute for the ith observation.

The task we are interested in is to train our ma-

chine learning model M so that it is fair for subgroups

in S. We measure the model performance and fairness

using X, Y ,

Y and S.

3.2 Proposed Approach

There are multiple approaches to improve machine

learning fairness. One approach is to build one model

for the entire data set and use various bias mitiga-

tion algorithms to improve the fairness and make the

model fair across all subgroups. This approach works

well when the number of sensitive attribute groups is

small and each group is well represented in the train-

ing data set. However, when there are more sensitive

attribute groups and the subgroup distribution of the

training data is very imbalanced, one model may not

perform well for all subgroups. Another approach to

improve the fairness is to build a model for each group

so that each model is optimized for each subgroup.

If we build models from scratch using training data

from each subgroup, any model that is built using sub-

groups with limited training data may perform poorly.

There are over-sampling techniques that can be used

to increase the size of training sample. However, be-

cause they are of a similar distribution as the original

Using Under-Represented Subgroup Fine Tuning to Improve Fairness for Disease Prediction

243

observations, the overall model performance may not

improve. We propose addressing these limitations by

using a ﬁne tuning approach that is designed to adjust

the model to improve performance on subgroups with

fewer training observations (see Section 2.2 for ﬁne

tuning background).

Consider a disease prediction task where we have

a large amount of data and the subgroup distribution

is very imbalanced. Our approach, similar to oth-

ers, begins by training a general pre-trained model

using a large number of examples so that the pre-

trained model has good overall performance on our

task and learns important background about our pre-

diction task. We can view this step as being simi-

lar to state of the art large language models such as

BERT and GPT. These models were trained on large

data sets that provide general information that can be

useful for speciﬁc prediction tasks. We then use train-

ing examples speciﬁc to the smaller subgroups to ﬁne

tune the pre-trained model. The ﬁne tuning step en-

ables us to utilize the general knowledge from the

pre-trained model and achieve higher performance for

smaller subgroups using a small number of training

examples.

Figure 1 illustrates our approach. In step one,

we ﬁrst split the available labeled data into three

non-overlapping partitions using stratiﬁed sampling:

training, hold out for ﬁne tuning and hold out for val-

idation. The hold out set for validation has the same

distribution as the raw data and the hold out set for

ﬁne tuning is a stratiﬁed sample of the raw data, where

the number of observations for each subpopulation is

the same. The remaining data serves as the training

data set. In step 2, during model training, we train

a large neural network model from scratch using the

training data. Because this model is developed us-

ing a large data set, it is likely biased toward accurate

prediction for the groups that have more training ex-

amples. We then use this base model as a pre-trained

model and ﬁne tune it on minority sensitive attribute

groups, those having relatively low performance and

fairness scores. We use the hold out set for ﬁne tun-

ing for this step. During ﬁne tuning, the model pa-

rameters are updated. In step 3, after the ﬁne tuning

process, we have a model for each sensitive attribute

group and we use the holdout for validation set to

measure model performance and fairness.

4 SYNTHETIC DATA

GENERATOR

Given the concerns around health data privacy, we

have developed a synthetic data generator to allow

researchers to generate data sets that can be used as

training data to develop initial models before devel-

oping a ﬁnal one in a privacy-preserving environment

with the actual patient data. Our data generator al-

lows researchers to generate temporal and non-time

varying records, vary the distribution for different fea-

tures, and use group level statistics to generate indi-

vidual records. Example statistics include mean, me-

dian, min, max, skewness, and covariance. This en-

ables safeguarding individual data while optimizing

models. We make our open-source synthetic genera-

tor available to the research community

If the data generator is used to create data sets

with temporal variables, the researcher begins by se-

lecting a distribution for the temporal period. For ex-

ample, if the researcher wants to generate data from

2010 to 2020 with one observation per year, the re-

searcher needs to decide on the underlying population

distribution at the ﬁrst observation period, e.g., 2010

in this case. The ﬁrst row of Table 1 shows all the

available distributions in the synthetic data generator.

It contains both univariate and multivariate distribu-

tions. The distribution selection can be determined

using the Kolmogorov-Smirnov goodness of ﬁt test

(Massey Jr, 1951) on the aggreagated data set. For

temporal data, the researcher also needs to select a

trend for the temporarily. The trend measures how

observations change over time. There are four dif-

ferent trends available: linear, uniform, polynomial

and exponential. The parameters for each trend can

be chosen using regression analysis on the aggregated

data. With each trend, the researcher can specify

the amount of noise (random or non-random) for the

trend.

5 EXPERIMENTAL DESIGN

The discussion of the experimental design is broken

down as follows: the data sets, the evaluation mea-

sures, the machine learning models, and the bias mit-

igation algorithms.

5.1 Data Sets

We use the following two data sets: the UCI Machine

Learning Repository heart disease data set and a syn-

thetic data set that has a similar distribution to a subset

of ﬁnancial data used to predict ADRD.

https://github.com/GU-DataLab/healthinf_synthetic

HEALTHINF 2025 - 18th International Conference on Health Informatics

244

Figure 1: Fine tuning for fairness steps.

Table 1: Synthetic data generator parameters.

Type of distribution

gamma, beta, normal, uniform, weibull, log normal,

multivariate normal, multivariate log normal

Temporal trend linear, uniform, polynomial, exponential

Type of data categorical, continuous numeric

5.1.1 UCI Heart Disease Data

The heart disease data from the UCI Machine Learn-

ing Repository contains 304 observations with 14 fea-

tures, including age, sex, and medical record infor-

mation, e.g. blood pressure and heart rate. The task

is to predict whether the individual has heart disease.

This data set was obtained from the Cleveland Clinic

and it has been extensively used by machine learn-

ing researchers (Janosi et al., 1988). We consider two

features as sensitive attributes: sex (a binary variable)

and age (multivariate, binned, attribute). For the sex

variable, there are 207 male observations and 97 fe-

male observations. Age is a numeric continuous vari-

able in the original data set. We create three imbal-

anced bins with 25%, 50%, and 25% of the data in

each bin.

5.1.2 Synthetic Data for ADRD Prediction

Our second task is predicting ADRD. As described

in Section 2.1, researchers have identiﬁed a link be-

tween money management ability and ADRD (Gre-

senz et al., 2020; Agarwal and Muckley, 2024). We

use that knowledge to generate ﬁnancial features and

basic health information that maps to existing credit

and health data sets.

The credit data that we map our distribution to is

credit data from Equifax. These data are the basis of

the Federal Reserve Bank of New York’s Consumer

Credit Panel. These data have been merged at the in-

dividual level using a unique common identiﬁer (So-

cial Security number) with Medicare enrollment and

claims data for fee-for-service (FFS) enrollees. The

merged data set only contains individuals over than

65 years old who are enrolled in Medicare fee-for-

service. This data set encompasses quarterly obser-

vations on ﬁnancial features for each quarter between

1999 and 2017 (Gresenz et al., 2024). In addition to

ﬁnancial features, the merged data set contains ba-

sic demographic information of individuals such as

age, sex, and race.

We use the race feature as the

sensitive attribute and there are 6 sensitive attribute

groups: White, Black, Asian, Hispanic, Native Amer-

ican and others/unknown. We construct a random

subsample of the data set that contains 5 percent of

the merged data with a subset of 12 features from ap-

proximately 1000 features in the merged data. The 12

features are training features about ﬁnancial indica-

tors such as number of account, balance amount and

past due amount. In the rest of the paper, we refer

this ﬁve percent merged set as the Merged Subsample

of Consumer Credit Panel data (MS-CCP data). The

synthetic data we generate is based on the statistical

properties of the MS-CCP data.

5.2 Evaluation

5.2.1 Fairness Metrics

In our fairness evaluation, we use two metrics to

quantify fairness, equal opportunity (Hardt et al.,

2016) and accuracy disparity (Berk et al., 2021; Zafar

et al., 2017). There are many other fairness metrics

available. Feng and colleagues published a survey of

machine learning fairness in healthcare and present

nine different fairness metrics used in literature (Feng

et al., 2024). To the best of our knowledge, no best

metric has been selected by health research commu-

nity. In this work, we use equal opportunity and ac-

curacy disparity to show the effectiveness of our pro-

posed method. We expect the results will generalize

The demographic information is from the Medicare

data. No explicit race or ethnicity data is obtained from

Equifax.

Using Under-Represented Subgroup Fine Tuning to Improve Fairness for Disease Prediction

245

to other fairness metrics and demonstrating this is one

of our future research directions.

The formal equal opportunity metric—also called

true positive rate parity—is inspired by eponymous

principles of justice from political philosophy, which

have broad acceptance (Barocas et al., 2023; Loi

et al., 2021). Equality of opportunity principles re-

quire, roughly, that people have the same chances of

obtaining some good outcome, regardless of which

social groups they belong to. The formal metric be-

low represents such a requirement as it applies to the

good outcome of an accurate decision from a machine

learning model. Accuracy disparity is a generaliza-

tion of the equality of opportunity metric, and is jus-

tiﬁed on the same grounds.

Equal Opportunity. Equal opportunity measures

the difference of true positive rate (TPR) across all

sensitive attribute groups. TPR for each sensitive at-

tribute group is deﬁned as:

Y = 1|S = s, y = 1)

∀s ∈ {sensitive attribute group}

For a classiﬁer that perfectly satisﬁes equal oppor-

tunity, the TPRs are the same across all sensitive at-

tribute groups.

Accuracy Disparity. Accuracy disparity is similar

to equal opportunity. It measures the difference of ac-

curacy across all sensitive attribute groups. Accuracy

is deﬁned as:

Y = 1|S = s, y = 1) +P(

Y = 0|S = s, y = 0)

P(S = s)

∀s ∈ {sensitive attribute group}

For a classiﬁer that perfectly satisﬁes accuracy

disparity, the accuracy scores should be the same

across all groups.

Computing Fairness Measures. In this work, we

focus on multivariate sensitive attributes and we use

the deviation from mean to compute equal opportu-

nity and accuracy disparity. It is deﬁned as:

∑

i=i

|( f

) −

where n is the number of sensitive attribute

groups, f

is the fairness score (TPR or accuracy) for

the i

sensitive attribute group and

X is the mean

value of the fairness score (TPR or accuracy) across

all sensitive attribute groups. In both fairness met-

rics, we want the deviation to be as small as possi-

ble, thereby indicting that the machine learning model

perform equally well on each group.

5.2.2 Model Performance and Fairness

Measurement

For both data sets, we use the train/test split approach

to measure the model performance and fairness. We

randomly select 20% of the raw data as the hold out

set for testing and the remaining 80% as the training

data. In the ﬁne tuning step, we further conduct strat-

iﬁed sampling on the training data as the hold out set

for ﬁne tuning. In particular, we randomly select a

ﬁxed number of individuals from each subgroup to

ﬁne tune the pre-trained model.

To evaluate the model performance and fairness,

we compute the metrics on the hold out set. For model

performance, we use accuracy and true positive rate

(TPR) for each subgroup. We use true positive rate

in addition to accuracy because in disease prediction,

getting a wrong predictive outcome may lead to a mis-

diagnosis (overdiagnosis or underdiagnosis). In heart

disease and ADRD prediction, underdiagnosis could

be more harmful since it can lead to a delayed diagno-

sis and lost opportunity to have an early intervention

(Ginsberg et al., 2014; Wenger, 2012). Compared to

accuracy, TPR can better measure underdiagosis of

disease. For fairness, we use equal opportunity and

accuracy disparity.

5.2.3 Machine Learning Models

Because our focus is on fairness, we present dif-

ferent fairness measures on an artiﬁcial neural net-

work model. We tested some classic models (decision

tree, random forest, logistic regression and SVM), but

none of them performed as well. Due to space con-

straint, we focus our analysis on the artiﬁcial neural

network models.

5.2.4 Bias Mitigation Algorithms

Table 2: Subgroup distribution in MS-CCP data.

Proportion

White 0.8

Black 0.1

Hispanic < 0.05

Asian < 0.02

Others/unknown < 0.01

Native American < 0.01

To evaluate the effectiveness of proposed ﬁne tuning

approach, we compare the proposed approach against

two existing bias mitigation algorithms: resampling

and training a separate model for each group.

Resampling. Wang and Singh propose resampling

to improve fairness with binary sensitive attribute

HEALTHINF 2025 - 18th International Conference on Health Informatics

246

Table 3: Model performance on UCI heart disease data with binary sex sensitive attribute.

Without ﬁxing Fine tuning Resampling Training by group

Accuracy TPR Accuracy TPR Accuracy TPR Accuracy TPR

Male 0.78 0.673 0.804 0.786 0.829 0.814 0.805 0.728

Female 0.95 0.857 0.839 0.767 0.858 0.776 0.75 0.767

Overall 0.836 0.745 0.822 0.786 0.829 0.796 0.786 0.741

(Wang and Singh, 2021). We extend the resampling

method from Wang and Singh described in Section

2.2 to the multivariate setting. The goal of resampling

is to remove selection bias and selection bias occurs if

some groups in the sample are oversampled and oth-

ers are undersampled. If the data set is unbiased, we

should observe statistical independence between the

sensitive attribute S and outcome Y . It is deﬁned as:

exp

(S = s, Y = y) = P(S = s) ×P(Y = y)

∀s ∈ {sensitive attribute group}, y ∈ {0, 1}

Separate Model for Each Subgroup. Another

strategy is to individually train a separate model for

each subgroup instead of building a single model

containing training examples from all the subgroups.

This second baseline uses the available training for

each subgroup to training a separate model for the

subgroup.

5.2.5 MS-CCP Synthetic Data Generation

One advantage of synthetic data generator is the abil-

ity to manipulate the sample distribution. To test the

effectiveness of different bias mitigation algorithms,

including the proposed ﬁne tuning approach, we cre-

ate three synthetic data sets with different sample dis-

tributions: the original, balanced by outcome label

only (Y ), and balanced by subgroup only (S).

The original data set has the same distribution

as the 5% random sample of the MS-CCP merged

data. The overall ratio of non-ADRD to ADRD is

approximately 8 to 1. Table 2 shows the racial sub-

group distribution in the MS-CCP data.

We see that

the sample is very imbalanced with White individ-

uals having the vast majority of samples and some

subgroups (such as Asian, Native American and oth-

ers/unknown) each having less than 2% representa-

tion in the sample. We generate synthetic data sets

using 14 features having a multivariate log normal

distribution with a linear trend on 12 temporal win-

dows.

The balanced by label data set, has a balanced la-

bel distribution, making the number of individuals di-

Due to privacy concerns, we only show the approxi-

mated racial distribution.

agnosed with ADRD the same as the number of in-

dividuals not diagnosed with ADRD. The subgroup

distribution is not altered. The balanced by subgroup

data set, balances the data across racial groups. In

other words, the number of individuals in each racial

group is the same and the outcome label distribution

is the same as the original sample. The two addi-

tional balanced data sets are less biased than the orig-

inal data set. While these are less realistic, we in-

clude them here to evaluate the effectiveness of the

proposed ﬁne tuning approach on input data with dif-

ferent levels of bias.

6 EXPERIMENTS

Table 4: Model fairness on UCI heart disease data set with

binary sex sensitive attribute.

Accuracy

disparity

Equal

opportunity

Without ﬁxing 0.09 0.092

Fine tuning 0.013 0.01

Resampling 0.014 0.015

Training by group 0.02 0.02

6.1 Heart Disease Prediction

Table 3 shows the neural network model performance

on the UCI heart disease data. Recall, that the sensi-

tive attribute for this data set is a binary sex variable.

Each row shows the results for a speciﬁc subgroup.

Comparing the overall results across ﬁxing methods,

we see that the accuracy without ﬁxing is 1% to 5%

more accurate than after applying a ﬁxing method,

while the TPR is higher for the resampling and ﬁne

tuning ﬁxing methods. In general, the training by

group performs worse than the other two ﬁxing meth-

ods, particularly on the female subgroup because of

the imbalanced training data. Table 4 shows the fair-

ness scores on the UCI heart disease data with the sex

sensitive attribute. Each row shows the ﬁxing method

and each column shows a fairness metric: accuracy

disparity and equal opportunity. We see that all three

methods have similar fairness scores that are large im-

provements (7% improvement in accuracy disparity

and 7-8% improvement in equal opportunity) over the

fairness scores when ﬁxing is not applied. The im-

Using Under-Represented Subgroup Fine Tuning to Improve Fairness for Disease Prediction

247

Table 5: Model performance on UCI heart disease data with multivariate age sensitive attribute.

Without ﬁxing Fine tuning Resampling Training by group

Accuracy TPR Accuracy TPR Accuracy TPR Accuracy TPR

Age bin 1 0.84 0.4 0.833 0.75 0.817 0.692 0.791 0.625

Age bin 2 0.853 0.789 0.842 0.784 0.819 0.741 0.675 0.688

Age bin 3 0.696 0.696 0.796 0.783 0.773 0.744 0.671 0.727

Overall 0.836 0.715 0.828 0.775 0.813 0.73 0.687 0.672

Table 6: Model fairness on UCI heart disease data with mul-

tivariate age sensitive attribute.

Accuracy

disparity

Equal

opportunity

Without ﬁxing 0.067 0.152

Fine tuning 0.009 0.015

Resampling 0.017 0.022

Training by group 0.052 0.037

provements are statistically signiﬁcant.

Table 5 shows the neural network model perfor-

mance on UCI heart disease data set when the sensi-

tive attribute is multivariate age. Similar to binary sex,

ﬁxing using ﬁne tuning or resampling have similar ac-

curacy model performance compared to the baseline

model, whereas ﬁxing using the training by subgroup

approach has the worst model performance with an

accuracy that is 15% lower and a TPR that is 5%

lower. Again, this results because of the imbalanced

data, with some subgroups having an even smaller

sample size than the sex sensitive attribute. Table 6

shows the fairness scores when applying different ﬁx-

ing methods for the age sensitive attribute. Among

the three bias mitigation approaches, ﬁne tuning has

the highest fairness score with a 6% improvement in

accuracy disparity and a 13% improvement in equal

opportunity. Resampling has similar results as ﬁne

tuning but the improvement is smaller. Training by

group has little improvement (1.5%) in accuracy dis-

parity, but the improvement in equal opportunity is

11%. This improvement is statistically signiﬁcant

over without ﬁxing. When comparing all three bias

mitigation algorithms, ﬁne tuning has the best perfor-

mance in terms of fairness improvement and overall

model performance.

6.2 ADRD Prediction

We now present results on ADRD prediction using

three synthetic data sets with different distributions

based on a random 5% sample that uses approxi-

mately 2% of the feature set.

6.2.1 Original Distribution

Table 7 shows the model performance on synthetic

data with the same subgroup and label distribution as

the MS-CCP data (Table 2). Across three bias miti-

gation algorithms, they all have similar overall accu-

racy (within 1% difference) and a relatively close TPR

(within 4%). However, if we look at the subgroups,

the resampling and the training by group methods

have much worse performance on small subgroups

(Asian, Native American and others/unknown pop-

ulation), whereas the proposed ﬁne tuning approach

has more consistent accuracy and TPR across all sub-

groups, small and large. In resampling, when the

subgroup is very small, the algorithm replicates the

same individuals multiple times. The model is un-

likely to learn much new information about the sub-

group from repeated entries and may overﬁt the train-

ing data. When using the training by group approach,

for the smaller groups, the sample size is too small

to train a good model, leading to model underﬁtting.

For example, there are less than 1% Native American

individuals in the sample, decreasing the likelihood

of building an effective neural network model. On

the other hand, the proposed ﬁne tuning approach is

able to ﬁrst build a good pre-trained model contain-

ing knowledge about predicting ADRD using a large

random sample of all the observations. Then this pre-

trained model is ﬁne tuned using samples from each

subgroups with less representation. This improves

each model’s understanding of the speciﬁc subgroup

of interest.

Table 8 shows the fairness scores on the synthetic

data mapping to the original distribution. Fine tuning

has the best fairness score with an improvement in

accuracy disparity of over 3% and an improvement in

equal opportunity of 9%. Fixing using resampling has

a less than 1% improvement in accuracy disparity and

a 4% improvement in equal opportunity. Fixing using

the training by group method has the worst accuracy

disparity (about 2% worse) and no change in equal

opportunity. The improvement on fairness in the ﬁne

tuning approach is statistically signiﬁcant over the

fairness without ﬁxing and the other two baseline bias

mitigation algorithms.

6.2.2 Balanced by Label Sample

In this sample, the number of individuals with and

without the disease is the same. Tables 9 and 10 show

the model performance and fairness for this experi-

HEALTHINF 2025 - 18th International Conference on Health Informatics

248

Table 7: Model performance on synthetic data with original distribution.

Without ﬁxing Fine tuning Resampling Training by group

Accuracy TPR Accuracy TPR Accuracy TPR Accuracy TPR

White 0.799 0.482 0.787 0.529 0.794 0.502 0.796 0.519

Black 0.822 0.415 0.816 0.555 0.811 0.447 0.808 0.529

Asian 0.702 0.085 0.812 0.522 0.692 0.299 0.626 0.302

Hispanic 0.698 0.263 0.792 0.542 0.741 0.392 0.699 0.497

Native American 0.702 0.313 0.798 0.513 0.712 0.348 0.602 0.311

Others/unknown 0.731 0.317 0.826 0.538 0.791 0.376 0.645 0.339

Overall 0.804 0.481 0.795 0.526 0.794 0.506 0.796 0.492

Table 8: Model fairness on synthetic data with original dis-

tribution.

Accuracy

disparity

Equal

opportunity

Without ﬁxing 0.049 0.102

Fine tuning 0.012 0.011

Resampling 0.041 0.062

Training by group 0.067 0.099

ment, respectively. The results are very similar to the

previous experiment in terms of model performance.

All the ﬁxing methods have a similar accuracy and

TPR. But the subgroup performance has more vari-

ation since the sample is balanced by outcome label

only and the subgroup distribution is still very im-

balanced. In terms of fairness, similar to the orig-

inal sample, the ﬁne tuning approach performs the

best, but the improvement is not as signiﬁcant (2%

improvement) for accuracy disparity and equal oppor-

tunity. Training by group performs the worst in terms

of accuracy disparity and about the same for equal op-

portunity.

6.2.3 Balanced by Subgroup Sample

The number of individuals in each racial subgroup is

the same and the label distribution remains the same

as the MS-CCP data in this sample. Tables 11 and 12

show the model performance and fairness. In terms

of model performance, the baseline, ﬁxing using ﬁne

tuning and resampling have similar accuracies (within

1%) and TPR (within 4%). Fixing using training by

group has a much lower accuracy (4-5% lower) be-

cause each subgroup has the same number of obser-

vations, leading to insufﬁcient data for all the groups.

In terms of fairness scores, all the methods have ap-

proximately the same fairness score (less than 1% dif-

ference).

6.3 Discussion

Our results for the heart disease and ADRD prediction

tasks show that the proposed ﬁne tuning approach ef-

fectively increases the fairness of the machine learn-

ing models. It performs particularly well when the

sensitive attribute is multivariate and the subgroup

distribution is very imbalanced. The design of the ﬁne

tuning approach takes advantage of the large amount

of training data from the majority subgroups to build a

pre-trained model that has reasonable overall knowl-

edge about the prediction task and the ﬁne tuning step

allows the pre-trained model to learn more speciﬁc in-

formation about each subgroup. Other advantages of

the ﬁne tuning approach include its ﬂexibility to work

with any number of subgroups, any type of loss func-

tion, and any type of deep learning model.

This work focuses on machine learning model

fairness for disease prediction with multivariate sensi-

tive attribute. We acknowledge that the ﬁne tuning ap-

proach with deep learning is not clinically explainable

and interpretable and may not be appropriate to ap-

ply in real applications. We consider a clinical analy-

sis that considers information about treatments, costs,

model explainability, etc., an important next step.

As we mentioned in Section 5.2.1, we only con-

sider two fairness metrics, accuracy disparity and

equal opportunity. While an important ﬁrst step, fu-

ture work should evaluate the performance of pro-

posed ﬁne tuning approach on other fairness metrics.

Also, in our analysis, we used the original data set

sizes to determine the number of observations used

for ﬁne tuning. Future work can consider analyzing

the minimum amount of data required for ﬁne tuning

each underrepresented subgroup. Another direction

would combine information from multiple sensitive

attributes to ensure fairness across all of them. Within

that setting, there are many more subgroups and the

minority groups will have even smaller amounts of

available training data. Therefore, this ﬁne tuning

approach may not be as effective without combining

subgroups that are similar.

Lastly, there are two types of bias to consider

when working on disease prediction: the algorithmic

bias we study in this paper and bias associated with

the probability of diagnosis. Demographic disparities

in the probability of diagnosis are common when di-

agnosing different diseases and would be an impor-

Using Under-Represented Subgroup Fine Tuning to Improve Fairness for Disease Prediction

249

Table 9: Model performance on synthetic data with balanced by label sample.

Without ﬁxing Fine tuning Resampling Training by group

Accuracy TPR Accuracy TPR Accuracy TPR Accuracy TPR

White 0.854 0.794 0.858 0.802 0.851 0.778 0.852 0.791

Black 0.817 0.822 0.859 0.812 0.809 0.802 0.813 0.812

Asian 0.925 0.771 0.917 0.808 0.891 0.773 0.718 0.706

Hispanic 0.856 0.761 0.869 0.784 0.846 0.758 0.789 0.752

Native American 0.934 0.859 0.901 0.863 0.899 0.856 0.662 0.754

Others/unknown 0.919 0.724 0.909 0.822 0.904 0.719 0.735 0.741

Overall 0.857 0.799 0.859 0.823 0.852 0.782 0.836 0.808

Table 10: Model fairness on synthetic data with balanced

by label sample.

Accuracy

disparity

Equal

opportunity

Without ﬁxing 0.042 0.037

Fine tuning 0.024 0.018

Resampling 0.032 0.024

Training by group 0.057 0.028

tant extension to our work.

7 ETHICAL CONSIDERATIONS

Given the fundamental importance of health to peo-

ple’s well-being, disease prediction is one of the

most consequential areas for algorithmic decision-

making. An inaccurate prediction concerning a med-

ical diagnosis can have dire consequences. A per-

son might miss out on treatment they need, or be

given unnecessary treatment that carries damaging

side effects. Moreover, many societies, including the

United States, have a long history of inequality and

injustice in the distribution of medical care and re-

sources (Chen et al., 2023). Historically, people from

marginalized groups receive fewer resources and sub-

par care. Models that inherit bias from the historical

data they are trained on—or from other sources in the

machine learning development pipeline—can com-

pound or exacerbate existing injustices in the distribu-

tion of medical care.When there are limitations in the

available data for marginalized or minority groups,

this too can result in subpar performance and thus

have similarly unjust results.

This paper offers a novel technical method for

building fairer machine learning models. The goal is

to lessen the degree to which algorithmic decision-

making contributes to unjust disparities in healthcare.

The ﬁne tuning method presented in this paper pro-

motes a more equal distribution of accuracy and error

faced by decision subjects receiving disease predic-

tions from a machine learning model.

However, the use of technical solutions—

including the appeal to formal fairness metrics—

raises concerns of “techno-solutionism” and “ethics-

washing”. Speciﬁcally, there is a concern that tech-

nical solutions will be used exclusively, and perhaps

used as an excuse to avoid other efforts to combat

injustice (Grote and Keeling, 2022; Fazelpour and

Danks, 2021). However, while technical fairness

methods are insufﬁcient for ensuring justice in dis-

ease prediction, they are still one valuable tool among

many. Achieving a just distribution of healthcare ac-

cess and medical resources is a complex and difﬁcult

problem. Any full solution to that problem will re-

quire a variety of efforts, potentially including signif-

icant changes to existing social institutions and power

structures. The ﬁne tuning methods discussed here

promote a more equal distribution of the risk of error

across important subpopulations. It is only one piece

of a complex solution to equitably improve health out-

comes for all subgroups.

8 FINAL THOUGHTS

This work studies machine learning fairness in dis-

ease prediction when the sensitive attribute is mul-

tivariate and the training data for different sensitive

attribute subgroups is imbalanced. Our method im-

proves fairness by ﬁne tuning a pretrained model

using examples from subgroups that are underrep-

resented in the base model. We demonstrate the

effectiveness of our approach on heart disease and

Alzheimer’s Disease and Related Dementias (ADRD)

prediction using real and synthetic data. We also in-

troduce a synthetic data generator that uses basic ag-

gregated statistics such as mean, median, and standard

deviation to generator temporal synthetic data with

varying levels of sparsity. This is particularly impor-

tant in the health domain where patient data needs to

remain private.

On four data sets, the UCI heart disease data and

three synthetic data sets with different distributions,

we ﬁnd that the ﬁne tuning approach can effectively

improve machine learning model fairness, especially

when the subgroup distribution is very imbalanced.

HEALTHINF 2025 - 18th International Conference on Health Informatics

250

Table 11: Model performance on synthetic data with balanced by subgroup sample.

Without ﬁxing Fine tuning Resampling Training by group

Accuracy TPR Accuracy TPR Accuracy TPR Accuracy TPR

White 0.782 0.398 0.789 0.452 0.769 0.439 0.742 0.403

Black 0.789 0.374 0.783 0.439 0.781 0.398 0.751 0.386

Asian 0.813 0.358 0.825 0.399 0.799 0.387 0.749 0.369

Hispanic 0.782 0.377 0.788 0.402 0.764 0.4 0.736 0.385

Native American 0.858 0.419 0.848 0.431 0.851 0.433 0.78 0.422

Others/unknown 0.849 0.356 0.797 0.389 0.839 0.398 0.801 0.378

Overall 0.812 0.385 0.805 0.419 0.801 0.41 0.76 0.391

Table 12: Model fairness on synthetic data with balanced

by subgroup sample.

Accuracy

disparity

Equal

opportunity

Without ﬁxing 0.024 0.016

Fine tuning 0.018 0.019

Resampling 0.02 0.015

Training by group 0.025 0.017

Due to privacy considerations of ﬁnancial and health

data, we are not releasing the parameters of the real

data or the generated synthetic data. However, we re-

lease the code for our synthetic data generator to help

researchers working with private data create data sets

that can be used to improve models for disease pre-

diction.

ACKNOWLEDGMENTS

This research was funded by the National Institute

on Aging of the National Institutes of Health under

award #R01AG080623, and the Massive Data Insti-

tute (MDI) at Georgetown University. We thank our

funders for supporting this work.

The content of and views expressed in this paper

are ours alone and do not necessarily represent the

ofﬁcial views of the National Institutes of Health or

the Federal Reserve Bank of New York or the Federal

Reserve System.

REFERENCES

Agarwal, S. and Muckley, C. B. (2024). Money manage-

ment difﬁculties and older people: Detection of early-

stage dementia in ﬁnancial data. Michael J. Brennan

Irish Finance Working Paper Series.

Allen, A., Mataraso, S., Siefkas, A., Burdick, H., Braden,

G., Dellinger, R. P., McCoy, A., Pellegrini, E., Hoff-

man, J., Green-Saxena, A., et al. (2020). A racially

unbiased, machine learning approach to prediction of

mortality: algorithm development study. JMIR public

health and surveillance, 6(4):e22400.

Arumugam, K., Naved, M., Shinde, P. P., Leiva-Chauca, O.,

Huaman-Osorio, A., and Gonzales-Yanac, T. (2023).

Multiple disease prediction using machine learning al-

gorithms. Materials Today: Proceedings, 80:3682–

3685.

Barocas, S., Hardt, M., and Narayanan, A. (2023). Fairness

and Machine Learning: Limitations and Opportuni-

ties. MIT Press.

Barthold, D., Joyce, G., Ferido, P., Drabo, E. F., Marcum,

Z. A., Gray, S. L., and Zissimopoulos, J. (2020). Phar-

maceutical treatment for alzheimer’s disease and re-

lated dementias: utilization and disparities. Journal of

Alzheimer’s Disease, 76(2):579–589.

Berk, R., Heidari, H., Jabbari, S., Kearns, M., and Roth, A.

(2021). Fairness in criminal justice risk assessments:

The state of the art. Sociological Methods & Research,

50(1):3–44.

Binns, R. (2018). Fairness in machine learning: Lessons

from political philosophy. In Conference on Fair-

ness, Accountability and Transparency, pages 149–

159. PMLR.

Center for Democracy & Technology (2024).

AI & Machine Learning.

Chae, S., Davoudi, A., Song, J., Evans, L., Hobensack, M.,

Bowles, K. H., McDonald, M. V., Barrón, Y., Rossetti,

S. C., Cato, K., et al. (2023). Predicting emergency de-

partment visits and hospitalizations for patients with

heart failure in home healthcare using a time series

risk model. Journal of American Medical Informatics

Association, 30(10):1622–1633.

Chakraborty, J., Majumder, S., and Menzies, T. (2021). Bias

in machine learning software: Why? how? what to

do? In Conference and Symposium on the Founda-

tions of Software Engineering, pages 429–440.

Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer,

W. P. (2002). Smote: synthetic minority over-

sampling technique. Journal of Artiﬁcial Intelligence

Research, 16:321–357.

Chen, R. J., Wang, J. J., Williamson, D. F., Chen, T. Y.,

Lipkova, J., Lu, M. Y., Sahai, S., and Mahmood, F.

(2023). Algorithmic fairness in artiﬁcial intelligence

for medicine and healthcare. Nature Biomedical En-

gineering, 7(6):719–742.

Chen, Z., Zhang, J. M., Sarro, F., and Harman, M. (2022).

Maat: a novel ensemble approach to addressing fair-

ness and performance bugs for machine learning soft-

ware. In Conference and Symposium on the Founda-

tions of Software Engineering, pages 1122–1134.

Using Under-Represented Subgroup Fine Tuning to Improve Fairness for Disease Prediction

251

Chen, Z., Zhang, J. M., Sarro, F., and Harman, M.

(2024). Fairness improvement with multiple protected

attributes: How far are we? In Conference on Soft-

ware Engineering.

Davoudi, A., Chae, S., Evans, L., Sridharan, S., Song,

J., Bowles, K. H., McDonald, M. V., and Topaz, M.

(2024). Fairness gaps in machine learning models for

hospitalization and emergency department visit risk

prediction in home healthcare patients with heart fail-

ure. Journal of Medical Informatics, page 105534.

Doecke, J. D., Laws, S. M., Faux, N. G., Wilson, W., Burn-

ham, S. C., Lam, C.-P., Mondal, A., Bedo, J., Bush,

A. I., Brown, B., et al. (2012). Blood-based pro-

tein biomarkers for diagnosis of alzheimer disease.

Archives of neurology, 69(10):1318–1325.

Fatima, M. and Pasha, M. (2017). Survey of machine learn-

ing algorithms for disease diagnostic. Journal of In-

telligent Learning Systems and Applications, 9(01):1–

16.

Fazelpour, S. and Danks, D. (2021). Algorithmic bias:

Senses, sources, solutions. Philosophy Compass,

16(8):e12760.

FDA (2022). Clinical decision support software: guidance

for industry and food and drug administration staff.

FDA Digirepo. NLM. NIH.

Feldman, M., Friedler, S. A., Moeller, J., Scheidegger, C.,

and Venkatasubramanian, S. (2015). Certifying and

removing disparate impact. In Conference on Knowl-

edge Discovery and Data Mining, pages 259–268.

Feng, Q., Du, M., Zou, N., and Hu, X. (2024). Fair machine

learning in healthcare: A survey. IEEE Transactions

on Artiﬁcial Intelligence.

Frisoni, G. (2001). Structural imaging in the clinical diag-

nosis of alzheimer’s disease: problems and tools.

Ginsberg, Y., Quintero, J., Anand, E., Casillas, M.,

and Upadhyaya, H. P. (2014). Underdiagnosis of

attention-deﬁcit/hyperactivity disorder in adult pa-

tients: a review of the literature. The Primary Care

Companion for CNS Disorders, 16(3):23591.

Grampurohit, S. and Sagarnal, C. (2020). Disease predic-

tion using machine learning algorithms. In Confer-

ence for Emerging Technology, pages 1–7. IEEE.

Gresenz, C. R., Mitchell, J. M., Marrone, J., and Federoff,

H. J. (2020). Effect of early-stage alzheimer’s disease

on household ﬁnancial outcomes. Health Economics,

29(1):18–29.

Gresenz, C. R., Mitchell, J. M., Rodriguez, B., Turner, R. S.,

and Van der Klaauw, W. (2024). The ﬁnancial conse-

quences of undiagnosed memory disorders. The Fed-

eral Reserve Bank of New York.

Grote, T. and Keeling, G. (2022). Enabling fairness in

healthcare through machine learning. Ethics and In-

formation Technology, 24(3):39.

Hardt, M., Price, E., and Srebro, N. (2016). Equality of op-

portunity in supervised learning. Advances in Neural

Information Processing Systems, 29.

Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B.,

De Laroussilhe, Q., Gesmundo, A., Attariyan, M., and

Gelly, S. (2019). Parameter-efﬁcient transfer learning

for nlp. In Conference on Machine Learning, pages

2790–2799. PMLR.

Howard, J. and Ruder, S. (2018). Universal language model

ﬁne-tuning for text classiﬁcation. arXiv preprint

arXiv:1801.06146.

Janosi, A., Steinbrunn, W., Pﬁsterer, M., and Detrano, R.

(1988). Heart Disease. UCI Machine Learning Repos-

itory. DOI: https://doi.org/10.24432/C52P4X.

Jiang, F., Jiang, Y., Zhi, H., Dong, Y., Li, H., Ma, S., Wang,

Y., Dong, Q., Shen, H., and Wang, Y. (2017). Arti-

ﬁcial intelligence in healthcare: past, present and fu-

ture. Stroke and Vascular Neurology, 2(4).

Kamiran, F. and Calders, T. (2012). Data preprocessing

techniques for classiﬁcation without discrimination.

Knowledge and Information Systems, 33(1):1–33.

Kang, J., Xie, T., Wu, X., Maciejewski, R., and Tong, H.

(2022). Infofair: Information-theoretic intersectional

fairness. In Conference on Big Data, pages 1455–

1464. IEEE.

Li, F., Wu, P., Ong, H. H., Peterson, J. F., Wei, W.-Q.,

and Zhao, J. (2023). Evaluating and mitigating bias

in machine learning models for cardiovascular dis-

ease prediction. Journal of Biomedical Informatics,

138:104294.

Liu, X., He, P., Chen, W., and Gao, J. (2019). Multi-task

deep neural networks for natural language understand-

ing. arXiv preprint arXiv:1901.11504.

Loi, M., Herlitz, A., and Heidari, H. (2021). Fair equality

of chances for prediction-based decisions. Economics

& Philosophy, pages 1–24.

Ma, J., Deng, J., and Mei, Q. (2021). Subgroup generaliza-

tion and fairness of graph neural networks. Advances

in Neural Information Processing Systems, 34:1048–

1061.

Massey Jr, F. J. (1951). The kolmogorov-smirnov test for

goodness of ﬁt. Journal of the American statistical

Association, 46(253):68–78.

Min, B., Ross, H., Sulem, E., Veyseh, A. P. B., Nguyen,

T. H., Sainz, O., Agirre, E., Heintz, I., and Roth, D.

(2023). Recent advances in natural language process-

ing via large pre-trained language models: A survey.

ACM Computing Surveys, 56(2):1–40.

Paraskevaidi, M., Morais, C. L., Halliwell, D. E., Mann,

D. M., Allsop, D., Martin-Hirsch, P. L., and Mar-

tin, F. L. (2018). Raman spectroscopy to diagnose

alzheimer’s disease and dementia with lewy bodies

in blood. ACS Chemical Neuroscience, 9(11):2786–

2794.

Peng, K., Chakraborty, J., and Menzies, T. (2022). Fair-

mask: Better fairness via model-based rebalancing of

protected attributes. Transactions on Software Engi-

neering, 49(4):2426–2439.

Pleiss, G., Raghavan, M., Wu, F., Kleinberg, J., and Wein-

berger, K. Q. (2017). On fairness and calibration. Ad-

vances in Neural Information Processing Systems, 30.

Rathore, S., Habes, M., Iftikhar, M. A., Shacklett, A., and

Davatzikos, C. (2017). A review on neuroimaging-

based classiﬁcation studies and associated feature ex-

traction methods for alzheimer’s disease and its pro-

dromal stages. NeuroImage, 155:530–548.

HEALTHINF 2025 - 18th International Conference on Health Informatics

252

Raza, S., Pour, P. O., and Bashir, S. R. (2023). Fairness in

machine learning meets with equity in healthcare. In

AAAI Symposium Series, volume 1, pages 149–153.

Secinaro, S., Calandra, D., Secinaro, A., Muthurangu, V.,

and Biancone, P. (2021). The role of artiﬁcial in-

telligence in healthcare: a structured literature re-

view. BMC Medical Informatics and Decision Mak-

ing, 21:1–23.

Shah, D., Patel, S., and Bharti, S. K. (2020). Heart dis-

ease prediction using machine learning techniques. SN

Computer Science, 1(6):345.

Shui, C., Xu, G., Chen, Q., Li, J., Ling, C. X., Arbel, T.,

Wang, B., and Gagné, C. (2022). On learning fair-

ness and accuracy on multiple subgroups. Advances

in Neural Information Processing Systems, 35:34121–

34135.

Singh, A. and Kumar, R. (2020). Heart disease prediction

using machine learning algorithms. In Conference on

Electrical and Electronics Engineering, pages 452–

457. IEEE.

Straw, I., Rees, G., and Nachev, P. (2024). Sex-based perfor-

mance disparities in machine learning algorithms for

cardiac disease prediction: Exploratory study. Journal

of Medical Internet Research, 26:e46936.

Sun, C., Qiu, X., Xu, Y., and Huang, X. (2019). How to

ﬁne-tune bert for text classiﬁcation? In Chinese com-

putational linguistics, pages 194–206. Springer.

Tarzanagh, D. A., Hou, B., Tong, B., Long, Q., and Shen,

L. (2023). Fairness-aware class imbalanced learning

on multiple subgroups. In Uncertainty in Artiﬁcial In-

telligence, pages 2123–2133. PMLR.

Wang, Y. and Singh, L. (2021). Analyzing the impact of

missing values and selection bias on fairness. In-

ternational Journal of Data Science and Analytics,

12(2):101–119.

Wenger, N. K. (2012). Women and coronary heart disease:

a century after herrick: understudied, underdiagnosed,

and undertreated. Circulation, 126(5):604–611.

Xie, S., Yu, Z., and Lv, Z. (2021). Multi-disease prediction

based on deep learning: a survey. Computer Modeling

in Engineering & Sciences, 128(2):489–522.

Yu, K.-H., Beam, A. L., and Kohane, I. S. (2018). Artiﬁcial

intelligence in healthcare. Nature Biomedical Engi-

neering, 2(10):719–731.

Yuan, C., Linn, K. A., and Hubbard, R. A. (2023). Al-

gorithmic fairness of machine learning models for

alzheimer disease progression. JAMA Network Open,

6(11):e2342203–e2342203.

Zafar, M. B., Valera, I., Rodriguez, M., Gummadi, K., and

Weller, A. (2017). From parity to preference-based

notions of fairness in classiﬁcation. Advances in Neu-

ral Information Processing Systems, 30.

Using Under-Represented Subgroup Fine Tuning to Improve Fairness for Disease Prediction

253