Towards Fairness in Machine Learning: Balancing Racially Imbalanced

Datasets Through Data Augmentation and Generative AI

Anthonie Schaap

, Sofoklis Kitharidis

and Niki van Stein

Leiden Institute of Advanced Computer Science, Leiden University, Einsteinweg 55, 2333 CC Leiden, Netherlands

a.l.j.schaap@issc.leidenuniv.nl, {s.kitharidis, n.van.stein}@liacs.leidenuniv.nl,

Keywords:

Fairness, Fair Machine Learning, Racial Bias, Inclusive AI, GANs, Dataset Balancing, Ethnicity

Classiﬁcation.

Abstract:

Existing AI models trained on facial images are often heavily biased towards certain ethnic groups due to

training data containing unrealistic ethnicity splits. This study examines ethnic biases in facial recognition

AI models, resulting from skewed dataset representations. Various data augmentation and generative AI tech-

niques were evaluated to mitigate these biases, employing fairness metrics to measure improvements. Our

methodology included balancing training datasets with synthetic data generated through Generative Adver-

sarial Networks (GANs), targeting underrepresented ethnic groups. Experimental results indicate that these

interventions effectively reduce bias, enhancing the fairness of AI models across different ethnicities. This re-

search contributes practical approaches for adjusting dataset imbalances in AI systems, ultimately improving

the reliability and ethical deployment of facial recognition technologies.

1 INTRODUCTION

As artiﬁcial intelligence (AI) continues to evolve and

integrate into daily life, the need for transparent and

fair AI systems becomes increasingly critical. Cur-

rent AI systems can exhibit unwanted biases, often

due to biases present in the datasets they are trained

on (Srinivasan and Chander, 2021). These biases can

signiﬁcantly impact real-world applications, where

human-like biases often arise due to imbalances in

ethnicity, age, or gender distribution within the train-

ing data. Biases can arise from uneven distribution of

attributes like age, ethnicity, and gender, as discussed

in (Wang et al., 2019). It is essential to tackle the

problem of model bias, which can have real-life im-

plications, such as in AI-driven hiring processes (Bo-

gen and Rieke, 2018) or risk assessments predicting

the likelihood of a defendant reoffending (Mehrabi

et al., 2022). In this paper, ethnicity bias in datasets

and AI models is investigated. To detect ethnicity

bias in datasets, an AI model is ﬁrst trained to clas-

sify different ethnicities. This model requires miti-

gation of ethnicity bias to minimize its inﬂuence on

classiﬁcation accuracy. Datasets for ethnicity clas-

https://orcid.org/0009-0007-4961-749X

https://orcid.org/0009-0005-8404-0724

https://orcid.org/0000-0002-0013-7969

siﬁcation can contain a lot of unwanted biases, and

are often unbalanced, as seen in Section 3.1. This re-

search aims to explore the presence of ethnicity bias

in datasets not speciﬁcally used for ethnicity classiﬁ-

cation but for other purposes, such as age classiﬁca-

tion and person re-identiﬁcation. When heavy ethnic-

ity biases are present in a dataset but are unknown,

researchers might develop biased and racial AI mod-

els that may better classify certain ethnicities more

represented in the dataset. Historically many prob-

lems such as the Google Photos misdemeanor were

coupled with people of color, dehumanizing them and

mislabeling them. This research addresses the follow-

ing questions:

1. To what degree do existing face datasets and AI

systems show biases related to ethnicity?

2. How can fairness within facial datasets be im-

proved through diverse modiﬁcations to the train-

ing data?

3. Which method of adding facial images to the

training data best mitigates unwanted bias and im-

proves fairness?

Following the introduction of these key concepts

discussed in this paper, similar areas of research re-

lated to this topic are examined, found in Section

2. After a review of several state-of-the-art papers

on ethnicity bias and explainability, the methods pro-

584

Schaap, A., Kitharidis, S. and van Stein, N.

Towards Fairness in Machine Learning: Balancing Racially Imbalanced Datasets Through Data Augmentation and Generative AI.

DOI: 10.5220/0013002600003837

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 16th International Joint Conference on Computational Intelligence (IJCCI 2024), pages 584-592

ISBN: 978-989-758-721-4; ISSN: 2184-3236

posed are outlined in Section 3, where the datasets

used in our experiments are also described. In Sec-

tion 4, the different experimental setups are described,

including the training of the ethnicity classiﬁcation

model and the use of a generative AI model in com-

parison with several state-of-the-art approaches. The

results of these experiments, which compare our pro-

posed method to existing methods, are also presented

within this section. Following the presentation of

these results, a discussion of the ﬁndings is provided.

Lastly, conclusions are drawn and some limitations

and suggestions for future work are given in Section

2 RELATED WORK

2.1 Bias in AI and Facial Recognition

Extensive research has highlighted the prevalence of

ethnic biases in AI systems. (Angwin et al., 2016)

found that algorithms used in the criminal justice

system disproportionately predicted more severe out-

comes for Black individuals, although (W Flores

et al., 2016) disputed these ﬁndings based on statis-

tical ﬂaws.

In facial recognition, biases often arise from

dataset imbalances (Islam, 2023), where racial stereo-

types skew the representation of certain groups.

Techniques such as detecting segregation patterns in

datasets have been proposed to reduce disparities

among different ethnic groups (Benthall and Haynes,

2019). Additionally, (Alvi et al., 2019) introduced a

joint learning and unlearning algorithm that improved

fairness and accuracy by addressing biases related to

gender, age, and ethnicity in neural networks.

Advanced facial recognition frameworks such as

DeepFace (Serengil and Ozpinar, 2021), VGG-Face2

(Cao et al., 2018), and FaceNet (Schroff et al., 2015)

have been used for ethnicity classiﬁcation tasks. Pre-

trained models like ResNet50 have also demonstrated

robust performance in classifying ethnicity and gen-

der attributes (Acien et al., 2019).

2.2 Methods for Mitigating Bias in AI

Mitigating bias in AI requires proactive measures to

identify and reduce biases in models (Howard and

Borenstein, 2018). (Raji et al., 2020) proposed a

framework to improve AI accountability by detecting

biases, while (Danks and London, 2017) emphasized

that model architecture can also harbor biases, neces-

sitating careful design and testing. Additionally, tech-

niques like synthetic faces can reveal unnoticed biases

related to appearance (Balakrishnan et al., 2020).

Generative Adversarial Networks (GANs) are a

promising tool for mitigating bias by generating sam-

ples representing underrepresented groups (Maluleke

et al., 2022). Advances like StyleGAN have improved

the generation of photorealistic images, allowing re-

searchers to manipulate key features such as pose and

expression (Karras et al., 2019). Building on this,

(Nitzan et al., 2020) introduced a method to separate

facial identity from other attributes, which we utilize

in our research to create diverse labeled data and im-

prove fairness in model training.

3 PROPOSED METHODOLOGY

Three methods are proposed and evaluated to miti-

gate bias by altering the training data during or before

model training. This section also outlines the datasets

critical for our methods.

3.1 Datasets

Experiments are conducted using various datasets tai-

lored for ethnicity classiﬁcation. The ﬁrst dataset,

UTKFace (Zhang et al., 2017), comprises over 20,000

images labeled with age, gender, and ethnicity. This

dataset is primarily used for training our ethnicity de-

tection model, featuring diverse labels. Labels in-

clude age (0 to 116 years), gender (0 for male, 1

for female), and race (0 for White, 1 for Black, 2

for Asian, 3 for Indian, 4 for Others). Particularly,

our focus is to address label imbalance, aiming for an

evenly distributed dataset across all racial categories.

The racial composition split is illustrated in Figure 1.

Figure 1: Different racial compositions in face datasets

from (K

arkk

ainen and Joo, 2019).

As depicted in Figure 1, the UTKFace dataset

does not have an evenly split on racial labels, but has a

far better split than other state-of-the-art face datasets,

excluding the FairFace dataset (K

arkk

ainen and Joo,

2019). Bias in training data, such as group imbal-

Towards Fairness in Machine Learning: Balancing Racially Imbalanced Datasets Through Data Augmentation and Generative AI

585

ances, typically leads to biased outcomes in AI mod-

els. Having a perfectly unbiased dataset is nearly im-

possible, but the goal is to mitigate the bias as much

as possible.

Additionally, Figure 1 highlights that, except

for FairFace, a disproportionately high number of

‘White’ images dominate most advanced datasets.

This imbalance can cause numerous issues; for in-

stance, when estimating the ages of individuals in a

dataset, the AI model is likely to perform better at

classifying images with the ‘White’ racial label due

to the greater volume of training data available for

this group, possibly resulting in poorer performance

for other racial groups.

To address this issue, experiments are also con-

ducted with the FairFace dataset (K

arkk

ainen and Joo,

2019) which provides an almost even split for the

racial labels White, Black, Latino, East Asian, South-

east Asian, Indian, and Middle Eastern. The FairFace

dataset contains 108.501 images, each labeled with

race, gender, and age. The racial compositions of the

UTKFace and FairFace datasets are presented in Ta-

ble 1. For clarity in presentation and analysis, the cat-

egories are grouped in the following manner in our ta-

bles: ‘White’, ‘Black’, ‘Asian’ (combining East Asian

and Southeast Asian), ‘Indian’, and ‘Others’ (encom-

passing Latino and Middle Eastern).

Table 1: Ethnicity split in the UTKFace and FairFace

datasets.

Ethnicity Count

UTKFace FairFace

White 8080 16527

Black 3621 12233

Asian 2759 23082

Indian 2146 12319

Others 1358 22583

3.2 Methodology

The following methods are proposed to sample or

generate additional data for underrepresented ethnic

groups.

Data Sampling Strategies

To use the datasets and train the model, some data

engineering is needed to match the data structures of

the different image datasets. Data preparation follows

the UTKFace dataset structure outlined in Section 3.1.

First, the FairFace dataset needs to be in the same

structure as the UTKFace dataset, where the image

labels are in the name of the images. After restruc-

turing, images are sorted into ethnicity-speciﬁc fold-

ers to facilitate targeted training on each class. This

splitting process is also done for the newly created

images using data augmentation and generative net-

works. For the generative images, the StyleGAN is

used to save the newly created images based on its

ethnicity. After the ethnicities of the created images

have been split, the name of every image is changed

to match the style of UTKFace. Names are updated

with a new age value and the appropriate ethnicity la-

bel. Timestamps are assigned with slight delays to

ensure unique ﬁlenames for each image. Each imple-

mentation can be used as a separate dataset or used

for additional training with speciﬁc ethnicity classes.

Data Augmentation

To increase both the size and quality of our datasets,

data augmentation is used to generate additional train-

ing data. Data Augmentation for image data is done

by adjusting the image slightly, such that the model

cannot recognize the image from its original state

and can use the image to improve its learning pro-

cess. Image adjustments include moving, rotating,

ﬂipping, cropping, and shifting to diversify training

data. Color adjustments are also applied to prevent the

model from learning biases associated with speciﬁc

color schemes. Our aim is to keep augmented images

realistic, avoiding excessive vertical ﬂips or rotations.

Additionally, the augmented images are saved to al-

low for manual quality control and future reuse. Table

2 displays the data augmentation methods used in this

research.

Table 2: Augmentation Methods Used to generate more re-

alistic images.

Aug. Method Value Description

Shear Range 0.05 Shear image by 5%

Zoom Range [1.0, 1.2] Zoom in up to 20%

Rotation 5 Rotate max. 5 deg.

Horizontal Flip [True,False] 50% ﬂip chance

The augmentation values are randomly chosen

in range as shown in Table 2. The images are

not zoomed out, because that would create borders

around the image which are not part of the original

image. Moreover, image shifting is avoided to prevent

border creation and information loss. The augmented

images are labeled the same way the original images

are labeled, except with a new creation timestamp.

Generating Training Data Using StyleGAN

Another method to enhance our training dataset is by

generating new images using the attributes of exist-

ing ones. This is done using an implementation of

NCTA 2024 - 16th International Conference on Neural Computation Theory and Applications

586

StyleGAN (Nitzan et al., 2020), where we use the in-

ference function to create new facial images based on

the identity and attributes of images from the training

set. For every combination of identity and attribute

image, a new image is created. For each classiﬁcation

label, 2000 images are generated using 10 randomly

chosen attribute images and 200 identity images. This

expansion of the training set improves its stability and

quality, helping to prevent overﬁtting. When handling

small datasets, the use of StyleGAN to expand the

size of the training set can be very considerable, be-

cause the images created are different from the orig-

inal dataset. Generated images retain the ethnicity

label of the original identity image, ensuring consis-

tency in label integrity. This research also ensure that

the attribute image is from the same ethnicity label,

increasing the likelihood that the generated image be-

longs to that particular label class.

Among the various generative models evaluated,

StyleGAN was selected for its robustness and ease

of use. Other GANs were considered but not used

due to practical challenges such as unavailable model

weights, expired resource links, and substantial im-

plementation errors, which impeded reproducibil-

ity. StyleGAN’s capacity to manipulate attribute and

identity vectors allows for the generation of high-

quality, diverse images, facilitating the study’s focus

on enhancing dataset fairness.

The StyleGAN implementation (Nitzan et al.,

2020) utilizes a variety of pre-trained models devel-

oped using diverse datasets, such as FFHQ at resolu-

tions of 256x256 and 1024x1024 with 70,000 images

each, VGGFace2 with 3.31 million images, Celeb-

500K with 500,000 images, and 300W with 300 im-

ages. These pre-trained models enable efﬁcient gen-

eration of new, diverse images, examples of which are

showcased in Figure 2.

3.3 Fairness Metrics for Performance

Evaluation

Various fairness metrics can be used to evaluate the

performance and fairness of the model. Models can

be evaluated based on their prediction accuracy for

each class, allowing for the determination of how

many true positives were correctly classiﬁed for each

class. However, in scenarios where classes are imbal-

anced, accuracy is not a good and informative met-

ric. It is important to also consider the false positives,

true negatives, and false negatives in the predictions.

This section examines different fairness metrics used

in this paper to evaluate fairness.

(a) White. (b) Black.

Figure 2: Some generated images from the StyleGAN

model, using UTKFace images.

F1 Score

The F1 score, a harmonic mean of Precision and Re-

call, assesses classiﬁcation performance, providing a

balance between the precision and recall for each la-

bel (Goutte and Gaussier, 2005):

Precision =

T P

T P + FP

, Recall =

T P

T P + FN

F1-score = 2 ×

Precision × Recall

Precision + Recall

The F1 score ranges from 0 to 1, with 1 indicating

perfect classiﬁcation without error.

Equalized Odds

The Equalized Odds metric (Hardt et al., 2016) evalu-

ates model fairness by comparing True Positive Rates

(TPR) and False Positive Rates (FPR) across classes:

T PR =

T P

T P + FN

, FPR =

FP + T N

The metric focuses on minimizing the maximum dif-

ference between the TPRs or FPRs across classes,

thereby enhancing fairness. It is given by:

Eq. Odds = max(|T PR1 − T PR2|, |FPR1 − FPR2|)

This research calculates Equalized Odds for all

class combinations, aiming to reduce the discrepan-

cies in recall and FPR, which could indicate bias. The

average of these differences is taken to represent the

model’s overall fairness in handling class disparities.

Towards Fairness in Machine Learning: Balancing Racially Imbalanced Datasets Through Data Augmentation and Generative AI

587

4 EXPERIMENTS

4.1 Ethnicity Detection Baseline Model

Experiments were conducted using a baseline model

for ethnicity detection, trained on the UTKFace

dataset. During 10 training epochs, the model

achieved a validation accuracy of about 91% and a

loss of 0.30.

The primary goal is not just high accuracy but

ensuring balanced accuracy across all ethnic groups.

This is critical for the model’s applicability across di-

verse datasets. A model may show high overall ac-

curacy but under-perform signiﬁcantly on certain eth-

nicity labels, like ’Indian’, making it unﬁt for detect-

ing ethnicity bias. It is also essential that the model

avoids defaulting to a single ethnicity label when pre-

dictions are inaccurate.

Complete implementation details are available in

our code repository (Anonymous, 2024).

4.2 Effect of Additional Training

Methods

In this experiment, the equal FairFace dataset is used,

where each label has 4000 images. The goal is to cre-

ate a fair baseline, meaning that every label can be

classiﬁed with a similar accuracy. The similarity of

the classiﬁcation accuracy is based on a difference of

2%, such that no label performs more than 2% better

than any other label. Ideally, the training set should be

ethnicity-balanced to achieve similar accuracy statis-

tics. In practice, this is often not the case, and ad-

ditional training is needed to balance the model. In

this experiment, we investigate the effect of adding

2000 images of a certain ethnicity label to the dataset,

such that the model can improve its learning process

on a particular label. Table 3 presents the performance

statistics of the baseline model on the UTKFace and

FairFace test sets.

Table 3: Baseline statistics for the Equal FairFace dataset,

tested on the FairFace and UTKFace test sets.

FairFace UTKFace

F1-score Acc F1-score Acc

White 0.59 52.0% 0.66 51.0%

Black 0.75 74.4% 0.81 76.2%

Asian 0.83 86.2% 0.83 85.9%

Indian 0.55 66.1% 0.68 66.2%

Others 0.52 47.8% 0.25 65.9%

As illustrated in 3, the baseline model trained on

the FairFace dataset is evaluated using two test sets.

This approach mitigates biases inherent in using a

test set derived from the same dataset. For UTKFace,

most of the images are labeled as ”White”, where the

same ethnicity split is used in the test set. This makes

the test accuracy far higher in comparison with us-

ing a test set with a balanced ethnicity split. It also

prevents the model from learning dataset-speciﬁc pat-

terns that could unreasonably inﬂuence test set accu-

racy. For the FairFace test set, the F1-score is the low-

est for the ”Indian” label, excluding the ”Others” la-

bel. To address this, 2000 data augmented ”Indian”

images were added to the original baseline training

data, and continued training. The results are presented

in Table 4.

Table 4: Additional training on ”Indian” label - FairFace

and UTKFace test set results.

FairFace UTKFace

F1-score Acc F1-score Acc

White 0.62 84.2% 0.86 84.2%

Black 0.72 58.3% 0.70 54.4%

Asian 0.80 90.2% 0.78 88.4%

Indian 0.52 64.3% 0.66 88.2%

Others 0.11 6.1% 0.05 2.7%

When adding data, it becomes evident that the

model stops trying to classify the ”Others” label, to

focus on the four more recognizable labels. The ”Oth-

ers” category consists of images that could be placed

in one of the four ethnicity labels, which makes the

”Others” category weak. Consequently, we exclude

the ”Others” category in subsequent experiments to

avoid its negative impact.

Equalized Odds on Real Data

A different method to evaluate the model involves us-

ing the Equalized Odds fairness metric from Section

3.3, aiming for the lowest possible Equalized Odds

value. A low Equalized Odds value signiﬁes that True

Positive Rates (TPR) and False Positive Rates (FPR)

are similar across classes, enhancing fairness. After

calculating the Equalized Odds value between each

pair of classes, the average value is taken for each

class. Classes with the highest Equalized Odds values

receive additional training, as these indicate a need for

improved fairness. Figure 3 displays the outcomes

of the Equalized Odds experiment using real training

data.

Equalized Odds on Augmented Data

In this experiment, we change the additional train-

ing data to the augmented data from the UTKFace

dataset. The other hyperparameters like the number of

steps and the amount of additional images each step

NCTA 2024 - 16th International Conference on Neural Computation Theory and Applications

588

Figure 3: Retraining the model using data derived from the

FairFace dataset. The additional training data is selected

based on the class with the highest Equalized Odds value,

aiming to enhance fairness across all ethnicity classes.

Figure 4: Retraining using augmented data derived from

UTKFace based on the highest Equalized Odds value. Af-

ter the initial training step, it is observed that the Indian

class exhibits the highest value, which means that this class

receives extra images based on data augmentation for the

next training step.

stays the same. Figure 4 presents the results of this

experiment.

Equalized Odds on GAN Data

In this experiment Generative Algorithms are used to

create new data for the training of the model. This

technique can be very helpful when dealing with very

small datasets, where a limited amount of images is

present. The Equalized Odds value for each ethnic-

ity class is calculated to demonstrate the effect of

adding additional training data to the baseline UTK-

Face training set. The results are displayed in Figure

5. Additional training is based on the highest Equal-

ized Odds value, aiming to balance the Equalized

Odds values over all classes. After several epochs,

Figure 5: Model retraining with GAN-generated data from

the UTKFace dataset, selected to improve fairness by tar-

geting classes with the highest Equalized Odds values.

it is observed that the Equalized Odds values of all

classes come fairly close to each other.

4.3 Balancing Fairness Using Different

Input Data

This section explores the impact of various ethnic

splits from different input datasets on model training.

There are differences between the original UTKFace

and FairFace models, with the possibility to add im-

ages using data augmentation, generative adversarial

networks, or simply using data from another dataset.

Baseline UTKFace

A method to balance fairness in machine learning

models is to adjust the input data the model uses for

training. This experiment examines the effect of alter-

ing the ethnicity split in the input data on the fairness

of the model. This is calculated using the Equalized

Odds fairness metric. As shown in Figure 6, the UTK-

Face dataset without the ‘Others’ category performs

reasonably well, particularly the ‘Indian’ class, which

exhibits the highest Equalized Odds value. An exam-

ination of the ethnicity split of the UTKFace dataset

in Table 1 reveals that the ‘Indian’ category contains

only 2146 images, whereas other classes have more

images. To address this imbalance, the next exper-

iment involves equalizing the image count for each

class to balance the dataset.

Equalizing Image Count in UTKFace and

FairFace

For the next experiment, images were added to the

Black, Asian, and Indian classes to match the num-

ber of images in the White class, which has the most

Towards Fairness in Machine Learning: Balancing Racially Imbalanced Datasets Through Data Augmentation and Generative AI

589

Figure 6: Equalized Odds values of UTKFace dataset with-

out the ”Others” category. It indicates that the ”Black” class

has the lowest average and max EQodds values, making it

the most ”fair” class in terms of Equalized Odds.

images with some 8063 different images. By using

an equalized image count for all classes, the poten-

tial for bias arising from an unbalanced ethnic split

in the input dataset is mitigated. This approach also

addresses the issue of the model potentially favoring

classes with more images. To augment the UTKFace

dataset and enable an equitable comparison with the

FairFace dataset, 4000 images of each class were se-

lected, totaling 16,000 images. This required adding

images to the Black, Asian, and Indian categories, as

detailed in Table 1, and removing excess images from

the White category by randomly selecting 4080 im-

ages for removal. After integrating randomly chosen

images from the FairFace dataset, each class was bal-

anced to exactly 4000 images. The test set, derived

from the original test set of the UTKFace dataset, was

also balanced by selecting 675 images of each class,

ensuring an equal number of test images for each pos-

sible classiﬁcation. The results of this experiment are

displayed in Figure 7.

4.4 Improving Fairness Using

Additional Training by Adding

(n)Generated Images

This experiment examines the effect of adding images

from different datasets, augmented images or gener-

ated images using a GAN. This is done by adding

1000 images of the class with the lowest Equalized

Odds value, to achieve an as balanced as possible

fairness result using equalized odds. The experiment

is conducted over a single epoch during which the

adjusted UTKFace model is trained. After the ﬁrst

epoch, the performance of the model is tested and the

fairness metrics are calculated. Based on the lowest

Figure 7: Equalized Odds values of the balanced UTKFace

dataset. The average values are averaged over 10 runs of

the model, with the maximum value found on the lighter

respectable color bars.

Figure 8: Equalized Odds values when performing different

additional training methods to the UTKFace dataset. Each

experiment is averaged over 10 runs. The area around the

lines is the conﬁdence interval calculated by taking the stan-

dard deviation over the 10 runs.

Equalized Odds value, 1000 images from three pos-

sible separate datasets are added. The ﬁrst dataset

comprised original FairFace images; the second fea-

tured UTKFace images augmented as described in

Section 3.2. The third dataset included images gen-

erated by StyleGAN from original UTKFace images.

All images from these three datasets are split over

each class, allowing for images from a single class to

be added to the original dataset. The experiment con-

sists of three separate runs, each its own dataset for

additional training. Each of the three experiments is

averaged over 10 runs, such that randomness is low-

ered. The outcomes are depicted in Figure 8.

NCTA 2024 - 16th International Conference on Neural Computation Theory and Applications

590

Figure 9: Equalized Odds values calculated over 10 differ-

ent runs using different input training data sets. The UTK-

Face dataset has its original amount of images for each eth-

nicity. The other datasets have 4000 images for each class,

totalling 16k images per dataset. This balances out the

dataset but loses some of the original images correspond-

ing to the ”White” label.

4.5 Balancing Training Data by Adding

Images Using Various Techniques

Section 4.4 described adding data to the model based

on training performance. In this experiment, different

adaptations of the UTKFace dataset are used to bal-

ance the training data and improve the equalized odds

values from each ethnicity. By balancing the datasets,

every class has the same amount of labeled images,

such that each class gets an equal amount of train-

ing. The goal of this experiment is to create fairness

in the UTKFace dataset and to see which of the meth-

ods can create fairness. In practice, extra data is not

always available, so by using data augmentation or by

generating new images using StyleGAN, the process

of balancing datasets can still be improved. Figure 9

displays the results.

5 DISCUSSION

The experiments indicate that extending training with

new image data improves Equalized Odds across

classes, as shown in Figures 4 and 5. The StyleGAN-

generated data performed particularly well when orig-

inal images were limited. However, initial additional

training on speciﬁc classes (Indian and White, as

per Figure 3) led to worse Equalized Odds for other

classes. Comparable Equalized Odds values were

achieved for all classes when training on augmented

data (Figure 4), and notably improved for the Indian

and Black classes. Using an equalized image count

in UTKFace and FairFace datasets (Figure 7) demon-

strated similar Equalized Odds for all classes, albeit

”Indian” and ”White” showed consistent bias. Equal-

izing image count effectively curbed the model’s ten-

dency to favor labels with a higher image count.

Additional training using varying techniques

(Section 4.4) initially decreased Equalized Odds, but

these values increased after four epochs (Figure 9).

Yet, it failed to elevate the Indian class’s Equalized

Odds above other classes due to the model’s bias to-

wards larger image counts. Results suggest GAN im-

age data, FairFace, and augmented data perform simi-

larly. To understand the effect of GAN data, different

data will be added before training to achieve balanced

images.

Finally, balanced training data using various tech-

niques revealed the UTK+FairFace dataset as the

worst performer, possibly owing to the StyleGAN

model’s training on unbalanced data. Meanwhile,

data-augmented images performed similarly to the

UTKFace dataset, suggesting data augmentation as a

viable option for balancing and bias reduction. Fur-

ther experimentations with different fairness metrics

are needed to understand the impact of these tech-

niques better.

6 CONCLUSION

This paper investigates techniques to enhance fairness

in facial image datasets and models. Through our

literature review, we found models often use biased

image datasets, limiting their global usability. In re-

sponse, the study explores adding data during train-

ing, with a focus on the class exhibiting the poor-

est fairness. Our results show that additional training

and balanced training data effectively improve fair-

ness. As shown in Figure 9, adding StyleGAN gen-

erated images yields the worst fairness results, likely

due to the model’s inherent bias. Future work should

examine whether newer GAN implementations can

enhance fairness. Enhancing fairness will likely in-

crease the usage of these models in real applications

where decisions should not be based on bias. This

emphasizes the need for further exploration and im-

provements in fairness as AI continues to permeate

real-world applications.

REFERENCES

Acien, A., Morales, A., Vera-Rodriguez, R., Bartolome, I.,

and Fierrez, J. (2019). Measuring the gender and eth-

nicity bias in deep models for face recognition. In

Vera-Rodriguez, R., Fierrez, J., and Morales, A., edi-

Towards Fairness in Machine Learning: Balancing Racially Imbalanced Datasets Through Data Augmentation and Generative AI

591

tors, Progress in Pattern Recognition, Image Analysis,

Computer Vision, and Applications, pages 584–593,

Cham. Springer International Publishing.

Alvi, M., Zisserman, A., and Nell

aker, C. (2019). Turning

a blind eye: Explicit removal of biases and variation

from deep neural network embeddings. In Leal-Taix

L. and Roth, S., editors, Computer Vision – ECCV

2018 Workshops, pages 556–572, Cham. Springer In-

ternational Publishing.

Angwin, J., Larson, J., Mattu, S., and Kirchner,

L. (2016). Machine bias: There’s software

used across the country to predict future crim-

inals. and it’s biased against blacks. ProP-

ublica. https://www.propublica.org/article/

machine-bias-risk-assessments-in-criminal-sentencing.

Anonymous (2024). Towards Fairness In Machine Learning

- experiments, code and data. https://doi.org/10.5281/

zenodo.12672289.

Balakrishnan, G., Xiong, Y., Xia, W., and Perona, P. (2020).

Towards causal benchmarking of bias in face analy-

sis algorithms. In Vedaldi, A., Bischof, H., Brox, T.,

and Frahm, J.-M., editors, Computer Vision – ECCV

2020, pages 547–563, Cham. Springer International

Publishing.

Benthall, S. and Haynes, B. D. (2019). Racial categories in

machine learning. In Proceedings of the Conference

on Fairness, Accountability, and Transparency, FAT*

’19, page 289–298, New York, NY, USA. Association

for Computing Machinery.

Bogen, M. and Rieke, A. (2018). Help wanted: an exami-

nation of hiring algorithms, equity, and bias.

Cao, Q., Shen, L., Xie, W., Parkhi, O. M., and Zisserman,

A. (2018). Vggface2: A dataset for recognising faces

across pose and age.

Danks, D. and London, A. (2017). Algorithmic bias in au-

tonomous systems. pages 4691–4697.

Goutte, C. and Gaussier, E. (2005). A probabilistic inter-

pretation of precision, recall and f-score, with impli-

cation for evaluation. In Losada, D. E. and Fern

andez-

Luna, J. M., editors, Advances in Information Re-

trieval, pages 345–359. Springer Berlin Heidelberg.

Hardt, M., Price, E., Price, E., and Srebro, N. (2016). Equal-

ity of opportunity in supervised learning.

Howard, A. and Borenstein, J. (2018). The ugly truth about

ourselves and our robot creations: The problem of bias

and social inequity. Science and Engineering Ethics,

24.

Islam, A. U. (2023). Gender and Ethnicity Bias in Deep

Learning. PhD thesis.

arkk

ainen, K. and Joo, J. (2019). Fairface: Face attribute

dataset for balanced race, gender, and age. ArXiv,

abs/1908.04913.

Karras, T., Laine, S., and Aila, T. (2019). A style-based

generator architecture for generative adversarial net-

works.

Maluleke, V. H., Thakkar, N., Brooks, T., Weber, E., Dar-

rell, T., Efros, A. A., Kanazawa, A., and Guillory, D.

(2022). Studying bias in gans through the lens of race.

Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K., and

Galstyan, A. (2022). A survey on bias and fairness in

machine learning.

Nitzan, Y., Bermano, A., Li, Y., and Cohen-Or, D. (2020).

Face identity disentanglement via latent space map-

ping. ACM Transactions on Graphics (TOG), 39:1 –

14.

Raji, I. D., Smart, A., White, R. N., Mitchell, M., Gebru,

T., Hutchinson, B., Smith-Loud, J., Theron, D., and

Barnes, P. (2020). Closing the ai accountability gap:

Deﬁning an end-to-end framework for internal algo-

rithmic auditing.

Schroff, F., Kalenichenko, D., and Philbin, J. (2015).

Facenet: A uniﬁed embedding for face recognition

and clustering. In 2015 IEEE Conference on Com-

puter Vision and Pattern Recognition (CVPR). IEEE.

Serengil, S. I. and Ozpinar, A. (2021). Hyperextended light-

face: A facial attribute analysis framework. In 2021

International Conference on Engineering and Emerg-

ing Technologies (ICEET), pages 1–4. IEEE.

Srinivasan, R. and Chander, A. (2021). Biases in ai systems.

Commun. ACM, 64(8):44–49.

W Flores, A., Bechtel, K., and Lowenkamp, C. (2016).

False positives, false negatives, and false analyses:

A rejoinder to “machine bias: There’s software used

across the country to predict future criminals. and it’s

biased against blacks.”. Federal probation, 80.

Wang, T., Zhao, J., Yatskar, M., Chang, K.-W., and Or-

donez, V. (2019). Balanced datasets are not enough:

Estimating and mitigating gender bias in deep image

representations.

Zhang, Z., Song, Y., and Qi, H. (2017). Age progression

regression by conditional adversarial autoencoder.

NCTA 2024 - 16th International Conference on Neural Computation Theory and Applications

592