Debiasing Sentence Embedders Through Contrastive Word Pairs

Philip Kenneweg

, Sarah Schr

oder

, Alexander Schulz

and Barbara Hammer

CITEC, University of Bielefeld, Inspiration 1, 33615 Bielefeld, Germany

Keywords:

NLP, Bias, Transformers, BERT, Debias.

Abstract:

Over the last years, various sentence embedders have been an integral part in the success of current machine

learning approaches to Natural Language Processing (NLP). Unfortunately, multiple sources have shown that

the bias, inherent in the datasets upon which these embedding methods are trained, is learned by them. A va-

riety of different approaches to remove biases in embeddings exists in the literature. Most of these approaches

are applicable to word embeddings and in fewer cases to sentence embeddings. It is problematic that most

debiasing approaches are directly transferred from word embeddings, therefore these approaches fail to take

into account the nonlinear nature of sentence embedders and the embeddings they produce. It has been shown

in literature that bias information is still present if sentence embeddings are debiased using such methods. In

this contribution, we explore an approach to remove linear and nonlinear bias information for NLP solutions,

without impacting downstream performance. We compare our approach to common debiasing methods on

classical bias metrics and on bias metrics which take nonlinear information into account.

1 INTRODUCTION

In the last couple of years, the transformer archi-

tecture pioneered by (Vaswani et al., 2017) has en-

abled large pre-trained neural networks to efﬁciently

tackle previously difﬁcult NLP tasks with relatively

few training examples. A common tool deployed to

facilitate fast transfer of knowledge are Sentence Em-

bedders which produce a vectorized representation of

a given text input, where the original text T can be of

arbitrary length. However, longer text passages pro-

duce more diluted embeddings.

Sentence Embedders produce a summarization of

the content that is easy to process for a multitude of

different tasks by e.g. shallow neural networks.

Many common architectures need a ﬁne-tuning

step on a speciﬁc task to achieve good performance

(Devlin et al., 2018; Liu et al., 2019). Recent litera-

ture has shown that language models are inherently

biased with regard to different protected attributes

(Bolukbasi et al., 2016; Caliskan et al., 2017). Com-

monly, investigated bias attributes are religion, gen-

der, etc. (May et al., 2019).

Though the most common source of bias is the

training data, other factors can have a mitigating or

https://orcid.org/0000-0002-7097-173X

https://orcid.org/0000-0002-7954-3133

https://orcid.org/0000-0002-0739-612X

https://orcid.org/0000-0002-0935-5591

magnifying effect.

Many approaches to reduce bias in word/sentence

embeddings exist (Cheng et al., 2021; Ravfogel et al.,

2020; Manzini et al., 2019), though most of these ap-

proaches were designed for word embeddings which

produce linearly combinable embeddings. Sentence

embeddings are based upon the vastly more complex

transformer architecture, furthermore they are often

processed by nonlinear multilayer networks. Many of

the most common debiasing methods and metrics to

evaluate bias in word/sentence embeddings are based

upon the assumption of linearity in the embedding

space. As this can no longer be guaranteed in sen-

tence embeddings, new debiasing methods and ways

to measure bias have to be considered.

In this paper we propose a new additional training

objective that can debias sentence embeddings pro-

vided with only a few contrastive words that implic-

itly deﬁne the bias direction. Contrastive objectives,

which are the basis for popular recent models such as

CLIP (Radford et al., 2021) and the thereupon based

DALL-E 2 (Ramesh et al., 2022), have proven to be

particularly promising. Our proposed debiasing ob-

jective can be applied during the standard ﬁne-tuning

procedure required for many tasks or during the pre-

training and provides better results than other debias-

ing procedures as we demonstrate, especially in the

case of nonlinear bias.

Kenneweg, P., Schröder, S., Schulz, A. and Hammer, B.

Debiasing Sentence Embedders Through Contrastive Word Pairs.

DOI: 10.5220/0011615300003411

In Proceedings of the 12th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2023), pages 205-212

ISBN: 978-989-758-626-2; ISSN: 2184-4313

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

205

2 RELATED WORK

The most commonly used architecture for sentence

embedders is the transformer architecture, which is

pre-trained on large textual datasets. Common objec-

tives for transformer based pre-training are masked

language modeling (MLM) and next sentence predic-

tion (NSP) (Devlin et al., 2018).

A variety of previous works have shown that lan-

guage models capture biases in their training data,

which can manifest in text representations produced

by these models (Bolukbasi et al., 2016; Caliskan

et al., 2017). This issue can further lead to issues in

downstream tasks. For instance, Abid, Abubakar et al.

(Abid et al., 2021) showed that GPT-3 produces texts

inheriting muslim-violence biases (e.g. in prompt

completion). Such ﬁndings motivated many works to

develop measures for biases in language models such

as word and sentence embeddings (Bolukbasi et al.,

2016; Caliskan et al., 2017) and debiasing algorithms

(Bolukbasi et al., 2016; Liang et al., 2020).

2.1 Measuring Bias in Sentence

Embeddings

Multiple different approaches to measuring bias in

sentence embeddings exist. Some focus on the ge-

ometric relations of words in the embedding space

(Caliskan et al., 2017; May et al., 2019; Bolukbasi

et al., 2016; Manzini et al., 2019), others on the in-

ﬂuence of bias on classiﬁcation, clustering or other

downstream tasks (Gonen and Goldberg, 2019; Zhao

et al., 2018). In the course of our work, we use the

classiﬁcation and clustering test by (Gonen and Gold-

berg, 2019). No universally agreed upon test to de-

termine bias exists, and many tests used for bias mea-

surement in sentence embeddings are adapted from

word embeddings. This is accomplished by inserting

words which are deﬁning the bias space into carefully

chosen neutral sentences. Since the sentence em-

bedding space is in many cases highly nonlinear and

more complex in contrast to many classical word em-

bedding counterparts (e.g. word2vec (Mikolov et al.,

2013) uses a single layer without an activation func-

tion to produce the word embedding), it is not clear if

the most common bias measuring methods applied in

the word case can be applied to the sentence embed-

ding case. Further, even in the word embedding con-

text there exists criticism towards many bias metrics

(Gonen and Goldberg, 2019; Schr

oder et al., 2023).

(Gonen and Goldberg, 2019) propose a classiﬁca-

tion test for bias, where a classiﬁer is trained to dis-

criminate theoretically neutral words by stereotypical

associations (in their case with gender). If the classi-

ﬁer can generalize these associations onto unseen em-

beddings, they are considered biased. The authors use

a RBF-kernel SVM for classiﬁcation. The test can be

easily expanded to sentence embedding by inserting

such words into neutral sentences and then classifying

on these sentences. In our experiments, we use a list

of occupations used in the work of (Bolukbasi et al.,

2016) as theoretically neutral words compared to gen-

der attributes. Furthermore, the choice of classiﬁer

inﬂuences which kinds of biases are detected, linear

classiﬁers can only detect linear biases, whereas non-

linear classiﬁers can be used to detect more complex

biases. The classiﬁcation test can only detect the pres-

ence of bias and the relative amount, but it can not

guarantee that no bias is present.

2.2 Removing Bias in Sentence

Embeddings

Word embeddings are well researched in comparison

to sentence embeddings with regards to their bias.

Since most word embeddings methods are inherently

linear, the approaches used for debiasing cannot be

directly applied to sentence embeddings.

Most recent work on removing bias in sentence em-

beddings is focused on removing the bias while treat-

ing the neural network as a black box and only ap-

plying their debiasing procedure post-hoc on the sen-

tence embeddings (Liang et al., 2020; Cheng et al.,

2021). Other work tries to debias sentence embed-

ders by retraining them on unbiased data (Zhao et al.,

2019). Obtaining large quantities of unbiased data

however proves difﬁcult.

In this paper, we will directly retrain the network

using a custom loss function, assuming that the ca-

pabilities of transformers to understand complex rela-

tions also make them perform well at debiasing.

(Liang et al., 2020) propose the Sent-Debias ap-

proach. It utilizes PCA to capture the gender dimen-

sion in a large variety of sentences, by replacing gen-

der sensitive words by their counterparts and comput-

ing the difference of the produced sentence embed-

ding. Furthermore, many different naturally occur-

ring sentences from a text corpus are utilized, thus

capturing more complexity of sentence embeddings

than by using purely simple sentence templates.

Null-It-Out (Ravfogel et al., 2020) approaches the

problem differently by looking at an SVM classiﬁer

and a corresponding bias related task. Then their so

called iterative nullspace projection algorithm is per-

formed that results in the SVM classiﬁer no longer

producing meaningful predictions.They also highlight

that the proposed approach only works for linear clas-

siﬁers and the corresponding information can be eas-

ICPRAM 2023 - 12th International Conference on Pattern Recognition Applications and Methods

206

ily recovered by a nonlinear model. We will compare

to this model in our experiments.

FairFil (Cheng et al., 2021) has a similar approach

to generating contrastive sentences as our approach,

but uses only a comparatively small set of manually

selected sentences. Moreover, the neural network is

treated as a black box on which the generated sen-

tence embeddings are debiased using an extra ﬁlter.

3 PROPOSED DEBIASING

APPROACH

Approaches like Sent-Debias (Liang et al., 2020) or

Null It Out (Ravfogel et al., 2020) are only capable of

removing linear biases and dependencies in the sen-

tence embeddings. Simple three layer deep neural

networks are able to recover most bias from examples

that are debiased using these methods, see Tables 1,

2 and 3. For this reason we try to remove linear and

nonlinear information present in the embeddings re-

garding the bias, while retaining performance on a va-

riety of classiﬁcation tasks. In literature (Liang et al.,

2020) the importance of using a wide variety of dif-

ferent sentences in a semi-supervised fashion is high-

lighted to generate better debiasing. We will follow

this concept in the present paper. In contrast to (Liang

et al., 2020) using linear projections as calculated by

a PCA to perform their debiasing, we will focus in

this work on potential improvements by training the

whole network with an additional cost function dur-

ing ﬁne tuning, pre-training or both.

3.1 Deﬁnition

First we choose word pairs K

, K

, ..K

that con-

trastively deﬁne the bias subspace. For example:

= [men, women], K

= [boy, girl], K

= [muslim,

christian, jew].

The words chosen should only differ in their meaning

by the targeted bias.

In the next step a large text data set D is searched for

occurrences of any of these words k ∈ K

∃i. When-

ever a sentence S is found in which one of these words

k occurs, the original sentence S

and a variation S

where the word k ∈ K

is replaced by one of its coun-

terparts in K

, is added as a pair S

, S

to the debias

training examples. If multiple counterparts k ∈ K

are

available as is the case in our religion example one

is selected at random. Accordingly, we propose the

following loss objective upon which the network is

additionally trained:

L(S

, S

) = ∥E(S

) − E(S

)∥ (1)

And the overall debiasing loss for one training epoch

is:

L =

∑

L(S

, S

) (2)

Thereby, E() is the embedding function produced by

the network and ∥.∥ is the euclidean norm. The intu-

ition behind this objective is to penalize the network

for producing different/biased embeddings if only the

gender information differs in the sentences. Overall

this incentivizes the network to not convey any bias

related information in its embedding.

The additional objective function we propose can be

applied during ﬁne-tuning on a ﬁnal classiﬁcation

task, or during pre-training of the model.

It is always necessary to perform another training

objective (for example the ﬁne tuning task, or a vari-

ety of different pre-training task) concurrently since a

clear solution to minimize the loss described in equa-

tion 2 is to produce the same embedding/shrink all

embeddings for every input.

Our additional objective is semi-supervised to lever-

age the capability of understanding complex relations

of transformers with large amounts of data. All in all,

we propose three schemes for debiasing by augment-

ing the training with equation 2:

• include the proposed cost function during pre-

training, further referred to as pre

• include the proposed cost function during ﬁne-

tuning, further referred to as f ine

• include the proposed cost function during pre-

training and ﬁne-tuning, further referred to as

pre f ine

We utilize the letter p to refer to our approach pair-

wise contrastive bias reduction.

4 EXPERIMENTAL APPROACH

In this section we detail our employed experimental

design to investigate the effects of our proposed ap-

proaches fine

, pre

and pre f ine

with respect to the

reduction of bias and the performance on downstream

tasks. We utilize the BertHugginface library (Wolf

et al., 2020) for implementation and the pre-trained

Bert model (’bert-base-uncased’) for all experiments

in order to reduce training time.

4.1 Generating Contrastive Sentences

In order to produce comparable results, we follow a

large line of work of in the literature and employ gen-

der attributes for debiasing evaluation. The word pairs

Debiasing Sentence Embedders Through Contrastive Word Pairs

207

we use to deﬁne the gender dimensions could for ex-

ample be:

women - men and girl - boy.

A larger selection of word pairs consisting of 11 pairs

is used for our experiments. For each bias attribute a

large amount of sentences (20,000 in our case) of the

News Corpus Multi-News (Fabbri et al., 2019) was

found in which the bias deﬁnition words are present.

These are then utilized as sentences S

and S

for our

proposed approaches.

4.2 Datasets

The Glue dataset by (Wang et al., 2019) is a collec-

tion of other datasets and is widely used to evaluate

common natural language processing capabilites of a

variety of networks. All datasets used are the version

provided by tensorﬂow-datasets 4.0.1.

4.3 Implementation Details

Occupation Task:(Gonen and Goldberg, 2019) pro-

pose a classiﬁcation test to determine bias, where a

classiﬁer C is trained to discriminate for embeddings

of occupations E(w), whether E(w) is typically male

or female. A high accuracy in this setting corresponds

to a stereotype present in the embedding.

This test can be directly expanded to the sentence

embedding context by inserting such words into neu-

tral sentences and then classifying these sentences. In

our experiments, we utilize a list of occupations used

in the work of (Bolukbasi et al., 2016) as theoretically

neutral words compared to gender attributes. Since

these occupations are rated by (Bolukbasi et al., 2016)

by how stereotypical they are male/female, the classi-

ﬁcation task from (Bolukbasi et al., 2016) is modiﬁed

to be a regression to these ratings.

This test is highly relevant for the purposes of this

paper since it can easily be modiﬁed to be only able

to distinguish linear bias information (by using only a

single layer without an activation function), as well as

nonlinear information (using an MLP). Furthermore

the reported results of this test have low standard de-

viations compared to the SEAT (Liang et al., 2020)

test, additionally the results of the SEAT test have

low statistical signiﬁcance as reported in (Gonen and

Goldberg, 2019).

In our work, we implement the occupation task

by the BERT architecture sentence embedder, which

was pre-trained and/or ﬁne-tuned using the parame-

ters supplied by the BERT paper. During the training

of the bias regressor this part of the network is kept

frozen. The produced embeddings are then fed into a

Multi Layer Perceptron. In the linear case the MLP

Figure 1: Number of pre-training steps per debiasing step

vs detected bias. Higher Occupation task loss denotes less

bias present in the embedding.

consists of just a single neuron with a sigmoid acti-

vation function. In the nonlinear case it consists of

3 Dense Layers with 20 neurons each and Rectiﬁed

Linear Unit (RELU) activation functions inbetween.

Again a single neuron with a sigmoid activation func-

tion acts as the output.

Each model is trained for 50 epochs on the train-

ing data or until the validation accuracy does not im-

prove for 5 epochs. The optimizer used is SGD with

a learning rate of 0.01. The utilized loss function to

compute the regression loss is the Mean Squared Er-

ror (MSE).

Finally, the resulting loss on the test set after train-

ing is the score used in Figure 1, 2 and Table 1, 2, 3.

4.3.1 Pre Training

During additional pre-training of pre

and pre f ine

we use the pre-trained Bert Model and train it on a

news corpus using the MLM task concurrently with

our debiasing method for gender for 400,000 training

steps. Using an already pre-trained model is done to

speed up convergence.

If the debias loss is balanced with the MLM loss

correctly, only the bias relevant information should

be removed from the model, while retaining high

performance on the MLM task and possible down-

stream tasks. In order to estimate a good balance

between these two objectives, we evaluate different

ratios of MLM steps per debias step on the Occupa-

tion task (see Figure 1). The debiasing performance

clearly drops when the number of pre-training steps

per debias step increases. However, the MLM pre-

training loss is not affected(not depicted). Accord-

ingly, we use the ratio of 1:1 for the training of pre

and pre f ine

ICPRAM 2023 - 12th International Conference on Pattern Recognition Applications and Methods

208

4.3.2 Fine-Tuning

During ﬁne-tuning of pre f ine

and f ine

we employ

the pre-trained Bert Model provided by Huggingface

(Wolf et al., 2020) or pre

Figure 2: Number of ﬁne-tune steps per debiasing step vs

detected bias. Higher bias classiﬁer loss denotes less bias

present in the embedding. The red line denotes the accuracy

on the downstream task. The blue line shows the bias loss

of the linear model and the yellow line denotes the bias loss

of the nonlinear model. The bias scores and accuracies are

averaged over 5 runs.

In order to estimate the same balance between our

debiasing objective and the ﬁne-tuning objective, we

perform a search on the relative number of steps on

the glue task vs steps on the debiasing task. Since

we are in a ﬁne-tuning scenario, we can now incorpo-

rate the accuracy on a downstream task for our search.

The result is shown in Figure 2. A clear downward

trend can be observed for the nonlinear bias score

(lower score equals more bias present in the embed-

ding) and less clearly for the linear one, in case of

more ﬁne-tuning steps per debiasing step. While the

accuracy does seem to be negatively affected in this

case, this effect is rather minor (in the range of 1-

2%). Following this, the same amount of steps with

our custom training objective are performed as on the

classiﬁcation task during ﬁne-tuning (1:1).

For each ﬁne-tuning task, the pre-trained BERT

architecture is used and a single new fully connected

layer after the sentence embeddings is added, then the

whole network is trained on the ﬁne-tuning task. All

ﬁne-tuning runs are trained for 5 epochs with 7,200

training examples per epoch. Longer training times

and hyperparameter tuning could have resulted in bet-

ter performance, but since the topic of this paper is de-

biasing these models, only the relative performance of

the different training setups matter. The used hyper-

parameters are the same as recommended in the orig-

inal BERT paper (Devlin et al., 2018). All ﬁne-tuning

experiments are performed 5 times and their aver-

ages are reported. This is done to reduce variation

in results, since BERT ﬁne-tuning is very sensitive to

small changes. The same run can vary in performance

up to 5% just due to non-deterministic training on the

graphics card.

4.4 Experimental Setup

In the following, we evaluate the performance of our

debiasing approach and investigate possible perfor-

mance drops on downstream tasks

For the glue tasks, we consider a pre-trained Bert

model which we ﬁne-tune on one downstream task

together with different debiasing approaches: pre-

trained and ﬁne-tuned Bert without debiasing (re-

ferred to as original Bert), pre-trained and ﬁne-tuned

Bert with subsequent debiasing by the Sent-Debias

approach (referred to as Sent-Debias), pre-trained and

ﬁne-tuned Bert with subsequent debiasing by Null-

it-Out (referred to as Null-it-Out), pre-trained Bert

together with our proposed f ine

debiasing (which

involves ﬁne-tuning; referred to as f ine

) and pre-

trained Bert together with our proposed pre f ine

de-

biasing (which involves ﬁne-tuning; referred to as

pre f ine

). Each of these combinations is, after per-

forming the debiasing (see the descriptions in sec.

4.3), evaluated on a test set of the according down-

stream task giving the accuracy values shown in the

tables. Subsequently, we employ the occupation task

in order to estimate the amount of bias present in the

models. This is done by adding (a) new layer(s),

which parameters are trained to estimate the gender

of a job embedded in a sentence (see secs. 4.1 and 4.3

for more details). According to the number of these

new layers we refer to linear bias (one layer) or non-

linear bias (multiple layers, with activation functions

in between) and high values imply low bias. The re-

sults of these experiments are depicted in section 5.1.

In a second series of experiments we investigate

debiasing capabilities on pre-trained models that are

not ﬁne-tuned. These are displayed in section 5.2.

5 RESULTS

This section displays the results of the four experi-

mental series described previously.

5.1 GLUE

For evaluation, 3 Glue (Wang et al., 2019) tasks

(CoLA, SST2, QNLI) are considered. The results are

described in the following three subsections.

Debiasing Sentence Embedders Through Contrastive Word Pairs

209

Table 1: Classiﬁcation accuracies, linear and nonlinear bias (loss on the occupation task) on the Corpus of Linguistic Accept-

ability task after ﬁne tuning. All ﬁne tuning runs were performed ﬁve times and scores averaged. Higher bias scores indicate

lower bias present in the embeddings. Baseline bias score (average prediction) equals to 0.0197.

method accuracy ↑ linear bias ↑ nonlinear bias ↑

original Bert 0.783 ± 0.0072 0.00677 ± 0.00106 0.00315 ± 0.00093

Sent-Debias 0.781 ± 0.0095 0.00705 ± 0.00085 0.00258 ± 0.00085

Null-It-Out 0.765 ± 0.0154 0.01635 ± 0.00116 0.00593 ± 0.00524

f ine

0.765 ± 0.0084 0.00806 ± 0.00375 0.00393 ± 0.00386

pre f ine

0.771 ± 0.0147 0.01100 ± 0.00452 0.01310 ± 0.00617

Table 2: Classiﬁcation accuracies, linear and nonlinear bias on the Stanford Sentiment Treebank task after ﬁne tuning. All

ﬁne tuning runs were performed ﬁve times and their scores averaged. Baseline bias score (average prediction) equals 0.0197.

method accuracy ↑ linear bias ↑ nonlinear bias ↑

original Bert 0.870 ± 0.0092 0.00473 ± 0.00075 0.00259 ± 0.00044

Sent-Debias 0.851 ± 0.0084 0.00482 ± 0.00072 0.00254 ± 0.00047

Null-It-Out 0.864 ± 0.0247 0.01539 ± 0.00105 0.00110 ± 0.00055

f ine

0.858 ± 0.0122 0.00553 ± 0.00269 0.00288 ± 0.00187

pre f ine

0.851 ± 0.0434 0.00989 ± 0.00101 0.00856 ± 0.00236

5.1.1 Corpus of Linguistic Acceptability

The results for the CoLA data set are summarized in

Table 1. The accuracy for the CoLA task does not

change strongly when ﬁne-tuning or pre-training with

our additional debiasing objective. The best perform-

ing model is the original BERT. While Sent-Debias

is able to increase performance on the linear occu-

pation task (i.e. reduce bias), it actually performes

worse than the original BERT on the nonlinear oc-

cupation task. Null-It-Out does signiﬁcantly decrease

the linear bias detected and even decreases the nonlin-

ear bias found. Both proposed debiasing approaches

are able to decrease the evaluated bias signiﬁcantly,

compared to the original BERT model. Especially

the pre f ine method achieves signiﬁcantly higher de-

bias scores in the nonlinear domain, whereas the f ine

method only decreases the measured bias slightly.

5.1.2 Stanford Sentiment Treebank

The results for the SST2 data set are display in Table

2. The best accuracy on this task is achieved by the

original BERT. pre f ine

performes 1.9% worse on

accuracy while achieving almost 100% better perfor-

mance on the linear debias score and more than 100%

improvement on the nonlinear debias score. Interest-

ingly, pre f ine

performs better on the occupation task

than f ine

, while Sent-Debias along with the original

BERT perform worse than pre f ine

. Null-It-Out per-

forms the best on the linear bias score, but decreases

performance on the nonlinear bias score.

5.1.3 Stanford Question Answering Dataset

The results for the QNLI data set are given in Table

3. Here, the best accuracy is achieved by the original

BERT. The best debiasing performance is achieved by

pre f ine

in linear and nonlinear bias. f ine

and Sent-

Debias performed very similar to the original BERT.

While Null-It-Out performs well on the linear bias

score, it fails to debias the occupations nonlinearly.

5.2 Pre-Trained Models

A variety of machine learning models try to use sen-

tence embeddings without any ﬁne-tuning. Hence,

we investigate the performance of various debiasing

methods without any ﬁne-tuning to a speciﬁc task.

In Table 4, pre

performs the best with regards to

the bias score. As no task is associated with pure sen-

tence embeddings no accuracy can be given to com-

pare the performance of the embeddings. For perfor-

mance comparison see Tables 1, 2 and 3, here pre

further trained with the additional debias loss to pro-

duce pre f ine

. It can be seen, that the performance

of pre f ine

is lower than the original BERT perfor-

mance on the Glue tasks by 1.3% on average.

5.2.1 Summary

Overall, the additional debiasing objectives (our pro-

posed ones and sent-debias) do not signiﬁcantly im-

pact the glue tasks performance compared to the orig-

nal BERT model. Models where our debiasing objec-

tive was applied during ﬁne-tuning or ﬁne-tuning and

ICPRAM 2023 - 12th International Conference on Pattern Recognition Applications and Methods

210

Table 3: Classiﬁcation accuracies, linear and nonlinear bias on the Stanford Question Answering Dataset task after ﬁne tuning.

Baseline bias score (average prediction) equals 0.0197.

method accuracy ↑ linear bias ↑ nonlinear bias ↑

original Bert 0.817 ± 0.0060 0.0103 ± 0.0023 0.0070 ± 0.0041

Sent-Debias 0.803 ± 0.0056 0.0108 ± 0.0025 0.0049 ± 0.0024

Null-It-Out 0.814 ± 0.0073 0.0166 ± 0.0015 0.0029 ± 0.0047

f ine

0.807 ± 0.0153 0.0089 ± 0.0032 0.0053 ± 0.0032

pre f ine

0.807 ± 0.0100 0.0202 ± 0.0020 0.0191 ± 0.0009

Table 4: Linear and nonlinear bias of the pretrained em-

bedding models without any ﬁnetuning to a speciﬁc task.

Higher bias scores indicate lower bias present in the em-

beddings. Baseline indicates the average prediction.

method linear bias ↑ nonlinear bias ↑

original Bert 0.0047 0.0018

Sent-Debias 0.0046 0.0020

Null-It-Out 0.0157 0.0021

pre

0.0159 0.0159

baseline 0.0197 0.0197

pre-training report only minor accuracy losses.

In summary, the best performing version of our

debiasing algorithms is pre f ine

. It beats the orig-

inal BERT and Sent-Debias in all cases regarding

bias score on the occupation task, while the accuracy

on the glue tasks is mostly retained. While Null-It-

Out performs well on the linear debiasing measure,

it completely fails to reduce the nonlinear measure.

f ine

often improves the bias score (by a smaller mar-

gin than pre f ine

) and is a valid option due to being

simpler and computationally cheaper than pre f ine

To further gain an intuition into the proposed debi-

asing method and its effects on the embedding space,

we visualize the average sentence embeddings of the

occupations from the bias classiﬁcation task using

DeepView (Schulz et al., 2020). In the linear case

it can be seen that the lighter areas, denoting more

uncertainty on the part of the classiﬁer, have grown

compared to the original example and that the dis-

tinction between the two classes is not as large for

the debiased case. Interestingly the performance on

the classiﬁcation task of the nonlinear classiﬁer has

decreased signiﬁcantly after debiasing from 99.84 %

to 83.75 % accuracy (The average classiﬁer which al-

ways produces as an output the male class achieves

an accuracy of 83.33), even though the visualization

does not seem to have changed extensively.

6 CONCLUSIONS

In this paper we presented an easy to implement ad-

ditional training objective that can be applied dur-

Figure 3: DeepView visualization of the original BERT

embedding vs pre

embeddings. Dark areas denote re-

gions of space where the predictor has a high conﬁdence

value, whereas lighter areas indicate greater uncertainty.

Light blue points indicate female occupations and dark blue

points indicate male occupations.

ing pre-training and/or ﬁne-tuning of the network. It

decreases the bias measured by the occupation task

clearly, while not impacting the accuracy on down-

stream task. Furthermore we show that, using this

method, we can strongly reduce nonlinear gender in-

formation in contrast to most other debiasing algo-

rithms, which can otherwise easily be recovered by a

multilayer perceptron.

Overall, further research on this topic is required,

as there is still no consensus regarding suitable met-

rics for detecting and comparing biases in sentence

embeddings. Even though our debiasing approach

seems promising in erasing bias related information

in sentence embeddings, it is not able to reliably erase

all information present. Thus debiasing algorithms

which are able to completely, precisely and reliably

erase linear and nonlinear information targeting a cer-

tain bias concept are still needed.

Debiasing Sentence Embedders Through Contrastive Word Pairs

211

All code for this paper can be found on

https://github.com/TheMody/Debiasing- Sentence-

Embedders-through-contrastive-word-pairs.

ACKNOWLEDGEMENTS

We gratefully acknowledge funding by the BMWi

(01MK20007E) in the project AI-marketplace.

REFERENCES

Abid, A., Farooqi, M., and Zou, J. (2021). Persistent anti-

muslim bias in large language models. In Proceed-

ings of the 2021 AAAI/ACM Conference on AI, Ethics,

and Society, AIES ’21, page 298–306, New York, NY,

USA. Association for Computing Machinery.

Bolukbasi, T., Chang, K.-W., Zou, J. Y., Saligrama, V., and

Kalai, A. T. (2016). Man is to computer programmer

as woman is to homemaker? debiasing word embed-

dings. In Lee, D., Sugiyama, M., Luxburg, U., Guyon,

I., and Garnett, R., editors, Advances in Neural Infor-

mation Processing Systems, volume 29, pages 4349–

4357. Curran Associates, Inc.

Caliskan, A., Bryson, J. J., and Narayanan, A. (2017). Se-

mantics derived automatically from language corpora

contain human-like biases. Science, 356(6334):183–

186.

Cheng, P., Hao, W., Yuan, S., Si, S., and Carin, L. (2021).

Fairﬁl: Contrastive neural debiasing method for pre-

trained text encoders. CoRR, abs/2103.06413.

Devlin, J., Chang, M., Lee, K., and Toutanova, K. (2018).

BERT: pre-training of deep bidirectional transformers

for language understanding. Proceedings of the 2019

Conference of the North American Chapter of the As-

sociation for Computational Linguistics: Human Lan-

guage Technologies, abs/1810.04805.

Fabbri, A. R., Li, I., She, T., Li, S., and Radev, D. R. (2019).

Multi-news: a large-scale multi-document summa-

rization dataset and abstractive hierarchical model.

Gonen, H. and Goldberg, Y. (2019). Lipstick on a pig: De-

biasing methods cover up systematic gender biases in

word embeddings but do not remove them. In Pro-

ceedings of the 2019 Conference of the North Amer-

ican Chapter of the Association for Computational

Linguistics: Human Language Technologies.

Liang, P. P., Li, I. M., Zheng, E., Lim, Y. C., Salakhutdinov,

R., and Morency, L.-P. (2020). Towards debiasing sen-

tence representations.

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D.,

Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov,

V. (2019). Roberta: A robustly optimized BERT pre-

training approach. CoRR, abs/1907.11692.

Manzini, T., Yao Chong, L., Black, A. W., and Tsvetkov,

Y. (2019). Black is to criminal as caucasian is to po-

lice: Detecting and removing multiclass bias in word

embeddings. In Proceedings of the 2019 Conference

of the North American Chapter of the Association for

Computational Linguistics: Human Language Tech-

nologies, Volume 1.

May, C., Wang, A., Bordia, S., Bowman, S. R., and

Rudinger, R. (2019). On measuring social biases in

sentence encoders.

Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013).

Efﬁcient estimation of word representations in vector

space.

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G.,

Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark,

J., et al. (2021). Learning transferable visual models

from natural language supervision. In International

Conference on Machine Learning, pages 8748–8763.

PMLR.

Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., and Chen,

M. (2022). Hierarchical text-conditional image gener-

ation with clip latents. OpenAI papers.

Ravfogel, S., Elazar, Y., Gonen, H., Twiton, M., and Gold-

berg, Y. (2020). Null it out: Guarding protected at-

tributes by iterative nullspace projection. In Proceed-

ings of the 58th Annual Meeting of the Association for

Computational Linguistics.

Schr

oder, S., Schulz, A., Kenneweg, P., and Hammer, B.

(2023). So can we use intrinsic bias measures or not?

In International Conference on Pattern Recognition

Applications and Methods.

Schulz, A., Hinder, F., and Hammer, B. (2020). Deep-

view: Visualizing classiﬁcation boundaries of deep

neural networks as scatter plots using discriminative

dimensionality reduction. Proceedings of the Twenty-

Ninth International Joint Conference on Artiﬁcial In-

telligence.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,

L., Gomez, A. N., Kaiser, L. u., and Polosukhin,

I. (2017). Attention is all you need. In Guyon,

I., Luxburg, U. V., Bengio, S., Wallach, H., Fer-

gus, R., Vishwanathan, S., and Garnett, R., editors,

Advances in Neural Information Processing Systems,

volume 30. Curran Associates, Inc.

Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., and

Bowman, S. R. (2019). GLUE: A multi-task bench-

mark and analysis platform for natural language un-

derstanding. In the Proceedings of ICLR.

Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C.,

Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz,

M., Davison, J., Shleifer, S., von Platen, P., Ma, C.,

Jernite, Y., Plu, J., Xu, C., Scao, T. L., Gugger, S.,

Drame, M., Lhoest, Q., and Rush, A. M. (2020). Hug-

gingface’s transformers: State-of-the-art natural lan-

guage processing.

Zhao, J., Wang, T., Yatskar, M., Cotterell, R., Ordonez, V.,

and Chang, K.-W. (2019). Gender bias in contextual-

ized word embeddings.

Zhao, J., Wang, T., Yatskar, M., Ordonez, V., and Chang,

K.-W. (2018). Gender bias in coreference resolution:

Evaluation and debiasing methods. In Proceedings of

the 2018 Conference of the North American Chapter

of the Association for Computational Linguistics: Hu-

man Language Technologies, Volume 2.

ICPRAM 2023 - 12th International Conference on Pattern Recognition Applications and Methods

212