An Explorative Guide on How to Detect Forged Car Insurance Claims

with Language Models

Quentin Telnoff

1,2 a

, Emanuela Boros

1 b

, Mickael Coustaty

1 c

, Fabrice Crohas

Antoine Doucet

1 d

and Fr

eric Le Bars

University of La Rochelle, L3i, F-17000, La Rochelle, France

Itesoft, F-30470, Aimargues, France

Keywords:

Forgery Detection, Tabular Data, Language Models.

Abstract:

Detecting forgeries in insurance car claims is a complex task that requires detecting fraudulent or overstated

claims related to property damage or personal injuries after a car accident. Building predictive models for

detecting them raises several issues (e.g. imbalance, concept drift) that cannot only depend on the frequency or

timing of the reported incidents. The difﬁculty of tackling this type of task is further intensiﬁed by the static

tabular data generally used in this domain, while submitted insurance claims largely consist of textual data. We,

thus, propose an explorative guide for detecting forged car insurance claims with language models. Speciﬁcally,

we investigate two transformer-based frameworks: supervised (where the model is trained to differentiate

between forged and non-forged cases) and self-supervised (where the model captures the standard attributes of

non-forged claims). For handling static tabular data and unstructured text ﬁelds, we inspect various forms of

data row modelling (table serialization techniques), different losses, and two language models (one general and

one domain-speciﬁc). Our work highlights the challenges and limitations of existing frameworks.

1 INTRODUCTION

Financial fraud is gaining improper advantages or ﬁ-

nancial beneﬁts by using illegal and fraudulent meth-

ods (Abdallah et al., 2016). This type of fraud can

be committed in different areas, such as insurance,

banking, taxation, and corporate sectors (Kirlidog and

Asuk, 2012; Peng et al., 2006). Speciﬁcally, insurance

fraud is a common phenomenon committed against in-

surance companies. According to the Insurance Fraud

Bureau of Australia (IFBA), the cost of fraudulent

claims incurred by the industry is more than $2 billion

annually and represents 10% of reported claims (Itri

et al., 2019a; Subudhi and Panigrahi, 2020). The rapid

advancement of digital processes, which became pri-

marily adopted with the COVID-19 pandemic, offered

great potential for forgeries (insurance professionals

believed that 20% of claims could contain fraud

When detecting forged car insurance claims, three

main challenges stand out: the data imbalance, the

https://orcid.org/0009-0009-1364-6242

https://orcid.org/0000-0001-6299-9452

https://orcid.org/0000-0002-0123-439X

https://orcid.org/0000-0001-6160-3356

https://www.friss.com/insight/insurance-fraud-repor

t-2022/

concept drift, and the tabular format of the data.

First, the number of fraudulent ﬁnancial transac-

tions is far fewer than non-fraudulent ones, and this

problem of imbalanced data distribution across classes

generally affects the efﬁciency of machine learning

models (Abdallah et al., 2016; Tennyson and Salsas-

Forn, 2002).

Second, the fraud types change over time, and the

effectiveness of these methods may diminish due to

concept drift, necessitating frequent model retraining

and rebalancing, which can be challenging in real-

world situations (Ryman-Tubb et al., 2018).

Finally, when a claim is submitted to an insurance

company, the information is converted into a tabu-

lar format to align with the structure of information

systems (Ali et al., 2022). This format, commonly

found in public datasets, is not readily conducive to

processing and analysing the full extent of valuable

information. These databases are composed of highly

heterogeneous data, especially when it comes to un-

structured free text, categorical and numerical data

(Borisov et al., 2023). This limitation can hinder the ef-

fectiveness of traditional machine-learning approaches

to fraud detection.

These approaches require text or categorical encod-

ing in order to use mathematical models. However,

these encodings lose information from the original

Telnoff, Q., Boros, E., Coustaty, M., Crohas, F., Doucet, A. and Bars, F.

An Explorative Guide on How to Detect Forged Car Insurance Claims with Language Models.

DOI: 10.5220/0012232900003598

In Proceedings of the 15th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2023) - Volume 1: KDIR, pages 403-412

ISBN: 978-989-758-671-2; ISSN: 2184-3228

403

data (e.g. one-hot encoding (Jiao and Zhang, 2021),

all texts and categories are at the same distance from

each other, there is, therefore, a loss of textual semantic

information and suffers from the curse of dimensional-

ity).

In this explorative study, we systematically address

the detection of forged car insurance claims. Our work

involves:

•

The investigation of different transformations

of heterogeneous tabular data into a sentence

in order to standardize content and use a large

language model to handle the tabular format of the

data;

•

The exploration of a self-supervised and a super-

vised framework based on BERT, both in general

and domain-speciﬁc variations;

•

The use of different types of loss functions and

the design of optimized threshold to tackle data

imbalance.

The rest of the article is organized as follows. Sec-

tion 2 reviews the related work about car insurance

forgery detection, with tabular data modelling using

supervised and unsupervised techniques. Section 3

describes the methodology used in the paper. Section

4 describes the experimental setup, i.e. describes the

studied data with the preprocessing step, the metrics,

and the parameters used in the experiments. Section 5

provides the description of the experiments, the results,

and the analysis. Section 6 concludes this study with

our main ﬁndings.

2 RELATED WORK

Fraud detection challenges, such as concept drift,

skewed distribution, and data imbalance, were gen-

erally approached with fraud detection systems based

on supervised approaches (Abdallah et al., 2016; Ali

et al., 2022; Ryman-Tubb et al., 2018), such as support

vector machines (SVMs) (Kirlidog and Asuk, 2012),

XGboost and light gradient-boosting machine (LGBM)

(Kate et al., 2023; Majhi et al., 2019), rule-induction

techniques, decision trees, logistic regression, and

meta-heuristics (e.g. genetic algorithms) (Ali et al.,

2022; Sithic and Balasubramanian, 2013).

With regard to the car insurance claims datasets uti-

lized in this study, several approaches were proposed

that are in line with previous research (Abdallah et al.,

2016; Tennyson and Salsas-Forn, 2002) i.e. data imbal-

ance. This challenge has generally been tackled with

supervised learning and data rebalancing techniques

such as upsampling and oversampling methods (Gupta

et al., 2021; Hassan and Abraham, 2016; Aslam et al.,

2022), synthetic minority oversampling techniques

(SMOTE) (Souﬁane et al., 2022; Kate et al., 2023),

undersampling approaches (e.g. fuzzy c-means cluster-

ing) (Subudhi and Panigrahi, 2020; Majhi et al., 2019;

Nian et al., 2016). These methods not only demon-

strate the signiﬁcance of eliminating noisy and redun-

dant samples from the majority class of highly skewed

imbalanced datasets but also prove efﬁciency in terms

of lowered false alarms while simultaneously control-

ling the imbalanced class distribution and systematic

identiﬁcation of fraudulent cases (Sundarkumar and

Ravi, 2015).

In regard to concept drift, when a fraud detec-

tion system is set up, the models cannot be static, as

the environment will evolve because fraud types will

vary over time (Gama et al., 2014). Several methods

were proposed to counter this phenomenon, such as

re-training the model when the drift of a concept is de-

tected, followed by removing minor relevant examples

(Dal Pozzolo et al., 2015) or modelling the distribution

of non-fraudulent data, which is likely to vary less in

time (Krawczyk and Wo

zniak, 2015). Finally, cluster-

ing techniques can detect suspicious healthcare frauds

from large databases (Peng et al., 2006).

Other research proposed deep learning (DL) mod-

els to gain pragmatic insights into the behaviour of

an insured person using unsupervised variable impor-

tance. For example, variational autoencoders were

trained to reconstruct non-fraud cases with minimal

reconstruction error, with impressive results (Gomes

et al., 2021). Finally, similar to our study, LogBERT is

another self-supervised approach to anomaly detection

in logs based on BERT with the objective of detect-

ing anomalies in logs generated by online systems

by learning the underlying patterns of normal log se-

quences and detecting if there are any deviations from

these normal log patterns (Guo et al., 2021).

3 METHODOLOGY

We explore tabular data modelling with a transformer-

based language model, which is decomposed into three

steps presented in the workﬂow overview in Figure 1.

3.1 Table Serialization

Table serialization is a method for representing 2-

dimensional tabular data into a 1-dimensional se-

quence of tokens suitable and understandable for a

transformer-based model (Badaro and Papotti, 2022).

It involves converting the rows and ﬁelds of a table

into a linear sequence of tokens, such as words or sub-

words. This allows the model to learn the structure and

KDIR 2023 - 15th International Conference on Knowledge Discovery and Information Retrieval

404





… 







…







… … …







…











… 







…







… … …







…







Step 3: Forgery

detection.

Dataset





…





BERT

model



Step 1: Table Serialization

Step 2: Fine-tuning

Framework: Self-supervised or Supervised

Financial

documents

Financial

documents

Financial

documents

Financial

documents

Raw Data

Figure 1: Workﬂow overview. It begins with information extraction from ﬁnancial documents in tabular format (not covered in

the article). Each row in the table represents a document. First, a transformation is applied to the rows of a table composed of

ﬁelds in order to create an understandable and suitable input for a large language model (Step 1). Finally, in the two last steps,

we ﬁne-tune the BERT-based models (Step 2) in order to detect forged cases from authentic cases (Step 3).

relationships between the various elements of the table.

More formally, let

and

be two non-zero integers,

and let us consider a given table with

rows and

ﬁelds. The ﬁeld names are noted

}

j=1

. Let

i ≤ n

and

j ≤ n

, we deﬁne

, the cell of coordinates

and

, as the intersection of the

row and the

ﬁeld.

Depending on the tokeniser, a cell can be composed of

one word, multiple words, or multiple sub-words. We,

thus, propose three table serialization transforms.

Cell Concatenation Transform (CC). The ﬁrst

transform consists of the concatenation of the tokens

separated by tokenizer-speciﬁc special markers

[CLS]

and [SEP] (Eq. 1).

= “[CLS] c

[SEP] c

[SEP] ... [SEP] c

[SEP]

′′

(1)

Field & Cell Concatenation Transform (FCC). In

the second transformation (Eq. 2), the ﬁeld names

are added at the beginning of each cell value and are

separated by

. This transformation makes the link be-

tween the ﬁeld name and the cell value. Instead of only

modelling the relation between cells of a speciﬁc row,

this transformation also models the relation between

ﬁeld names and their cell value.

FCC

= “[CLS] F

| c

[SEP] ... [SEP] F

| c

[SEP]

′′

(2)

Text Template Transform (TT). The idea behind

the third transformation (Eq. 3) is to use a pre-trained

language model, which is trained on large amounts of

text data, to represent the sentence in a semantic space

(Borisov et al., 2022). In order to obtain a tabular

representation in the semantic space, i.e., to obtain the

relation between each feature of the tabular data.

T T

= “[CLS] the F

is c

, ... , the F

is c

[SEP]

′′

(3)

3.2 Pre-Trained Models

The architecture of the pre-trained model chosen is

BERT (Devlin et al., 2019). In this study, we experi-

ment with two pre-trained BERT models: BERT (base,

uncased)

(trained on BookCorpus (Zhu et al., 2015)

and English Wikipedia) and FinBERT (Yang et al.,

2020)

, a pre-trained ﬁnance-speciﬁc language model

(trained on ﬁnancial corpora composed of Corporate

Reports 10k & 10-Q, analyst reports and earnings call

transcripts).

3.3 Fine-Tuning Strategies

This section explores two frameworks: self-supervised

and supervised, with different ﬁne-tuning strategies.

3.3.1 Self-Supervised Framework

The key idea behind this framework is to model the

non-forged data distribution and to detect the forged

data when the data deviates from the modelled dis-

tribution according to a criterion. First, we divide

the dataset into forged cases and non-forged cases.

Then, we ﬁne-tune a BERT model only on non-forged

rows with two training tasks, in order to model the

non-forged row distribution with two self-supervised

tasks. The ﬁrst one is named the Whole Cell Masking

(WCM), and the second one is the Volume of Hyper-

sphere Minimization (VHM).

Task#1: Whole Cell Masking. Similar to (Herzig

et al., 2020), we use the whole cell masking that prac-

tically masks the entire cell instead of only a token

(word) (Devlin et al., 2019). Let

i ≤ n

consider

a selected non-forged row in our dataset. Let

j ≤ p

be a selected column in our dataset. Consider the

https://huggingface.co/bert-base-uncased

https://huggingface.co/yiyanghkust/finbert-pretrain

An Explorative Guide on How to Detect Forged Car Insurance Claims with Language Models

405

cell

∈ r

can be composed of one word, sub-

words or words, depending on the tokenizer used. The

whole cell masking strategy is to replace all tokens

with

[MASK]

tokens. Let

t ∈ c

, and consider

the

embedding vector of the masked token

given by the

output of BERT. Eq. 4 gives the probability distribu-

tion over the entire tokenizer vocabulary.

ˆy

= log(So f tmax(W h

+ b)) (4)

where

and

is the classiﬁer parameters. Then the

loss function is the negative log-likelihood (Eq. 5).

wcm

= −

∑

i=1

∑

j=1

Card(c

( j)

)

∑

t∈c

( j)

ˆy

(5)

where

is the number of masked cells.

is an ele-

ment of the permutation group of

elements.

Card

is the number of elements of a set. Speciﬁcally, it is

the number of tokens in a cell.

is one hot encoding

vector. The value 1 is at the coordinate of the original

token number that has been masked.

Task#2: Volume of Hypersphere Minimization.

Similarly to (Guo et al., 2021), we use the training

task named volume of hypersphere minimization. This

task uses the contextual embedding of the tokens

[CLS]

that represents the entire row, noted

[CLS]

, given by

the output of BERT. The formula of volume of hyper-

sphere minimization is given by Eq. 6.

V HM

∑

i=1

||h

[CLS]

− c||

where c =

∑

i=1

[CLS]

(6)

First, the hypersphere’s centre,

, is computed by av-

eraging all contextual vectors of the non-forged rows.

Then, the task is to minimize the volume of the hy-

persphere by averaging the square Euclidean distance

between the previous hypersphere’s centre and the

contextual vector of the non-forged rows.

Loss. We use as the loss function the linear combi-

nation between the WCM loss and the VHM loss (Eq.

7).

f inal

= L

wcm

+ αL

V HM

(7)

The motivation for choosing this loss is that each

task has a crucial role to play in modelling the distribu-

tion of non-forged data. On the one hand, the ﬁrst task

models the relationship and the coherence between

cells of a speciﬁc row and, sometimes, between cells

and ﬁeld names depending on the table serialization

presented in Section 3.1. On the other hand, the sec-

ond task gathers non-forged rows close to each other

in the contextual space.

Forgery Criterion. After the ﬁne-tuning step, the

model is trained on non-forged rows. Here, we evalu-

ate how the forged rows deviate from the non-forged

rows distribution thanks to a loss-based criterion. The

strategy is to use the WCM method over each tabular

cell and use the trained model to get the output of the

masked cell. Hence, we obtain

predicted cell distri-

butions (PCD). The

PCD of the

row is given by

Eq. 8.

PCD

= {( ˆy

, y

) | t ∈ c

} (8)

where

ˆy

is the predicted distribution, and

is the label.

Finally, the forgery detection criterion is based on the

loss of whole cell masking task value without mean

reduction. We sum all errors the trained model makes

on a speciﬁc row (Eq. 9).

= −

∑

j=1

∑

( ˆy

, y

)∈PCD

ˆy

(9)

After obtaining all loss values from each row, we use

an optimized threshold, noted

thresh

opt

, in order to

maximize the micro-f1 score on the forged class.

Criterion(r

) =



Forged if L

> thresh

opt

Non Forged else.

(10)

3.3.2 Supervised Framework

The key idea behind this framework is to ﬁne-tune

a BERT model for text classiﬁcation to classify car

insurance claims as forged or non-forged. More pre-

cisely, let

i ≤ n

a selected row in our dataset and

its label (

= 1

is the forged class and

= 0

is the

non-forged class). Then, we use table serialization

transformation on

and obtain a list of tokens, noted

. Regardless of the table serialization transformation,

begins by

[CLS]

token. Hence, we use

[CLS]

the con-

textual embedding, the output of the

[CLS]

token given

by the BERT model, to do the classiﬁcation task. The

classiﬁer part is a linear layer followed by a sigmoid

activation function. The

output of our framework is

given by Eq. 11.

ˆy

= Sigmoid(W h

[CLS]

+ b) (11)

Loss. The model is ﬁne-tuned using the task of min-

imization of the binary cross entropy (BCE) given by

Eq. 12.

BCE = −

∑

i=1

log( ˆy

) + (1 − y

)log(1 − ˆy

) (12)

KDIR 2023 - 15th International Conference on Knowledge Discovery and Information Retrieval

406

Table 1: Cell reconstruction results (weighted average). The highest values are in bold.

Method R@1 P@1 F1@1 R@3 P@3 F1@3 R@5 P@5 F1@5

w/o VHM

BERT + CC 64.9 ± 0.9 67.4 ± 0.3 64.3 ± 0.3 90.8 ± 0.8 89.1 ± 0.3 88.0 ± 0.7 94.9 ± 0.6 94.7 ± 0.1 94.1 ± 0.3

BERT + FCC 63.2 ± 1.1 66.6 ± 1.0 62.9 ± 1.5 89.6 ± 0.2 88.9 ± 0.3 88.1 ± 0.5 95.5 ± 0.6 94.6 ± 0.2 94.4 ± 0.4

BERT + TT 65.8 ± 1.2 68.1 ± 0.7 64.5 ± 0.8 90.6 ± 0.9 89.3 ± 0.3 88.4 ± 0.5 95.7 ± 0.6 94.9 ± 0.1 94.5 ± 0.4

FinBERT + CC 68.4 ± 0.9 70.6 ± 0.2 67.5 ± 0.5 92.6 ± 1.0 90.3 ± 0.1 89.6 ± 0.3 96.2 ± 0.5 95.0 ± 0.1 94.8 ± 0.2

FinBERT + FCC 67.1 ± 1.0 68.9 ± 0.7 66.1 ± 1.0 92.3 ± 0.5 89.8 ± 0.4 89.5 ± 0.7 96.5 ± 0.4 95.0 ± 0.1 95.0 ± 0.3

FinBERT + TT 66.1± 0.9 69.0 ± 0.3 65.6 ± 0.4 91.8 ± 0.3 90.1 ± 0.2 89.6 ± 0.3 96.0 ± 0.6 95.1 ± 0.1 94.7 ± 0.2

w/ VHM

BERT + CC 64.6 ± 1.3 67.6 ± 0.5 64.3 ± 0.7 90.1 ± 0.7 88.9 ± 0.4 88.1 ± 0.6 95.5 ± 0.4 94.7 ± 0.1 94.4 ± 0.3

BERT + FCC 64.1 ± 0.8 67.6 ± 0.7 63.9 ± 1.0 89.5 ± 0.7 89.0 ± 0.1 88.0 ± 0.4 94.8 ± 0.4 94.5 ± 0.1 94.0 ± 0.2

BERT + TT 65.6 ± 1.3 68.6 ± 0.3 65.4 ± 0.6 90.4 ± 1.0 89.5 ± 0.1 88.7 ± 0.4 95.5 ± 1.1 94.9 ± 0.1 94.6 ± 0.6

FinBERT + CC 68.1 ± 0.8 70.7 ± 0.3 67.6 ± 0.4 92.3 ± 1.1 90.1 ± 0.1 89.3 ± 0.4 96.2 ± 0.7 95.0 ± 0.1 94.6 ± 0.4

FinBERT + FCC 67.0 ± 0.7 69.2 ± 0.5 66.2 ± 0.6 91.4 ± 0.7 89.9 ± 0.3 89.2 ± 0.6 95.7 ± 0.8 95.0 ± 0.1 94.6 ± 0.4

FinBERT + TT 67.2± 0.8 69.4 ± 0.4 66.3 ± 0.6 91.2 ± 1.2 90.1 ± 0.3 89.5 ± 0.6 96.4 ± 0.5 95.0 ± 0.1 94.8 ± 0.3

Forgery Criterion. After the ﬁne-tuning step, the

forgery detection criterion is based on the probability

of the BERT output (Eq. 13).

Criterion(r

) =



Forged if ˆy

> thresh

opt

Non-forged else.

(13)

Usually, a threshold of 0.5 is used as a decision cri-

terion. However, in our case, the dataset used is highly

imbalanced. Thus, we decided to use an optimized

threshold, noted

thresh

opt

, on validation data in order

to maximize the micro F1 of the forged class.

4 EXPERIMENTAL SETUP

Dataset. We base our study on the real-world car in-

surance claims dataset provided by Angoss Knowledge

Software (Phua et al., 2004), which contains 15,420 ob-

servations, with 14,497 non-fraudulent and 923 fraudu-

lent rows in tabular format

. Each record, represented

by a row, contains a set of attributes of an insurance

company’s customer related to their sociodemographic

proﬁle and the insured vehicle. There are seven numer-

ical attributes and twenty-ﬁve categorical attributes,

and each attribute is pre-sociodemographic.

Dataset Preprocessing. We standardized values for

consistent word embeddings. Abbreviations in ﬁelds

like Month were expanded (e.g., “January” for “Jan”).

For categories with numerical intervals, such as Num-

ber of supplements, we replaced “none” with “0”. Mis-

spellings, such as “Porche” in the Make category, were

corrected to “Porsche”. The PolicyType category, a

combination of BasePolicy and VehicleCategory, had

4,849 mismatches. We split PolicyType to update the

content of the other two categories and subsequently

removed it (Abakarim et al., 2023).

The dataset is available on Kaggle (data science compe-

tition platform) at https://www.kaggle.com/datasets/khushe

ekapoor/vehicle-insurance-fraud-detection.

Training Strategy, Metrics, and Hyperparameters.

We split the dataset 80:20 and utilized 5-fold cross-

validation. Evaluation metrics include precision (P),

recall (R), speciﬁcity, and P@k, R@k, and F1@k for

tabular experiments. Models had a learning rate of

−5

, a batch size of 32, and ran for 10 epochs. In

the self-supervised framework, we masked ﬁve cells

randomly and used an

α = 0.1

for the ﬁnal loss calcu-

lation (see Eq. 7).

5 RESULTS AND ANALYSIS

In this section, we performed three experiments to eval-

uate the proposed frameworks. The ﬁrst experiment

allows the evaluation of the ability of a self-supervised

framework to model the tabular data (Cell Reconstruc-

tion); the second, the capacity of language models to

detect forged data from non-forged data (Forgery De-

tection); and the third, the loss functions that mitigate

the effect of class imbalance during the ﬁne-tuning

step (Loss Ablation Study).

5.1 Cell Reconstruction

In this section, we evaluate the ability of BERT models

to reconstruct a cell of the table when it is masked in

input based on the semantics of the surrounding cells.

A fortiori, this ability is used in the forgery detection

criterion. All results are reported in Table 1.

Main Findings. First, we notice that the main draw-

back is the bias introduced by the training task. When

a cell is masked, it can be composed of one or sev-

eral mask tokens. This number helps the model select

speciﬁc inter-classes in the ﬁeld.

More speciﬁcally, in the ﬁeld PastNumberOf-

Claims, with intra-classes like “0”, “1”, “2 to 4”, and

“more than 4”, we observed token swaps between “0”

and 1“”, and between “2” and “more” (Figure 2a).

Such token permutations are due to differences in

An Explorative Guide on How to Detect Forged Car Insurance Claims with Language Models

407

(a) Past number of a claim. (b) Age of vehicle. (c) Age of policyholder.

Figure 2: First-word confusion matrices of three different ﬁelds.

masking lengths for these categories. Similar issues

appear in ﬁelds like AgeOfVehicle (Figure 2b) and oth-

ers listed. Despite this, the training task helps the

model recognize word order in multi-token cells, such

as AgeOfPolicyholder (e.g. the numbers “18”, “21”,

“26” are unrecognized as “31”). It ensures the correct

sequencing of numbers within categories and consis-

tently predicts using ﬁeld-speciﬁc vocabularies.

Second, even if FinBERT is pre-trained on ﬁnan-

cial documents, it can cause several issues when re-

constructing the numerical values (e.g., the ﬁelds Rep-

Number, Age, DriverRating and Year have low

F1@1

score between

and

23.1%

). However, these difﬁ-

culties may also be due to the inability of the model to

model speciﬁc ﬁelds from the surrounding ﬁelds.

Third, intra-ﬁeld class imbalance signiﬁcantly im-

pacts our experimental results. Fields such as DayPol-

icyClaim, DayPolicyAccident, and others listed have

F1@1

score above 84.6%. However, the major-

ity class in these ﬁelds constitutes over 90% of the

data, skewing results. For instance, in DayPolicyClaim,

“more than 30” accounts for 99% of entries, leading to

high scores due to the disproportionate representation.

W/o VHM versus W/ VHM. We observe that the

minimization of the volume of the hypersphere does

not improve or degrade the model performances of

model reconstruction when we compare pairwise, i.e.

the same model with the same table serialization, but

the models trained with and without the VHM task.

The performance metrics are very close to each other.

This observation allows us to keep this learning task

because if it had degraded the quality of the modelling,

we would have had problems with the forgery detec-

tion step.

BERT versus FinBERT. Independently of table se-

rialization, FinBERT has better results than the BERT

(base). If we consider the perfect cell reconstruction

(i.e. R@1, P@1 and F1@1), FinBERT increases the

overall performance metrics by around 4%. Then, the

higher the tolerance threshold, the less FinBERT offers

better results than BERT (the improvement is around

1%).

Table Serialization. On the one hand, the TT trans-

formation allows the BERT (base) to perform better

results than the other transformations. On the other

hand, the CC transformation allows FinBERT to per-

form better than the other transformations, with an

increase of around 3%.

Overall. The best framework composition is based

on FinBERT and CC transformation. It must be em-

phasized that the overall results are very close. It

reconstructs the cell with an R@1, P@1, and F1@1

of approximately 68.5%. When the tolerance thresh-

old increases, it allows obtaining metric values around

95%. Thus, the self-supervised framework could be

used to model tabular data.

5.2 Forgery Detection

In this section, we evaluate the self-supervised and

supervised frameworks’ ability to model tabular data

to detect forged data while comparing our results with

SoTA methods.

BERT versus FinBERT. Globally, whether it is the

supervised or the self-supervised framework, the pre-

trained FinBERT model provides the best results. In-

deed, in the case of supervised learning, the FinBERT

model with cell concatenation transformation obtains

the best results with high speciﬁcity and sensitivity

scores.

Fine-Tuning Strategies. When we compare the su-

pervised method and the self-supervised method, the

ﬁrst greatly outperforms the second with a 14% im-

provement in speciﬁcity, a 69% improvement in sensi-

tivity, a 190% improvement in precision and a 154%

KDIR 2023 - 15th International Conference on Knowledge Discovery and Information Retrieval

408

Table 2: Results for the supervised and self-supervised frameworks. The highest values per section (SoTA, supervised/self-

supervised w/ VHM and w/o VHM) are in bold, while the overall highest performance values are underlined.

Method Speciﬁcity Sensitivity Precision F1

SoTA

(Farquad et al., 2012) 56.22 85.18 - -

(Sundarkumar and Ravi, 2015) 58.39 91.89 - -

(Nian et al., 2016) 52.00 91.00 - -

(Itri et al., 2019b) - 23.83 19.66 21.52

(Majhi et al., 2019) 70.39 97.47 - -

(Subudhi and Panigrahi, 2020) 88.45 83.21 - -

(Kate et al., 2023) 57.5 96.0 - -

Supervised

BERT + CC 86.1 ± 6.9 47.0 ± 13.5 19.2 ± 2.3 26.5 ± 1.0

BERT + FCC 89.0 ± 7.9 42.3 ± 17.6 22.6 ± 3.7 27.4 ± 3.8

BERT + TT 86.9 ± 2.5 50.3 ± 8.8 20.3 ± 0.7 28.7 ± 1.4

FinBERT + CC 89.7 ± 2.3 46.0 ± 9.2 22.9 ± 1.2 30.2 ± 2.3

FinBERT + FCC 85.8 ± 4.4 53.3 ± 13.9 20.1 ± 1.3 28.7 ± 1.7

FinBERT + TT 87.1 ± 7.0 46.8 ± 18.0 20.5 ± 2.7 27.0 ± 3.0

Self-supervised

w/o VHM

BERT + CC 62.9 ± 14.3 43.1 ± 16.3 7.2 ± 0.2 11.9 ± 1.0

BERT + FCC 65.2 ± 8.0 42.0 ± 8.8 7.4 ± 0.4 12.5 ± 0.7

BERT + TT 70.7 ± 12.5 36.9 ± 12.8 7.9 ± 0.5 12.7 ± 0.3

FinBERT + CC 65.6 ± 20.3 40.9 ± 19.6 7.9 ± 1.1 12.5 ± 0.4

FinBERT + FCC 73.9 ± 7.9 36.8 ± 10.6 8.5 ± 0.4 13.6 ± 1.3

FinBERT + TT 64.5 ± 7.1 45.5 ± 9.0 7.8 ± 0.4 13.3 ± 0.9

w/ VHM

BERT + CC 72.0 ± 12.7 34.6 ± 14.0 7.7 ± 0.7 12.3 ± 0.4

BERT + FCC 59.4 ± 28.3 46.4 ± 25.6 7.5 ± 0.8 12.3 ± 0.8

BERT + TT 73.2 ± 10.8 31.8 ± 11.0 7.5 ± 0.7 1.8 ± 0.7

FinBERT + CC 50.5 ± 25.3 55.0 ± 22.6 7.1 ± 0.7 12.3 ± 0.6

FinBERT + FCC 63.9 ± 22.1 43.0 ± 21.6 7.7 ± 0.9 12.5 ± 0.8

FinBERT + TT 78.6 ± 7.7 27.2 ± 8.4 7.9 ± 0.7 11.9 ± 0.7

improvement in F1. In addition, we can see the impor-

tance of the VHM task in the self-supervised frame-

work. Thus, this task improves the speciﬁcity score

by 6% but reduces the sensitivity by 26%. Even if

the decrease in the sensitivity score is important, this

task allows for reducing the number of false positives,

which is ideal.

Overall. The best results are obtained by the Fin-

BERT model trained in a supervised manner with cell

concatenation transformation with a speciﬁcity score

of 89.7%, sensitivity score of 46.0%, Precision score

of 22.9% and F1 score of 30.2%.

Comparison with SoTA. Generally, the methods we

explored yielded distinct results compared to others, as

seen in Table 2. While other methods prioritize detect-

ing forged data, achieving sensitivity scores between

85.18% and 97.47%, their speciﬁcity ranges only from

52.00% to 70.39%, resulting in a high false positive

rate. Given a class imbalance of 91:9, they often mis-

classify non-forged data. Our approach aligns more

with (Subudhi and Panigrahi, 2020; Itri et al., 2019a),

emphasizing speciﬁcity over sensitivity to reduce false

positives. Compared to (Itri et al., 2019a), our methods

display higher precision and F1, but sensitivity scores

lag behind those in (Subudhi and Panigrahi, 2018).

5.3 Loss Ablation Study

The main objective of these experiments is to improve

the classiﬁcation rate of forged data of the supervised

framework by using losses designed to change the

contribution of each example, depending on its class,

to mitigate the effect of class imbalance. Using the

same notation as in Section 3, we study three losses

to compare their ability to reduce the effect of class

imbalance and compare to binary cross entropy (BCE)

(Eq. 12) and mean square error (MSE) (Eq. 14).

MSE =

∑

i=1

− ˆy

)

(14)

The ﬁrst loss studied is the weighted mean square

error (WMSE) (Eq. 15). The value of the parameter

is high for the minority class and is low for the

majority class to balance the importance of each class

during the training phase.

W MSE =

∑

i=1

− ˆy

)

(15)

Next, we experiment with the dice loss (Li et al.,

2019) (Eq. 16) that has as its objective to give more

importance during the training process to the minor-

ity class (forged) and less to the majority class (non-

forged) with the addition of the hyperparameter

used

to smooth the loss and also used by the model to train

An Explorative Guide on How to Detect Forged Car Insurance Claims with Language Models

409

Table 3: Ablation study results. The highest values are in bold.

Metrics MSE WMSE Dice Loss Self-adjust Dice Loss

Parameter(s) N/A N/A

γ = 10

−1

γ = 10

−2

γ = 10

−3

α = 10

−1

α = 10

−2

γ = 10

−1

γ = 10

−2

γ = 10

−3

γ = 10

−1

γ = 10

−2

γ = 10

−3

Speciﬁcity 91.6 ± 0.2 88.8 ± 1.5 91.5± 2.4 89.5 ± 1.7 59.8± 17.7 36.3 ±31.8 30.5 ± 29.0 18.1 ± 35.5 51.5 ±28.5 53.9 ± 28.3 34.5 ± 42.2

Sensitivity 37.2 ± 4.5 41.6 ± 5.2 32.9 ±7.5 38.8 ± 3.6 68.7 ± 23.7 65.2 ±29.8 78.5 ± 19.3 82.8 ± 33.2 57.0 ± 23.7 52.2 ± 25.9 67.8 ±39.2

Precision 21.9 ± 2.8 19.1 ±1.2 20.1 ± 2.7 19.2 ± 1.2 10.4 ± 1.7 6.4 ± 0.7 7.3± 1.7 6.5 ±0.8 7.7 ± 1.8 7.1± 0.9 6.9 ± 1.2

F1 27.6 ± 3.4 26.0 ±1.2 24.5 ± 2.1 25.6 ± 0.8 17.4 ± 2.3 11.2 ± 0.7 13.2 ± 2.5 11.3 ± 0.4 13.1 ± 2.2 12.1± 1.0 11.5 ± 1.2

on the majority class, with squared

ˆy

and

to accel-

erate the convergence.

DL =

∑

i=1

1 −

2 ˆy

+ γ

ˆy

+ y

+ γ

(16)

Finally, the self-adjust dice loss (Eq. 17) replaces

ˆy

(1 − ˆy

)

ˆy

to push down the weight of easy ex-

amples.

SADL =

∑

i=1

1 −

2(1 − ˆy

)

ˆy

+ γ

(1 − ˆy

)

ˆy

+ y

+ γ

(17)

5.3.1 Results and Analysis

We examine the impact of losses to prevent overﬁtting

on the non-forged class and contrast these ﬁndings

with the top results from Section 5.2.

MSE vs. WMSE. We set

15420

15420−923

≈ 1

and

15420

923

≈ 16

. Weighting the MSE loss implies that

the model will learn less on the non-forged examples

in comparison to the usual MSE, and this effect can

be observed in the results presented in Table 3. On the

one hand, the speciﬁcity score of the model trained

with WMSE decreases in comparison with that of the

model trained with MSE. The speciﬁcity score drops

from 91.6% to 88.8% On the other hand, the sensitivity

score of the model trained with WMSE increases in

comparison with that of the model trained with MSE.

The speciﬁcity score increases from 37.2% to 41.6%.

Dice Loss. We notice that reducing the hyperparam-

eter

implies that the model learns less on the non-

forged examples, i.e. the speciﬁcity decreases when

gamma decreases. This score drops from 91.5% to

59.8% (Table 3). However, the model learns more

about forged examples, and this effect can be observed

in Table 3 where the sensitivity score increases from

32.9% to 68% while gamma decreases.

Self-Adjusting Dice Loss. We experiment with dif-

ferent

and

values, and we notice that the gamma

affects the model training the same as that observed

with the dice loss. In addition, we observe that the

alpha parameter allows during the model training to

continue to learn on the non-forged class and avoid the

over-ﬁtting on the forged class.

Overall. The results in terms of F1 are lower than

the best result given by the loss CE in Section 5.2 (with

an F1 of

30.2 ± 2.3

against an F1 between

11.3 ± 0.4

and

26.0±1.2

). In addition, the more the models learn

about the forged data, the more the variability level

increases (e.g. reducing gamma in the dice loss im-

plies that the standard deviations of different metrics

increase in Table 3). This highlights the model’s dif-

ﬁculty in learning to classify forged from non-forged

data.

6 CONCLUSIONS

Our experiments indicate that FinBERT with cell con-

catenation excels in modelling tabular data through

our self-supervised framework. For detecting forged

claims, the standout is a supervised FinBERT achiev-

ing 89.7% speciﬁcity and 46.0% sensitivity, compa-

rable to state-of-the-art results. In the self-supervised

setup, the distinction between forged and authentic

data is not pronounced, underscoring the challenge

of distinguishing them even when guiding the model

towards forged data. However, these experiments al-

lowed us to highlight the difﬁculty of separating the

forged data from the non-forged ones by using various

losses and forcing the model learning to focus on the

forged data.

ACKNOWLEDGEMENTS

This work was supported by the French government in

the framework of the France Relance program and by

the Itesoft company under grant number AD 22-252.

REFERENCES

Abakarim, Y., Lahby, M., and Attioui, A. (2023). A bagged

ensemble convolutional neural networks approach to

recognize insurance claim frauds. volume 6.

Abdallah, A., Maarof, M. A., and Zainal, A. (2016). Fraud

detection system: A survey. Journal of Network and

Computer Applications, 68:90–113.

Ali, A., Abd Razak, S., Othman, S. H., Eisa, T. A. E., Al-

Dhaqm, A., Nasser, M., Elhassan, T., Elshaﬁe, H., and

KDIR 2023 - 15th International Conference on Knowledge Discovery and Information Retrieval

410

Saif, A. (2022). Financial fraud detection based on ma-

chine learning a systematic literature review. Applied

Sciences, 12(19):9637.

Aslam, F., Hunjra, A. I., Ftiti, Z., Louhichi, W., and Shams,

T. (2022). Insurance fraud detection: Evidence from

artiﬁcial intelligence and machine learning. Research

in International Business and Finance, 62:101744.

Badaro, G. and Papotti, P. (2022). Transformers for tab-

ular data representation: a tutorial on models and

applications. Proceedings of the VLDB Endowment,

15(12):3746–3749.

Borisov, V., Broelemann, K., Kasneci, E., and Kasneci, G.

(2023). Deeptlf: robust deep neural networks for het-

erogeneous tabular data. International Journal of Data

Science and Analytics, 16(1):85–100.

Borisov, V., Seßler, K., Leemann, T., Pawelczyk, M., and

Kasneci, G. (2022). Language models are realistic tab-

ular data generators. arXiv preprint arXiv:2210.06280.

Dal Pozzolo, A., Boracchi, G., Caelen, O., Alippi, C., and

Bontempi, G. (2015). Credit card fraud detection and

concept-drift adaptation with delayed supervised in-

formation. In 2015 International Joint Conference on

Neural Networks (IJCNN), pages 1–8.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019).

BERT: Pre-training of deep bidirectional transformers

for language understanding. In Proceedings of the

2019 Conference of the North American Chapter of

the Association for Computational Linguistics: Human

Language Technologies, Volume 1 (Long and Short

Papers), pages 4171–4186, Minneapolis, Minnesota.

Association for Computational Linguistics.

Farquad, M. A. H., Ravi, V., and Raju, S. B. (2012). Analyt-

ical crm in banking and ﬁnance using svm: a modiﬁed

active learning-based rule extraction approach. Inter-

national Journal of Electronic Customer Relationship

Management.

Gama, J.,

Zliobait

e, I., Bifet, A., Pechenizkiy, M., and

Bouchachia, A. (2014). A survey on concept drift

adaptation. ACM computing surveys (CSUR), 46(4):1–

37.

Gomes, C., Jin, Z., and Yang, H. (2021). Insurance fraud

detection with unsupervised deep learning. Journal of

Risk and Insurance, 88(3):591–624.

Guo, H., Yuan, S., and Wu, X. (2021). Logbert: Log anomaly

detection via bert. In 2021 International Joint Confer-

ence on Neural Networks (IJCNN), pages 1–8. IEEE.

Gupta, R., Mudigonda, S., and Baruah, P. K. (2021). Tgans

with machine learning models in automobile insurance

fraud detection and comparative study with other data

imbalance techniques. International Journal of Recent

Technology and Engineering, 9:236–244.

Hassan, A. K. I. and Abraham, A. (2016). Modeling insur-

ance fraud detection using imbalanced data classiﬁca-

tion. In Advances in nature and biologically inspired

computing, pages 117–127. Springer.

Herzig, J., Nowak, P. K., M

uller, T., Piccinno, F., and Eisen-

schlos, J. (2020). TaPas: Weakly supervised table

parsing via pre-training. In Proceedings of the 58th

Annual Meeting of the Association for Computational

Linguistics, pages 4320–4333, Online. Association for

Computational Linguistics.

Itri, B., Mohamed, Y., Mohammed, Q., and Omar, B.

(2019a). Performance comparative study of machine

learning algorithms for automobile insurance fraud de-

tection. In 2019 Third International Conference on

Intelligent Computing in Data Sciences (ICDS), pages

1–4. IEEE.

Itri, B., Mohamed, Y., Mohammed, Q., and Omar, B.

(2019b). Performance comparative study of machine

learning algorithms for automobile insurance fraud de-

tection. In 2019 Third International Conference on

Intelligent Computing in Data Sciences (ICDS), page

1–4.

Jiao, Q. and Zhang, S. (2021). A brief survey of word

embedding and its recent development. In 2021 IEEE

5th Advanced Information Technology, Electronic and

Automation Control Conference (IAEAC), volume 5,

page 1697–1701.

Kate, P., Ravi, V., and Gangwar, A. (2023). Fingan:

Chaotic generative adversarial network for analyti-

cal customer relationship management in banking

and insurance. Neural Computing and Applications,

35(8):6015–6028.

Kirlidog, M. and Asuk, C. (2012). A fraud detection ap-

proach with data mining in health insurance. Procedia-

Social and Behavioral Sciences, 62:989–994.

Krawczyk, B. and Wo

zniak, M. (2015). One-class clas-

siﬁers with incremental learning and forgetting for

data streams with concept drift. Soft Computing,

19(12):3387–3400.

Li, X., Sun, X., Meng, Y., Liang, J., Wu, F., and Li, J. (2019).

Dice loss for data-imbalanced nlp tasks. arXiv preprint

arXiv:1911.02855.

Majhi, S. K., Bhatachharya, S., Pradhan, R., and Biswal, S.

(2019). Fuzzy clustering using salp swarm algorithm

for automobile insurance fraud detection. Journal of

Intelligent & Fuzzy Systems, 36(3):2333–2344.

Nian, K., Zhang, H., Tayal, A., Coleman, T., and Li, Y.

(2016). Auto insurance fraud detection using unsuper-

vised spectral ranking for anomaly. The Journal of

Finance and Data Science, 2(1):58–75.

Peng, Y., Kou, G., Sabatka, A., Chen, Z., Khazanchi, D., and

Shi, Y. (2006). Application of clustering methods to

health insurance fraud detection. In 2006 International

Conference on Service Systems and Service Manage-

ment, volume 1, pages 116–120. IEEE.

Phua, C., Alahakoon, D., and Lee, V. (2004). Minority

report in fraud detection: Classiﬁcation of skewed data.

SIGKDD Explor. Newsl., 6(1):50–59.

Ryman-Tubb, N. F., Krause, P., and Garn, W. (2018). How

artiﬁcial intelligence and machine learning research

impacts payment card fraud detection: A survey and

industry benchmark. Engineering Applications of Arti-

ﬁcial Intelligence, 76:130–157.

Sithic, H. L. and Balasubramanian, T. (2013). Survey of

insurance fraud detection using data mining techniques.

arXiv preprint arXiv:1309.0806.

Souﬁane, E., EL Baghdadi, S.-E., Berrahou, A., Mesbah, A.,

and Berbia, H. (2022). Automobile insurance claims

An Explorative Guide on How to Detect Forged Car Insurance Claims with Language Models

411

auditing: A comprehensive survey on handling awry

datasets. In WITS 2020: Proceedings of the 6th Inter-

national Conference on Wireless Technologies, Embed-

ded, and Intelligent Systems, pages 135–144. Springer.

Subudhi, S. and Panigrahi, S. (2018). Detection of automo-

bile insurance fraud using feature selection and data

mining techniques. International Journal of Rough

Sets and Data Analysis (IJRSDA), 5(3):1–20.

Subudhi, S. and Panigrahi, S. (2020). Use of optimized

fuzzy c-means clustering and supervised classiﬁers for

automobile insurance fraud detection. Journal of King

Saud University - Computer and Information Sciences,

32(5):568–575.

Sundarkumar, G. G. and Ravi, V. (2015). A novel hybrid

undersampling method for mining unbalanced datasets

in banking and insurance. Engineering Applications of

Artiﬁcial Intelligence, 37:368–377.

Tennyson, S. and Salsas-Forn, P. (2002). Claims auditing in

automobile insurance: fraud detection and deterrence

objectives. Journal of Risk and Insurance, 69(3):289–

308.

Yang, Y., Uy, M. C. S., and Huang, A. (2020). Finbert: A pre-

trained language model for ﬁnancial communications.

arXiv preprint arXiv:2006.08097.

Zhu, Y., Kiros, R., Zemel, R., Salakhutdinov, R., Urtasun,

R., Torralba, A., and Fidler, S. (2015). Aligning books

and movies: Towards story-like visual explanations by

watching movies and reading books. In Proceedings of

the IEEE international conference on computer vision,

pages 19–27.

KDIR 2023 - 15th International Conference on Knowledge Discovery and Information Retrieval

412