Q-Credit Card Fraud Detector for Imbalanced Classiﬁcation using

Reinforcement Learning

Luis Zhinin-Vera

, Oscar Chang

, Rafael Valencia-Ramos

, Ronny Velastegui

Gissela E. Pilliza

and Francisco Quinga Socasi

School of Mathematical and Computational Sciences, Yachay Tech University, 100650, Urcuqui, Ecuador

Keywords:

Agents, Credit Card Fraud Detector, Deep Learning.

Abstract:

Every year, billions of dollars are lost due to credit card fraud, causing huge losses for users and the ﬁnancial

industry. This kind of illicit activity is perhaps the most common and the one that causes most concerns in the

ﬁnance world. In recent years great attention has been paid to the search for techniques to avoid this signiﬁcant

loss of money. In this paper, we address credit card fraud by using an imbalanced dataset that contains

transactions made by credit card users. Our Q-Credit Card Fraud Detector system classiﬁes transactions

into two classes: genuine and fraudulent and is built with artiﬁcial intelligence techniques comprising Deep

Learning, Auto-encoder, and Neural Agents, elements that acquire their predicting abilities through a Q-

learning algorithm. Our computer simulation experiments show that the assembled model can produce quick

responses and high performance in fraud classiﬁcation.

1 INTRODUCTION

Currently, fraud is the number one enemy in the busi-

ness world. It affects industries and organizations and

accounts for the big money invested in fraud predic-

tion researching. The constant grow of this problem

has strongly promoted the development of new tech-

nologies to counteract fraudsters. The last advances

in credit card fraud detection include top technologies

themes such as Artiﬁcial Neural Network (ANN),

Deep Learning and Intelligent Agents. In particular

the implementation of agents has become important

since it produces effective, quick acting monitoring

of credit card fraud transactions, reducing the risk of

fraud or other ﬁnancial traps that could signify losses.

A quick response is essential because fraudsters are

constantly creating new elaborated treachery mecha-

nisms.

In terms of a robust functional system the main

goal is to detect the highest possible number of

fraudulent transactions using a ﬁnite dataset, in

https://orcid.org/0000-0002-6505-614X

https://orcid.org/0000-0002-4336-7545

https://orcid.org/0000-0002-1036-1817

https://orcid.org/0000-0001-8628-9930

https://orcid.org/0000-0001-6386-9254

https://orcid.org/0000-0003-3213-4460

our case treated by Principal Component Analysis

(PCA) approaches to anonymize the user and min-

imizes/maximize data correlation. Since frauds oc-

curs with more frequency than regular transactions,

databases are always imbalanced. This paper de-

velops a fraud detection methodology that resolves

the problem of imbalanced classiﬁcation by combin-

ing the processing capacities of neural agents and

Q-learning, establishing a promising way to satisfy

quick acting and high precision requirement. The

Reinforcement Learning method slightly outperforms

neural networks while a similar representation is used

(Wiering et al., 2011).

2 RELATED WORK

2.1 Credit Card Fraud Detection

An approach called Long Short-term Memory Recur-

rent Neural Network (LSTM) is used. Authors imple-

ment an ANN for detecting credit card fraud, taking

into account sequences of transactions occurred in the

past, in order to determine whether a new transaction

is legitimate or fraudulent (Wiese and Omlin, 2009).

Checking the usage patterns of a user in previous

transactions to detect a credit card frauds is suggested

Zhinin-Vera, L., Chang, O., Valencia-Ramos, R., Velastegui, R., Pilliza, G. and Socasi, F.

Q-Credit Card Fraud Detector for Imbalanced Classiﬁcation using Reinforcement Learning.

DOI: 10.5220/0009156102790286

In Proceedings of the 12th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2020) - Volume 1, pages 279-286

ISBN: 978-989-758-395-7; ISSN: 2184-433X

279

(Dighe et al., 2018). They compare the usage pattern

and current transaction, to classify it as either fraud

or a legitimate transaction. Among the techniques

implemented are KNN, Na

ıve Bayes, CFLANN, M-

Perceptron and DTrees.

Credit cards frauds have no constant patterns is

stated (Pumsirirat and Yan, 2018). Therefore, the use

of an unsupervised learning is necessary. They take

account that the frauds are committed once through

online mediums and then the techniques change. To

solve this issue, they implement a deep Auto-encoder

model and a restricted Boltzmann machine, that can

reconstruct normal transactions to ﬁnd anomalies in

the patterns.

An intelligent agent can obtain a high rate of fraud

transaction with low false alarm rate, providing a con-

venient way to detect frauds (Chukwuneke, 2018).

Their implementation of the intelligent agent is focus

on detect the fraud when transaction is in progress,

taking into account the costumers pattern, and any de-

viation from the regular pattern is considered to the

fraudulent transaction.

2.2 Imbalanced Classiﬁcation

Research work in imbalanced data classiﬁcation is fo-

cused on two levels: data level and algorithmic level.

In data level, the objective is balance the class dis-

tribution by manipulating the training samples, tak-

ing into account the over-sampling minority class,

the under sampling majority class and their combi-

nations. The authors takes into account that the over-

sampling can lead to overﬁtting while under-sampling

lose valuable information on the majority class. On

the other hand, the objective of the algorithmic level

methods, is increase the importance of minority class

by improving algorithms by decision threshold ad-

justment, cost-sensitive learning and ensemble learn-

ing (Lin et al., 2019). An alternative loss function in

deep neural network that can capture the classiﬁcation

errors from both minority class and majority class is

established (Wang et al., 2016). Extracting hard sam-

ples of minority classes and improved the bootstrap-

ping sampling algorithm which ensure the training

data in each mini-batch, by batch-wise optimization

with Class Rectiﬁcation Loss function (Dong et al.,

2019).

2.3 Reinforcement Learning in

Classiﬁcation

Recently, the deep reinforcement learning has had ex-

cellent results, because it can assist classiﬁers to learn

important features or select good instances from noisy

data. The authors understand the classiﬁcation task as

sequential decision-making process, that uses a multi-

ple agents interacting with the environment to obtain

the optimal policy of classiﬁcation. However, the in-

teraction between agents and environment, generate

an extremely high time complexity (Lin et al., 2019).

Establishing a deep reinforcement learning model di-

vided into instance selector and relational classiﬁer, to

learn the relationship classiﬁcation in noisy text data.

The instance selector part, implements an agent se-

lects high quality sentence from noisy data while the

relational classiﬁer part learns from the previous se-

lected data and give a reward to the instance selector.

The ﬁnally model obtains a better classiﬁer and high-

quality data set (Feng et al., 2018). A deep reinforce-

ment learning framework for time series data classi-

ﬁcation is established. This framework use a speciﬁc

reward function and a clearly formulated Markov pro-

cess (Martinez et al., 2018). The information avail-

able about imbalanced data classiﬁcation with rein-

forcement learning is quite limited.

3 TECHNICAL BACKGROUND

3.1 Q-Learning

Q-Learning is one far-reaching reinforcement learn-

ing techniques that does not require a model of the

environment to learn to execute complex tasks. Es-

sentially Q-Learning makes possible for an algorithm,

to learn a sequential task, where rewards are re-

leased in a step by step fashion, until a journey called

”Episode” is completed. After training the ”educated”

agent develops a road map memory called ”policy”,

usually represented by a matrix Q, which optimizes

rewards capture trajectories in any deﬁnable environ-

ment. Q(s

) gives the value of taking action a

a state s

. Equation 1 is the leading actor of the Q-

learning algorithm, derived from the Bellman equa-

tion by considering the ﬁrst and second term of an

inﬁnite series (Watkins and Dayan, 1992):

obs

) = r + γmax

Q(s

t+1

), (1)

where γ is the discount factor which manages the bal-

ance between immediate and future rewards. In this

equation the value of Q(s

) of state and action is

given by the sum of the reward r with the discounted

maximum future expected reward after moving to the

next state S

t+1

. The value of Q

obs

) is com-

puted by an agent and then is updated with its the

own estimate of Q

∗

) in a Q-table. The term

max

Q(s

t+1

) gives the maximum value for all

ICAART 2020 - 12th International Conference on Agents and Artiﬁcial Intelligence

280

actions in the following state. Q-learning is an off-

policy algorithm since during training it updates the

Q-values without making any assumptions about the

actual policy being followed (Li et al., 2019).

It should be noted that in low-dimensional ﬁnite

state spaces, Q-functions are recorded by a table (ma-

trix). However in high-dimensional continuous state

spaces, Q functions cannot be resolved unless a deep

Q-learning algorithm is used as mediator to ﬁt the Q-

function with a deep neural network producing fea-

ture rich data (Lin et al., 2019).

3.2 Data Description

The dataset contains transactions made by credit cards

in September 2013 by European cardholders, it repre-

sents transactions that occurred in two days, where we

have 492 frauds out of 284,807 transactions (MLG-

Kaggle, 2015), notice the strong unbalancing between

fraud and legal trades, typical of this kind of big

business. The database contains numerical variables

which have been hidden using PCA (Principal Com-

ponent Analysis) transformation. These 28 features

named V1,V 2, ...,V 28 contain conﬁdential data of the

users and are non-reversible, thus protecting the orig-

inal characteristics of the data. There are two special

features that have not been transformed using PCA,

these are Time and Amount. There is also an im-

portant variable called Class, which is a fundamental

value in the database.

The feature called Amount is the amount of

money in each transaction. The largest transaction

that has this data set is $25,691.16 while the aver-

age of the transactions is $88.35. Figure 1 shows that

the data is mostly concentrated at very small values

close to zero while only a few transactions approxi-

mate the maximum value found. On the other hand,

the representation of the amount of money for each

transaction (see Figure 2) shows some values that dif-

fer from the others. These are called outliers and in

this case they are transactions in which a large amount

of money is transferred. Logically, these values at-

tract the attention of being possible frauds, however

this is something that fraudsters want to totally avoid.

Existing information shows that fraudsters frequently

transferred small amounts of money to continue steal-

ing in an undetectable manner.

The class feature is what gives information that

if the transactions are fraudulent or not, this variable

takes value 1 in case of fraud and 0 otherwise. This

feature shows that there is a minimum percentage of

fraudulent cases which represent 0.17% of all data.

While non-fraudulent cases equal 99.83%. It is con-

cluded that the data is highly imbalanced, which re-

Figure 1: Distribution of Monetary Value Feature.

Figure 2: Money per Transaction.

quires choosing appropriate measures to divide the

data and make the training of the system effective.

4 PROPOSED SOLUTION

This Section describes the proposed intelligent sys-

tem, which learns to detect fraudulent transactions ef-

ﬁciently. The Q-Credit Card Fraud Detector (Q-

CCFD) is composed of three different artiﬁcial intel-

ligence techniques related to each other.

The ﬁrst is an Auto-encoder (AE), which allows

to extract important features from the unlabeled data.

The hidden layer of this unsupervised network is the

input to another Mediator Network (MN) designed

to classify each input with its respective label gen-

uine/fraud. The intermediate layer of the Mediator

becomes the input to an Agent which is responsible

for the ﬁnal real time classiﬁcation of the transaction

being processed.

This Section is divided in two sections: Section

4.1 presents all the design and architecture of this pro-

posed system and Section 4.2 which describes algo-

Q-Credit Card Fraud Detector for Imbalanced Classiﬁcation using Reinforcement Learning

281

rithm speciﬁcations used by these 3 artiﬁcial intelli-

gence techniques.

4.1 Network Architectures

The system is the result of learning three main com-

ponents which were adjusted individually. Figure 3

presents a general scheme of Q-CCFD in order to

have a complete overview and simplify the deﬁnition

of each component.

To set hyper-parameters to a convenient point is

a common problem in neural networks, usually there

exist a large number of possible combinations of

which only few give the best results. Finding this op-

timal conﬁguration requires effort but it is a unavoid-

able task. For this work, after some trial and error the

following hyper-parameters are chosen.

4.1.1 Auto-encoder

This ﬁrst component extracts important characteris-

tics from the database, which is essential to facilitate

the learning of the system and the detection of signif-

icant transactions that could harm the entire process.

Normalization. The Auto-encoder inputs are the 28

features V 1,V 2,...,V 28 of the database. This data

must be re-scaled so that they have a standard normal

distribution with a mean of 0 and standard deviation

of 1, this is compatible with sigmoid used as trans-

fer function. As usual normalizing carried out before

starting to train a model.

Before each X transaction is given as input of the

AE, the maximum and minimum value of the 28 fea-

tures is found and then the following equation is used:

X −X

min

max

− X

min

, (2)

where each X transaction is rescaled at the interval

[0,1].

Number of Hidden Neurons and Layers. An

Auto-encoder reproduces in its output the same data

received in its inputs and has therefore the same num-

ber of neurons in both layers. On the other hand, the

number of neurons in the hidden layer is a free choice

element and it generally depends on the complexity of

the relations in the database. The number of neurons

per layers should be carefully chosen since too many

or too few neurons will cause underﬁtting/overﬁtting

problems.

We made comparative tests to determine the best

number of hidden neurons for reliable representa-

tions; however, the deﬁnitive metric is determined

based on the performance of the next operative net-

work that is fed by the Auto-encoder. Finally, an AE

with 38 neurons in a single hidden layer is chosen be-

cause with this numbers the output presents a mini-

mum error value.

4.1.2 Mediator Network

A neural network is used to classify transactions. The

input is given by the hidden layer of the Auto-encoder,

therefore it does not require data normalization.

The output layer is given by 2 neurons, which cor-

respond to the 2 possible cases: fraud or genuine.

This is the base for a metric to measure the perfor-

mance and ﬁnd the best structure of the hidden layer,

which will become important by deﬁning the input for

the next component.

After performing some tests which include pass-

ing information to other downstream components the

best found possible structure was a hidden layer with

11 neurons.

4.1.3 Agent

The ﬁnal component of the entire system is an agent

who is responsible for accurately classifying in real

time each one of the incoming transactions.

The agent input is the hidden layer of the MN. The

output layer are 2 neurons corresponding to the two

classes of the database. After some accuracy evalu-

ation the amount of neurons in this hidden layer was

set to 11.

4.2 Parameter Setting

This Section details system speciﬁcations such as the

learning algorithm, the transfer functions, loss func-

tions and other fundamental characteristics tuned to

obtain good performance.

4.2.1 Auto-encoder

An auto-encoder uses an unsupervised learning.

Backpropagation algorithm while setting the target

values to be equal to the inputs. The transfer function

that computes the weighted sum of inputs is sigmoid

function and the loss function is Mean Squared Error.

The best found learning rate value is 0.9.

4.2.2 Mediator Network

This supervised learning network uses Backpropaga-

tion algorithm with MSE and learning rate equal to

0.9. Sigmoid is the activation function that takes the

input and compresses the output in the interval [0.1].

ICAART 2020 - 12th International Conference on Agents and Artiﬁcial Intelligence

282

Figure 3: General Architecture of Q-Credit Card Fraud Detector.

4.2.3 Agent

The utilized Agent is based in reinforcement learn-

ing. Its activation function is the learning sigmoid

and the algorithm is a novel version of Deep Q-

learning (Mnih et al., 2015). Concerning framework

our method solves classiﬁcation problems using Re-

inforcement learning and an agent which receives re-

wards depending on the class to which it classiﬁes in-

coming data. A positive reward is given it classiﬁes

a transaction correctly, otherwise a negative reward is

assigned.

Imbalanced Classiﬁcation Markov Decision Pro-

cess (Lin et al., 2019) decomposes the task of clas-

siﬁcation into a sequential decision-making problem.

The input received by the agent is each of the weights

of the hidden layer of the neural network with its

respective label l

. Therefore the training data set is

D = (x

),(x

),...,(x

). The ICMDP uses a

ﬁve-tuple (S , A,T ,R , E ) with the following deﬁni-

tions:

• State S: Each state is deﬁned by the input, with x

being the initial state s

. It goes to the next state

with the next transaction given by hidden layer of

NN.

• Action A: The action a

that the agent will take is

given by the dataset label. In database used, the

action is given by A = {0, 1} being 1 for cases of

fraud and 0 for non-fraud.

• Transition probability T : The agent moves from

one state s

to the next state s

t+1

according to the

order in which the data is read. Transition proba-

bility is deﬁned by p(s

t+1

• Reward R : the value of the reward r

depends on

the action the agent takes. Two different rewards

are deﬁned for each class. The reward function is

given by:

R(s

) =











+1 a

= l

and s

∈ D

−1 a

6= l

and s

∈ D

+λ a

= l

and s

∈ D

−λ a

6= l

and s

∈ D

where D

represents the set of fraudulent transac-

tions, D

is the set of non-fraudulent transactions

and λ is a value in the interval [0,1] that is the

reward when the agent correctly classiﬁes a non-

fraudulent transaction, which are part of the class

with the largest samples of the entire imbalanced

dataset. The reward of 1/ − 1 is greater than λ

since being the minority class the impact of cor-

rectly classifying signiﬁcantly affects the perfor-

mance of the system.

• Terminal E: Set to indicate that training must be

stopped by the appearance of some terminal state.

These states can arise in any training episode.

Episode is a concept in reinforcement learning,

which is a transition trajectory from initial state

to terminal state {s

,...,s

Episode ends when all training data is classiﬁed

or when the agent misclassiﬁes the sample from

minority class (Lin et al., 2019) and then the ter-

minal variable changes its value, to start a new

episode.

Q-Credit Card Fraud Detector for Imbalanced Classiﬁcation using Reinforcement Learning

283

• Policy π

: This is a mapping function π

S →

A where π

denotes the action a

performed by

agent in state s

. The policy π

in ICMDP can

be considered as a classiﬁer with the θ parameter

(Lin et al., 2019).

The above-mentioned deﬁnitions are an essential part

of formally raising the problem and seeking to opti-

mize the classiﬁcation policy π

∗

S → A which max-

imized rewards in ICMDP.

The ﬁnal value with the best accuracy value of λ

is 0.1. The memory replay size M used is 200000 and

γ value is equal to 0.9.

The above-mentioned deﬁnitions are an essential

part of formally raising the problem and seeking to

optimize the classiﬁcation policy π

∗

S → A which

maximized rewards in ICMDP.

The ﬁnal value with the best accuracy value of λ

is equal to 0.1. The memory replay size M used is

200000 and γ value is equal to 0.9.

5 IMPLEMENTATION

This Section describes the implementation of Q-

Credit Card Fraud Detector, which makes possible to

tune various parameters and then evaluate the perfor-

mance of the overall system. Section 5.1 speciﬁes the

programming language and libraries used and Section

5.2 speciﬁes the process of implementation.

5.1 Technology

The language used is C/C++ because it is ﬂexible

and implemented code is easy to read, understand and

edit.

The working software runs in a C++ compiler

with minimum graphic capacities and IDE, including

our own neural and Q-Learning libraries.

5.2 Code Structure

Algorithm 1 shows an implementation of the general

scheme that presents the Figure 3, in which each of

the components of the proposed system is described,

so that the three main techniques of artiﬁcial intelli-

gence used can be distinguished.

The function Train Auto-encoder (D) receives the

normalized database as an input parameter and begins

modifying its weights of the hidden layer θ

until

the best representation of its input is obtained in its

output. While Train Mediator Network (θ

) starts

to train a MN using as input the θ

values. Set the

targets using the label of each transaction {0,1}.

Algorithm 1: Q-CCFD.

Input : T = {1,2,3,...,284807}

= {v1, v2, ...v28}, with i ∈ T

Read Training Data

Normalize x

∈ D

D = {(x

,..,l

),(x

),...,(x

)}

Randomly initialize weights θ

IHO

while Key not pressed do

=Train Auto-encoder(D)

=Train Mediator Network(θ

)

Train Agent(θ

)

end

Testing: Data unlabeled

Agent learning uses Train Agent(θ

) the algo-

rithm 2, this is an implementation base on Deep Q-

learning for Imbalanced Classiﬁcation Markov Deci-

sion Process (Lin et al., 2019).

In each iteration, the agent uses policy a

= π

)

to choose the action and then obtains the reward using

REWARD function in Algorithm 3.

Algorithm 2: Train Agent.

Input : θ

D = {(θ

),(θ

),...,(θ

)}

Episode number K.

Initialize replay memory D with M capacity

Randomly initialize parameters θ

Initialize simulation environment ε

for episode k = 1 to K do

Shufﬂe D

Initialize state s

= θ

for t = 1 to T do

Pick an action: a

= π

)

,terminal

= REWARD(a

)

Set s

t+1

= θ

t+1

Store (s

t+1

,terminal

) to M

Randomly sample

j+1

,terminal

) from M

Set y

(

, terminal

+ γmax

Q(s

j+1

;θ),terminal

Perform a gradient descent step:

L(θ) = (y

− Q(s

;θ))

if terminal

=1 then

break

end

ICAART 2020 - 12th International Conference on Agents and Artiﬁcial Intelligence

284

Table 1: General speciﬁcations of Q-Credit Card Fraud Detector.

Auto-encoder Mediator Network Agent

Input Layer 28 56 25

Hidden Layer 56 25 15

Output Layer 28 2 2

Learning Algorithm Backpropagation Backpropagation Deep Q-learning

Transfer Function Sigmoid Sigmoid Sigmoid

Loss Function MSE MSE (y

− Q(s

;θ))

Learning Rate 0.9 0.9 0.9

Discount factor (γ) N/A N/A 0.9

Lambda value (λ) N/A N/A 0.1

Size replay memory (M) N/A N/A 200000

Algorithm 3: Environment Simulation.

represents the fraud class

represents the non-fraud class

Function: REWARD(a

∈ A,l

∈ L)

Initialize terminal

= 0

if s

∈ D

then

if a

= l

then

Set r

= 1

else

Set r

= −1

terminal

= 1

end

else

if a

= l

then

Set r

= λ

else

Set r

= −λ

end

return r

,terminal

6 RESULTS

Figure 4 shows the proposed system interface, a

friendly and simple software that allows the user to

have an easy interaction. Table 1 show a general sum-

mary of the Sections 4.1 and 4.2, which deﬁnes the

best parameter settings for the system.

Table 2 shows the best result, which is given by

the accuracy of the system. These values were mea-

sured in training and testing phase using whole im-

balanced database. There are other algorithms that

use the same dataset to detect fraud. The difference

with Q-CCFD is that they use methods to balance

the database: hybrid sampling and random sampling

(Dighe et al., 2018). Q-CCFD when using the imbal-

anced dataset is more adaptable to the real problems

that arise in bank transactions. Table 3 shows a com-

parison of the accuracy values obtained between these

algorithms.

Table 2: Results of Q-Credit Card Fraud Detector.

Accuracy

Training Testing

Auto-encoder 99.9 96.4

Mediator Network 98.2 96.7

Agent 98.9 98.1

Table 3: Comparison with other existing algorithms.

*Balanced dataset.

Classiﬁers Accuracy

Logistic Regression 96.2*

Naive Bayes 96.9*

Decision Tree 96.4*

K-Nearest Neighbour 99.1*

Q-Credit Card Fraud Detector 98.1

7 CONCLUSIONS

This paper presents a credit card fraud detector based

on deep networks and reinforcement learning. It uses

an auto-encoder and neural agent trained with a Q-

learning algorithm working on imbalanced data set,

treated by PCA and containing real credit card trans-

actions. The system performs credit cards transaction

classiﬁcation by combining supervised, unsupervised

and reinforcement learning. The proposed solution

works as a quick acting intelligent agents and can

identify frauds with remarkably accuracy.

The future scope is to adjust our solution to work

in real-time banking systems. For this, a more com-

plex database should be obtained (e.g., credit card

holder id, where the transaction was realized, sales-

man id) and improve the solution if necessary.

Q-Credit Card Fraud Detector for Imbalanced Classiﬁcation using Reinforcement Learning

285

Figure 4: Interface of Q-Credit Card Fraud Detector.

ACKNOWLEDGEMENTS

Authors thank to the SDAS Research Group

(www.sdas-group.com) for its valuable support.

REFERENCES

Chukwuneke, C. (2018). Credit card fraud detection system

using intelligent agents and enhanced security fea-

tures.

Dighe, D., Patil, S., and Kokate, S. (2018). Detection of

credit card fraud transactions using machine learning

algorithms and neural networks: A comparative study.

pages 1–6.

Dong, Q., Gong, S., and Zhu, X. (2019). Imbalanced deep

learning by minority class incremental rectiﬁcation.

IEEE Transactions on Pattern Analysis and Machine

Intelligence, 41(6):1367–1381.

Feng, J., Huang, M., Zhao, L., Yang, Y., and Zhu, X.

(2018). Reinforcement learning for relation classiﬁ-

cation from noisy data.

Li, M.-J., Li, A.-H., Huang, Y.-J., and Chu, S.-I. (2019). Im-

plementation of deep reinforcement learning. In Pro-

ceedings of the 2019 2Nd International Conference on

Information Science and Systems, ICISS 2019, pages

232–236, New York, NY, USA. ACM.

Lin, E., Chen, Q., and Qi, X. (2019). Deep reinforce-

ment learning for imbalanced classiﬁcation. CoRR,

abs/1901.01379.

Martinez, C., Perrin, G., Ramasso, E., and Rombaut, M.

(2018). A deep reinforcement learning approach for

early classiﬁcation of time series. In 2018 26th Eu-

ropean Signal Processing Conference (EUSIPCO),

pages 2030–2034.

MLG-Kaggle (2015). Credit card fraud detection dataset.

Retrieved June 5, 2019.

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Ve-

ness, J., Bellemare, M. G., Graves, A., Riedmiller,

M., Fidjeland, A. K., Ostrovski, G., Petersen, S.,

Beattie, C., Sadik, A., Antonoglou, I., King, H., Ku-

maran, D., Wierstra, D., Legg, S., and Hassabis, D.

(2015). Human-level control through deep reinforce-

ment learning. Nature, 518(7540):529–533.

Pumsirirat, A. and Yan, L. (2018). Credit card fraud detec-

tion using deep learning based on auto-encoder and

restricted boltzmann machine. International Journal

of Advanced Computer Science and Applications, 9.

Wang, S., Liu, W., Wu, J., Cao, L., Meng, Q., and Kennedy,

P. (2016). Training deep neural networks on imbal-

anced data sets. pages 4368–4374.

Watkins, C. J. C. H. and Dayan, P. (1992). Q-learning. Ma-

chine Learning, 8(3):279–292.

Wiering, M., Van Hasselt, H., Pietersma, A.-D., and

Schomaker, L. (2011). Reinforcement learning algo-

rithms for solving classiﬁcation problems. pages 91 –

96.

Wiese, B. and Omlin, C. (2009). Credit Card Transactions,

Fraud Detection, and Machine Learning: Modelling

Time with LSTM Recurrent Neural Networks, volume

247, pages 231–268.

ICAART 2020 - 12th International Conference on Agents and Artiﬁcial Intelligence

286